[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9. External representation

The thematic catalogs have an external representation that allow easy transportation of their content. The format used is XML. The file containing the XML representation of a catalog is named with the .rdf extension. When the XML/RDF conventions will be better supported the catalogs will eventually use these conventions, hence the extension. You should not invest too much on the current XML format because it is likely to change drastically in the next few monthes. Nevertheless, we've found very convinient to have a text representation of the catalogs, specially for importing data from various sources.

9.1 XML short example  
9.2 XML document encoding  
9.3 XML structure  
9.4 dmoz.org  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.1 XML short example

Here is a short example of an XML file

 
<?xml version="1.0" encoding="ISO-8859-1" ?>
<RDF xmlns:rdf="http://www.w3.org/TR/1999/REC-rdf-syntax-19990222#"
     xmlns="http://www.senga.org/">

 <Table>
  <![CDATA[
CREATE TABLE urldemo (
  rowid int(11) DEFAULT '0' NOT NULL auto_increment,
  created datetime DEFAULT '0000-00-00 00:00:00' NOT NULL,
  modified timestamp(14),
  info enum('active','inactive') DEFAULT 'active',
  url char(128),
  comment char(255),
  UNIQUE cdemo1 (rowid)
)
  ]]>
 </Table>

 <Catalog>
  <navigation>theme</navigation>
  <tablename>urldemo</tablename>
  NAMEurltheme</name>
 </Catalog>

 <Category>
  NAMENews</name>
  <rowid>12</rowid>
  <parent>1</parent>
 </Category>

 <Link>
  <row>135</row>
  <category>12</category>
 </Link>

 <Record table="urldemo">
  <url>http://www.mediaslink.com/</url>
  <comment>Medias Link</comment>
  <rowid>135</rowid>
 </Record>

 <Sync/>
</RDF>


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.2 XML document encoding

The encoding of an XML document is specified in the <?xml ... ?> line at the beginning. Accepted encodings are:

More encodings should be available as the XML manipulation library evolve.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.3 XML structure

When an element is said to describe a record it means that it contains elements whose name are record field names and contains the value of the field. For instance:

 
<Record table=urldemo>
 <url>http://www.senga.org/</url>
 <comment>Senga</comment>
</Record>

defines a record of the urldemo table with two fields (url and comment) whose values are, respectively http://www.senga.org/ and Senga.

`Table'
Contains a unique SQL order that will create a table.

`Catalog'
Describes a record of the catalog table, See section catalog. The remaining of the file will relate to the catalog described in this element. There must be only one Catalog element in a given file.

`Category'
Describes a record of the catalog_category table, See section catalog_category_NAME. The pseudo field parent will build a record in the catalog_category2category table linking the category to its parent, See section catalog_category2category_NAME.

`Link'
Describes a record of the catalog_entry2category table, See section catalog_entry2category_NAME.

`Record'
Describes a record of the table named by the table attribute.

`Symlink'
Describes a record of the catalog_category2category table, See section catalog_category2category_NAME. The info field of the record is automaticaly set to symbolic.

`Sync'
When this element is seen during the parsing of the file, administrative information is recomputed for the catalog. This should only occur once, at the end of the file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.4 dmoz.org

The site http://www.dmoz.org/ provides a dump of their catalog data. The format of the dump is a custom XML that looks like RDF but is not really. Since the XML format of dmoz.org and the XML format of Catalog are not compatible, the convert_dmoz command is provided to perform a translation. It must be called from the command line.

Since the dmoz.org catalog has specific requirements, a specialized version of Catalog is also provided. If you access Catalog using the CGIDIR/dmoz cgi script instead of CGIDIR/Catalog, you will use this specialized version.

The easiest way to reach it is to start from the home page of Catalog that is installed with the product at http://localhost/Catalog/ and follow the DMOZ Control Panel link. Alternatively you can jump directly to http://localhost/cgi-bin/DMOZ/dmoz?context=ccontrol_panel.

We have loaded a version of dmoz.org that contains approximately 1 500 000 records and around 250 000 categories on a Pentium 450. It leads to a 500Mb MySQL database. It takes about one hour to load. The response time when navigating the categories is excellent, provided you are using Apache + mod_perl.

The memory used during the load is around 70Mb during the conversion and 10Mb for loading.

In order to load dmoz.org data using Catalog you must follow the steps listed below. This procedure assumes that you have created a database named dmoz and a catalog of named dmoz within this database. A dmoz database has been created during the installation process. If you don't have it create it with the following command:

 
mysql -e "create database dmoz" 
If you have a database named dmoz, go to the URL http://localhost/cgi-bin/DMOZ/dmoz?context=ccontrol_panel and a catalog named dmoz will automatically be created.

After a while, you will want to reload a new version of the dmoz.org data. It can be done using the same commands. The problem is that while you do that the catalog will be unavailable to the users. The data are first removed and then populated. Catalog does not currently offer support for user transparent reloading. Instead we suggest you follow these steps:


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by root on October, 27 2004 using texi2html