Fwd: Template expansion inconsistency - Xmldatadumps-l

23 Feb 2014

Dear Nemo,

Thanks for enlightening me regarding <title>.  I did not know that it
was intended to be a compound of namespace word and `page_title'
field.

Still, I have some thoughts on this matter.

1) importDump.php

As of WP-MIRROR 0.6, `importDump.php' is not longer used.

The disadvantage of `importDump.php' is that it is slow.  Importation
of `enwiki' takes about two months, which is greater than the interval
between XML dumps.

The advantage of `importDump.php' is that it handles any idiosyncrasy
(such as compound <title> entries) in the XML dumps.

2) mwxml2sql

As of WP-MIRROR 0.6,  `mwxml2sql' is used to convert the XML dump into
a set of SQL dumps (for the `page', `revision', `text' tables) which
can then be directly loaded into the underlying database tables.

The advantage of `mwxml2sql' is that it is very fast.  And, when used
in conjunction with MySQL 5.5 fast index creation, one can load
`enwiki' using 80% less time.

The disadvantage is that it faithfully copies the <title> field into
the SQL statement for INSERTing the `page_title' field.  We now know
that this results in pages from the Template and other namespaces
being not found by MediaWiki, which then renders them as red-links.

3) First Normal Form

One issue in the back of my mind concerns the recent changes in the
XML schema. As of `export-0.6.xsd.gz' we note that ``Version 0.6 adds
a separate namespace tag''.  To my mind, the presence of the <ns>
field should obviate the need to include a namespace word (e.g.
`Category:', `Template:', etc.) within the <title> field.

The principle is known as first normal form (1NF) which basically
means that the contents of a field should be atomic rather than
compound.

4) Solution

Granted that the objective is to faithfully mirror the WMF database
tables; the issue before us is this:  Where along the tool chain
should the patch be made.

a) My instinct is to correct the issue upstream (the XML dump generation phase).
    The WMF `page_namespace' field should be copied to the <ns> field.
    The WMF `page_title' field should be copied to the <title> field.
    Adhere to principles of database normalization.
b) Second best, would be to patch WP-MIRROR 0.7 to normalize the XML
dump prior to feeding it into `mwxml2sql'.  This I have done.
c) Third best, would be to patch `mwxml2sql'.  This I also favor, but
would like some guidance from its author, Ariel Glenn, before I start
hacking.
d) A last resort would be to write an SQL query to clean up compound
`page_title' entries in the mirror's database. But I really would
rather not load unnormalized data in the first place.

Sincerely Yours,
Kent

On 2/22/14, Federico Leva (Nemo) &lt;nemowiki(a)gmail.com&gt; wrote:
...
  wp mirror, 22/02/2014 23:40:
  Still, it would be nice if the dump files could
be fixed. 
 Fixed? <title> is the full page name as it's supposed to be. Either
 you're doing something wrong with the import, or the import
 script/special page has a bug (not uncommon, but needs a bug report with
 steps to reproduce). I see nothing to blame on the export side.

 Nemo

 _______________________________________________
 Xmldatadumps-l mailing list
 Xmldatadumps-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l