Dear Nemo,
Thanks for enlightening me regarding <title>. I did not know that it
was intended to be a compound of namespace word and `page_title'
field.
Still, I have some thoughts on this matter.
1) importDump.php
As of WP-MIRROR 0.6, `importDump.php' is not longer used.
The disadvantage of `importDump.php' is that it is slow. Importation
of `enwiki' takes about two months, which is greater than the interval
between XML dumps.
The advantage of `importDump.php' is that it handles any idiosyncrasy
(such as compound <title> entries) in the XML dumps.
2) mwxml2sql
As of WP-MIRROR 0.6, `mwxml2sql' is used to convert the XML dump into
a set of SQL dumps (for the `page', `revision', `text' tables) which
can then be directly loaded into the underlying database tables.
The advantage of `mwxml2sql' is that it is very fast. And, when used
in conjunction with MySQL 5.5 fast index creation, one can load
`enwiki' using 80% less time.
The disadvantage is that it faithfully copies the <title> field into
the SQL statement for INSERTing the `page_title' field. We now know
that this results in pages from the Template and other namespaces
being not found by MediaWiki, which then renders them as red-links.
3) First Normal Form
One issue in the back of my mind concerns the recent changes in the
XML schema. As of `export-0.6.xsd.gz' we note that ``Version 0.6 adds
a separate namespace tag''. To my mind, the presence of the <ns>
field should obviate the need to include a namespace word (e.g.
`Category:', `Template:', etc.) within the <title> field.
The principle is known as first normal form (1NF) which basically
means that the contents of a field should be atomic rather than
compound.
4) Solution
Granted that the objective is to faithfully mirror the WMF database
tables; the issue before us is this: Where along the tool chain
should the patch be made.
a) My instinct is to correct the issue upstream (the XML dump generation phase).
The WMF `page_namespace' field should be copied to the <ns> field.
The WMF `page_title' field should be copied to the <title> field.
Adhere to principles of database normalization.
b) Second best, would be to patch WP-MIRROR 0.7 to normalize the XML
dump prior to feeding it into `mwxml2sql'. This I have done.
c) Third best, would be to patch `mwxml2sql'. This I also favor, but
would like some guidance from its author, Ariel Glenn, before I start
hacking.
d) A last resort would be to write an SQL query to clean up compound
`page_title' entries in the mirror's database. But I really would
rather not load unnormalized data in the first place.
Sincerely Yours,
Kent
On 2/22/14, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:
wp mirror, 22/02/2014 23:40:
Still, it would be nice if the dump files could
be fixed.
Fixed? <title> is the full page name as it's supposed to be. Either
you're doing something wrong with the import, or the import
script/special page has a bug (not uncommon, but needs a bug report with
steps to reproduce). I see nothing to blame on the export side.
Nemo
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l