[Xmldatadumps-l] Fwd: Template expansion inconsistency

23 Feb 2014


      Dear Nemo,
Thanks for enlightening me regarding <title>.  I did not know that it
was intended to be a compound of namespace word and `page_title'
field.
Still, I have some thoughts on this matter.
1) importDump.php
As of WP-MIRROR 0.6, `importDump.php' is not longer used.
The disadvantage of `importDump.php' is that it is slow.  Importation
of `enwiki' takes about two months, which is greater than the interval
between XML dumps.
The advantage of `importDump.php' is that it handles any idiosyncrasy
(such as compound <title> entries) in the XML dumps.
2) mwxml2sql
As of WP-MIRROR 0.6,  `mwxml2sql' is used to convert the XML dump into
a set of SQL dumps (for the `page', `revision', `text' tables) which
can then be directly loaded into the underlying database tables.
The advantage of `mwxml2sql' is that it is very fast.  And, when used
in conjunction with MySQL 5.5 fast index creation, one can load
`enwiki' using 80% less time.
The disadvantage is that it faithfully copies the <title> field into
the SQL statement for INSERTing the `page_title' field.  We now know
that this results in pages from the Template and other namespaces
being not found by MediaWiki, which then renders them as red-links.
3) First Normal Form
One issue in the back of my mind concerns the recent changes in the
XML schema. As of `export-0.6.xsd.gz' we note that ``Version 0.6 adds
a separate namespace tag''.  To my mind, the presence of the <ns>
field should obviate the need to include a namespace word (e.g.
`Category:', `Template:', etc.) within the <title> field.
The principle is known as first normal form (1NF) which basically
means that the contents of a field should be atomic rather than
compound.
4) Solution
Granted that the objective is to faithfully mirror the WMF database
tables; the issue before us is this:  Where along the tool chain
should the patch be made.
a) My instinct is to correct the issue upstream (the XML dump generation phase).
    The WMF `page_namespace' field should be copied to the <ns> field.
    The WMF `page_title' field should be copied to the <title> field.
    Adhere to principles of database normalization.
b) Second best, would be to patch WP-MIRROR 0.7 to normalize the XML
dump prior to feeding it into `mwxml2sql'.  This I have done.
c) Third best, would be to patch `mwxml2sql'.  This I also favor, but
would like some guidance from its author, Ariel Glenn, before I start
hacking.
d) A last resort would be to write an SQL query to clean up compound
`page_title' entries in the mirror's database. But I really would
rather not load unnormalized data in the first place.
Sincerely Yours,
Kent
On 2/22/14, Federico Leva (Nemo) nemowiki@gmail.com wrote:
...
wp mirror, 22/02/2014 23:40:
...
Still, it would be nice if the dump files could be fixed.
Fixed? <title> is the full page name as it's supposed to be. Either
you're doing something wrong with the import, or the import
script/special page has a bug (not uncommon, but needs a bug report with
steps to reproduce). I see nothing to blame on the export side.
Nemo

Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[Xmldatadumps-l] Fwd: Template expansion inconsistency