Dear Nemo,
Thanks for enlightening me regarding <title>. I did not know that it was intended to be a compound of namespace word and `page_title' field.
Still, I have some thoughts on this matter.
1) importDump.php
As of WP-MIRROR 0.6, `importDump.php' is not longer used.
The disadvantage of `importDump.php' is that it is slow. Importation of `enwiki' takes about two months, which is greater than the interval between XML dumps.
The advantage of `importDump.php' is that it handles any idiosyncrasy (such as compound <title> entries) in the XML dumps.
2) mwxml2sql
As of WP-MIRROR 0.6, `mwxml2sql' is used to convert the XML dump into a set of SQL dumps (for the `page', `revision', `text' tables) which can then be directly loaded into the underlying database tables.
The advantage of `mwxml2sql' is that it is very fast. And, when used in conjunction with MySQL 5.5 fast index creation, one can load `enwiki' using 80% less time.
The disadvantage is that it faithfully copies the <title> field into the SQL statement for INSERTing the `page_title' field. We now know that this results in pages from the Template and other namespaces being not found by MediaWiki, which then renders them as red-links.
3) First Normal Form
One issue in the back of my mind concerns the recent changes in the XML schema. As of `export-0.6.xsd.gz' we note that ``Version 0.6 adds a separate namespace tag''. To my mind, the presence of the <ns> field should obviate the need to include a namespace word (e.g. `Category:', `Template:', etc.) within the <title> field.
The principle is known as first normal form (1NF) which basically means that the contents of a field should be atomic rather than compound.
4) Solution
Granted that the objective is to faithfully mirror the WMF database tables; the issue before us is this: Where along the tool chain should the patch be made.
a) My instinct is to correct the issue upstream (the XML dump generation phase). The WMF `page_namespace' field should be copied to the <ns> field. The WMF `page_title' field should be copied to the <title> field. Adhere to principles of database normalization. b) Second best, would be to patch WP-MIRROR 0.7 to normalize the XML dump prior to feeding it into `mwxml2sql'. This I have done. c) Third best, would be to patch `mwxml2sql'. This I also favor, but would like some guidance from its author, Ariel Glenn, before I start hacking. d) A last resort would be to write an SQL query to clean up compound `page_title' entries in the mirror's database. But I really would rather not load unnormalized data in the first place.
Sincerely Yours, Kent
On 2/22/14, Federico Leva (Nemo) nemowiki@gmail.com wrote:
wp mirror, 22/02/2014 23:40:
Still, it would be nice if the dump files could be fixed.
Fixed? <title> is the full page name as it's supposed to be. Either you're doing something wrong with the import, or the import script/special page has a bug (not uncommon, but needs a bug report with steps to reproduce). I see nothing to blame on the export side.
Nemo
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l