Ran into a weird dump problem yesterday, which has me wondering if there's a problem with my artificially created xml for upload. Here's what happens. I have a script that builds wiki pages from an external source and embeds them in xml for upload via importDump. The script can be toggled to either generate a single page or a bunch of them. The same script failed to load some pages that load just fine if you specify them individually, but ImportDump.php does NOT crash during the import.
I suspect that there is something wrong with the upstream items, but I can't find it. The Brown Univ XML validator complains about the following:
line 3, ecoliwiki20070730123135.xml: error (1102): tag uses GI for an undeclared element: mediawiki line 166616, ecoliwiki20070730123135.xml: error (1012): reference to undeclared entity: line 166616, ecoliwiki20070730123135.xml: error (1003): entity (or its expansion) is invalid: line 166616, ecoliwiki20070730123135.xml: error (1012): reference to undeclared entity: line 166616, ecoliwiki20070730123135.xml: error (1003): entity (or its expansion) is invalid: line 184234, ecoliwiki20070730123135.xml: error (402): EOF encountered; no doctype declaration found: mediawiki
but I'm pretty sure these are all red herrings. So...is there a validator out there I should be using? Is there another reason why a record might be skipped? I know that if the timestamps are earlier than the last version there's a problem, but these were all loading into empty Category pages.
Jim ===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054
Jim Hu wrote:
Ran into a weird dump problem yesterday, which has me wondering if there's a problem with my artificially created xml for upload. Here's what happens. I have a script that builds wiki pages from an external source and embeds them in xml for upload via importDump. The script can be toggled to either generate a single page or a bunch of them. The same script failed to load some pages that load just fine if you specify them individually, but ImportDump.php does NOT crash during the import.
I understand the failing is your building script. What does it load? Wiki pages from Special:Export? How does this script handle the XML? Are you using some XML library? Search and replace?
I suspect that there is something wrong with the upstream items, but I can't find it. The Brown Univ XML validator complains about the following:
I guess these has to do with not having a DOCTYPE declaration
On Jul 30, 2007, at 3:49 PM, Platonides wrote:
Jim Hu wrote:
Ran into a weird dump problem yesterday, which has me wondering if there's a problem with my artificially created xml for upload. Here's what happens. I have a script that builds wiki pages from an external source and embeds them in xml for upload via importDump. The script can be toggled to either generate a single page or a bunch of them. The same script failed to load some pages that load just fine if you specify them individually, but ImportDump.php does NOT crash during the import.
I understand the failing is your building script. What does it load? Wiki pages from Special:Export? How does this script handle the XML? Are you using some XML library? Search and replace?
Yes, I'm sure it's my hacky script ; ). It's based on search and replace. I make a template for the <page> container with tags like MW template params, e.g. {{{}}}, and I replace things in that and write to a file. I have a pair of functions that get called each time I make a page.
function xml_page_template() { return '<page> <title>{{{TITLE}}}</title> <id>{{{PAGEID}}}</id> <revision> <id>1</id> <timestamp>{{{TIMESTAMP}}}</timestamp> <contributor> <username>{{{USERNAME}}}</username> <id>{{{UID}}}</id> </contributor> <comment>Automated import of articles</comment> <text xml:space="preserve">{{{TEXT}}}</text> </revision> </page>'; }#end function xml_page_template
function make_page($title,$text){ global $xml_user, $uid, $change_count; $page = xml_page_template(); $page = str_replace("{{{TITLE}}}",fix_title($title),$page); $page = str_replace("{{{PAGEID}}}",$change_count,$page); $page = str_replace("{{{TIMESTAMP}}}", gmdate("Y-m- d").'T'.gmdate("H:i:s")."Z",$page); $page = str_replace("{{{USERNAME}}}",$xml_user,$page); $page = str_replace("{{{UID}}}",$uid,$page); $page = str_replace("{{{TEXT}}}",htmlentities($text),$page); return $page; }
The scripts that call these are responsible for generating the parameters passed to them. From what I could tell with earlier versions of MW, the page id wasn't being used for import, so I just put a counter in to help me debug the XML.
Jim
I suspect that there is something wrong with the upstream items, but I can't find it. The Brown Univ XML validator complains about the following:
I guess these has to do with not having a DOCTYPE declaration
That's what I thought. Is there supposed to be one there? I don't see one from the output of Special:Export, which is my model for building the xml.
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054
Yes, but wouldn't that affect all pages? I think it's not used. Dumping from wiki A into wiki B, the revision ids would be different anyway. Same for page uids. Primary keys from external dbs should not be assumed, right?
Jim
On Jul 30, 2007, at 4:29 PM, Platonides wrote:
<revision> <id>1</id>
You have the id hardcoded. This doesn't seem A Good Thing (tm). On dumps with several pages, there will be repeated revisions. Don't know if it is really used, but...
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054
Jim Hu wrote:
Yes, but wouldn't that affect all pages? I think it's not used. Dumping from wiki A into wiki B, the revision ids would be different anyway. Same for page uids. Primary keys from external dbs should not be assumed, right?
Jim
It /shouldn't/ be used, and certainly not copied to the db. But your XML database is inconsistent :-)
# FIXME: Use original rev_id optionally # FIXME: blah blah blah Doesn't seem used. But fixing it wouldn't harm either.
Jim Hu wrote:
Ran into a weird dump problem yesterday, which has me wondering if there's a problem with my artificially created xml for upload. Here's what happens. I have a script that builds wiki pages from an external source and embeds them in xml for upload via importDump. The script can be toggled to either generate a single page or a bunch of them. The same script failed to load some pages that load just fine if you specify them individually, but ImportDump.php does NOT crash during the import.
I suspect that there is something wrong with the upstream items, but I can't find it. The Brown Univ XML validator complains about the following:
line 3, ecoliwiki20070730123135.xml: error (1102): tag uses GI for an undeclared element: mediawiki
That sounds like you didn't include a schema declaration (dunno what your thingy takes, maybe it's doctype only?)
There's an XML Schema description file -- you can use any XML Schema validator, such as one of the demo scripts packaged with the Apache Xalan java library, to run over your .xml file.
In theory, anyway. :)
line 166616, ecoliwiki20070730123135.xml: error (1012): reference to undeclared entity: line 166616, ecoliwiki20070730123135.xml: error (1003): entity (or its expansion) is invalid: line 166616, ecoliwiki20070730123135.xml: error (1012): reference to undeclared entity: line 166616, ecoliwiki20070730123135.xml: error (1003): entity (or its expansion) is invalid: line 184234, ecoliwiki20070730123135.xml:
The only predefined named character reference entities in XML are < > and &.
For any other characters that you really intend to be interpreted *as the character*, use decimal or binary codes -- eg   or  
For things you want to appear *as the HTML character reference* you need to escape the & as & for instance "&nbsp;" to be producing correct XML.
error (402): EOF encountered; no doctype declaration found: mediawiki
but I'm pretty sure these are all red herrings. So...is there a validator out there I should be using?
-- brion vibber (brion @ wikimedia.org)
mediawiki-l@lists.wikimedia.org