On Mon, Jun 15, 2009 at 2:42 AM, Felipe Ortegaglimmer_phoenix@yahoo.es wrote:
Hello, all.
For (yet) unknown reasons, last complete dump files (pages-meta-history.xml) in some languages are flawed. Certain revision items are missing info about rev_user. Even though there are only 3 or 4 of that kind, this is enough to mess up either the parsing process or the later SQL load into the DB.
So far, the last 3 dumps of DE Wikipedia and 20090603 from FR Wikipedia have presented this error.
I have updated both WikiXRay parsers: http://meta.wikimedia.org/wiki/WikiXRay_parser http://meta.wikimedia.org/wiki/WikiXRay_parser_research
They now probe whether the parsed revision item is complete or not, before creating the SQL. If it's flawed, its omitted and logged into an error file for later inspection.
This is only a guess, but I would speculate that these items were intentionally removed through the recently created RevisionDelete system that allows selective removal of bad content, including user names or edit summaries, without removing the entire edit. It is mostly used when the edit summary/user name contains privacy violating information, e.g. "User:That bastard Ortega lives at 1234 Someplace Ave."
My guess is that if the username was suppressed in that way then it would also remove the user info from the dump.
-Robert Rohde