On Mon, Jun 15, 2009 at 2:42 AM, Felipe Ortega<glimmer_phoenix(a)yahoo.es> wrote:
Hello, all.
For (yet) unknown reasons, last complete dump files (pages-meta-history.xml) in some
languages are flawed. Certain revision items are missing info about rev_user. Even though
there are only 3 or 4 of that kind, this is enough to mess up either the parsing process
or the later SQL load into the DB.
So far, the last 3 dumps of DE Wikipedia and 20090603 from FR Wikipedia have presented
this error.
I have updated both WikiXRay parsers:
http://meta.wikimedia.org/wiki/WikiXRay_parser
http://meta.wikimedia.org/wiki/WikiXRay_parser_research
They now probe whether the parsed revision item is complete or not, before creating the
SQL. If it's flawed, its omitted and logged into an error file for later inspection.
This is only a guess, but I would speculate that these items were
intentionally removed through the recently created RevisionDelete
system that allows selective removal of bad content, including user
names or edit summaries, without removing the entire edit. It is
mostly used when the edit summary/user name contains privacy violating
information, e.g. "User:That bastard Ortega lives at 1234 Someplace
Ave."
My guess is that if the username was suppressed in that way then it
would also remove the user info from the dump.
-Robert Rohde