At first sight these appear to be related to redaction (i.e. oversighted/deleted data)
Richard Farmbrough.
On 06/05/2011 21:16, Felipe Ortega wrote:
Hi Ariel.
I've been noticing since last year (when I introduced a log error service in WikiXRay) that there are several malformed revision items in the dump files. This cause exceptions when trying to insert tuples in the DB without coherent values.
I still receive these errors in the new dumps, so I think there is indeed some issue that you should check.
Most of times, the missing value is for<rev_user>. I'm not sure about the cause, but perhaps the dump process is facing a high load in the target server and this causes blanks to be inserted instead of the actual value.
You can take a look at these malformed items in the following error log file (taken from chunk 10 produced in March 2011):
http://gsyc.es/~jfelipe/tmp/error10_wx_enwiki_032011
All chunks from March 2011 (and previous dumps) contained these errors.
The fraction of these erroneous entries is still very low, compared to the size of the whole dump, so it doesn't affect the accuracy of global studies. All the same, it might cause some trouble in case one is looking for a particular revision in the complete collection (I haven't checked explicitly, but it looks like there is no pattern in these errors and they are produced randomly).
Let me know in you need more info that can be of help to solve this issue.
Best, Felipe.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l