Hi Ariel.
I've been noticing since last year (when I introduced a log error service in WikiXRay) that there are several malformed revision items in the dump files. This cause exceptions when trying to insert tuples in the DB without coherent values.
I still receive these errors in the new dumps, so I think there is indeed some issue that you should check.
Most of times, the missing value is for <rev_user>. I'm not sure about the cause, but perhaps the dump process is facing a high load in the target server and this causes blanks to be inserted instead of the actual value.
You can take a look at these malformed items in the following error log file (taken from chunk 10 produced in March 2011):
http://gsyc.es/~jfelipe/tmp/error10_wx_enwiki_032011
All chunks from March 2011 (and previous dumps) contained these errors.
The fraction of these erroneous entries is still very low, compared to the size of the whole dump, so it doesn't affect the accuracy of global studies. All the same, it might cause some trouble in case one is looking for a particular revision in the complete collection (I haven't checked explicitly, but it looks like there is no pattern in these errors and they are produced randomly).
Let me know in you need more info that can be of help to solve this issue.
Best, Felipe.
At first sight these appear to be related to redaction (i.e. oversighted/deleted data)
Richard Farmbrough.
On 06/05/2011 21:16, Felipe Ortega wrote:
Hi Ariel.
I've been noticing since last year (when I introduced a log error service in WikiXRay) that there are several malformed revision items in the dump files. This cause exceptions when trying to insert tuples in the DB without coherent values.
I still receive these errors in the new dumps, so I think there is indeed some issue that you should check.
Most of times, the missing value is for<rev_user>. I'm not sure about the cause, but perhaps the dump process is facing a high load in the target server and this causes blanks to be inserted instead of the actual value.
You can take a look at these malformed items in the following error log file (taken from chunk 10 produced in March 2011):
http://gsyc.es/~jfelipe/tmp/error10_wx_enwiki_032011
All chunks from March 2011 (and previous dumps) contained these errors.
The fraction of these erroneous entries is still very low, compared to the size of the whole dump, so it doesn't affect the accuracy of global studies. All the same, it might cause some trouble in case one is looking for a particular revision in the complete collection (I haven't checked explicitly, but it looks like there is no pattern in these errors and they are produced randomly).
Let me know in you need more info that can be of help to solve this issue.
Best, Felipe.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
El 06/05/11 22:16, Felipe Ortega escribió:
complete collection (I haven't checked explicitly, but it looks like there is no pattern in these errors and they are produced randomly).
Let me know in you need more info that can be of help to solve this issue.
Best
Maybe those usernames were RevDeleted? See in http://en.wikipedia.org/wiki/?oldid=233494693 how it is not shown in the wiki either "This is an old revision of this page, as edited by (Username or IP removed) at 07:55, 22 August 2008."
So the dumps are right.
----- Mensaje original ----
De: Platonides platonides@gmail.com Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: xmldatadumps-l@lists.wikimedia.org Enviado: vie,6 mayo, 2011 23:39 Asunto: Re: [Xmldatadumps-l] Malformed revision items
El 06/05/11 22:16, Felipe Ortega escribió:
complete collection (I haven't checked explicitly, but it looks like there is no pattern in these errors and they are produced randomly).
Let me know in you need more info that can be of help to solve this issue.
Best
Maybe those usernames were RevDeleted? See in http://en.wikipedia.org/wiki/?oldid=233494693 how it is not shown in the wiki either "This is an old revision of this page, as edited by (Username or IP removed) at 07:55, 22 August 2008."
So the dumps are right.
Interesting. I hadn't thought about this possibility and it's a good explanation.
Felipe.
Στις 07-05-2011, ημέρα Σαβ, και ώρα 16:46 +0100, ο/η Felipe Ortega έγραψε:
----- Mensaje original ----
De: Platonides platonides@gmail.com Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: xmldatadumps-l@lists.wikimedia.org Enviado: vie,6 mayo, 2011 23:39 Asunto: Re: [Xmldatadumps-l] Malformed revision items
El 06/05/11 22:16, Felipe Ortega escribió:
complete collection (I haven't checked explicitly, but it looks like there is no pattern in these errors and they are produced randomly).
Let me know in you need more info that can be of help to solve this issue.
Best
Maybe those usernames were RevDeleted? See in http://en.wikipedia.org/wiki/?oldid=233494693 how it is not shown in the wiki either "This is an old revision of this page, as edited by (Username or IP removed) at 07:55, 22 August 2008."
So the dumps are right.
Interesting. I hadn't thought about this possibility and it's a good explanation.
Felipe.
Please excuse my delay in replying.
Indeed these revisions were oversighted and the user name hidden. A number of fields including the edit summary and the username can be hidden by oversighters; you'll want to adjust your code to account for these. Typically the xml tag will have the contents 'missing' (although this can also occur for other reasons).
Ariel
----- Mensaje original ---- De: Ariel T. Glenn ariel@wikimedia.org Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: Platonides platonides@gmail.com; xmldatadumps-l@lists.wikimedia.org Enviado: lun,16 mayo, 2011 10:07 Asunto: Re: [Xmldatadumps-l] Malformed revision items
Στις 07-05-2011, ημέρα Σαβ, και ώρα 16:46 +0100, ο/η Felipe Ortega έγραψε:
----- Mensaje original ----
De: Platonides platonides@gmail.com Para: Felipe Ortega glimmer_phoenix@yahoo.es CC: xmldatadumps-l@lists.wikimedia.org Enviado: vie,6 mayo, 2011 23:39 Asunto: Re: [Xmldatadumps-l] Malformed revision items
El 06/05/11 22:16, Felipe Ortega escribió:
complete collection (I haven't checked explicitly, but it looks like there is no pattern in these errors and they are produced randomly).
Let me know in you need more info that can be of help to solve this issue.
Best
Maybe those usernames were RevDeleted? See in http://en.wikipedia.org/wiki/?oldid=233494693 how it is not shown in the wiki either "This is an old revision of this page, as edited by (Username or IP removed) at 07:55, 22 August 2008."
So the dumps are right.
Interesting. I hadn't thought about this possibility and it's a good explanation.
Felipe.
Please excuse my delay in replying.
Indeed these revisions were oversighted and the user name hidden. A number of fields including the edit summary and the username can be hidden by oversighters; you'll want to adjust your code to account for these. Typically the xml tag will have the contents 'missing' (although this can also occur for other reasons).
Yeap, I'll use some kind of default value for missing fields.
Thanks for your answers.
Felipe.
Ariel
xmldatadumps-l@lists.wikimedia.org