Hi all,
in the dump file wikidatawiki-20140912-pages-articles.xml.bz2
I seem to find some items with a key "description", some with "descriptions".
For example, near the beginning of the file: Q15 seems to have key "description" Q17 seems to have key "descriptions"
This is rather unhelpful when running e.g. my stats script.
a) Can someone please confirm that I'm not crazy? I mean, in this instance. b) Is this a bug, or a feature? c) If a bug, is it already fixed for the next dump? Which key will it be? (If a feature: why?)
Thanks, Magnus
Oh, I just noticed: same with "links" and "sitelinks".
On Wed, Oct 8, 2014 at 8:28 PM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi all,
in the dump file wikidatawiki-20140912-pages-articles.xml.bz2
I seem to find some items with a key "description", some with "descriptions".
For example, near the beginning of the file: Q15 seems to have key "description" Q17 seems to have key "descriptions"
This is rather unhelpful when running e.g. my stats script.
a) Can someone please confirm that I'm not crazy? I mean, in this instance. b) Is this a bug, or a feature? c) If a bug, is it already fixed for the next dump? Which key will it be? (If a feature: why?)
Thanks, Magnus
Is the problem with the different representation of empty values still in there?
links: [] vs. links: ""
Lukas
Am Mi 08.10.2014 21:29, schrieb Magnus Manske:
Oh, I just noticed: same with "links" and "sitelinks".
On Wed, Oct 8, 2014 at 8:28 PM, Magnus Manske magnusmanske@googlemail.com wrote:
Hi all,
in the dump file wikidatawiki-20140912-pages-articles.xml.bz2
I seem to find some items with a key "description", some with "descriptions".
For example, near the beginning of the file: Q15 seems to have key "description" Q17 seems to have key "descriptions"
This is rather unhelpful when running e.g. my stats script.
a) Can someone please confirm that I'm not crazy? I mean, in this instance. b) Is this a bug, or a feature? c) If a bug, is it already fixed for the next dump? Which key will it be? (If a feature: why?)
Thanks, Magnus
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Haven't checked, really; I'll now ignore the XML dumps, which are obviously broken for the time being, and use the JSON dumps.
Speaking of which, the last one seems to have failed; 20141006.json.gz is stuck at 700 bytes, for two days now.
On Wed, Oct 8, 2014 at 9:31 PM, Lukas Benedix lukas.benedix@fu-berlin.de wrote:
Is the problem with the different representation of empty values still in there?
links: [] vs. links: ""
Lukas
Am Mi 08.10.2014 21:29, schrieb Magnus Manske:
Oh, I just noticed: same with "links" and "sitelinks".
On Wed, Oct 8, 2014 at 8:28 PM, Magnus Manske <
magnusmanske@googlemail.com>
wrote:
Hi all,
in the dump file wikidatawiki-20140912-pages-articles.xml.bz2
I seem to find some items with a key "description", some with "descriptions".
For example, near the beginning of the file: Q15 seems to have key "description" Q17 seems to have key "descriptions"
This is rather unhelpful when running e.g. my stats script.
a) Can someone please confirm that I'm not crazy? I mean, in this
instance.
b) Is this a bug, or a feature? c) If a bug, is it already fixed for the next dump? Which key will it
be?
(If a feature: why?)
Thanks, Magnus
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hey,
Speaking of which, the last one seems to have failed; 20141006.json.gz is
stuck at 700 bytes, for two days now.
It's our new state of the art compression algorithm.
Cheers
-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3
On Wed, Oct 8, 2014 at 10:34 PM, Magnus Manske magnusmanske@googlemail.com wrote:
Speaking of which, the last one seems to have failed; 20141006.json.gz is stuck at 700 bytes, for two days now.
Marius just prepared a fix. Will be deployed later tonight.
Cheers Lydia
Thanks!
2014-10-09 20:06 GMT+01:00 Lydia Pintscher lydia.pintscher@wikimedia.de:
On Wed, Oct 8, 2014 at 10:34 PM, Magnus Manske magnusmanske@googlemail.com wrote:
Speaking of which, the last one seems to have failed; 20141006.json.gz is stuck at 700 bytes, for two days now.
Marius just prepared a fix. Will be deployed later tonight.
Cheers Lydia
-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
I just found someone to restart the dump process, after the dump creation script has been fixed. The next json-dump should be available at around noon (CEST) :)
Cheers,
Marius
On Thu, 2014-10-09 at 21:54 +0100, Magnus Manske wrote:
Thanks!
2014-10-09 20:06 GMT+01:00 Lydia Pintscher lydia.pintscher@wikimedia.de: On Wed, Oct 8, 2014 at 10:34 PM, Magnus Manske magnusmanske@googlemail.com wrote: > Speaking of which, the last one seems to have failed; 20141006.json.gz is > stuck at 700 bytes, for two days now.
Marius just prepared a fix. Will be deployed later tonight. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. _______________________________________________ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Am 08.10.2014 21:29, schrieb Magnus Manske:
Oh, I just noticed: same with "links" and "sitelinks".
Sounds like some revisions are using the old internal format, while some use the canonical external representation. That shouldn't happen...
-- daniel
I managed to do the task at hand by switching to JSON dumps (because that's the new, officially supported, long-term-stable Wikidata dump format, right? Right???), so no hurry there.
Maybe the XML dump process was run in the middle of the switch to the new format, or got a stale cache for some items?
On Thu, Oct 9, 2014 at 10:32 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Am 08.10.2014 21:29, schrieb Magnus Manske:
Oh, I just noticed: same with "links" and "sitelinks".
Sounds like some revisions are using the old internal format, while some use the canonical external representation. That shouldn't happen...
-- daniel
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
On Thu, Oct 9, 2014 at 3:19 PM, Magnus Manske magnusmanske@googlemail.com wrote:
I managed to do the task at hand by switching to JSON dumps (because that's the new, officially supported, long-term-stable Wikidata dump format, right? Right???), so no hurry there.
Maybe the XML dump process was run in the middle of the switch to the new format, or got a stale cache for some items?
It looks like the switch happened in the middle of a dump creation so this one is half old and half new format mixed. The ones after that should be all new format. And yay for switching to JSON!
Cheers Lydia
Different keys can still be found in the actual xml dump wikidatawiki-20141009-pages-articles.xml.bz2. I'm not sure if this bug is present in the dump with history.
page_id, wd_id, keys 111, Q15, ['aliases', 'claims', 'descriptions', 'id', 'labels', 'sitelinks', 'type'] 137, Q24, ['aliases', 'claims', 'description', 'entity', 'label', 'links'] 31500, Q28119, ['aliases', 'description', 'entity', 'label', 'links'] 225144, ?, ['entity', 'redirect']
Lukas
Am Do 09.10.2014 19:32, schrieb Lydia Pintscher:
On Thu, Oct 9, 2014 at 3:19 PM, Magnus Manske magnusmanske@googlemail.com wrote:
I managed to do the task at hand by switching to JSON dumps (because that's the new, officially supported, long-term-stable Wikidata dump format, right? Right???), so no hurry there.
Maybe the XML dump process was run in the middle of the switch to the new format, or got a stale cache for some items?
It looks like the switch happened in the middle of a dump creation so this one is half old and half new format mixed. The ones after that should be all new format. And yay for switching to JSON!
Cheers Lydia
wikidata-tech@lists.wikimedia.org