Hi,
The recent Wikidata JSON dumps again contain huge amounts of broken JSON where empty maps are serialized as [] instead of using {}. Just grep for
"claims":[] or "aliases":[] or any other key that requires a map
to find many examples. The scope of the problem is massive. Basically all entity documents that include some empty map are broken, which is almost every entity document in http://dumps.wikimedia.org/other/wikidata/20150803.json.gz. Concretely, there are around 15.7 million entities with [] for aliases.
This is critically breaking the consumption of Wikidata content for all model-based JSON parsers, including Wikidata Toolkit.
The bug used to occur only in XML dumps, but now also affects the JSON dumps in the same way. In previous JSON dumps, the problem was avoided by omitting empyt maps altogether (no keys, no values), which is better because it allows implementations to fall back to the obvious default. This is still done in the Web API, e.g., https://www.wikidata.org/wiki/Special:EntityData/Q12062430.json
It would be nice to test the export code before deploying it.
Regards,
Markus
On Tue, Aug 4, 2015 at 12:20 PM, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Hi,
The recent Wikidata JSON dumps again contain huge amounts of broken JSON where empty maps are serialized as [] instead of using {}. Just grep for
"claims":[] or "aliases":[] or any other key that requires a map
to find many examples. The scope of the problem is massive. Basically all entity documents that include some empty map are broken, which is almost every entity document in http://dumps.wikimedia.org/other/wikidata/20150803.json.gz. Concretely, there are around 15.7 million entities with [] for aliases.
This is critically breaking the consumption of Wikidata content for all model-based JSON parsers, including Wikidata Toolkit.
The bug used to occur only in XML dumps, but now also affects the JSON dumps in the same way. In previous JSON dumps, the problem was avoided by omitting empyt maps altogether (no keys, no values), which is better because it allows implementations to fall back to the obvious default. This is still done in the Web API, e.g., https://www.wikidata.org/wiki/Special:EntityData/Q12062430.json
It would be nice to test the export code before deploying it.
Sorry for that. Adam and Marius are working on a fix right now. They'll report back in a bit.
Cheers Lydia
Please see https://gerrit.wikimedia.org/r/#/c/229099/ and https://gerrit.wikimedia.org/r/#/c/229100/ for the change to master and the currently deployed branch. This will be merged and back ported today and a new dump created
I'm also going to follow this up by writing some more integration tests for our json dumps to spot this kind of thing!
On 4 August 2015 at 11:26, Lydia Pintscher lydia.pintscher@wikimedia.de wrote:
On Tue, Aug 4, 2015 at 12:20 PM, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Hi,
The recent Wikidata JSON dumps again contain huge amounts of broken JSON where empty maps are serialized as [] instead of using {}. Just grep for
"claims":[] or "aliases":[] or any other key that requires a map
to find many examples. The scope of the problem is massive. Basically all entity documents that include some empty map are broken, which is almost every entity document in http://dumps.wikimedia.org/other/wikidata/20150803.json.gz. Concretely, there are around 15.7 million entities with [] for aliases.
This is critically breaking the consumption of Wikidata content for all model-based JSON parsers, including Wikidata Toolkit.
The bug used to occur only in XML dumps, but now also affects the JSON
dumps
in the same way. In previous JSON dumps, the problem was avoided by
omitting
empyt maps altogether (no keys, no values), which is better because it allows implementations to fall back to the obvious default. This is still done in the Web API, e.g., https://www.wikidata.org/wiki/Special:EntityData/Q12062430.json
It would be nice to test the export code before deploying it.
Sorry for that. Adam and Marius are working on a fix right now. They'll report back in a bit.
Cheers Lydia
-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 04.08.2015 12:33, Addshore wrote:
Please see https://gerrit.wikimedia.org/r/#/c/229099/ and https://gerrit.wikimedia.org/r/#/c/229100/ for the change to master and the currently deployed branch. This will be merged and back ported today and a new dump created
I'm also going to follow this up by writing some more integration tests for our json dumps to spot this kind of thing!
That's great news! Many thanks.
Markus
On 4 August 2015 at 11:26, Lydia Pintscher <lydia.pintscher@wikimedia.de mailto:lydia.pintscher@wikimedia.de> wrote:
On Tue, Aug 4, 2015 at 12:20 PM, Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: > Hi, > > The recent Wikidata JSON dumps again contain huge amounts of broken JSON > where empty maps are serialized as [] instead of using {}. Just grep for > > "claims":[] > or > "aliases":[] > or > any other key that requires a map > > to find many examples. The scope of the problem is massive. Basically all > entity documents that include some empty map are broken, which is almost > every entity document in >http://dumps.wikimedia.org/other/wikidata/20150803.json.gz. Concretely, > there are around 15.7 million entities with [] for aliases. > > This is critically breaking the consumption of Wikidata content for all > model-based JSON parsers, including Wikidata Toolkit. > > The bug used to occur only in XML dumps, but now also affects the JSON dumps > in the same way. In previous JSON dumps, the problem was avoided by omitting > empyt maps altogether (no keys, no values), which is better because it > allows implementations to fall back to the obvious default. This is still > done in the Web API, e.g., >https://www.wikidata.org/wiki/Special:EntityData/Q12062430.json > > It would be nice to test the export code before deploying it. Sorry for that. Adam and Marius are working on a fix right now. They'll report back in a bit. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de <http://www.wikimedia.de> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985 <tel:27%2F681%2F51985>. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Addshore
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata