On Tue, Aug 4, 2015 at 12:20 PM, Markus Krötzsch
<markus(a)semantic-mediawiki.org> wrote:
Hi,
The recent Wikidata JSON dumps again contain huge amounts of broken JSON
where empty maps are serialized as [] instead of using {}. Just grep for
"claims":[]
or
"aliases":[]
or
any other key that requires a map
to find many examples. The scope of the problem is massive. Basically all
entity documents that include some empty map are broken, which is almost
every entity document in
http://dumps.wikimedia.org/other/wikidata/20150803.json.gz. Concretely,
there are around 15.7 million entities with [] for aliases.
This is critically breaking the consumption of Wikidata content for all
model-based JSON parsers, including Wikidata Toolkit.
The bug used to occur only in XML dumps, but now also affects the JSON dumps
in the same way. In previous JSON dumps, the problem was avoided by omitting
empyt maps altogether (no keys, no values), which is better because it
allows implementations to fall back to the obvious default. This is still
done in the Web API, e.g.,
https://www.wikidata.org/wiki/Special:EntityData/Q12062430.json
It would be nice to test the export code before deploying it.
Sorry for that. Adam and Marius are working on a fix right now.
They'll report back in a bit.
Cheers
Lydia
--
Lydia Pintscher -
http://about.me/lydia.pintscher
Product Manager for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.