The JSON dump is the preferred if you want to process the entity data. From the
JSON dumps, you can get the current entities (items and properties) in the
current canonical format, for further processing.
The XML dumps are an "opaque" exchange format for mediawiki page content. They
are designed to allow content from pages in one wiki to be imported into another
wiki(*), including old revisions. It can also be used for backups, since it
provides a future proof way to store your wiki's content. But the format of the
page content in the XML dumps is not strictly specified. It can be wikitext, or
JSON data, or whatever. The JSON you find embedded of the XML dumps of wikidata
may or may not be compatible with the format in the JSON file, and is subject to
change without notice. It's not designed for processing by 3rd parties.
Wikidata XML dumps will be generated, for all pages, including history, like it
is done for all Wikimedia projects. However, this process often breaks, due to
the large size of these dumps. If you want to process Wikidata items, you should
use the JSON dumps.
HTH
-- daniel
(*) this is usually disabled for wikibase entities, to avoid ID conflicts.
Am 14.06.2015 um 03:38 schrieb gnosygnu:
According to
http://www.wikidata.org/wiki/Wikidata:Database_download,
the JSON dump is listed as the recommended dump format. Also, at the
time of writing, the JSON dump has been generating regularly every
week whereas the XML dump has been delayed for 2+ months.
Going forward, will both dumps continue to be supported? Or will the
XML dump be phased out and only the JSON dump remain? Or are these
plans still to be determined based on upcoming changes to the dumping
infrastructure as per
https://phabricator.wikimedia.org/T88728?
If the JSON dump is to be the sole data format, is there any way to
address the following omissions?
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.