Hi Daniel,
Thanks for the quick and helpful reply. I was hoping that the XML dumps could be changed, but I understand now that the JSON dumps are the recommended format.
To avoid downloading redundant information, you can use one of the wikidatawiki-20161120-stub-* dumps instead of the full page dumps
This is useful, but unfortunately it won't suffice. Wikidata also has pages which are wikitext (for example, https://www.wikidata.org/wiki/Wikidata:WikiProject_Names). These wikitext pages are in the XML dumps, but aren't in the stub dumps nor the JSON dumps. I actually do use these Wikidata wikitext entries to try to reproduce Wikidata in its entirety. So for now, it looks like both XML dumps and JSON dumps will be required.
At any rate, thanks again for the excellent reply.
On Sat, Nov 26, 2016 at 12:25 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Hi gnosygnu!
The JSON in the XML dumps is the raw contents of the storage backend. It can't be changed retroactively, and re-encoding everything on the fly would be too expensive. Also, the JSON embedded in the XML files is not officially supported as a stable interface of Wikibase. The JSON format in the XML files can change without notice, and you may encounter different representations even within the same dump.
I recommend to use the JSON dumps, they contain our data in canonical form. To avoid downloading redundant information, you can use one of the wikidatawiki-20161120-stub-* dumps instead of the full page dumps. These don't contain the actual page content, just meta-data.
Caveat: there is currently no dump that contains the JSON of old revisions of entities in canonical form. You can only get them individually from Special:EntityData, e.g. https://www.wikidata.org/wiki/Special:EntityData/Q23.json?oldid=30279
HTH -- daniel
Am 26.11.2016 um 02:13 schrieb gnosygnu:
Hi everyone. I have a question about the Wikidata xml dump, but I'm posting this question here, because it looks more related to Wikidata.
In short, it seems that the "pages-articles.xml" does not include the datatype property for snaks. For example, the xml dump does not list a datatype for Q38 (Italy) and P41 (flag image). In contrast, the json dump does list a datatype of "commonsMedia".
Can this datatype property be included in future xml dumps? The alternative would be to download two large and redundant dumps (xml and json) in order to reconstruct a Wikidata instance.
More information is provided below the break. Let me know if you need anything else.
Thanks.
Here's an excerpt from the xml data dump for Q38 (Italy) and P41 (flag image). Notice that there is no "datatype" property // https://dumps.wikimedia.org/wikidatawiki/20161120/wikidatawiki-20161120-page... "mainsnak": { "snaktype": "value", "property": "P41", "hash": "a3bd1e026c51f5e0bdf30b2323a7a1fb913c9863", "datavalue": { "value": "Flag of Italy.svg", "type": "string" } },
Meanwhile, the API and the JSON dump lists a datatype property of "commonsMedia": // https://www.wikidata.org/w/api.php?action=wbgetentities&ids=q38 // https://dumps.wikimedia.org/wikidatawiki/entities/20161114/wikidata-20161114... "P41": [{ "mainsnak": { "snaktype": "value", "property": "P41", "datavalue": { "value": "Flag of Italy.svg", "type": "string" }, "datatype": "commonsMedia" },
As far as I can tell, the Turtle (ttl) dump does not list a datatype property either, but this may be because I don't understand its format. wd:Q38 p:P41 wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D . wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D a wikibase:Statement, wikibase:BestRank ; wikibase:rank wikibase:NormalRank ; ps:P41 http://commons.wikimedia.org/wiki/Special:FilePath/Flag%20of%20Italy.svg ; pq:P580 "1946-06-19T00:00:00Z"^^xsd:dateTime ; pqv:P580 wdv:204e90b1bce9f96d6d4ff632a8da0ecc .
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata