Hi Daniel,
Thanks for the quick and helpful reply. I was hoping that the XML
dumps could be changed, but I understand now that the JSON dumps are
the recommended format.
To avoid downloading redundant information, you can
use one of the
wikidatawiki-20161120-stub-* dumps instead of the full page dumps
This is useful, but unfortunately it won't suffice. Wikidata also has
pages which are wikitext (for example,
https://www.wikidata.org/wiki/Wikidata:WikiProject_Names). These
wikitext pages are in the XML dumps, but aren't in the stub dumps nor
the JSON dumps. I actually do use these Wikidata wikitext entries to
try to reproduce Wikidata in its entirety. So for now, it looks like
both XML dumps and JSON dumps will be required.
At any rate, thanks again for the excellent reply.
On Sat, Nov 26, 2016 at 12:25 PM, Daniel Kinzler
<daniel.kinzler(a)wikimedia.de> wrote:
Hi gnosygnu!
The JSON in the XML dumps is the raw contents of the storage backend. It can't
be changed retroactively, and re-encoding everything on the fly would be too
expensive. Also, the JSON embedded in the XML files is not officially supported
as a stable interface of Wikibase. The JSON format in the XML files can change
without notice, and you may encounter different representations even within the
same dump.
I recommend to use the JSON dumps, they contain our data in canonical form. To
avoid downloading redundant information, you can use one of the
wikidatawiki-20161120-stub-* dumps instead of the full page dumps. These don't
contain the actual page content, just meta-data.
Caveat: there is currently no dump that contains the JSON of old revisions of
entities in canonical form. You can only get them individually from
Special:EntityData, e.g.
<https://www.wikidata.org/wiki/Special:EntityData/Q23.json?oldid=30279>
HTH
-- daniel
Am 26.11.2016 um 02:13 schrieb gnosygnu:
Hi everyone. I have a question about the Wikidata
xml dump, but I'm
posting this question here, because it looks more related to Wikidata.
In short, it seems that the "pages-articles.xml" does not include the
datatype property for snaks. For example, the xml dump does not list a
datatype for Q38 (Italy) and P41 (flag image). In contrast, the json
dump does list a datatype of "commonsMedia".
Can this datatype property be included in future xml dumps? The
alternative would be to download two large and redundant dumps (xml
and json) in order to reconstruct a Wikidata instance.
More information is provided below the break. Let me know if you need
anything else.
Thanks.
----
Here's an excerpt from the xml data dump for Q38 (Italy) and P41 (flag
image). Notice that there is no "datatype" property
//
https://dumps.wikimedia.org/wikidatawiki/20161120/wikidatawiki-20161120-pag…
"mainsnak": {
"snaktype": "value",
"property": "P41",
"hash": "a3bd1e026c51f5e0bdf30b2323a7a1fb913c9863",
"datavalue": {
"value": "Flag of Italy.svg",
"type": "string"
}
},
Meanwhile, the API and the JSON dump lists a datatype property of
"commonsMedia":
//
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=q38
//
https://dumps.wikimedia.org/wikidatawiki/entities/20161114/wikidata-2016111…
"P41": [{
"mainsnak": {
"snaktype": "value",
"property": "P41",
"datavalue": {
"value": "Flag of Italy.svg",
"type": "string"
},
"datatype": "commonsMedia"
},
As far as I can tell, the Turtle (ttl) dump does not list a datatype
property either, but this may be because I don't understand its
format.
wd:Q38 p:P41 wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D .
wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D a wikibase:Statement,
wikibase:BestRank ;
wikibase:rank wikibase:NormalRank ;
ps:P41
<http://commons.wikimedia.org/wiki/Special:FilePath/Flag%20of%20Italy.svg>
;
pq:P580 "1946-06-19T00:00:00Z"^^xsd:dateTime ;
pqv:P580 wdv:204e90b1bce9f96d6d4ff632a8da0ecc .
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata