Hi everyone. I have a question about the Wikidata xml dump, but I'm posting this question here, because it looks more related to Wikidata.
In short, it seems that the "pages-articles.xml" does not include the datatype property for snaks. For example, the xml dump does not list a datatype for Q38 (Italy) and P41 (flag image). In contrast, the json dump does list a datatype of "commonsMedia".
Can this datatype property be included in future xml dumps? The alternative would be to download two large and redundant dumps (xml and json) in order to reconstruct a Wikidata instance.
More information is provided below the break. Let me know if you need anything else.
Thanks.
----
Here's an excerpt from the xml data dump for Q38 (Italy) and P41 (flag image). Notice that there is no "datatype" property // https://dumps.wikimedia.org/wikidatawiki/20161120/wikidatawiki-20161120-page... "mainsnak": { "snaktype": "value", "property": "P41", "hash": "a3bd1e026c51f5e0bdf30b2323a7a1fb913c9863", "datavalue": { "value": "Flag of Italy.svg", "type": "string" } },
Meanwhile, the API and the JSON dump lists a datatype property of "commonsMedia": // https://www.wikidata.org/w/api.php?action=wbgetentities&ids=q38 // https://dumps.wikimedia.org/wikidatawiki/entities/20161114/wikidata-20161114... "P41": [{ "mainsnak": { "snaktype": "value", "property": "P41", "datavalue": { "value": "Flag of Italy.svg", "type": "string" }, "datatype": "commonsMedia" },
As far as I can tell, the Turtle (ttl) dump does not list a datatype property either, but this may be because I don't understand its format. wd:Q38 p:P41 wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D . wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D a wikibase:Statement, wikibase:BestRank ; wikibase:rank wikibase:NormalRank ; ps:P41 http://commons.wikimedia.org/wiki/Special:FilePath/Flag%20of%20Italy.svg ; pq:P580 "1946-06-19T00:00:00Z"^^xsd:dateTime ; pqv:P580 wdv:204e90b1bce9f96d6d4ff632a8da0ecc .
Hi gnosygnu!
The JSON in the XML dumps is the raw contents of the storage backend. It can't be changed retroactively, and re-encoding everything on the fly would be too expensive. Also, the JSON embedded in the XML files is not officially supported as a stable interface of Wikibase. The JSON format in the XML files can change without notice, and you may encounter different representations even within the same dump.
I recommend to use the JSON dumps, they contain our data in canonical form. To avoid downloading redundant information, you can use one of the wikidatawiki-20161120-stub-* dumps instead of the full page dumps. These don't contain the actual page content, just meta-data.
Caveat: there is currently no dump that contains the JSON of old revisions of entities in canonical form. You can only get them individually from Special:EntityData, e.g. https://www.wikidata.org/wiki/Special:EntityData/Q23.json?oldid=30279
HTH -- daniel
Am 26.11.2016 um 02:13 schrieb gnosygnu:
Hi everyone. I have a question about the Wikidata xml dump, but I'm posting this question here, because it looks more related to Wikidata.
In short, it seems that the "pages-articles.xml" does not include the datatype property for snaks. For example, the xml dump does not list a datatype for Q38 (Italy) and P41 (flag image). In contrast, the json dump does list a datatype of "commonsMedia".
Can this datatype property be included in future xml dumps? The alternative would be to download two large and redundant dumps (xml and json) in order to reconstruct a Wikidata instance.
More information is provided below the break. Let me know if you need anything else.
Thanks.
Here's an excerpt from the xml data dump for Q38 (Italy) and P41 (flag image). Notice that there is no "datatype" property // https://dumps.wikimedia.org/wikidatawiki/20161120/wikidatawiki-20161120-page... "mainsnak": { "snaktype": "value", "property": "P41", "hash": "a3bd1e026c51f5e0bdf30b2323a7a1fb913c9863", "datavalue": { "value": "Flag of Italy.svg", "type": "string" } },
Meanwhile, the API and the JSON dump lists a datatype property of "commonsMedia": // https://www.wikidata.org/w/api.php?action=wbgetentities&ids=q38 // https://dumps.wikimedia.org/wikidatawiki/entities/20161114/wikidata-20161114... "P41": [{ "mainsnak": { "snaktype": "value", "property": "P41", "datavalue": { "value": "Flag of Italy.svg", "type": "string" }, "datatype": "commonsMedia" },
As far as I can tell, the Turtle (ttl) dump does not list a datatype property either, but this may be because I don't understand its format. wd:Q38 p:P41 wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D . wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D a wikibase:Statement, wikibase:BestRank ; wikibase:rank wikibase:NormalRank ; ps:P41 http://commons.wikimedia.org/wiki/Special:FilePath/Flag%20of%20Italy.svg ; pq:P580 "1946-06-19T00:00:00Z"^^xsd:dateTime ; pqv:P580 wdv:204e90b1bce9f96d6d4ff632a8da0ecc .
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Daniel,
Thanks for the quick and helpful reply. I was hoping that the XML dumps could be changed, but I understand now that the JSON dumps are the recommended format.
To avoid downloading redundant information, you can use one of the wikidatawiki-20161120-stub-* dumps instead of the full page dumps
This is useful, but unfortunately it won't suffice. Wikidata also has pages which are wikitext (for example, https://www.wikidata.org/wiki/Wikidata:WikiProject_Names). These wikitext pages are in the XML dumps, but aren't in the stub dumps nor the JSON dumps. I actually do use these Wikidata wikitext entries to try to reproduce Wikidata in its entirety. So for now, it looks like both XML dumps and JSON dumps will be required.
At any rate, thanks again for the excellent reply.
On Sat, Nov 26, 2016 at 12:25 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Hi gnosygnu!
The JSON in the XML dumps is the raw contents of the storage backend. It can't be changed retroactively, and re-encoding everything on the fly would be too expensive. Also, the JSON embedded in the XML files is not officially supported as a stable interface of Wikibase. The JSON format in the XML files can change without notice, and you may encounter different representations even within the same dump.
I recommend to use the JSON dumps, they contain our data in canonical form. To avoid downloading redundant information, you can use one of the wikidatawiki-20161120-stub-* dumps instead of the full page dumps. These don't contain the actual page content, just meta-data.
Caveat: there is currently no dump that contains the JSON of old revisions of entities in canonical form. You can only get them individually from Special:EntityData, e.g. https://www.wikidata.org/wiki/Special:EntityData/Q23.json?oldid=30279
HTH -- daniel
Am 26.11.2016 um 02:13 schrieb gnosygnu:
Hi everyone. I have a question about the Wikidata xml dump, but I'm posting this question here, because it looks more related to Wikidata.
In short, it seems that the "pages-articles.xml" does not include the datatype property for snaks. For example, the xml dump does not list a datatype for Q38 (Italy) and P41 (flag image). In contrast, the json dump does list a datatype of "commonsMedia".
Can this datatype property be included in future xml dumps? The alternative would be to download two large and redundant dumps (xml and json) in order to reconstruct a Wikidata instance.
More information is provided below the break. Let me know if you need anything else.
Thanks.
Here's an excerpt from the xml data dump for Q38 (Italy) and P41 (flag image). Notice that there is no "datatype" property // https://dumps.wikimedia.org/wikidatawiki/20161120/wikidatawiki-20161120-page... "mainsnak": { "snaktype": "value", "property": "P41", "hash": "a3bd1e026c51f5e0bdf30b2323a7a1fb913c9863", "datavalue": { "value": "Flag of Italy.svg", "type": "string" } },
Meanwhile, the API and the JSON dump lists a datatype property of "commonsMedia": // https://www.wikidata.org/w/api.php?action=wbgetentities&ids=q38 // https://dumps.wikimedia.org/wikidatawiki/entities/20161114/wikidata-20161114... "P41": [{ "mainsnak": { "snaktype": "value", "property": "P41", "datavalue": { "value": "Flag of Italy.svg", "type": "string" }, "datatype": "commonsMedia" },
As far as I can tell, the Turtle (ttl) dump does not list a datatype property either, but this may be because I don't understand its format. wd:Q38 p:P41 wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D . wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D a wikibase:Statement, wikibase:BestRank ; wikibase:rank wikibase:NormalRank ; ps:P41 http://commons.wikimedia.org/wiki/Special:FilePath/Flag%20of%20Italy.svg ; pq:P580 "1946-06-19T00:00:00Z"^^xsd:dateTime ; pqv:P580 wdv:204e90b1bce9f96d6d4ff632a8da0ecc .
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Am 27.11.2016 um 01:15 schrieb gnosygnu:
This is useful, but unfortunately it won't suffice. Wikidata also has pages which are wikitext (for example, https://www.wikidata.org/wiki/Wikidata:WikiProject_Names). These wikitext pages are in the XML dumps, but aren't in the stub dumps nor the JSON dumps. I actually do use these Wikidata wikitext entries to try to reproduce Wikidata in its entirety.
If you are also using the same software (Wikibase on MediaWiki), the XML dumps should Just Work (tm). The idea of the XML dumps is that the "text" blobs are opaque to 3rd parties, but will continue to work with future versions of MediaWiki & friends (with a compatible configuration - which is rather tricky).
If you are also using the same software (Wikibase on MediaWiki), the XML dumps should Just Work (tm). The idea of the XML dumps is that the "text" blobs are opaque to 3rd parties, but will continue to work with future versions of MediaWiki & friends (with a compatible configuration - which is rather tricky).
Not sure I follow. Even from a Wikibase on MediaWiki perspective, the XML dumps are still incomplete (since they're missing mainsnak.datatype).
For example, consider the following: * You download only the XML dump pages-articles.xml dump from https://dumps.wikimedia.org/wikidatawiki/latest/ * You load it into MediaWiki * You then create a module that looks like the Wikidata Module from Russian Wikipedia: https://ru.wikipedia.org/w/index.php?title=Module:Wikidata&action=edit
One line of the file specifically checks for datatype: "if datatype and datatype == 'commonsMedia' then". This line always evaluates to false, even though you are looking at an entity (Q38: Italy) and property (P41: flag image) which does have a datatype for "commonsMedia" (since the XML dump does not have "mainsnak.datatype").
From a user standpoint, this means that if you're trying to set up a
local version of Russian Wikipedia and Wikidata, then all Country infoboxes will not show the country's flag (the above line of code will substitute text for the image)
The only way around this is to supplement the XML dump with the JSON dump. But then, you'll need to download 2 large dumps and somehow merge them. (I don't know if MediaWiki has a facility to load the JSON dump, much less merge it)
Anyway, I understand that there are technical complications with trying to add mainsnak.datatype to the XML dumps. But if this never gets resolved, then the current situation basically offers two unsatisfying options: * Have an XML dump which is 99.9% complete but still missing key info (mainsnak.datatype) * Try to merge the JSON dump into the XML dump (which MediaWiki may not be able to do)
Hope this makes sense.
Thanks.
On Sun, Nov 27, 2016 at 11:49 AM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 27.11.2016 um 01:15 schrieb gnosygnu:
This is useful, but unfortunately it won't suffice. Wikidata also has pages which are wikitext (for example, https://www.wikidata.org/wiki/Wikidata:WikiProject_Names). These wikitext pages are in the XML dumps, but aren't in the stub dumps nor the JSON dumps. I actually do use these Wikidata wikitext entries to try to reproduce Wikidata in its entirety.
If you are also using the same software (Wikibase on MediaWiki), the XML dumps should Just Work (tm). The idea of the XML dumps is that the "text" blobs are opaque to 3rd parties, but will continue to work with future versions of MediaWiki & friends (with a compatible configuration - which is rather tricky).
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Am 28.11.2016 um 16:31 schrieb gnosygnu:
If you are also using the same software (Wikibase on MediaWiki), the XML dumps should Just Work (tm). The idea of the XML dumps is that the "text" blobs are opaque to 3rd parties, but will continue to work with future versions of MediaWiki & friends (with a compatible configuration - which is rather tricky).
Not sure I follow. Even from a Wikibase on MediaWiki perspective, the XML dumps are still incomplete (since they're missing mainsnak.datatype).
The datatype is implicit, it can be derived from the property ID. You can find it by looking at the Property page's JSON.
The XML dumps are complete by definition, since they contain a raw copy of the primary data blob. All other data is derived from this. However, since they are "raw", they are not easy to process by consumers, and we make no guarantees regarding the raw data format.
We include the data type in the statements of the canonical JSON dumps for convenience. We are planning to add more things to the JSON output for convenience. That does not make the XML dumps incomplete.
You use case is special since you want canonical JSON *and* wikitext. I'm afraid you will have to process both kinds of dumps.
One line of the file specifically checks for datatype: "if datatype and datatype == 'commonsMedia' then". This line always evaluates to false, even though you are looking at an entity (Q38: Italy) and property (P41: flag image) which does have a datatype for "commonsMedia" (since the XML dump does not have "mainsnak.datatype").
That is incorrect. datatype will always be set in Lua, even if it is not present in the XML. Remember that it is not present in the primary blob on Wikidata either. Wikibase will look it up internally, from the wb_property_info table, and make that information available to Lua.
When loading the XML file, a lot of secondary information is extracted into database tables for this kind of use, e.g. all the labels and descriptions go into the wb_terms table, property types go into wb_property_info, links to other items go to page_links, etc.
Actually, you may have to run refreshLinks.php or rebuildall.php after doing the XML import, I'm not sure which is needed when any more. But the point is: the XML dump contains all information needed to reconstruct the content. This is true for wikitext as well as for Wikibase JSON data. All derived information is extracted upon import, and is made available via the respective APIs, including Lua, just like on Wikidata.
The datatype is implicit, it can be derived from the property ID. You can find it by looking at the Property page's JSON. ...
Thanks for all the info. I see my error. I didn't realize that mainsnak.datatype was inferred. I assumed it would have to be embedded directly in the XML's JSON (partly because it is embedded directly in the JSON's dump JSON)
The rest of your points make sense. Thanks again for taking the time to clarify.
On Mon, Nov 28, 2016 at 11:22 AM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 28.11.2016 um 16:31 schrieb gnosygnu:
If you are also using the same software (Wikibase on MediaWiki), the XML dumps should Just Work (tm). The idea of the XML dumps is that the "text" blobs are opaque to 3rd parties, but will continue to work with future versions of MediaWiki & friends (with a compatible configuration - which is rather tricky).
Not sure I follow. Even from a Wikibase on MediaWiki perspective, the XML dumps are still incomplete (since they're missing mainsnak.datatype).
The datatype is implicit, it can be derived from the property ID. You can find it by looking at the Property page's JSON.
The XML dumps are complete by definition, since they contain a raw copy of the primary data blob. All other data is derived from this. However, since they are "raw", they are not easy to process by consumers, and we make no guarantees regarding the raw data format.
We include the data type in the statements of the canonical JSON dumps for convenience. We are planning to add more things to the JSON output for convenience. That does not make the XML dumps incomplete.
You use case is special since you want canonical JSON *and* wikitext. I'm afraid you will have to process both kinds of dumps.
One line of the file specifically checks for datatype: "if datatype and datatype == 'commonsMedia' then". This line always evaluates to false, even though you are looking at an entity (Q38: Italy) and property (P41: flag image) which does have a datatype for "commonsMedia" (since the XML dump does not have "mainsnak.datatype").
That is incorrect. datatype will always be set in Lua, even if it is not present in the XML. Remember that it is not present in the primary blob on Wikidata either. Wikibase will look it up internally, from the wb_property_info table, and make that information available to Lua.
When loading the XML file, a lot of secondary information is extracted into database tables for this kind of use, e.g. all the labels and descriptions go into the wb_terms table, property types go into wb_property_info, links to other items go to page_links, etc.
Actually, you may have to run refreshLinks.php or rebuildall.php after doing the XML import, I'm not sure which is needed when any more. But the point is: the XML dump contains all information needed to reconstruct the content. This is true for wikitext as well as for Wikibase JSON data. All derived information is extracted upon import, and is made available via the respective APIs, including Lua, just like on Wikidata.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Am 28.11.2016 um 17:34 schrieb gnosygnu:
The datatype is implicit, it can be derived from the property ID. You can find it by looking at the Property page's JSON. ...
Thanks for all the info. I see my error. I didn't realize that mainsnak.datatype was inferred. I assumed it would have to be embedded directly in the XML's JSON (partly because it is embedded directly in the JSON's dump JSON)
The rest of your points make sense. Thanks again for taking the time to clarify.
If you have problems accessing the datatype from Lua or elsewhere, let me know. There may be issues with the import process.
It's always cool to see that people use our data and our software!
If you have problems accessing the datatype from Lua or elsewhere, let me know.
Honestly, I haven't tried. Just so you know, I'm the developer of XOWA which is an offline wiki app in Java. As such, I'm accessing Wikidata data directly, not through the Wikibase code. (If you're curious, I also use it to recreate Wikidata locally as well. See: http://xowa.org/home/file/screenshot_wikidata.png)
Going forward, I'll double-check that my Wikidata issues are not related to my not using Wikibase. Again, my thanks to you for clearing that up.
It's always cool to see that people use our data and our software!
Yup. Wikidata is very cool in concept and in practice. It's amazing to have a single, multi-lingual, verifiable repository of facts / details -- all free and open-content. Kudos to you and your team for the excellent work!
On Mon, Nov 28, 2016 at 11:39 AM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 28.11.2016 um 17:34 schrieb gnosygnu:
The datatype is implicit, it can be derived from the property ID. You can find it by looking at the Property page's JSON. ...
Thanks for all the info. I see my error. I didn't realize that mainsnak.datatype was inferred. I assumed it would have to be embedded directly in the XML's JSON (partly because it is embedded directly in the JSON's dump JSON)
The rest of your points make sense. Thanks again for taking the time to clarify.
If you have problems accessing the datatype from Lua or elsewhere, let me know. There may be issues with the import process.
It's always cool to see that people use our data and our software!
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata