The datatype is implicit, it can be derived from the property ID. You can find it by looking at the Property page's JSON. ...
Thanks for all the info. I see my error. I didn't realize that mainsnak.datatype was inferred. I assumed it would have to be embedded directly in the XML's JSON (partly because it is embedded directly in the JSON's dump JSON)
The rest of your points make sense. Thanks again for taking the time to clarify.
On Mon, Nov 28, 2016 at 11:22 AM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 28.11.2016 um 16:31 schrieb gnosygnu:
If you are also using the same software (Wikibase on MediaWiki), the XML dumps should Just Work (tm). The idea of the XML dumps is that the "text" blobs are opaque to 3rd parties, but will continue to work with future versions of MediaWiki & friends (with a compatible configuration - which is rather tricky).
Not sure I follow. Even from a Wikibase on MediaWiki perspective, the XML dumps are still incomplete (since they're missing mainsnak.datatype).
The datatype is implicit, it can be derived from the property ID. You can find it by looking at the Property page's JSON.
The XML dumps are complete by definition, since they contain a raw copy of the primary data blob. All other data is derived from this. However, since they are "raw", they are not easy to process by consumers, and we make no guarantees regarding the raw data format.
We include the data type in the statements of the canonical JSON dumps for convenience. We are planning to add more things to the JSON output for convenience. That does not make the XML dumps incomplete.
You use case is special since you want canonical JSON *and* wikitext. I'm afraid you will have to process both kinds of dumps.
One line of the file specifically checks for datatype: "if datatype and datatype == 'commonsMedia' then". This line always evaluates to false, even though you are looking at an entity (Q38: Italy) and property (P41: flag image) which does have a datatype for "commonsMedia" (since the XML dump does not have "mainsnak.datatype").
That is incorrect. datatype will always be set in Lua, even if it is not present in the XML. Remember that it is not present in the primary blob on Wikidata either. Wikibase will look it up internally, from the wb_property_info table, and make that information available to Lua.
When loading the XML file, a lot of secondary information is extracted into database tables for this kind of use, e.g. all the labels and descriptions go into the wb_terms table, property types go into wb_property_info, links to other items go to page_links, etc.
Actually, you may have to run refreshLinks.php or rebuildall.php after doing the XML import, I'm not sure which is needed when any more. But the point is: the XML dump contains all information needed to reconstruct the content. This is true for wikitext as well as for Wikibase JSON data. All derived information is extracted upon import, and is made available via the respective APIs, including Lua, just like on Wikidata.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata