Re: [Wikidata] Can mainsnak.datatype be included in the pages-articles.xml dump?

28 Nov 2016

Am 28.11.2016 um 16:31 schrieb gnosygnu:
...
   If you are
also using the same software (Wikibase on MediaWiki), the XML dumps
 should Just Work (tm). The idea of the XML dumps is that the "text" blobs are
 opaque to 3rd parties, but will continue to work with future versions of
 MediaWiki & friends (with a compatible configuration - which is rather tricky).

 Not sure I follow. Even from a Wikibase on MediaWiki perspective, the
 XML dumps are still incomplete (since they're missing
 mainsnak.datatype). 
The datatype is implicit, it can be derived from the property ID. You can find
it by looking at the Property page's JSON.

The XML dumps are complete by definition, since they contain a raw copy of the
primary data blob. All other data is derived from this. However, since they are
"raw", they are not easy to process by consumers, and we make no guarantees
regarding the raw data format.

We include the data type in the statements of the canonical JSON dumps for
convenience. We are planning to add more things to the JSON output for
convenience. That does not make the XML dumps incomplete.

You use case is special since you want canonical JSON *and* wikitext. I'm afraid
you will have to process both kinds of dumps.

...
  One line of the file specifically checks for datatype:
"if datatype
 and datatype == 'commonsMedia' then". This line always evaluates to
 false, even though you are looking at an entity (Q38: Italy) and
 property (P41: flag image) which does have a datatype for
 "commonsMedia" (since the XML dump does not have
"mainsnak.datatype"). 
That is incorrect. datatype will always be set in Lua, even if it is not present
in the XML. Remember that it is not present in the primary blob on Wikidata
either. Wikibase will look it up internally, from the wb_property_info table,
and make that information available to Lua.

When loading the XML file, a lot of secondary information is extracted into
database tables for this kind of use, e.g. all the labels and descriptions go
into the wb_terms table, property types go into wb_property_info, links to other
items go to page_links, etc.

Actually, you may have to run refreshLinks.php or rebuildall.php after doing the
XML import, I'm not sure which is needed when any more. But the point is: the
XML dump contains all information needed to reconstruct the content. This is
true for wikitext as well as for Wikibase JSON data. All derived information is
extracted upon import, and is made available via the respective APIs, including
Lua, just like on Wikidata.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Can mainsnak.datatype be included in the pages-articles.xml dump?