On 4 May 2013 17:12, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
On 04.05.2013 12:05, Jona Christopher Sahnwaldt wrote:
On 26 April 2013 17:15, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
*internal* JSON representation, which is different from what the API returns, and may change at any time without notice.
Somewhat off-topic: I didn't know you have different JSON representations. I'm curious and I'd be happy about a few quick answers...
- How many are there? Just the two, internal and external?
Yes, these two.
- Which JSON representations do the API and the XML dump provide? Will
they do so in the future?
The XML dump provides the internal representations (since it's a dump of the raw page content). The API uses the external representation.
This is pretty much dictated by the nature of the dumps and the API, so it will stay that way. However, we plan to add more types of dumps, including:
- a plain JSON dump (using the external representation)
- an RDF/XML dump
It's not sure yet when or even if we'll provide these, but we are considering it.
- Are the API and XML dump representations stable? Or should we expect
some changes?
The internal representation is unstable and subject to changes without notice. In fact, it may even change to something other than JSON. I don't think it's even documented anywhere outside the source code.
The external representation is pretty stable, though not final yet. We will definitely make additions to it, and some (hopefully minor) structural changes may be necessary. We'll try to stay largely backwards compatible, but can't promise full stability yet.
Also, the external representation uses the API framework for generating the actual JSON, and may be subject to changes imposed by that framework.
Unfortunately, this means that there are currently no dumps with a reliable representation of our data. You need to a) use the API or b) use the unstable internal JSON or c) wait for "real" data dumps.
Thanks for the clarification. Not the best news, but not terribly bad either.
We will produce a DBpedia release pretty soon, I don't think we can wait for the "real" dumps. The inter-language links are an important part of DBpedia, so we have to extract data from almost all Wikidata items. I don't think it's sensible to make ~10 million calls to the API to download the external JSON format, so we will have to use the XML dumps and thus the internal format. But I think it's not a big deal that it's not that stable: we parse the JSON into an AST anyway. It just means that we will have to use a more abstract AST, which I was planning to do anyway. As long as the semantics of the internal format will remain more or less the same - it will contain the labels, the language links, the properties, etc. - it's no big deal if the syntax changes, even if it's not JSON anymore.
Christopher
-- daniel