On 4 May 2013 17:12, Daniel Kinzler <daniel.kinzler(a)wikimedia.de> wrote:
On 04.05.2013 12:05, Jona Christopher Sahnwaldt
On 26 April 2013 17:15, Daniel Kinzler
*internal* JSON representation, which is
different from what the API returns,
and may change at any time without notice.
Somewhat off-topic: I didn't know you have different JSON
representations. I'm curious and I'd be happy about a few quick
- How many are there? Just the two, internal and external?
Yes, these two.
- Which JSON representations do the API and the
XML dump provide? Will
they do so in the future?
The XML dump provides the internal representations (since it's a dump of the raw
page content). The API uses the external representation.
This is pretty much dictated by the nature of the dumps and the API, so it will
stay that way. However, we plan to add more types of dumps, including:
* a plain JSON dump (using the external representation)
* an RDF/XML dump
It's not sure yet when or even if we'll provide these, but we are considering
- Are the API and XML dump representations
stable? Or should we expect
The internal representation is unstable and subject to changes without notice.
In fact, it may even change to something other than JSON. I don't think it's
even documented anywhere outside the source code.
The external representation is pretty stable, though not final yet. We will
definitely make additions to it, and some (hopefully minor) structural changes
may be necessary. We'll try to stay largely backwards compatible, but can't
promise full stability yet.
Also, the external representation uses the API framework for generating the
actual JSON, and may be subject to changes imposed by that framework.
Unfortunately, this means that there are currently no dumps with a reliable
representation of our data. You need to a) use the API or b) use the unstable
internal JSON or c) wait for "real" data dumps.
Thanks for the clarification. Not the best news, but not terribly bad either.
We will produce a DBpedia release pretty soon, I don't think we can
wait for the "real" dumps. The inter-language links are an important
part of DBpedia, so we have to extract data from almost all Wikidata
items. I don't think it's sensible to make ~10 million calls to the
API to download the external JSON format, so we will have to use the
XML dumps and thus the internal format. But I think it's not a big
deal that it's not that stable: we parse the JSON into an AST anyway.
It just means that we will have to use a more abstract AST, which I
was planning to do anyway. As long as the semantics of the internal
format will remain more or less the same - it will contain the labels,
the language links, the properties, etc. - it's no big deal if the
syntax changes, even if it's not JSON anymore.