2013/1/25 Daniel Kinzler <daniel.kinzler@wikimedia.de>

Hi!

I thought about the RDF export a bit, and I think we should break this up into
several steps for better tracking. Here is what I think needs to be done:

Daniel,

I am answering to Wikidata-l, and adding Tpt (since he started working on something similar), hoping to get more input on the open list.

I especially hope that Markus and maybe Jeroen can provide insight from the experience with Semantic MediaWiki.

Just to reiterate internally: in my opinion we should learn from the experience that SMW made here, but we should not immediately try to create common code for this case. First step should be to create something that works for Wikibase, and then analyze if we can refactor some code on both Wikibase and SMW and then have a common library that both build on. This will give us two running systems that can be tested against while refactoring. But starting the other way around -- designing a common library, developing it for both Wikibase and SMW, while keeping SMW's constraints in mind -- will be much more expensive in terms of resources. I guess we agree on the end result -- share as much code as possible. But please let us not *start* with that goal, but rather aim first at the goal "Get an RDF export for Wikidata". (This is especially true because of the fact that Wikibase is basically reified all the way through, something SMW does not have to deal with).

In Semantic MediaWiki, the relevant parts of the code are (if I get it right):

SMWSemanticData is roughly what we call Wikibase::Entity

includes/export/SMW_ExportController.php - SMWExportController - main object responsible for creating serializations. Used for configuration, and then calls the SMWExporter on the relevant data (which it collects itself) and applies the defined SMWSerializer on the returned SMWExpData.

includes/export/SMW_Exporter.php - SMWExporter - takes a SMWSemanticData object and returns a SMWExpData object, which is optimized for being exported

includes/export/SMW_Exp_Data.php - SMWExpData - holds the data that is needed for export

includes/export/SMW_Exp_Element.php - several classes used to represent the data in SMWExpData. Note that there is some interesting interplay happening with DataItems and DataValues here.

includes/export/SMW_Serializer.php - SMWSerializer - abstract class for different serializers

includes/export/SMW_Serializer_RDFXML.php - SMWRDFXMLSerializer - responsible to create the RDF/XML serialization

includes/export/SWM_Serializer_Turtle.php - SMWTurtleSerializer - responsible to create the Turtle serialization

special/URIResolver/SMW_SpecialURIResolver.php - SMWURIResolver - Special page that deals with content negotiation.

special/Export/SMW_SpecialOWLExport.php - SMWSpecialOWLExport - Special page that serializes a single item.

maintenance/SMW_dumpRDF.php - calling the serialization code to create a dump of the whole wiki, or of certain entity types. Basically configures a SMWExportController and let's it do its job.

There are some smart ideas in the way that the ExportController and Exporter are being called by both the dump script as well as the single item serializer, and that allow it to scale to almost any size.

Remember that unlike SMW, Wikibase contains mostly reified knowledge. Here is the spec of how to translate the internal Wikibase representation to RDF: http://meta.wikimedia.org/wiki/Wikidata/Development/RDF

The other major influence is obviously the MediaWiki API, with its (almost) clean separation of results and serialization formats. Whereas we can also get inspired here, the issue is that RDF is a graph based model and the MediaWiki API is really built for a tree. Therefore I am afraid that we cannot reuse much here.

Note that this does not mean that the API can not be used to access the data about entities, but merely that the API answers with tree-based objects, most prominently the JSON objects described here: http://meta.wikimedia.org/wiki/Wikidata/Data_model/JSON

So, after this lengthy prelude, let's get to the Todos that Daniel suggests:

* A low-level serializer for RDF triples, with namespace support. Would be nice
if it had support for different forms of output (xml, n3, etc). I suppose we can
just use an existing one, but it needs to be found and tried.

Re reuse: the thing is that to the best of my knowledge PHP RDF packages are quite heavyweight (because they also contain parsers, not just serializers, and often enough SPARQL processors and support for blank nodes etc.), and it is rare that they support the kind of high-throughput streaming that we would require for the complete dump (i.e. there is obviously no point of first setting all triples into a graph model and then call the model->serialize() method, this needs too much memory). Also some optimizations that we can use (re ordering of triples, use of namespaces, some assumptions about the whole dump, etc.). I will ask the Semantic Web mailing list about that, but I don't have much hope.

The corresponding classes in SMW are the SMWSerializer classes.

* A high level RDF serializer that process Entity objects. It should be possible
to use this in streaming mode, i.e. it needs separate functions for generating
the document header and footer in addition to the actual Entities.

This corresponds to the SMWExporter and and parts of the SMWExportController classes.

* Support for pulling in extra information on demand, e.g. back-links or
prope3rty definitions.

SMWExportController provides most of these supporting tasks.

* A maintenance script for generating dumps. It should at least be able to
generate a dump of either all entities, or one kind of entity (e.g. items). And
it should also be able to dump a given list of entities.

Surprisingly, creating a dump of all entities or of one kind of entities is quite different from providing a dump of a given list of entities, because whenever you create a dump of everything you can make some assumptions that save you from keeping a lot of state. Therefor this item should be split into two (or even three) subitems.

* A script to create a dump of all entities

* (A script to create a dump of all entities of a given kind)

* A script to create a dump of a list of entities

Personally I think the last item has a rather low priority, because it can be so easily simulated.

* Special:EntityData needs a plug in interface so the RDF serializer can be used
from there.

Or call the exporter. This special page corresponds to SMWSpecialOWLExport.

* Special:EntityData should support format selection using file extension syntax
(e.g. Q1234.n3 vs. Q1234.json).

That is a nice solution which works with Wikibase and was not available in SMW.

* Similarly, Special:EntityData should support a "pretty" syntax for showing
specific revisions, e.g. Q1234.json@81762345.

I really never understood why you considered this one so important. Let's keep it as an item, but for me the priority of it is really low.

* Special:EntityData should support content negotiation (using redirects).

Basically what SMWURIResolver provides, but can be a bit nicer due to the file extension suffixes.

Did I miss anything?

Probably, just as I did.

I'd like to see if we get some input here, and then we can extract the items from it and start with implementing them.

Already available is the following special page:

http://wikidata-test-repo.wikimedia.de/wiki/Special:EntityData/Q3

-- daniel

--
Daniel Kinzler, Softwarearchitekt
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.