2013/1/25 Daniel Kinzler <daniel.kinzler(a)wikimedia.de>
Hi!
I thought about the RDF export a bit, and I think we should break this up
into
several steps for better tracking. Here is what I think needs to be done:
Daniel,
I am answering to Wikidata-l, and adding Tpt (since he started working on
something similar), hoping to get more input on the open list.
I especially hope that Markus and maybe Jeroen can provide insight from the
experience with Semantic MediaWiki.
Just to reiterate internally: in my opinion we should learn from the
experience that SMW made here, but we should not immediately try to create
common code for this case. First step should be to create something that
works for Wikibase, and then analyze if we can refactor some code on both
Wikibase and SMW and then have a common library that both build on. This
will give us two running systems that can be tested against while
refactoring. But starting the other way around -- designing a common
library, developing it for both Wikibase and SMW, while keeping SMW's
constraints in mind -- will be much more expensive in terms of resources. I
guess we agree on the end result -- share as much code as possible. But
please let us not *start* with that goal, but rather aim first at the goal
"Get an RDF export for Wikidata". (This is especially true because of the
fact that Wikibase is basically reified all the way through, something SMW
does not have to deal with).
In Semantic MediaWiki, the relevant parts of the code are (if I get it
right):
SMWSemanticData is roughly what we call Wikibase::Entity
includes/export/SMW_ExportController.php - SMWExportController - main
object responsible for creating serializations. Used for configuration, and
then calls the SMWExporter on the relevant data (which it collects itself)
and applies the defined SMWSerializer on the returned SMWExpData.
includes/export/SMW_Exporter.php - SMWExporter - takes a SMWSemanticData
object and returns a SMWExpData object, which is optimized for being
exported
includes/export/SMW_Exp_Data.php - SMWExpData - holds the data that is
needed for export
includes/export/SMW_Exp_Element.php - several classes used to represent the
data in SMWExpData. Note that there is some interesting interplay happening
with DataItems and DataValues here.
includes/export/SMW_Serializer.php - SMWSerializer - abstract class for
different serializers
includes/export/SMW_Serializer_RDFXML.php - SMWRDFXMLSerializer -
responsible to create the RDF/XML serialization
includes/export/SWM_Serializer_Turtle.php - SMWTurtleSerializer -
responsible to create the Turtle serialization
special/URIResolver/SMW_SpecialURIResolver.php - SMWURIResolver - Special
page that deals with content negotiation.
special/Export/SMW_SpecialOWLExport.php - SMWSpecialOWLExport - Special
page that serializes a single item.
maintenance/SMW_dumpRDF.php - calling the serialization code to create a
dump of the whole wiki, or of certain entity types. Basically configures a
SMWExportController and let's it do its job.
There are some smart ideas in the way that the ExportController and
Exporter are being called by both the dump script as well as the single
item serializer, and that allow it to scale to almost any size.
Remember that unlike SMW, Wikibase contains mostly reified knowledge. Here
is the spec of how to translate the internal Wikibase representation to
RDF:
http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
The other major influence is obviously the MediaWiki API, with its (almost)
clean separation of results and serialization formats. Whereas we can also
get inspired here, the issue is that RDF is a graph based model and the
MediaWiki API is really built for a tree. Therefore I am afraid that we
cannot reuse much here.
Note that this does not mean that the API can not be used to access the
data about entities, but merely that the API answers with tree-based
objects, most prominently the JSON objects described here:
http://meta.wikimedia.org/wiki/Wikidata/Data_model/JSON
So, after this lengthy prelude, let's get to the Todos that Daniel suggests:
* A low-level serializer for RDF triples, with namespace support. Would be
nice
if it had support for different forms of output (xml, n3, etc). I suppose
we can
just use an existing one, but it needs to be found and tried.
Re reuse: the thing is that to the best of my knowledge PHP RDF packages
are quite heavyweight (because they also contain parsers, not just
serializers, and often enough SPARQL processors and support for blank nodes
etc.), and it is rare that they support the kind of high-throughput
streaming that we would require for the complete dump (i.e. there is
obviously no point of first setting all triples into a graph model and then
call the model->serialize() method, this needs too much memory). Also some
optimizations that we can use (re ordering of triples, use of namespaces,
some assumptions about the whole dump, etc.). I will ask the Semantic Web
mailing list about that, but I don't have much hope.
The corresponding classes in SMW are the SMWSerializer classes.
* A high level RDF serializer that process Entity
objects. It should be
possible
to use this in streaming mode, i.e. it needs separate functions for
generating
the document header and footer in addition to the actual Entities.
This corresponds to the SMWExporter and and parts of the
SMWExportController classes.
* Support for pulling in extra information on demand,
e.g. back-links or
prope3rty definitions.
SMWExportController provides most of these supporting tasks.
* A maintenance script for generating dumps. It should
at least be able to
generate a dump of either all entities, or one kind of entity (e.g.
items). And
it should also be able to dump a given list of entities.
Surprisingly, creating a dump of all entities or of one kind of entities is
quite different from providing a dump of a given list of entities, because
whenever you create a dump of everything you can make some assumptions that
save you from keeping a lot of state. Therefor this item should be split
into two (or even three) subitems.
* A script to create a dump of all entities
* (A script to create a dump of all entities of a given kind)
* A script to create a dump of a list of entities
Personally I think the last item has a rather low priority, because it can
be so easily simulated.
* Special:EntityData needs a plug in interface so the
RDF serializer can
be used
from there.
Or call the exporter. This special page corresponds to SMWSpecialOWLExport.
* Special:EntityData should support format selection
using file extension
syntax
(e.g. Q1234.n3 vs. Q1234.json).
That is a nice solution which works with Wikibase and was not available in
SMW.
* Similarly, Special:EntityData should support a
"pretty" syntax for
showing
specific revisions, e.g. Q1234.json@81762345.
I really never understood why you considered this one so important. Let's
keep it as an item, but for me the priority of it is really low.
* Special:EntityData should support content
negotiation (using redirects).
Basically what SMWURIResolver provides, but can be a bit nicer due to the
file extension suffixes.
Did I miss anything?
Probably, just as I did.
I'd like to see if we get some input here, and then we can extract the
items from it and start with implementing them.
Already available is the following special page:
http://wikidata-test-repo.wikimedia.de/wiki/Special:EntityData/Q3
-- daniel
--
Daniel Kinzler, Softwarearchitekt
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 |
http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.