Hi,
I am happy to report that an initial, yet fully functional RDF export
for Wikidata is now available. The exports can be created using the
wda-export-data.py script of the wda toolkit [1]. This script downloads
recent Wikidata database dumps and processes them to create RDF/Turtle
files. Various options are available to customize the output (e.g., to
export statements but not references, or to export only texts in English
and Wolof). The file creation takes a few (about three) hours on my
machine depending on what exactly is exported.
For your convenience, I have created some example exports based on
yesterday's dumps. These can be found at [2]. There are three Turtle
files: site links only, labels/descriptions/aliases only, statements
only. The fourth file is a preliminary version of the Wikibase ontology
that is used in the exports.
The export format is based on our earlier proposal [3], but it adds a
lot of details that had not been specified there yet (namespaces,
references, ID generation, compound datavalue encoding, etc.). Details
might still change, of course. We might provide regular dumps at another
location once the format is stable.
As a side effect of these activities, the wda toolkit [1] is also
getting more convenient to use. Creating code for exporting the data
into other formats is quite easy.
Features and known limitations of the wda RDF export:
(1) All current Wikidata datatypes are supported. Commons-media data is
correctly exported as URLs (not as strings).
(2) One-pass processing. Dumps are processed only once, even though this
means that we may not know the types of all properties when we first
need them: the script queries wikidata.org to find missing information.
This is only relevant when exporting statements.
(3) Limited language support. The script uses Wikidata's internal
language codes for string literals in RDF. In some cases, this might not
be correct. It would be great if somebody could create a mapping from
Wikidata language codes to BCP47 language codes (let me know if you
think you can do this, and I'll tell you where to put it)
(4) Limited site language support. To specify the language of linked
wiki sites, the script extracts a language code from the URL of the
site. Again, this might not be correct in all cases, and it would be
great if somebody had a proper mapping from Wikipedias/Wikivoyages to
language codes.
(5) Some data excluded. Data that cannot currently be edited is not
exported, even if it is found in the dumps. Examples include statement
ranks and timezones for time datavalues. I also currently exclude labels
and descriptions for simple English, formal German, and informal Dutch,
since these would pollute the label space for English, German, and Dutch
without adding much benefit (other than possibly for simple English
descriptions, I cannot see any case where these languages should ever
have different Wikidata texts at all).
Feedback is welcome.
Cheers,
Markus
[1] https://github.com/mkroetzsch/wda
Run "python wda-export.data.py --help" for usage instructions
[2] http://semanticweb.org/RDF/Wikidata/
[3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
--
Markus Kroetzsch, Departmental Lecturer
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529 http://korrekt.org/
On 12/08/13 17:07, Paul A. Houle wrote:
> My feelings are strong towards one-line-per-fact.
> Large RDF data sets have validity problems, and the difficulty of
> convincing publishers that this matters indicates that this situation
> will continue.
I hope that the Wikidata export that the script creates is actually
valid, but please feel free to report a bug if you found any. I will
update the online files I had created tomorrow to make sure they are
based on the latest code as well.
> I’ve thought a bit about the problem of the “streaming converter
> from Turtle to N-Triples”. It’s true that this can be done in a
> streaming manner most of the time, but there is a stack that can get
> infinitely deep in Turtle so you can’t say, strictly, that memory
> consumption is bounded.
Yes, it is not a "streaming algorithm" in a formal sense, just a
streaming-style algorithm. But in practice, it would most likely work
quite well.
> It’s also very unclear to me how exactly to work around broken
> records and restart the parser in the general case. It’s not hard to
> mock up examples where a simple recovery mechanism works, but I dread
> the thought of developing one for commercial use where I’d probably be
> playing whack-a-mole for edge cases for years.
The beauty of a "quirks mode" Turtle parser is that there are no
requirements on it. If the Turtle is broken, then anything is better
than rejecting it altogether. The state of the art seems to be to give
up and return no triples, not even the ones that were well-formed higher
up in the file (which would also help to find the location of the error
...). A first improvement would be to keep the finished triples that
have been recognized so far. Then you can think about restarting
methods. And of course, while one can construe cases that seem confusing
(at least to the human eye), most errors in real turtle documents are
missing escapes, missing terminators, or missing entities (unexpected
terminators), one at a time. It seems we can do a fairly reasonable
recovery in each case (and of course there is always "Too many errors,
giving up").
...
(I have nothing to say about compression)
> I am very much against blank nodes for ‘wiki-ish’ data that is shared
> between systems. The fact that Freebase reifies “blank nodes” as CVTs
> means that we can talk about them on the outside, reason about them,
> and then name them in order to interact with them on the live Freebase
> system. By their nature, blank nodes defy the “anyone, anything,
> anywhere” concept because they can’t be referred to. In the case of OWL
> that’s a feature not a bug because you can really close the world
> because nobody can add anything to a lisp-list without introducing a new
> node. Outside tiny tiny T-Boxes (say SUMO size), internal DSLs like
> SPIN, or expressing algebraic sorts of things (i.e. describe the mixed
> eigenstates of quarks in some Hadron), the mainstream of linked data
> doesn’t use them.
I think I agree. I also create proper URIs for all auxiliary objects and
confine bnodes to OWL axioms, which are always small standard patterns
that express one axiom.
> Personally I’d like to see the data published in Quad form and have the
> reification data expressed in the context field. As much as possible,
> the things in the (?s ?p ?o) fields should make sense as facts. Ideally
> you could reuse one ?c node for a lot of facts, such as if a number of
> them came in one transactions. You could ask for the ?c fields (show me
> all estimates for the population of Berlin from 2000 to the present and
> who made them) or you could go through the file of facts and pick the
> ?c’s that provide the point of view that you need the system to have.
This has been discussed before and there was a decision in the past that
the reification-based format should be implemented first, and that a
named-graph based format should be implemented as well, but later. So
hopefully we will have both at some time. For now, the RDF export is
still experimental, and has no concrete uses (or planned uses that I
know of) yet, so the priority of extensions/changes in the grand scheme
of things is rather low. This might change, of course.
Cheers,
Markus
--
Markus Kroetzsch, Departmental Lecturer
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529 http://korrekt.org/
My feelings are strong towards one-line-per-fact.
Large RDF data sets have validity problems, and the difficulty of convincing publishers that this matters indicates that this situation will continue.
I’ve thought a bit about the problem of the “streaming converter from Turtle to N-Triples”. It’s true that this can be done in a streaming manner most of the time, but there is a stack that can get infinitely deep in Turtle so you can’t say, strictly, that memory consumption is bounded.
It’s also very unclear to me how exactly to work around broken records and restart the parser in the general case. It’s not hard to mock up examples where a simple recovery mechanism works, but I dread the thought of developing one for commercial use where I’d probably be playing whack-a-mole for edge cases for years.
There was a gap of quite a few years in the lare-90’s when there weren’t usable open-source web browsers because a practical web browser had to: (1) read broken markup, and (2) render it exactly the same as Netscape 3. Commercial operations can get things like this done by burning out programmers, who finally show up at a standup meeting one day, smash their laptop and stomp out. It’s not so easy in the open source world where you’re forced to use carrots and not sticks.
So far as compression v. inner format I also have some thoughts because for every product I’ve made in the last few years I always tried a few different packaging methods before releasing something to final.
Gzip eats up a lot of the ‘markupbloat’ in N-Triples because recently used IRIs and prefixes will be in the dictionary. The minus is that the dictionary isn’t very big, so the contents of the dictionary itself are bloated; there isn’t much entropy there, but the same markupbloat gets repeated hundreds of times; if you just put the prefixes in a hash table that might be more like 1000 bytes total to represent that. When you prefix-compress RDF and gzip it then, you’ve got the advantage that the dictionary contains more entropy than it would otherwise. Even though gzip is not cutting out so much markup bloat, it is compressing off a better model of the document so you get better results.
As has been pointed out, sorting helps. If you sort in ?s ?p ?o . order it helps, partially because the sorting itself removes entropy (There are N! possible unsorted files and only one sorted one) and obviously the dictionary is being set up to roll together common ?s and ?s ?p’s the way turtle does.
Bzip’s ability to work like a markov chain with the element of chance taken out is usually more effective at compression than gzip is, but I’ve noticed some exceptions. In the original :BaseKB products, all of the nodes looked like
<http://rdf.basekb.com/ns/m.112az>
I found my ?s ?p ?o sorted data compressed better with gzip than bzip, and perhaps the structure of the identifiers had something to do with it.
A big advantage of bzip tha is that the block-based nature of the compression means that blocks can be compressed and decompressed in parallel (pbzip2 is a drop-in replacements for bzip2), so that the possible top speed of decompressing bzip data is in principle unlimited, even though bzip is a more expensive algorithm Hadoop in version 1.1.0+ can even automatically decompress a bzip2 file and split the result into separate mappers. Generally system performance is better if you read data out of pre-split gzip, but it is just so easy to load a big bz2 in HDFS and point a lot of transistors at it.
I am very much against blank nodes for ‘wiki-ish’ data that is shared between systems. The fact that Freebase reifies “blank nodes” as CVTs means that we can talk about them on the outside, reason about them, and then name them in order to interact with them on the live Freebase system. By their nature, blank nodes defy the “anyone, anything, anywhere” concept because they can’t be referred to. In the case of OWL that’s a feature not a bug because you can really close the world because nobody can add anything to a lisp-list without introducing a new node. Outside tiny tiny T-Boxes (say SUMO size), internal DSLs like SPIN, or expressing algebraic sorts of things (i.e. describe the mixed eigenstates of quarks in some Hadron), the mainstream of linked data doesn’t use them.
Personally I’d like to see the data published in Quad form and have the reification data expressed in the context field. As much as possible, the things in the (?s ?p ?o) fields should make sense as facts. Ideally you could reuse one ?c node for a lot of facts, such as if a number of them came in one transactions. You could ask for the ?c fields (show me all estimates for the population of Berlin from 2000 to the present and who made them) or you could go through the file of facts and pick the ?c’s that provide the point of view that you need the system to have.
My feelings are strong towards one-line-per-fact.
Large RDF data sets have validity problems, and the difficulty of convincing publishers that this matters indicates that this situation will continue.
I’ve thought a bit about the problem of the “streaming converter from Turtle to N-Triples”. It’s true that this can be done in a streaming manner most of the time, but there is a stack that can get infinitely deep in Turtle so you can’t say, strictly, that memory consumption is bounded.
It’s also very unclear to me how exactly to work around broken records and restart the parser in the general case. It’s not hard to mock up examples where a simple recovery mechanism works, but I dread the thought of developing one for commercial use where I’d probably be playing whack-a-mole for edge cases for years.
There was a gap of quite a few years in the lare-90’s when there weren’t usable open-source web browsers because a practical web browser had to: (1) read broken markup, and (2) render it exactly the same as Netscape 3. Commercial operations can get things like this done by burning out programmers, who finally show up at a standup meeting one day, smash their laptop and stomp out. It’s not so easy in the open source world where you’re forced to use carrots and not sticks.
So far as compression v. inner format I also have some thoughts because for every product I’ve made in the last few years I always tried a few different packaging methods before releasing something to final.
Gzip eats up a lot of the ‘markupbloat’ in N-Triples because recently used IRIs and prefixes will be in the dictionary. The minus is that the dictionary isn’t very big, so the contents of the dictionary itself are bloated; there isn’t much entropy there, but the same markupbloat gets repeated hundreds of times; if you just put the prefixes in a hash table that might be more like 1000 bytes total to represent that. When you prefix-compress RDF and gzip it then, you’ve got the advantage that the dictionary contains more entropy than it would otherwise. Even though gzip is not cutting out so much markup bloat, it is compressing off a better model of the document so you get better results.
As has been pointed out, sorting helps. If you sort in ?s ?p ?o . order it helps, partially because the sorting itself removes entropy (There are N! possible unsorted files and only one sorted one) and obviously the dictionary is being set up to roll together common ?s and ?s ?p’s the way turtle does.
Bzip’s ability to work like a markov chain with the element of chance taken out is usually more effective at compression than gzip is, but I’ve noticed some exceptions. In the original :BaseKB products, all of the nodes looked like
<http://rdf.basekb.com/ns/m.112az>
I found my ?s ?p ?o sorted data compressed better with gzip than bzip, and perhaps the structure of the identifiers had something to do with it.
A big advantage of bzip tha is that the block-based nature of the compression means that blocks can be compressed and decompressed in parallel (pbzip2 is a drop-in replacements for bzip2), so that the possible top speed of decompressing bzip data is in principle unlimited, even though bzip is a more expensive algorithm Hadoop in version 1.1.0+ can even automatically decompress a bzip2 file and split the result into separate mappers. Generally system performance is better if you read data out of pre-split gzip, but it is just so easy to load a big bz2 in HDFS and point a lot of transistors at it.
I am very much against blank nodes for ‘wiki-ish’ data that is shared between systems. The fact that Freebase reifies “blank nodes” as CVTs means that we can talk about them on the outside, reason about them, and then name them in order to interact with them on the live Freebase system. By their nature, blank nodes defy the “anyone, anything, anywhere” concept because they can’t be referred to. In the case of OWL that’s a feature not a bug because you can really close the world because nobody can add anything to a lisp-list without introducing a new node. Outside tiny tiny T-Boxes (say SUMO size), internal DSLs like SPIN, or expressing algebraic sorts of things (i.e. describe the mixed eigenstates of quarks in some Hadron), the mainstream of linked data doesn’t use them.
Personally I’d like to see the data published in Quad form and have the reification data expressed in the context field. As much as possible, the things in the (?s ?p ?o) fields should make sense as facts. Ideally you could reuse one ?c node for a lot of facts, such as if a number of them came in one transactions. You could ask for the ?c fields (show me all estimates for the population of Berlin from 2000 to the present and who made them) or you could go through the file of facts and pick the ?c’s that provide the point of view that you need the system to have.
Hello,
Can you help me understand the scope of a Wikidata entry please?
What is this Wikidata entry for?
http://www.wikidata.org/wiki/Q272619
Is it for the person Norman Cook and all of his aliases?
Should that title be Fatboy Slim or Norman Cook?
Is it ok that it has different titles in different languages?
Do there have to be separate Wikpedia pages before we can create separate
Wikidata entities for the separate concepts?
In MusicBrainz there are three artists that point to the 'Norman Cook'
Wikipedia page:
http://musicbrainz.org/artist/3150be04-f42f-43e0-ab5c-77965a4f7a7dhttp://musicbrainz.org/artist/34c63966-445c-4613-afe1-4f0e1e53ae9ahttp://musicbrainz.org/artist/ba81eb4a-0c89-489f-9982-0154b8083a28
Should they all be pointing at the same Wikidata entry too?
Is it ok that there is only a single MusicBrainz identifier in Wikidata?
How is that identifier chosen?
The problem that we are experiencing is that our Triplestore is merging
all these concepts together into a single entity and I am trying to work
out where to break the equivalence, or if it is even a problem.
Thanks!
nick.
-----------------------------
http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and
may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in
error, please delete it from your system.
Do not use, copy or disclose the
information in any way nor act in reliance on it and notify the sender
immediately.
Please note that the BBC monitors e-mails
sent or received.
Further communication will signify your consent to
this.
-----------------------------
John Erling Blad, 12/08/2013 11:43:
> I was quite sure the "no" option was removed from preference after a
> discussion about this language code.
Seems not; probably that requires deleting the no language file and make
it into a dummy language? Dunno.
Nemo
>
> On 8/12/13, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:
>> John Erling Blad, 12/08/2013 01:30:
>>> You can't use "no" as a language in ULS, but you can use setlang and
>>> uselang with "no" if I remember correct. All messages are aliased to
>>> "nb" if the language is "nb". Also at nowiki will the messages for
>>> "nb" be used, and this is an accepted solution. Previously
>>> no.wikidata.org redirected with a setlang=no and that created a lot of
>>> confusion as we then had to different language codes depending on how
>>> the page was opened. There are also bots that use the site id to
>>> generate a language code and that will create a "no" language code.
>>
>> This answers the question indirectly: as far as I know,
>> language-dependent content can, currently, be entered only in your
>> interface language. However, both no and nb are available in preferences
>> and you may also encounter
>> https://bugzilla.wikimedia.org/show_bug.cgi?id=37459
>>
>> Nemo
>>
>>>
>>> On 8/10/13, Markus Krötzsch wrote:
>>>>
>>>> What I wonder is: if users choose to enter a "no" label on Wikidata,
>>>> what is the language setting that they see? Does this say "Norwegian
>>>> (any variant)" or what? That's what puzzles me. I know that a Wikipedia
>>>> can allow multiple languages (or dialects) to coexist, but in the
>>>> Wikidata language selector I thought you can only select "real"
>>>> languages, not "language groups".
>>
>
[Sorry for cross-posting]
Yes, I agree that the OmegaWiki community should be involved in the
discussions, and I pointed GerardM to our proposals whenever and
discussions, using him as a liaison. We also looked and keep looking at the
OmegaWiki data model to see what we are missing.
Our latest proposal is different from OmegaWiki in two major points:
* our primary goal is to provide support for structured data in the
Wiktionaries. We do not plan to be the main resource ourselves, where
readers come to in order to look up something, we merely provide structured
data that a Wiktionary may or may not use. This parallels the role of
Wikidata has with regards to Wikipedia. This also highlights the difference
between Wikidata and OmegaWiki, since OmegaWiki's goal is "to create a
dictionary of all words of all languages, including lexical, terminological
and ontological information."
* a smaller difference is the data model. Wikidata's latest proposal to
support Wiktionary is centered around lexemes, and we do not assume that
there is such a things as a language-independent defined meaning. But no
matter what model we end up with, it is important to ensure that the bulk
of the data could freely flow between the projects, and even though we
might disagree on this issue in the modeling, it is ensured that the
exchange of data is widely possible.
We tried to keep notes on the discussion we had today: <
http://epl.wikimedia.org/p/WiktionaryAndWikidata>
My major take home message for me is that:
* the proposal needs more visual elements, especially a mock-up or sketch
of how it would look like and how it could be used on the Wiktionaries
* there is no generally accepted place for a discussion that involves all
Wiktionary projects. Still, my initial decision to have the discussion on
the Wikidata wiki was not a good one, and it should and will be moved to
Meta.
Having said that, the current proposal for the data model of how to support
Wiktionary with Wikidata seems to have garnered a lot of support so far. So
this is what I will continue building upon. Further comments are extremely
welcomed. You can find it here:
<http://www.wikidata.org/wiki/Wikidata:Wiktionary>
As said, it will be moved to Meta, as soon as the requested mockups and
extensions are done.
Cheers,
Denny
2013/8/10 Samuel Klein <meta.sj(a)gmail.com>
> Hello,
>
> > On Fri, Aug 9, 2013 at 6:13 PM, JP Béland <lebo.beland(a)gmail.com> wrote:
> >> I agree. We also need to include the Omegawiki community.
>
> Agreed.
>
> On Fri, Aug 9, 2013 at 12:22 PM, Laura Hale <laura(a)fanhistory.com> wrote:
> > Why? The question of moving them into the WMF fold was pretty much no,
> > because the project has an overlapping purpose with Wiktionary,
>
> This is not actually the case.
> There was overwhelming community support for adopting Omegawiki - at
> least simply providing hosting. It stalled because the code needed a
> security and style review, and Kip (the lead developer) was going to
> put some time into that. The OW editors and dev were very interested
> in finding a way forward that involved Wikidata and led to a combined
> project with a single repository of terms, meanings, definitions and
> translations.
>
> Recap: The page describing the OmegaWiki project satisfies all of the
> criteria for requesting WMF adoption.
> * It is well-defined on Meta http://meta.wikimedia.org/wiki/Omegawiki
> * It describes an interesting idea clearly aligned with expanding the
> scope of free knowledge
> * It is not a 'competing' project to Wiktionaries; it is an idea that
> grew out of the Wiktionary community, has been developed for years
> alongside it, and shares many active contributors and linguiaphiles.
> * It started an RfC which garnered 85% support for adoption.
> http://meta.wikimedia.org/wiki/Requests_for_comment/Adopt_OmegaWiki
>
> Even if the current OW code is not used at all for a future Wiktionary
> update -- and this idea was proposed and taken seriously by the OW
> devs -- their community of contributors should be part of discussions
> about how to solve the Wiktionary problem that they were the first to
> dedicate themselves to.
>
> Regards,
> Sam.
>
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>
--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
<wiktionary-l(a)lists.wikimedia.org>Hi,
If there is someone in Wikimania interested in participating in the talks
about the future support of Wiktionary in Wikidata, we will having a
discussion about the several proposals.
http://wikimania2013.wikimedia.org/wiki/Support_of_Wiktionary_in_Wikidata
Date : Saturday, 10 Aug, 11:30 am - 1:00 pm
Place: Y520 (block Y, 5th floor)
See you there,
Micru
Over time people have gotten the message that you shouldn't write XML
like
System.out.println("<blah>"+someString+"</blah>")
because it is something that usually ends in tears.
Although (most) RDF toolkits are like XML toolkits in that they choke on
invalid data, people who write RDF seem to have little concern of whether
or not it is valid. This cultural problem is one of the reasons why RDF has
seemed to catch on so slow. If you told somebody their XML is invalid,
they'll feel like they have to do, but people don't seem to take any action
when they hear that the 20 GB file they published is trash.
As a general practice you should use real RDF tools to write RDF files.
This adds some overhead, but it's generally not hard and it gives you a
pretty good chance you'll get valid output. ;-)
Lately I've been working on this system
https://github.com/paulhoule/infovore/wiki
which is intended to deal with exactly this situation on a large scale.
The "Parallel Super Eyeball 3" (3 means triple, PSE 4 is a hypothetical
tool that does the same for quads) tool physically separates valid and
invalid triples so you can use the valid triples while being aware of what
invalid data tried to sneak it.
Early next week I'm planning on rolling out ":BaseKB Now" which will be
filtered Freebase data, processed automatically on a weekly basis. I've
got a project in the pipeline that are going to require Wikipedia Categories
(I better get them fast before they go away) and another large 4D
metamemomic data set for which Wikidata Phase I will be a Rosetta Stone so
support for those data sets are on my critical path.
-----Original Message-----
From: Sebastian Hellmann
Sent: Friday, August 9, 2013 10:44 AM
To: Discussion list for the Wikidata project.
Cc: Dimitris Kontokostas ; Jona Christopher Sahnwaldt
Subject: Re: [Wikidata-l] Wikidata RDF export available
Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.
I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:
Over the last 4 years, I have seen about two dozen people (undergraduate
and PhD students, as well as Post-Docs) implement "simple" serializers
for RDF.
They all failed.