Re: [Wikidata-l] Wikidata RDF export available

10 Aug 2013

Dear Sebastian,

On 10/08/13 12:18, Sebastian Hellmann wrote:
...
  Hi Markus!
 Thank you very much.

 Regarding your last email:
 Of course, I am aware of your arguments in your last email, that the
 dump is not "official". Nevertheless, I am expecting you and others to
 code (or supervise) similar RDF dumping projects in the future.

 Here are two really important things to consider:

 1. Always use a mature RDF framework for serializing: ...

Statements that involve "always" are easy to disagree with. An important 
part of software engineering is to achieve one's goals with optimal 
investment of resources. If you work on larger and more long-term 
projects, you will start to appreciate that the theoretically "best" or 
"cleanest" solution is not always the one that leads to a successful 
project. To the contrary, such a viewpoint can even make it harder to 
work in a "messy" surrounding, full of tools and data that do not quite 
adhere to the high ideals that one would like everyone (on the Web!) to 
have. You can see good example of this in HTML evolution.

Turtle is *really* easy to parse in a robust and fault-tolerant way. I 
am tempted to write a little script that sanitizes Turtle input in a 
streaming fashion by discarding garbage triples. Can't take more than a 
weekend to do that, don't you think? But I already have plans this 
weekend :-)

...

 2. Use NTriples or one-triple-per-line Turtle:
 (Turtle supports IRIs and unicode, compare)
 curl
 http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 |
 bzcat | head
 curl
 http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.nt.bz2 |
 bzcat | head

 one-triple-per-line let's you
 a) find errors easier and
 b) aids further processing, e.g. calculate the outdegree of subjects:
 curl
 http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 |
 bzcat | head -100 | cut -f1 -d '>' | grep -v '^#' | sed
's/<//;s/>//' |
 awk '{count[$1]++}END{for(j in count) print "<" j ">"
"\t"count [j]}'

 Furthermore:
 - Parsers can treat one-triple-per-line more robust, by just skipping lines
 - compression size is the same
 - alphabetical ordering of data works well (e.g. for GitHub diffs)
 - you can split the files in several smaller files easily 
See above. Why not write a little script that streams a Turtle file and 
creates one-triple-per-line output? This could be done with very little 
memory overhead in a streaming fashion. Both nested and line-by-line 
Turtle have their advantages and disadvantages, but one can trivially be 
converted into the other whereas the other cannot be converted back easily.

Of course we will continue to improve our Turtle quality, but there will 
always be someone who would prefer a slightly different format. One will 
always have to draw a line somewhere.

...

 Blank nodes have some bad properties:
 - some databases react weird to them and they sometimes fill up indexes
 and make the DB slow (depends on the implementations of course, this is
 just my experience )
 - make splitting one-triple-per-line more difficult
 - difficult for SPARQL to resolve recursively
 - see http://videolectures.net/iswc2011_mallea_nodes/ or
 http://web.ing.puc.cl/~marenas/publications/iswc11.pdf 
Does this relate to Wikidata or are we getting into general RDF design 
discussions here (wrong list)? Wikidata uses blank nodes only for 
serialising OWL axioms, and there is no alternative in this case.

...

 Turtle prefixes:
 Why do you think they are a "good thing"? They are disputed as sometimes
 as a premature feature. They do make data more readable, but nobody is
 going to read 4.4 GB of Turtle. 
If you want to fight against existing W3C standards, this is really not 
the right list. I have not made Turtle, and I won't defend its design 
here. But since you asked: I think readability is a good thing.

...
  By the way, you can always convert it to turtle
easily:
 curl
 http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 |
 bzcat | head -100  | rapper -i turtle -o turtle -I - - file 
If conversion is so easy, it does not seem worthwhile to have much of a 
discussion about this at all.

Cheers,

Markus

...

 Am 10.08.2013 12:44, schrieb Markus Krötzsch:
  Good morning. I just found a bug that was caused
by a bug in the
 Wikidata dumps (a value that should be a URI was not). This led to a
 few dozen lines with illegal qnames of the form "w: ". The updated
 script fixes this.

 Cheers,

 Markus

 On 09/08/13 18:15, Markus Krötzsch wrote:
  Hi Sebastian,

 On 09/08/13 15:44, Sebastian Hellmann wrote:
  Hi Markus,
 we just had a look at your python code and created a dump. We are still
 getting a syntax error for the turtle dump. 
 You mean "just" as in "at around 15:30 today" ;-)? The code is under
 heavy development, so changes are quite frequent. Please expect things
 to be broken in some cases (this is just a little community project, not
 part of the official Wikidata development).

 I have just uploaded a new statements export (20130808) to
 http://semanticweb.org/RDF/Wikidata/ which you might want to try.

 I saw, that you did not use a mature framework for serializing the
 turtle. Let me explain the problem:

 Over the last 4 years, I have seen about two dozen people
 (undergraduate
 and PhD students, as well as Post-Docs) implement "simple" serializers
 for RDF.

 They all failed.

 This was normally not due to the lack of skill, but due to the lack of
 missing time. They wanted to do it quick, but they didn't have the time
 to implement it correctly in the long run.
 There are some really nasty problems ahead like encoding or special
 characters in URIs. I would direly advise you to:

 1. use a Python RDF framework
 2. do some syntax tests on the output, e.g. with "rapper"
 3. use a line by line format, e.g. use turtle without prefixes and just
 one triple per line (It's like NTriples, but with Unicode) 
 Yes, URI encoding could be difficult if we were doing it manually. Note,
 however, that we are already using a standard library for URI encoding
 in all non-trivial cases, so this does not seem to be a very likely
 cause of the problem (though some non-zero probability remains). In
 general, it is not unlikely that there are bugs in the RDF somewhere;
 please consider this export as an early prototype that is meant for
 experimentation purposes. If you want an official RDF dump, you will
 have to wait for the Wikidata project team to get around doing it (this
 will surely be based on an RDF library). Personally, I already found the
 dump useful (I successfully imported some 109 million triples of some
 custom script into an RDF store), but I know that it can require some
 tweaking.

 We are having a problem currently, because we tried to convert the dump
 to NTriples (which would be handled by a framework as well) with
 rapper.
 We assume that the error is an extra "<" somewhere (not confirmed) and
 we are still searching for it since the dump is so big.... 
 Ok, looking forward to hear about the results of your search. A good tip
 for checking such things is to use grep. I did a quick grep on my
 current local statements export to count the numbers of < and > (this
 takes less than a minute on my laptop, including on-the-fly
 decompression). Both numbers were equal, making it unlikely that there
 is any unmatched < in the current dumps. Then I used grep to check that
 < and > only occur in the statements files in lines with "commons" URLs.
 These are created using urllib, so there should never be any < or > in
 them.

  so we can not provide a detailed bug report. If
we had one triple per
 line, this would also be easier, plus there are advantages for stream
 reading. bzip2 compression is very good as well, no need for prefix
 optimization. 
 Not sure what you mean here. Turtle prefixes in general seem to be a
 Good Thing, not just for reducing the file size. The code has no easy
 way to get rid of prefixes, but if you want a line-by-line export you
 could subclass my exporter and overwrite the methods for incremental
 triple writing so that they remember the last subject (or property) and
 create full triples instead. This would give you a line-by-line export
 in (almost) no time (some uses of [...] blocks in object positions would
 remain, but maybe you could live with that).

 Best wishes,

 Markus

 All the best,
 Sebastian

 Am 03.08.2013 23:22, schrieb Markus Krötzsch:
> Update: the first bugs in the export have already been discovered --
> and fixed in the script on github. The files I uploaded will be
> updated on Monday when I have a better upload again (the links file
> should be fine, the statements file requires a rather tolerant Turtle
> string literal parser, and the labels file has a malformed line that
> will hardly work anywhere).
>
> Markus
>
> On 03/08/13 14:48, Markus Krötzsch wrote:
>> Hi,
>>
>> I am happy to report that an initial, yet fully functional RDF export
>> for Wikidata is now available. The exports can be created using the
>> wda-export-data.py script of the wda toolkit [1]. This script
>> downloads
>> recent Wikidata database dumps and processes them to create
>> RDF/Turtle
>> files. Various options are available to customize the output
>> (e.g., to
>> export statements but not references, or to export only texts in
>> English
>> and Wolof). The file creation takes a few (about three) hours on my
>> machine depending on what exactly is exported.
>>
>> For your convenience, I have created some example exports based on
>> yesterday's dumps. These can be found at [2]. There are three Turtle
>> files: site links only, labels/descriptions/aliases only, statements
>> only. The fourth file is a preliminary version of the Wikibase
>> ontology
>> that is used in the exports.
>>
>> The export format is based on our earlier proposal [3], but it adds a
>> lot of details that had not been specified there yet (namespaces,
>> references, ID generation, compound datavalue encoding, etc.).
>> Details
>> might still change, of course. We might provide regular dumps at
>> another
>> location once the format is stable.
>>
>> As a side effect of these activities, the wda toolkit [1] is also
>> getting more convenient to use. Creating code for exporting the data
>> into other formats is quite easy.
>>
>> Features and known limitations of the wda RDF export:
>>
>> (1) All current Wikidata datatypes are supported. Commons-media
>> data is
>> correctly exported as URLs (not as strings).
>>
>> (2) One-pass processing. Dumps are processed only once, even though
>> this
>> means that we may not know the types of all properties when we first
>> need them: the script queries wikidata.org to find missing
>> information.
>> This is only relevant when exporting statements.
>>
>> (3) Limited language support. The script uses Wikidata's internal
>> language codes for string literals in RDF. In some cases, this might
>> not
>> be correct. It would be great if somebody could create a mapping from
>> Wikidata language codes to BCP47 language codes (let me know if you
>> think you can do this, and I'll tell you where to put it)
>>
>> (4) Limited site language support. To specify the language of linked
>> wiki sites, the script extracts a language code from the URL of the
>> site. Again, this might not be correct in all cases, and it would be
>> great if somebody had a proper mapping from Wikipedias/Wikivoyages to
>> language codes.
>>
>> (5) Some data excluded. Data that cannot currently be edited is not
>> exported, even if it is found in the dumps. Examples include
>> statement
>> ranks and timezones for time datavalues. I also currently exclude
>> labels
>> and descriptions for simple English, formal German, and informal
>> Dutch,
>> since these would pollute the label space for English, German, and
>> Dutch
>> without adding much benefit (other than possibly for simple English
>> descriptions, I cannot see any case where these languages should ever
>> have different Wikidata texts at all).
>>
>> Feedback is welcome.
>>
>> Cheers,
>>
>> Markus
>>
>> [1] https://github.com/mkroetzsch/wda
>>      Run "python wda-export.data.py --help" for usage instructions
>> [2] http://semanticweb.org/RDF/Wikidata/
>> [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
>>
>
>
> _______________________________________________
> Wikidata-l mailing list
> Wikidata-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>

 --
 Dipl. Inf. Sebastian Hellmann
 Department of Computer Science, University of Leipzig
 Events:
 * NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, Extended
 Deadline: *July 18th*)
 * LSWT 23/24 Sept, 2013 in Leipzig (http://aksw.org/lswt)
 Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
 Projects: http://nlp2rdf.org , http://linguistics.okfn.org ,
 http://dbpedia.org/Wiktionary , http://dbpedia.org
 Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
 Research Group: http://aksw.org 

-- 
Markus Kroetzsch, Departmental Lecturer
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529               http://korrekt.org/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Wikidata RDF export available