Re: [Wikidata-l] Wikidata RDF export available

9 Aug 2013


      Hi Markus,
we just had a look at your python code and created a dump. We are still 
getting a syntax error for the turtle dump.
I saw, that you did not use a mature framework for serializing the 
turtle. Let me explain the problem:
Over the last 4 years, I have seen about two dozen people (undergraduate 
and PhD students, as well as Post-Docs) implement "simple" serializers 
for RDF.
They all failed.
This was normally not due to the lack of skill, but due to the lack of 
missing time. They wanted to do it quick, but they didn't have the time 
to implement it correctly in the long run.
There are some really nasty problems ahead like encoding or special 
characters in URIs. I would direly advise you to:
1. use a Python RDF framework
2. do some syntax tests on the output, e.g. with "rapper"
3. use a line by line format, e.g. use turtle without prefixes and just 
one triple per line (It's like NTriples, but with Unicode)
We are having a problem currently, because we tried to convert the dump 
to NTriples (which would be handled by a framework as well) with rapper.
We assume that the error is an extra "<" somewhere (not confirmed) and 
we are still searching for it since the dump is so big....
so we can not provide a detailed bug report. If we had one triple per 
line, this would also be easier, plus there are advantages for stream 
reading. bzip2 compression is very good as well, no need for prefix 
optimization.
All the best,
Sebastian
Am 03.08.2013 23:22, schrieb Markus Krötzsch:
...
Update: the first bugs in the export have already been discovered -- 
and fixed in the script on github. The files I uploaded will be 
updated on Monday when I have a better upload again (the links file 
should be fine, the statements file requires a rather tolerant Turtle 
string literal parser, and the labels file has a malformed line that 
will hardly work anywhere).
Markus
On 03/08/13 14:48, Markus Krötzsch wrote:
...
Hi,
I am happy to report that an initial, yet fully functional RDF export
for Wikidata is now available. The exports can be created using the
wda-export-data.py script of the wda toolkit [1]. This script downloads
recent Wikidata database dumps and processes them to create RDF/Turtle
files. Various options are available to customize the output (e.g., to
export statements but not references, or to export only texts in English
and Wolof). The file creation takes a few (about three) hours on my
machine depending on what exactly is exported.
For your convenience, I have created some example exports based on
yesterday's dumps. These can be found at [2]. There are three Turtle
files: site links only, labels/descriptions/aliases only, statements
only. The fourth file is a preliminary version of the Wikibase ontology
that is used in the exports.
The export format is based on our earlier proposal [3], but it adds a
lot of details that had not been specified there yet (namespaces,
references, ID generation, compound datavalue encoding, etc.). Details
might still change, of course. We might provide regular dumps at another
location once the format is stable.
As a side effect of these activities, the wda toolkit [1] is also
getting more convenient to use. Creating code for exporting the data
into other formats is quite easy.
Features and known limitations of the wda RDF export:
(1) All current Wikidata datatypes are supported. Commons-media data is
correctly exported as URLs (not as strings).
(2) One-pass processing. Dumps are processed only once, even though this
means that we may not know the types of all properties when we first
need them: the script queries wikidata.org to find missing information.
This is only relevant when exporting statements.
(3) Limited language support. The script uses Wikidata's internal
language codes for string literals in RDF. In some cases, this might not
be correct. It would be great if somebody could create a mapping from
Wikidata language codes to BCP47 language codes (let me know if you
think you can do this, and I'll tell you where to put it)
(4) Limited site language support. To specify the language of linked
wiki sites, the script extracts a language code from the URL of the
site. Again, this might not be correct in all cases, and it would be
great if somebody had a proper mapping from Wikipedias/Wikivoyages to
language codes.
(5) Some data excluded. Data that cannot currently be edited is not
exported, even if it is found in the dumps. Examples include statement
ranks and timezones for time datavalues. I also currently exclude labels
and descriptions for simple English, formal German, and informal Dutch,
since these would pollute the label space for English, German, and Dutch
without adding much benefit (other than possibly for simple English
descriptions, I cannot see any case where these languages should ever
have different Wikidata texts at all).
Feedback is welcome.
Cheers,
Markus
[1] https://github.com/mkroetzsch/wda
     Run "python wda-export.data.py --help" for usage instructions
[2] http://semanticweb.org/RDF/Wikidata/
[3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF

Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events:
* NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, Extended 
Deadline: *July 18th*)
* LSWT 23/24 Sept, 2013 in Leipzig (http://aksw.org/lswt)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Wikidata RDF export available