Re: [Wikidata-l] Wikidata RDF export available

12 Aug 2013

      On 12/08/13 17:56, Nicolas Torzec wrote:
...
With respect to the RDF export I'd advocate for:

an RDF format with one fact per line.
the use of a mature/proven RDF generation framework.

Optimizing too early based on a limited and/or biased view of the
potential use cases may not be a good idea in the long run.
I'd rather keep it simple and standard at the data publishing level, and
let consumers access data easily and optimize processing to their need.
RDF has several official, standardised syntaxes, and one of them is 
Turtle. Using it is not a form of optimisation, just a choice of syntax. 
Every tool I have ever used for serious RDF work (triple stores, 
libraries, even OWL tools) supports any of the standard RDF syntaxes 
*just as well*. I do see that there are some advantages in some formats 
and others in others (I agree with most arguments that have been put 
forward). But would it not be better to first take a look at the actual 
content rather than debating the syntactic formatting now? As I said, 
this is not the final syntax anyway, which will be created with 
different code in a different programming language.
...
Also, I should not have to run a preprocessing step for filtering out the
pieces of data that do not follow the standardŠ
To the best of our knowledge, there are no such pieces in the current 
dump. We should try to keep this conversation somewhat related to the 
actual Wikidata dump that is created by the current version of the 
Python script on github (I will also upload a dump again tomorrow; 
currently, you can only get the dump by running the script yourself). I 
know I suggested that one could parse Turtle in a robust way (which I 
still think one can) but I am not suggesting for a moment that this 
should be necessary for using Wikidata dumps in the future. I am 
committed to fix any error as it is found, but so far I don't get much 
input in that direction.
...
Note that I also understand the need for a format that groups every facts
about an subject into one record, and serialize them one record per line.
It sometime makes life easier for bulk processing of large datasets. But
that's a different discussion.
As I said: advantages and disadvantages. This is why we will probably 
have all desired formats at some time. But someone needs to start somewhere.
Markus
...
--
Nicolas Torzec.
On 8/12/13 1:49 AM, "Markus Krötzsch" markus@semantic-mediawiki.org
wrote:
...
On 11/08/13 22:29, Tom Morris wrote:
...
On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch
<markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org>
wrote:
 Anyway, if you restrict yourself to tools that are installed by
 default on your system, then it will be difficult to do many
 interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump
 is really meant specifically for tools that take RDF inputs. It is
 not very straightforward to encode all of Wikidata in triples, and
 it leads to some inconvenient constructions (especially a lot of
 reification). If you don't actually want to use an RDF tool and you
 are just interested in the data, then there would be easier ways of
 getting it.

A single fact per line seems like a pretty convenient format to me.
   What format do you recommend that's easier to process?
I'd suggest some custom format that at least keeps single data values in
one line. For example, in RDF, you have to do two joins to find all
items that have a property with a date in the year 2010. Even with a
line-by-line format, you will not be able to grep this. So I think a
less normalised representation would be nicer for direct text-based
processing. For text-based processing, I would probably prefer a format
where one statement is encoded on one line. But it really depends on
what you want to do. Maybe you could also remove some data to obtain
something that is easier to process.
Markus

Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Wikidata RDF export available