Re: [Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available

12 Aug 2013

      On 12/08/13 17:07, Paul A. Houle wrote:
...
 My feelings are strong towards one-line-per-fact.
 Large RDF data sets have validity problems,  and the difficulty of

convincing publishers that this matters indicates that this situation
will continue.
I hope that the Wikidata export that the script creates is actually 
valid, but please feel free to report a bug if you found any. I will 
update the online files I had created tomorrow to make sure they are 
based on the latest code as well.
...
 I’ve thought a bit about the problem of the “streaming converter

from Turtle to N-Triples”.  It’s true that this can be done in a
streaming manner most of the time,  but there is a stack that can get
infinitely deep in Turtle so you can’t say,  strictly,  that memory
consumption is bounded.
Yes, it is not a "streaming algorithm" in a formal sense, just a 
streaming-style algorithm. But in practice, it would most likely work 
quite well.
...
 It’s also very unclear to me how exactly to work around broken

records and restart the parser in the general case.  It’s not hard to
mock up examples where a simple recovery mechanism works,  but I dread
the thought of developing one for commercial use where I’d probably be
playing whack-a-mole for edge cases for years.
The beauty of a "quirks mode" Turtle parser is that there are no 
requirements on it. If the Turtle is broken, then anything is better 
than rejecting it altogether. The state of the art seems to be to give 
up and return no triples, not even the ones that were well-formed higher 
up in the file (which would also help to find the location of the error 
...). A first improvement would be to keep the finished triples that 
have been recognized so far. Then you can think about restarting 
methods. And of course, while one can construe cases that seem confusing 
(at least to the human eye), most errors in real turtle documents are 
missing escapes, missing terminators, or missing entities (unexpected 
terminators), one at a time. It seems we can do a fairly reasonable 
recovery in each case (and of course there is always "Too many errors, 
giving up").
...
(I have nothing to say about compression)
...
I am very much against blank nodes for ‘wiki-ish’ data that is shared
between systems.  The fact that Freebase reifies “blank nodes” as CVTs
means that we can talk about them on the outside,  reason about them,
and then name them in order to interact with them on the live Freebase
system.  By their nature,  blank nodes defy the “anyone, anything,
anywhere” concept because they can’t be referred to.  In the case of OWL
that’s a feature not a bug because you can really close the world
because nobody can add anything to a lisp-list without introducing a new
node.  Outside tiny tiny T-Boxes (say SUMO size),  internal DSLs like
SPIN,  or expressing algebraic sorts of things (i.e. describe the mixed
eigenstates of quarks in some Hadron),  the mainstream of linked data
doesn’t use them.
I think I agree. I also create proper URIs for all auxiliary objects and 
confine bnodes to OWL axioms, which are always small standard patterns 
that express one axiom.
...
Personally I’d like to see the data published in Quad form and have the
reification data expressed in the context field.  As much as possible,
the things in the (?s ?p ?o) fields should make sense as facts.  Ideally
you could reuse one ?c node for a lot of facts,  such as if a number of
them came in one transactions.  You could ask for the ?c fields (show me
all estimates for the population of Berlin from 2000 to the present and
who made them) or you could go through the file of facts and pick the
?c’s that provide the point of view that you need the system to have.
This has been discussed before and there was a decision in the past that 
the reification-based format should be implemented first, and that a 
named-graph based format should be implemented as well, but later. So 
hopefully we will have both at some time. For now, the RDF export is 
still experimental, and has no concrete uses (or planned uses that I 
know of) yet, so the priority of extensions/changes in the grand scheme 
of things is rather low. This might change, of course.
Cheers,
Markus
-- 
Markus Kroetzsch, Departmental Lecturer
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529               http://korrekt.org/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available