On 12/08/13 17:07, Paul A. Houle wrote:
My feelings are strong towards one-line-per-fact. Large RDF data sets have validity problems, and the difficulty of
convincing publishers that this matters indicates that this situation will continue.
I hope that the Wikidata export that the script creates is actually valid, but please feel free to report a bug if you found any. I will update the online files I had created tomorrow to make sure they are based on the latest code as well.
I’ve thought a bit about the problem of the “streaming converter
from Turtle to N-Triples”. It’s true that this can be done in a streaming manner most of the time, but there is a stack that can get infinitely deep in Turtle so you can’t say, strictly, that memory consumption is bounded.
Yes, it is not a "streaming algorithm" in a formal sense, just a streaming-style algorithm. But in practice, it would most likely work quite well.
It’s also very unclear to me how exactly to work around broken
records and restart the parser in the general case. It’s not hard to mock up examples where a simple recovery mechanism works, but I dread the thought of developing one for commercial use where I’d probably be playing whack-a-mole for edge cases for years.
The beauty of a "quirks mode" Turtle parser is that there are no requirements on it. If the Turtle is broken, then anything is better than rejecting it altogether. The state of the art seems to be to give up and return no triples, not even the ones that were well-formed higher up in the file (which would also help to find the location of the error ...). A first improvement would be to keep the finished triples that have been recognized so far. Then you can think about restarting methods. And of course, while one can construe cases that seem confusing (at least to the human eye), most errors in real turtle documents are missing escapes, missing terminators, or missing entities (unexpected terminators), one at a time. It seems we can do a fairly reasonable recovery in each case (and of course there is always "Too many errors, giving up").
... (I have nothing to say about compression)
I am very much against blank nodes for ‘wiki-ish’ data that is shared between systems. The fact that Freebase reifies “blank nodes” as CVTs means that we can talk about them on the outside, reason about them, and then name them in order to interact with them on the live Freebase system. By their nature, blank nodes defy the “anyone, anything, anywhere” concept because they can’t be referred to. In the case of OWL that’s a feature not a bug because you can really close the world because nobody can add anything to a lisp-list without introducing a new node. Outside tiny tiny T-Boxes (say SUMO size), internal DSLs like SPIN, or expressing algebraic sorts of things (i.e. describe the mixed eigenstates of quarks in some Hadron), the mainstream of linked data doesn’t use them.
I think I agree. I also create proper URIs for all auxiliary objects and confine bnodes to OWL axioms, which are always small standard patterns that express one axiom.
Personally I’d like to see the data published in Quad form and have the reification data expressed in the context field. As much as possible, the things in the (?s ?p ?o) fields should make sense as facts. Ideally you could reuse one ?c node for a lot of facts, such as if a number of them came in one transactions. You could ask for the ?c fields (show me all estimates for the population of Berlin from 2000 to the present and who made them) or you could go through the file of facts and pick the ?c’s that provide the point of view that you need the system to have.
This has been discussed before and there was a decision in the past that the reification-based format should be implemented first, and that a named-graph based format should be implemented as well, but later. So hopefully we will have both at some time. For now, the RDF export is still experimental, and has no concrete uses (or planned uses that I know of) yet, so the priority of extensions/changes in the grand scheme of things is rather low. This might change, of course.
Cheers,
Markus