On 26-08-2016 16:58, Stas Malyshev wrote:
Hi!
I think in terms of the dump, /replacing/ the Turtle dump with the N-Triples dump would be a good option. (Not sure if that's what you were suggesting?)
No, I'm suggesting having both. Turtle is easier to comprehend and also more compact for download, etc. (though I didn't check how much is the difference - compressed it may not be that big).
I would argue that human readability is not so important for a dump? For dereferenced documents sure, but less so for a dump perhaps.
Also I'd expect when [G|B]Zipped the difference would not justify having both (my guess is the N-triples file compressed should end up within +25% of the size of the Turtle file compressed, but that's purely a guess; obviously worth trying it to see!).
But yep, I get both points.
to have both: existing tools expecting Turtle shouldn't have a problem with N-Triples.
That depends on whether these tools actually understand RDF - some might be more simplistic (with text-based formats, you can achieve a lot even with dumber tools). But that definitely might be an option too. I'm not sure if it's the best one but a possibility, so we may discuss it too.
I'd imagine that anyone processing Turtle would be using a full-fledged Turtle parser? A dumb tool would have to be pretty smart to do anything useful with the Turtle I think. And it would not seem wise to parse the precise syntax of Turtle that way. But you never know [1]. :)
Of course if providing both is easy, then there's no reason not to provide both.
(Also just to put the idea out there of perhaps (also) having N-Quads where the fourth element indicates the document from which the RDF graph can be dereferenced. This can be useful for a tool that, e.g., just
What you mean by "document" - like entity? That may be a problem since some data - like references and values, or property definitions - can be used by more than one entity. So it's not that trivial to extract all data regarding one entity from the dump. You can do it via export, e.g.: http://www.wikidata.org/entity/Q42?flavor=full - but that doesn't extract it, it just generates it.
If it's problematic, then for sure it can be skipped as a feature. I'm mainly just floating the idea.
Perhaps to motivate the feature briefly: we worked a lot for a while on a search engine over RDF data ingested from the open Web. Since we were ingesting data from the Web, considering one giant RDF graph was not a possibility: we needed to keep track of which RDF triples came from which Web documents for a variety of reasons. This simple notion of provenance was easy to keep track of when we crawled the individual documents themselves because we knew what documents we were taking triples from. But we could rarely if ever use dumps because they did not give such information.
In this view, Wikidata is a website publishing RDF like any other.
It is useful in such applications to know the online RDF documents in which a triple can be found. The document could be the entity, or it could be a physical location like:
http://www.wikidata.org/entity/Q13794921.ttl
Mainly it needs to be an IRI that can be resolved by HTTP to a document containing the triple. Ideally the quads would also cover all triples in that document. Even more ideally, the dumps would somehow cover all the information that could be obtained from crawling the RDF documents on Wikidata, including all HTTP redirects, and so forth.
At the same time I understand this is not a priority and there's probably no immediate need for N-Quads or publishing redirects. The need for this is rather abstract at the moment so perhaps left until the need becomes more concrete.
tl;dr: N-Triples or N-Triples + Turtle sounds good. N-Quads would be a bonus if easy to do.
Best, Aidan
[1] http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtm...