Re: [Wikidata] ntriples dump?

26 Aug 2016


      On 26-08-2016 16:58, Stas Malyshev wrote:
...
Hi!
...
I think in terms of the dump, /replacing/ the Turtle dump with the
N-Triples dump would be a good option. (Not sure if that's what you were
suggesting?)
No, I'm suggesting having both. Turtle is easier to comprehend and also
more compact for download, etc. (though I didn't check how much is the
difference - compressed it may not be that big).
I would argue that human readability is not so important for a dump? For 
dereferenced documents sure, but less so for a dump perhaps.
Also I'd expect when [G|B]Zipped the difference would not justify having 
both (my guess is the N-triples file compressed should end up within 
+25% of the size of the Turtle file compressed, but that's purely a 
guess; obviously worth trying it to see!).
But yep, I get both points.
...
...
to have both: existing tools expecting Turtle shouldn't have a problem
with N-Triples.
That depends on whether these tools actually understand RDF - some might
be more simplistic (with text-based formats, you can achieve a lot even
with dumber tools). But that definitely might be an option too. I'm not
sure if it's the best one but a possibility, so we may discuss it too.
I'd imagine that anyone processing Turtle would be using a full-fledged 
Turtle parser? A dumb tool would have to be pretty smart to do anything 
useful with the Turtle I think. And it would not seem wise to parse the 
precise syntax of Turtle that way. But you never know [1]. :)
Of course if providing both is easy, then there's no reason not to 
provide both.
...
...
(Also just to put the idea out there of perhaps (also) having N-Quads
where the fourth element indicates the document from which the RDF graph
can be dereferenced. This can be useful for a tool that, e.g., just
What you mean by "document" - like entity? That may be a problem since
some data - like references and values, or property definitions - can be
used by more than one entity. So it's not that trivial to extract all
data regarding one entity from the dump. You can do it via export, e.g.:
http://www.wikidata.org/entity/Q42?flavor=full - but that doesn't
extract it, it just generates it.
If it's problematic, then for sure it can be skipped as a feature. I'm 
mainly just floating the idea.
Perhaps to motivate the feature briefly: we worked a lot for a while on 
a search engine over RDF data ingested from the open Web. Since we were 
ingesting data from the Web, considering one giant RDF graph was not a 
possibility: we needed to keep track of which RDF triples came from 
which Web documents for a variety of reasons. This simple notion of 
provenance was easy to keep track of when we crawled the individual 
documents themselves because we knew what documents we were taking 
triples from. But we could rarely if ever use dumps because they did not 
give such information.
In this view, Wikidata is a website publishing RDF like any other.
It is useful in such applications to know the online RDF documents in 
which a triple can be found. The document could be the entity, or it 
could be a physical location like:
http://www.wikidata.org/entity/Q13794921.ttl
Mainly it needs to be an IRI that can be resolved by HTTP to a document 
containing the triple. Ideally the quads would also cover all triples in 
that document. Even more ideally, the dumps would somehow cover all the 
information that could be obtained from crawling the RDF documents on 
Wikidata, including all HTTP redirects, and so forth.
At the same time I understand this is not a priority and there's 
probably no immediate need for N-Quads or publishing redirects. The need 
for this is rather abstract at the moment so perhaps left until the need 
becomes more concrete.
tl;dr:
N-Triples or N-Triples + Turtle sounds good.
N-Quads would be a bonus if easy to do.
Best,
Aidan
[1] 
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtm...

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] ntriples dump?