Hi!
I was thinking recently about various data processing scenarios in wikidata and there's one case we don't have a good coverage for I think.
TLDR: One of the things I think we might do to make it easier to work with data is having ntriples (line-based) RDF dump format available.
If you need to process a lot of data (like all enwiki sitelinks, etc.) then the Query Service is not very efficient there, due to limits and sheer volume of data. We could increase limits but not by much - I don't think we can allow a 30-minute processing task to hog the resources of the service to itself. We have some ways to mitigate this, in theory, but in practice they'll take time to be implemented and deployed.
The other approach would be to do dump processing. Which would work in most scenarios but the problem is that we have two forms of dump right now - JSON and TTL (Turtle) and both are not easy to process without tools with deep understanding of the formats. For JSON, we have Wikidata Toolkit but it can't ingest RDF/Turtle, and also has some entry barrier to get everything running even when operation that needs to be done is trivial.
So I was thinking - what if we had also ntriples RDF dump? The difference between ntriples and Turtle is that ntriples is line-based and fully expanded - which means every line can be understood on its own without needing any context. This enables to process the dump using the most basic text processing tools or any software that can read a line of text and apply regexp to it. The downside of ntriples is it's really verbose, but compression will take care of most of it, and storing another 10-15G or so should not be a huge deal. Also, current code already knows how to generate ntriples dump (in fact, almost all unit tests internally use this format) - we just need to create a job that actually generates it.
Of course, with right tools you can generate ntriples dump from both Turtle one and JSON one (Wikidata toolkit can do the latter, IIRC) but it's one more moving part which makes it harder and introduces potential for inconsistencies and surprises.
So, what do you think - would having ntriples RDF dump for wikidata help things?
Hi Stas,
I think in terms of the dump, /replacing/ the Turtle dump with the N-Triples dump would be a good option. (Not sure if that's what you were suggesting?)
As you already mentioned, N-Triples is easier to process with typical unix command-line tools and scripts, etc. But also any (RDF 1.1) N-Triples file should be valid Turtle, so I don't see a convincing need to have both: existing tools expecting Turtle shouldn't have a problem with N-Triples.
(Also just to put the idea out there of perhaps (also) having N-Quads where the fourth element indicates the document from which the RDF graph can be dereferenced. This can be useful for a tool that, e.g., just wants to quickly refresh a single graph from the dump, or more generally that wants to keep track of a simple and quick notion of provenance: "this triple was found in this Web document".)
Cheers, Aidan
On 26-08-2016 16:30, Stas Malyshev wrote:
Hi!
I was thinking recently about various data processing scenarios in wikidata and there's one case we don't have a good coverage for I think.
TLDR: One of the things I think we might do to make it easier to work with data is having ntriples (line-based) RDF dump format available.
If you need to process a lot of data (like all enwiki sitelinks, etc.) then the Query Service is not very efficient there, due to limits and sheer volume of data. We could increase limits but not by much - I don't think we can allow a 30-minute processing task to hog the resources of the service to itself. We have some ways to mitigate this, in theory, but in practice they'll take time to be implemented and deployed.
The other approach would be to do dump processing. Which would work in most scenarios but the problem is that we have two forms of dump right now - JSON and TTL (Turtle) and both are not easy to process without tools with deep understanding of the formats. For JSON, we have Wikidata Toolkit but it can't ingest RDF/Turtle, and also has some entry barrier to get everything running even when operation that needs to be done is trivial.
So I was thinking - what if we had also ntriples RDF dump? The difference between ntriples and Turtle is that ntriples is line-based and fully expanded - which means every line can be understood on its own without needing any context. This enables to process the dump using the most basic text processing tools or any software that can read a line of text and apply regexp to it. The downside of ntriples is it's really verbose, but compression will take care of most of it, and storing another 10-15G or so should not be a huge deal. Also, current code already knows how to generate ntriples dump (in fact, almost all unit tests internally use this format) - we just need to create a job that actually generates it.
Of course, with right tools you can generate ntriples dump from both Turtle one and JSON one (Wikidata toolkit can do the latter, IIRC) but it's one more moving part which makes it harder and introduces potential for inconsistencies and surprises.
So, what do you think - would having ntriples RDF dump for wikidata help things?
Hi!
I think in terms of the dump, /replacing/ the Turtle dump with the N-Triples dump would be a good option. (Not sure if that's what you were suggesting?)
No, I'm suggesting having both. Turtle is easier to comprehend and also more compact for download, etc. (though I didn't check how much is the difference - compressed it may not be that big).
to have both: existing tools expecting Turtle shouldn't have a problem with N-Triples.
That depends on whether these tools actually understand RDF - some might be more simplistic (with text-based formats, you can achieve a lot even with dumber tools). But that definitely might be an option too. I'm not sure if it's the best one but a possibility, so we may discuss it too.
(Also just to put the idea out there of perhaps (also) having N-Quads where the fourth element indicates the document from which the RDF graph can be dereferenced. This can be useful for a tool that, e.g., just
What you mean by "document" - like entity? That may be a problem since some data - like references and values, or property definitions - can be used by more than one entity. So it's not that trivial to extract all data regarding one entity from the dump. You can do it via export, e.g.: http://www.wikidata.org/entity/Q42?flavor=full - but that doesn't extract it, it just generates it.
On 26-08-2016 16:58, Stas Malyshev wrote:
Hi!
I think in terms of the dump, /replacing/ the Turtle dump with the N-Triples dump would be a good option. (Not sure if that's what you were suggesting?)
No, I'm suggesting having both. Turtle is easier to comprehend and also more compact for download, etc. (though I didn't check how much is the difference - compressed it may not be that big).
I would argue that human readability is not so important for a dump? For dereferenced documents sure, but less so for a dump perhaps.
Also I'd expect when [G|B]Zipped the difference would not justify having both (my guess is the N-triples file compressed should end up within +25% of the size of the Turtle file compressed, but that's purely a guess; obviously worth trying it to see!).
But yep, I get both points.
to have both: existing tools expecting Turtle shouldn't have a problem with N-Triples.
That depends on whether these tools actually understand RDF - some might be more simplistic (with text-based formats, you can achieve a lot even with dumber tools). But that definitely might be an option too. I'm not sure if it's the best one but a possibility, so we may discuss it too.
I'd imagine that anyone processing Turtle would be using a full-fledged Turtle parser? A dumb tool would have to be pretty smart to do anything useful with the Turtle I think. And it would not seem wise to parse the precise syntax of Turtle that way. But you never know [1]. :)
Of course if providing both is easy, then there's no reason not to provide both.
(Also just to put the idea out there of perhaps (also) having N-Quads where the fourth element indicates the document from which the RDF graph can be dereferenced. This can be useful for a tool that, e.g., just
What you mean by "document" - like entity? That may be a problem since some data - like references and values, or property definitions - can be used by more than one entity. So it's not that trivial to extract all data regarding one entity from the dump. You can do it via export, e.g.: http://www.wikidata.org/entity/Q42?flavor=full - but that doesn't extract it, it just generates it.
If it's problematic, then for sure it can be skipped as a feature. I'm mainly just floating the idea.
Perhaps to motivate the feature briefly: we worked a lot for a while on a search engine over RDF data ingested from the open Web. Since we were ingesting data from the Web, considering one giant RDF graph was not a possibility: we needed to keep track of which RDF triples came from which Web documents for a variety of reasons. This simple notion of provenance was easy to keep track of when we crawled the individual documents themselves because we knew what documents we were taking triples from. But we could rarely if ever use dumps because they did not give such information.
In this view, Wikidata is a website publishing RDF like any other.
It is useful in such applications to know the online RDF documents in which a triple can be found. The document could be the entity, or it could be a physical location like:
http://www.wikidata.org/entity/Q13794921.ttl
Mainly it needs to be an IRI that can be resolved by HTTP to a document containing the triple. Ideally the quads would also cover all triples in that document. Even more ideally, the dumps would somehow cover all the information that could be obtained from crawling the RDF documents on Wikidata, including all HTTP redirects, and so forth.
At the same time I understand this is not a priority and there's probably no immediate need for N-Quads or publishing redirects. The need for this is rather abstract at the moment so perhaps left until the need becomes more concrete.
tl;dr: N-Triples or N-Triples + Turtle sounds good. N-Quads would be a bonus if easy to do.
Best, Aidan
[1] http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtm...
Hi!
Of course if providing both is easy, then there's no reason not to provide both.
Technically it's quite easy - you just run the same script with different options. So the only question is what is useful.
It is useful in such applications to know the online RDF documents in which a triple can be found. The document could be the entity, or it could be a physical location like:
That's where the tricky part is: many triples won't have specific document there since they may appear in many documents. Of course, if you merge all these documents in a dump, the triple would appear only once (we have special deduplication code to take care of that) but it's impossible to track it back to a specific document then. So I understand the idea, and see how it may be useful, but I don't see a real way to implement it now.
Hello Stas,
+1 for .nt RDF dump of WD due to (as you also said) easier processing!
Regards, Fariz
On Fri, Aug 26, 2016 at 10:52 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Of course if providing both is easy, then there's no reason not to provide both.
Technically it's quite easy - you just run the same script with different options. So the only question is what is useful.
It is useful in such applications to know the online RDF documents in which a triple can be found. The document could be the entity, or it could be a physical location like:
That's where the tricky part is: many triples won't have specific document there since they may appear in many documents. Of course, if you merge all these documents in a dump, the triple would appear only once (we have special deduplication code to take care of that) but it's impossible to track it back to a specific document then. So I understand the idea, and see how it may be useful, but I don't see a real way to implement it now.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Stats,
out of curiosity, can you give an example of triples that do not originate from a single wikidata item / property?
for me turtle dumps are process-able only by RDF tools while nt-like dumps both by rdf tools and other kind of scripts and I fild the former redundant
On Fri, Aug 26, 2016 at 11:52 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Of course if providing both is easy, then there's no reason not to provide both.
Technically it's quite easy - you just run the same script with different options. So the only question is what is useful.
It is useful in such applications to know the online RDF documents in which a triple can be found. The document could be the entity, or it could be a physical location like:
That's where the tricky part is: many triples won't have specific document there since they may appear in many documents. Of course, if you merge all these documents in a dump, the triple would appear only once (we have special deduplication code to take care of that) but it's impossible to track it back to a specific document then. So I understand the idea, and see how it may be useful, but I don't see a real way to implement it now.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
I also support the creation of .nt dumps as they are far easier to process than .ttl ones.
If compressed .nt dumps are less than 20% bigger than .ttl ones I don't see the point of keeping .ttl dumps as Ntriples files could be parsed with a Turtle parser.
Having provenance informations as suggested by Aidan is definitely a good idea. For triples shared by multiple pages (probably only complex data values descriptions because property descriptions could have as context the document describing the property) there are two possible ways: - use as context the one of the entity using it. It would lead to have as many quads as entities using the value. - use a "special" context. As the description of values should not change, there is no need to be able to retrieve new content about them in the future. But it leads to the creation of a probably not dereferenceable context IRI.
Cheers,
Thomas
Le 27 août 2016 à 11:05, Dimitris Kontokostas jimkont@gmail.com a écrit :
Hi Stats,
out of curiosity, can you give an example of triples that do not originate from a single wikidata item / property?
for me turtle dumps are process-able only by RDF tools while nt-like dumps both by rdf tools and other kind of scripts and I fild the former redundant
On Fri, Aug 26, 2016 at 11:52 PM, Stas Malyshev smalyshev@wikimedia.org wrote: Hi!
Of course if providing both is easy, then there's no reason not to provide both.
Technically it's quite easy - you just run the same script with different options. So the only question is what is useful.
It is useful in such applications to know the online RDF documents in which a triple can be found. The document could be the entity, or it could be a physical location like:
That's where the tricky part is: many triples won't have specific document there since they may appear in many documents. Of course, if you merge all these documents in a dump, the triple would appear only once (we have special deduplication code to take care of that) but it's impossible to track it back to a specific document then. So I understand the idea, and see how it may be useful, but I don't see a real way to implement it now.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Kontokostas Dimitris _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
out of curiosity, can you give an example of triples that do not originate from a single wikidata item / property?
All references and values can be shared between items. E.g. if two items refer to the same date, they will refer to the same value node. Same if they have a reference with same properties - i.e. one URL to the same address. These nodes do not have their own documents - since in Wikibase and Wikidata it's not possible to address individual values/references - but they are not linked to a single entity.
On 27/08/16 10:56, Markus Kroetzsch wrote:
On 26.08.2016 22:32, Aidan Hogan wrote: ...
tl;dr: N-Triples or N-Triples + Turtle sounds good. N-Quads would be a bonus if easy to do.
+1 to all of this
Best,
Markus
Also, if we are having new dump formats, it might also be worth considering using better compression, particularly for fully-expanded formats like n-triples. Would, for example, .7z compression give significantly better results than .bz2 on this data?
Neil
Hi!
Looks like the feedback to the idea has been positive (thanks to everybody that participated!) so I've made a task to track it:
https://phabricator.wikimedia.org/T144103