Hoi, Indeed 70$ buys 2TB nowadays but who is the public? Have you done a breakdown on how many people have so much spare capacity for this?
Really is Wikidata for institutional users or is it for a community. My biggest criticism on Wikidata is that it is severely underfunded and that its users get the raw end of the stick. Wikidata would be much more active and relevant when its users and their needs get proper attention and not only student and community projects to serve its needs. Thanks, GerardM
On 23 December 2016 at 01:40, Ruben Verborgh Ruben.Verborgh@ugent.be wrote:
Hi Markus,
I am not sure but would guess that my 1h20min query has not received
much more than 100MB of data.
That might be possible (we'd need to verify), but then this means the other time was spent computing, which shows the query plan or execution was highly inefficient. So not an inherent limitation of TPF then, but rather an implementation problem that is fixable.
- Caching: Ruben's arguments about this being a "finite set of URLs to
cache" is misleading, and, I believe, not true.
Note that my argument was about a finite number of _non-empty_ fragments.
For example, there are infinitely many different strings and for each
you could look for all Wikidata entities with this label
True, but empty is a very easy case, as there are many shortcuts to take here.
(2) It is misleading since the number of triple patterns is astronomic
even when restricting to URLs and literals that exist in Wikidata:
-> Number of Item-Property-Item triple patterns with one variable: 23M x 829 x 2 + 23M x 23M = 529 * 10^12 (0.5 Quadrillion)
Another calculation is that each triple belongs to 2^3 = 8 fragments (variable or constant for each component) so that is just 11 billion non-empty fragments. Taking the gzipped version of https://query.wikidata.org/bigdata/ldf (2kb) as an average, it would be a total of 22 terabytes to cache the first page of every fragment on disk. Big, but not insurmountable. 70$ buys 2TB nowadays.
So the situation of "what can I cache in practice" vs. "what would I
need to cache in theory" is the same as in SPARQL: there is far too much to cache to ever get anywhere near completeness.
But we don't need to cache everything; just already the most asked fragments. Caching is about lowering load for the server, not about completely removing that load.
A 9-month study of DBpedia's TPFs showed that 60% of fragments were cached [1]— and that's with the expiration set to only 1 hour.
I don't have the statistics for a SPARQL endpoint, but I would imagine the hit rate to be much lower, given that a hit only occurs when the exact same query is asked with the exact same syntax.
- Can caching still help? In a real-world setting, you would like to
update your data once in a while. The real question for caching is which percentage of your requests are served from cache *during this time*.
One could: – be relaxed about caching, and allow an hour of delay; – do active cache invalidation.
In any case, the relative benefit of caching will surely be much smaller
for TPF (where you cache an easy-to-compute but large answer) than for SPARQL endpoints (where you cache a hard-to-compute but often small answer).
Disagree, because the benefit strongly depends on the probability that a given request will be asked again, which is much higher for TPF due to lower granularity.
Also note that the total number of SPARQL queries one needs to cache is
much (much!) smaller than the number of TPF replies one would cache for answering them.
We were talking about real-world settings; what is the chance that two independent parties ask the exact same SPARQL query (presuming they don't just try the examples)?
- Are request numbers no problem because requests are simple? This
really depends on the database implementation behind the service.
Very true. Since TPF requests are so simple, you don't even need an actual DBMS to serve them; most of our servers use simple HDT files.
But there is also the notion of "thrashing" a database, where you
destroy performance not by difficult queries but by large numbers of simple ones.
For this, I'm happy with Stas' measurements; and caching is also going to help out.
Moreover, other components of the server could also be challenged, such
as logging or other types of filtering that happen before you even get to the DB.
True, you'd want to keep those as light as possible.
I realise that I am being rather critical here
That's great and rest assured that it is challenging for me to find replies, so the criticism is certainly fair!
we must cross-check if it is realistic to assume that this approach
could become useful for Wikidata
The Wikidata case might be peculiar, in that there seem to be no budget issues for hosting a public SPARQL endpoint. So if that's the case, TPF might have low added value, except for federated queries.
I have supported this work as a research work, and I still do
Thanks, and I believe we have some valuable lessons already for further research on this. It might be that WIkidata is better suited with more-than-TPF approaches.
but the impression we got so far in the "wild" is really not very
encouraging.
True, I'm disappointed myself by the hickups mentioned in this thread. However, we must distinguish between inherent TPF limitations and limitations of the current client implementation. I hope to revisit this with an updated client after some months, to better assess what is really a TPF limitation and what is not.
Best,
Ruben
[1] http://linkeddatafragments.org/publications/eswc2015-workshops.pdf
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata