Re: [Wikidata] Can LDF scale?

23 Dec 2016

Hi Markus,

...
  I am not sure but would guess that my 1h20min query
has not received much more than 100MB of data. 
That might be possible (we'd need to verify),
but then this means the other time was spent computing,
which shows the query plan or execution was highly inefficient.
So not an inherent limitation of TPF then,
but rather an implementation problem that is fixable.

...
  * Caching: Ruben's arguments about this being a
"finite set of URLs to cache" is misleading, and, I believe, not true. 
Note that my argument was about
a finite number of _non-empty_ fragments.

...
  For example, there are infinitely many different
strings and for each you could look for all Wikidata entities with this label 
True, but empty is a very easy case,
as there are many shortcuts to take here.

...
  (2) It is misleading since the number of triple
patterns is astronomic even when restricting to URLs and literals that exist in Wikidata:
  -> Number of Item-Property-Item triple patterns with one variable:
 	23M x 829 x 2 + 23M x 23M = 529 * 10^12
 		(0.5 Quadrillion) 
Another calculation is that each triple belongs to 2^3 = 8 fragments
(variable or constant for each component)
so that is just 11 billion non-empty fragments.
Taking the gzipped version of https://query.wikidata.org/bigdata/ldf (2kb)
as an average, it would be a total of 22 terabytes
to cache the first page of every fragment on disk.
Big, but not insurmountable. 70$ buys 2TB nowadays.

...
  So the situation of "what can I cache in
practice" vs. "what would I need to cache in theory" is the same as in
SPARQL: there is far too much to cache to ever get anywhere near completeness. 
But we don't need to cache everything;
just already the most asked fragments.
Caching is about lowering load for the server,
not about completely removing that load.

A 9-month study of DBpedia's TPFs showed
that 60% of fragments were cached [1]—
and that's with the expiration set to only 1 hour.

I don't have the statistics for a SPARQL endpoint,
but I would imagine the hit rate to be much lower,
given that a hit only occurs when the exact same query
is asked with the exact same syntax.

...
  * Can caching still help? In a real-world setting, you
would like to update your data once in a while. The real question for caching is which
percentage of your requests are served from cache *during this time*. 
One could:
– be relaxed about caching, and allow an hour of delay;
– do active cache invalidation.

...
  In any case, the relative benefit of caching will
surely be much smaller for TPF (where you cache an easy-to-compute but large answer) than
for SPARQL endpoints (where you cache a hard-to-compute but often small answer). 
Disagree, because the benefit strongly depends
on the probability that a given request will be asked again,
which is much higher for TPF due to lower granularity.

...
  Also note that the total number of SPARQL queries one
needs to cache is much (much!) smaller than the number of TPF replies one would cache for
answering them. 
We were talking about real-world settings;
what is the chance that two independent parties
ask the exact same SPARQL query
(presuming they don't just try the examples)?

...
  * Are request numbers no problem because requests are
simple? This really depends on the database implementation behind the service. 
Very true. Since TPF requests are so simple,
you don't even need an actual DBMS to serve them;
most of our servers use simple HDT files.

...
  But there is also the notion of "thrashing"
a database, where you destroy performance not by difficult queries but by large numbers of
simple ones. 
For this, I'm happy with Stas' measurements;
and caching is also going to help out.

...
  Moreover, other components of the server could also be
challenged, such as logging or other types of filtering that happen before you even get to
the DB. 
True, you'd want to keep those as light as possible.

...
  I realise that I am being rather critical here 
That's great and rest assured that
it is challenging for me to find replies,
so the criticism is certainly fair!

...
  we must cross-check if it is realistic to assume that
this approach could become useful for Wikidata 
The Wikidata case might be peculiar,
in that there seem to be no budget issues
for hosting a public SPARQL endpoint.
So if that's the case, TPF might have low added value,
except for federated queries.

...
  I have supported this work as a research work, and I
still do 
Thanks, and I believe we have some valuable lessons already
for further research on this.
It might be that WIkidata is better suited with more-than-TPF approaches.

...
  but the impression we got so far in the
"wild" is really not very encouraging. 
True, I'm disappointed myself by the hickups mentioned in this thread.
However, we must distinguish between inherent TPF limitations
and limitations of the current client implementation.
I hope to revisit this with an updated client after some months,
to better assess what is really a TPF limitation and what is not.

Best,

Ruben

[1] http://linkeddatafragments.org/publications/eswc2015-workshops.pdf

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Can LDF scale?