Efficient caching of large data sets for wikidata - Wikitech-l

28 Nov 2014

Hi all!

I'd like to get some input on a tricky problem regarding caching in memcached
(or accelerator cache). For Wikidata, we often need to look up the label (name)
of an item in a given language - on wikidata itself, as well as on wikis that
use wikidata. So, if something somewhere references Q5, we need to somehow look
up the label "Human" in English resp "Mensch" in German, etc.

The typical access pattern is to look up labels for a dozen or so items in a
handful of languages in order to generate a single response. In some situations
we can batch that into a single lookup, but at other times, we do not have
sufficient context, and it will be one label at a time. Also, some items (and
thus, their labels) are references a lot more than others.

Anyway, to get the labels, we can either fetch the full data item (several KB of
data) from the page content store (external store). This is what we currently
do, and it's quite bad. Full data items are cached in memcached and shared
between wikis - but it's still a lot of data to move, and may swamp memcached.

Alternatively, we can fetch the labels from the wb_terms table - we have a
mechanism for that, but no caching layer.

And now, the point of this email: how to make a caching layer for lookups to the
wb_terms table?

Naively, we could just give each label (one per item and language) a cache key,
and put it into memcached. I'm not sure this would improve performance much, and
it would mean massive overhead to memcached; also, putting *all* labels into
memcached would likely swamp it (we have in the order of 100 million labels and
descriptions).

We could put all "terms" for a given entity under one cache key, shared between
wikis.But then we'd still be moving a lot of pointless data around. Or we could
group using some hashing mechanism. But then we would not be able to take
advantage of the fact that some items are used a lot more often than others.

I'd like a good mechanism to cache just the 1000 or so most used labels,
preferably locally in the accelerator cache. Would it be possible to make a
"compartment" with LRU semantics in our caching infrastructure? As far as I
know, all of a memcached server, or all of an APC instance, acts as a single LRU
cache.

In order to make it less likely for rarely used labels to hog the cache, I can
think of two strategies:

1) Use a low expiry time. If set to 1 hour, only stuff accessed every hour stays
in the cache.

2) Use randomized writing: put things into the cache only 1/3 of the time; this
makes it more likely for frequently used labels to get into the cache... I'm no
good at probabilities and statistics, but I'd love to discuss this option with
someone who can actually calculate how well this might work.

So, which strategy should we use? We have:

* Full item data from external store + memcached
* Individual simple database queries, no cache
* DB query + memcached, low duration, one key per label
* DB query + memcached, randomized, one key per label
* Group cache entries by item (similar to caching full entities)
* ...

Are there other options, or other aspects that should be considered? Which
strategy would you recommend?

-- daniel