Hi all!
I'd like to get some input on a tricky problem regarding caching in memcached (or accelerator cache). For Wikidata, we often need to look up the label (name) of an item in a given language - on wikidata itself, as well as on wikis that use wikidata. So, if something somewhere references Q5, we need to somehow look up the label "Human" in English resp "Mensch" in German, etc.
The typical access pattern is to look up labels for a dozen or so items in a handful of languages in order to generate a single response. In some situations we can batch that into a single lookup, but at other times, we do not have sufficient context, and it will be one label at a time. Also, some items (and thus, their labels) are references a lot more than others.
Anyway, to get the labels, we can either fetch the full data item (several KB of data) from the page content store (external store). This is what we currently do, and it's quite bad. Full data items are cached in memcached and shared between wikis - but it's still a lot of data to move, and may swamp memcached.
Alternatively, we can fetch the labels from the wb_terms table - we have a mechanism for that, but no caching layer.
And now, the point of this email: how to make a caching layer for lookups to the wb_terms table?
Naively, we could just give each label (one per item and language) a cache key, and put it into memcached. I'm not sure this would improve performance much, and it would mean massive overhead to memcached; also, putting *all* labels into memcached would likely swamp it (we have in the order of 100 million labels and descriptions).
We could put all "terms" for a given entity under one cache key, shared between wikis.But then we'd still be moving a lot of pointless data around. Or we could group using some hashing mechanism. But then we would not be able to take advantage of the fact that some items are used a lot more often than others.
I'd like a good mechanism to cache just the 1000 or so most used labels, preferably locally in the accelerator cache. Would it be possible to make a "compartment" with LRU semantics in our caching infrastructure? As far as I know, all of a memcached server, or all of an APC instance, acts as a single LRU cache.
In order to make it less likely for rarely used labels to hog the cache, I can think of two strategies:
1) Use a low expiry time. If set to 1 hour, only stuff accessed every hour stays in the cache.
2) Use randomized writing: put things into the cache only 1/3 of the time; this makes it more likely for frequently used labels to get into the cache... I'm no good at probabilities and statistics, but I'd love to discuss this option with someone who can actually calculate how well this might work.
So, which strategy should we use? We have:
* Full item data from external store + memcached * Individual simple database queries, no cache * DB query + memcached, low duration, one key per label * DB query + memcached, randomized, one key per label * Group cache entries by item (similar to caching full entities) * ...
Are there other options, or other aspects that should be considered? Which strategy would you recommend?
-- daniel