Hi all!
I'd like to get some input on a tricky problem regarding caching in memcached
(or accelerator cache). For Wikidata, we often need to look up the label (name)
of an item in a given language - on wikidata itself, as well as on wikis that
use wikidata. So, if something somewhere references Q5, we need to somehow look
up the label "Human" in English resp "Mensch" in German, etc.
The typical access pattern is to look up labels for a dozen or so items in a
handful of languages in order to generate a single response. In some situations
we can batch that into a single lookup, but at other times, we do not have
sufficient context, and it will be one label at a time. Also, some items (and
thus, their labels) are references a lot more than others.
Anyway, to get the labels, we can either fetch the full data item (several KB of
data) from the page content store (external store). This is what we currently
do, and it's quite bad. Full data items are cached in memcached and shared
between wikis - but it's still a lot of data to move, and may swamp memcached.
Alternatively, we can fetch the labels from the wb_terms table - we have a
mechanism for that, but no caching layer.
And now, the point of this email: how to make a caching layer for lookups to the
wb_terms table?
Naively, we could just give each label (one per item and language) a cache key,
and put it into memcached. I'm not sure this would improve performance much, and
it would mean massive overhead to memcached; also, putting *all* labels into
memcached would likely swamp it (we have in the order of 100 million labels and
descriptions).
We could put all "terms" for a given entity under one cache key, shared between
wikis.But then we'd still be moving a lot of pointless data around. Or we could
group using some hashing mechanism. But then we would not be able to take
advantage of the fact that some items are used a lot more often than others.
I'd like a good mechanism to cache just the 1000 or so most used labels,
preferably locally in the accelerator cache. Would it be possible to make a
"compartment" with LRU semantics in our caching infrastructure? As far as I
know, all of a memcached server, or all of an APC instance, acts as a single LRU
cache.
In order to make it less likely for rarely used labels to hog the cache, I can
think of two strategies:
1) Use a low expiry time. If set to 1 hour, only stuff accessed every hour stays
in the cache.
2) Use randomized writing: put things into the cache only 1/3 of the time; this
makes it more likely for frequently used labels to get into the cache... I'm no
good at probabilities and statistics, but I'd love to discuss this option with
someone who can actually calculate how well this might work.
So, which strategy should we use? We have:
* Full item data from external store + memcached
* Individual simple database queries, no cache
* DB query + memcached, low duration, one key per label
* DB query + memcached, randomized, one key per label
* Group cache entries by item (similar to caching full entities)
* ...
Are there other options, or other aspects that should be considered? Which
strategy would you recommend?
-- daniel
Show replies by date
A few things to note:
* APC is not LRU, it just detects expired items on get() and clears
everything when full (
https://groups.drupal.org/node/397938)
* APC has a low max keys config on production, so using key-per-item would
require that to change
* Implementing LRU groups for BagOStuff would require heavy CAS use and
would definitely be bad over the wire (and not great locally either)
Just how high is the label traffic/queries? Do we profile this?
If it is super high, I'd suggest the following as a possibility:
a) Install a tiny redis instance on each app server.
b) Have a sorted set in redis containing (label key => score) and individual
redis keys for label strings (with label keys). Label keys would be like
P33-en. The sorted set and string values would use a common key prefix in
redis. The sorted-set key would mention the max size.
c) Cache get() method would use the normal redis GET method. Once every 10
times it could send a Lua command to bump the label key's score in the
sorted-set (ZSCORE) to that of the highest score +1 (find via ZRANGE key -1
-1 WITHSCORES).
d) Cache set() method would be a no-op except once every 10 times. When it
does anything, it would send a Lua command to remove the lowest scored key
if there is no room (ZREMRANGEBYRANK key 0 1) and in any case add the label
key with a score equal to the highest score + 1. It would also add the value
in the separate key for that value with a TTL (likewise deleting it on
eviction). The sorted-set TTL would be set to max(current TTL, new value
TTL).
e) Cache misses would fetch from the DB rather than text store
If high traffic causes flooding, the "10" number can be tweaked (or
eliminated) or the "highest rank + 1" logic could be tweaked to insert new
labels with a score that's better than only 3/8 of the stuff rather than all
of it (borrowing from MySQL). The above method just uses O(logN) redis
stuff.
Such a thing could probably be useful for at least a few more use cases I'd
bet.
--
View this message in context:
http://wikimedia.7.x6.nabble.com/Efficient-caching-of-large-data-sets-for-w…
Sent from the Wikipedia Developers mailing list archive at
Nabble.com.