> Message: 3
> Date: Tue, 10 Jul 2012 13:28:38 -0400
> From: Daniel Kinzler <daniel(a)brightbyte.de>
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>,
> Wikidata-intern <wikidata-intern(a)wikimedia.de>
> Subject: [Wikitech-l] Wikidata: Splitting the Cache by Language
> Message-ID: <4FFC6646.2080707(a)brightbyte.de>
> Content-Type: text/plain; charset=UTF-8
>
> Hi all!
>
> At the hackathon in DC, I have been discussing options for caching Wikidata
> content with a few people, primarily RobLa and Ryan Lane. Here's a quick
> overview of the problem, and what I took away from the conversation.
>
> So, what we need is this:
>
> * Wikidata objects are multi-lingual
> * Wikidata generates language specific views of the data objects ("items" and
> other "entities") as HTML.
> * To provide good response times, we need to cache that generated HTML on some
> level, but that cache needs to be specific to (i.e. varying on) the user's language.
> * anonymous users must have a way to set their preferred language, so they are
> able to read pages on wikidata in their native language.
>
> Now, there are (at least) two levels of caching to be considered:
>
> * the parser cache, containing the HTML for every page's content. This cache
> uses keys based on the ParserOptions, which includes a field for specifying the
> language. However, currently only a "canonical" version of the rendered page is
> cached, disregarding the user language.
That's not entirely true. In cases where the page differs depending on
language ( {{int:...}}, Pages with the babel extension
depending on config of that extension), we split the parser cache
based on user language
Well those might seem like small cases, I would guess (to pull a
number out of nowhere, so probably wrong) that at least
75% of pages on commons have the parser cache split by user language.
>
> * the squid cache, containing full HTML pages for each URL. Currently, our squid
> setup doesn't support language versions of the same page; it does however
> support multiple versions of the same page in general - a mechanism that is
> already in use for caching different output for different mobile devices.
>
> For each cache, we have to decide whether we want to split the cache by
> language, or not use that cache for wikidata. After some discussion with RobLa
> and Ryan, we came to the following conclusion:
>
> * We will not use the parser cache for data pages (normal wikitext pages on the
> wikidata site, e.g. talk pages, will use it). The reasons are:
> a) memory for the parser cache is relatively scarce, multiplying the number of
> cached objects by the number of languages (dozens, if not hundreds) is not a
> good idea.
> b) generating HTML from Wikidata objects (which are JSON-like array structures)
> is much quicker than generating HTML from wikitext
> c) Wikidata will have relatively view direct page view, it will mainly be used
> via Wikipedia pages or the API.
>
> Splitting the parser cache by language may be desired in the future, depending
> on what caching mechanisms will be available for the parser cache, and how
> expensive HTML generation actually is. We agreed however that this could be left
> for later and only needs to be addressed in case of actual performance problems.
>
> * We will split the squid cache by language, using a cookie that specifies the
> user's language preference (for both logged in users and anons). The same URL is
> used for all language versions of the page (this keeps purging simple). The
> reasons are:
> a) bypassing caching completely would open us up to DDoS attacks, or even
> unintentional DoS by spiders, etc.
> b) the amount of pages in the cache need not be large, only a small subset will
> be actively viewed.
> c) squid memory is relatively plenty, the experience with splitting the cache by
> mobile device shows that this is feasible.
>
> Note that logged in users currently bypass the squid cache. This may change in
> the future, but the above scheme is oblivious to this: it will cache based on
> the language cookie. Looking more closely, there is four cases to consider:
>
> * anonymous with language cookie: vary cache on the language in the cookie, keep
> URL unchanged
> * logged in with language cookie: once squid cache is implemented for logged in
> users, vary cache on the language in the cookie, keep URL unchanged. Until then,
> bypass the cache.
> * anonymous with no cookie set (first access, or the user agent doesn't support
> cookies): this should use the primary content language (english). Squids could
> also detect the user language somehow, set the cookie, and only then pass the
> request to the app server - this would be nice, but puts app logic into the
> squids. Can be added later if desired, MediaWiki is oblivious to this.
> * logged in with no cookie: skip cache. the response will depend on the user's
> language preference stored in the database, caching it without the language
> cookie to vary on would poison the cache. We could try to come up with a scheme
> to mitigate the overhead this generates, but that only becomes relevant once
> squid layer caching is supported for logged in users at all.
>
> So, that's it: no parser cache, split squid cache, use cookie smartly.
>
> What do you think? Does that sound good to you?
>
> -- daniel
Does this imply anon users can some how have a language set via a
cookie? As it stands you need to be logged in to
have a lang preference. Are there concrete plans to have squid servers
start partially caching pages for logged in users?
(And if so are we including the content area of the page with that?).
If there are no concrete plans to have squid start caching logged in
views, and only logged in users have a lang preference, it
doesn't seem to make a lot of sense to split the squid cache based on
language as that would catach 0% of the hits.
If there is concern that parser cache would be too diluted by this,
why not cache in the normal $wgMemc->set
cache. We certainly cache all sorts of stuff there much of it not even
very expensive to generate.
--
-bawolff