Re: [Wikitech-l] Wikidata: Splitting the Cache by Language

13 Jul 2012

On 10/07/12 19:28, Daniel Kinzler wrote:
...
  Hi all!

 At the hackathon in DC, I have been discussing options for caching Wikidata
 content with a few people, primarily RobLa and Ryan Lane. Here's a quick
 overview of the problem, and what I took away from the conversation.

 So, what we need is this:

 * Wikidata objects are multi-lingual
 * Wikidata generates language specific views of the data objects ("items" and
 other "entities") as HTML.
 * To provide good response times, we need to cache that generated HTML on some
 level, but that cache needs to be specific to (i.e. varying on) the user's language.
 * anonymous users must have a way to set their preferred language, so they are
 able to read pages on wikidata in their native language.

 Now, there are (at least) two levels of caching to be considered:

 * the parser cache, containing the HTML for every page's content. This cache
 uses keys based on the ParserOptions, which includes a field for specifying the
 language. However, currently only a "canonical" version of the rendered page
is
 cached, disregarding the user language. 
That's not true. You are confusing content language (which is what you
usually want inside a page, with interface language). If you have German
as your language preference in English, you will see pages cached with
lang=de (as soon as they have language-specific content, such a TOC or
edit links), and sharing that parser cache only with other
German-speaking users.
I expect very few users do that currently, though, so their current
impact would be negligible.

...
  * the squid cache, containing full HTML pages for each
URL. Currently, our squid
 setup doesn't support language versions of the same page; it does however
 support multiple versions of the same page in general - a mechanism that is
 already in use for caching different output for different mobile devices.

 For each cache, we have to decide whether we want to split the cache by
 language, or not use that cache for wikidata. After some discussion with RobLa
 and Ryan, we came to the following conclusion:

 * We will not use the parser cache for data pages (normal wikitext pages on the
 wikidata site, e.g. talk pages, will use it). The reasons are:
 a) memory for the parser cache is relatively scarce, multiplying the number of
 cached objects by the number of languages (dozens, if not hundreds) is not a
 good idea.
 b) generating HTML from Wikidata objects (which are JSON-like array structures)
 is much quicker than generating HTML from wikitext
 c) Wikidata will have relatively view direct page view, it will mainly be used
 via Wikipedia pages or the API. 
I'm not sure what you mean by this. Are you talking about
a) sites with a "wikidata widget" (such as an infobox)?
b) A special page showing wikidata items?
c) A wiki for storing wikidata information.

If (a), it makes no sense to have the infobox in your native language
and eg. the sidebar labels still in the content language.
(even then you could do that without partitioning theparser cache, see
the code for edit links)

If (b), special pages don't use the parser cache...

If (c), see below.

...
  Splitting the parser cache by language may be desired
in the future, depending
 on what caching mechanisms will be available for the parser cache, and how
 expensive HTML generation actually is. We agreed however that this could be left
 for later and only needs to be addressed in case of actual performance problems.

 * We will split the squid cache by language, using a cookie that specifies the
 user's language preference (for both logged in users and anons). The same URL is
 used for all language versions of the page (this keeps purging simple). The
 reasons are: (...)

The parser cache already supports per-language splitting. You MAY want
to ignore the anon language cookie for "normal" wikis, in anticipation
for a smaller cache hit. But it SHOULD be used on multilingual wikis.
Specially for Commons, where there's a huge use of in-page language
customization, and the parser cache is already very splitted anyway.
Incubator will probably benefit a lot from allowing anons to view the
skin in their own language. Not sure about meta.
For Wikidata, I expect the same cross-wiki usage as Wikimedia Commons,
so the anon cookie should be obeyed too.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Wikidata: Splitting the Cache by Language