Hi all!
At the hackathon in DC, I have been discussing options for caching Wikidata content with a few people, primarily RobLa and Ryan Lane. Here's a quick overview of the problem, and what I took away from the conversation.
So, what we need is this:
* Wikidata objects are multi-lingual * Wikidata generates language specific views of the data objects ("items" and other "entities") as HTML. * To provide good response times, we need to cache that generated HTML on some level, but that cache needs to be specific to (i.e. varying on) the user's language. * anonymous users must have a way to set their preferred language, so they are able to read pages on wikidata in their native language.
Now, there are (at least) two levels of caching to be considered:
* the parser cache, containing the HTML for every page's content. This cache uses keys based on the ParserOptions, which includes a field for specifying the language. However, currently only a "canonical" version of the rendered page is cached, disregarding the user language.
* the squid cache, containing full HTML pages for each URL. Currently, our squid setup doesn't support language versions of the same page; it does however support multiple versions of the same page in general - a mechanism that is already in use for caching different output for different mobile devices.
For each cache, we have to decide whether we want to split the cache by language, or not use that cache for wikidata. After some discussion with RobLa and Ryan, we came to the following conclusion:
* We will not use the parser cache for data pages (normal wikitext pages on the wikidata site, e.g. talk pages, will use it). The reasons are: a) memory for the parser cache is relatively scarce, multiplying the number of cached objects by the number of languages (dozens, if not hundreds) is not a good idea. b) generating HTML from Wikidata objects (which are JSON-like array structures) is much quicker than generating HTML from wikitext c) Wikidata will have relatively view direct page view, it will mainly be used via Wikipedia pages or the API.
Splitting the parser cache by language may be desired in the future, depending on what caching mechanisms will be available for the parser cache, and how expensive HTML generation actually is. We agreed however that this could be left for later and only needs to be addressed in case of actual performance problems.
* We will split the squid cache by language, using a cookie that specifies the user's language preference (for both logged in users and anons). The same URL is used for all language versions of the page (this keeps purging simple). The reasons are: a) bypassing caching completely would open us up to DDoS attacks, or even unintentional DoS by spiders, etc. b) the amount of pages in the cache need not be large, only a small subset will be actively viewed. c) squid memory is relatively plenty, the experience with splitting the cache by mobile device shows that this is feasible.
Note that logged in users currently bypass the squid cache. This may change in the future, but the above scheme is oblivious to this: it will cache based on the language cookie. Looking more closely, there is four cases to consider:
* anonymous with language cookie: vary cache on the language in the cookie, keep URL unchanged * logged in with language cookie: once squid cache is implemented for logged in users, vary cache on the language in the cookie, keep URL unchanged. Until then, bypass the cache. * anonymous with no cookie set (first access, or the user agent doesn't support cookies): this should use the primary content language (english). Squids could also detect the user language somehow, set the cookie, and only then pass the request to the app server - this would be nice, but puts app logic into the squids. Can be added later if desired, MediaWiki is oblivious to this. * logged in with no cookie: skip cache. the response will depend on the user's language preference stored in the database, caching it without the language cookie to vary on would poison the cache. We could try to come up with a scheme to mitigate the overhead this generates, but that only becomes relevant once squid layer caching is supported for logged in users at all.
So, that's it: no parser cache, split squid cache, use cookie smartly.
What do you think? Does that sound good to you?
-- daniel
Le 10/07/12 19:28, Daniel Kinzler a écrit :
- To provide good response times, we need to cache that generated HTML on some
level, but that cache needs to be specific to (i.e. varying on) the user's language.
- anonymous users must have a way to set their preferred language, so they are
able to read pages on wikidata in their native language.
Have you considered generating a PHP template and just cache that? Then all hits would be served directly by the very simple PHP template?
Well we have no support for that yet :-(
On 10.07.2012 13:58, Antoine Musso wrote:
Le 10/07/12 19:28, Daniel Kinzler a écrit :
- To provide good response times, we need to cache that generated HTML on some
level, but that cache needs to be specific to (i.e. varying on) the user's language.
- anonymous users must have a way to set their preferred language, so they are
able to read pages on wikidata in their native language.
Have you considered generating a PHP template and just cache that? Then all hits would be served directly by the very simple PHP template?
Well we have no support for that yet :-(
Could you elaborate on that idea? I'm not sure I fully understand what you are suggesting.
-- daniel
On 10/07/12 23:14, Daniel Kinzler wrote:
On 10.07.2012 13:58, Antoine Musso wrote:
Have you considered generating a PHP template and just cache that? Then all hits would be served directly by the very simple PHP template?
Could you elaborate on that idea? I'm not sure I fully understand what you are suggesting.
I'm not sure if that is what he meant, but there exist an idea of creating a new level of parser cache, so that when a page is parsed, the parser would create and cache HTML template that is static, and a mapping to non-static objects, and then when a local version is needed, it would just replace the objects without recreating HTML.
So, for example, if you parse wikitext "Now is {{LOCALMONTHNAME}}", the parser would produce something like "<p>Now is f93c57b0-cb27-11e1-9b23-0800200c9a66</p>" (static template) and f93c57b0-cb27-11e1-9b23-0800200c9a66 => "{{LOCALMONTHNAME}}" (mapping) and then when a local version is needed just parse "{{LOCALMONTHNAME}}" and insert in the template.
An additional advantage of this is that in addition to localized data it could be applied to any separate textual block. So, for example, wikitext
"{{Infobox}}
Blah blah"
would actually be parsed like
"<p>f73d0fd0-cb28-11e1-9b23-0800200c9a66</p> <p>Blah blah</p>"
f73d0fd0-cb28-11e1-9b23-0800200c9a66 => "{{Infobox}}"
And then when you change the template, it would be parsed and inserted in all the articles it is used instead of completely reparsing every article.
However a great problem with this is that the parser would have to somehow be aware of when a piece of wikitext affects other wikitext and when it does not. For example, it would somehow have to know that something like
"{{Table start}} ||blah blah {{Table end}}"
could not fit in the model and that these templates could not be replaced with uuids.
On 11.07.2012 03:30, Nikola Smolenski wrote:
On 10/07/12 23:14, Daniel Kinzler wrote:
On 10.07.2012 13:58, Antoine Musso wrote:
Have you considered generating a PHP template and just cache that? Then all hits would be served directly by the very simple PHP template?
Could you elaborate on that idea? I'm not sure I fully understand what you are suggesting.
I'm not sure if that is what he meant, but there exist an idea of creating a new level of parser cache, so that when a page is parsed, the parser would create and cache HTML template that is static, and a mapping to non-static objects, and then when a local version is needed, it would just replace the objects without recreating HTML.
I can imagine that this may be an advantage over re-parsing wikitext. We however generate html from a json-like data structure. I'd expect this to be faster than processing another level of templating...
-- daniel
On 10/07/12 19:28, Daniel Kinzler wrote:
- We will not use the parser cache for data pages (normal wikitext pages on the
wikidata site, e.g. talk pages, will use it). The reasons are: a) memory for the parser cache is relatively scarce, multiplying the number of cached objects by the number of languages (dozens, if not hundreds) is not a good idea.
I don't know if you have mentioned this, but it isn't an all-or-nothing situation. For example, the cache could contain only several most viewed languages. Or, it could be limited in size, with the oldest entries deleted when the newest are cached. This would ensure that the most visited pages are the ones that are the most quickly displayed to visitors.
On 10/07/12 19:28, Daniel Kinzler wrote:
- We will split the squid cache by language, using a cookie that specifies the
user's language preference (for both logged in users and anons). The same URL is used for all language versions of the page (this keeps purging simple). The reasons are:
Uhh. If the same URL is used for all language versions of the page, does this mean that Google bot for example will not be able to see non-English pages?
On 11.07.2012 04:04, Nikola Smolenski wrote:
On 10/07/12 19:28, Daniel Kinzler wrote:
- We will split the squid cache by language, using a cookie that specifies the
user's language preference (for both logged in users and anons). The same URL is used for all language versions of the page (this keeps purging simple). The reasons are:
Uhh. If the same URL is used for all language versions of the page, does this mean that Google bot for example will not be able to see non-English pages?
good call!
hmmm... denny? what do we do about this? use the non-canonical language-specific urls to set the user language for anons?
-- daniel
On 10/07/12 19:28, Daniel Kinzler wrote:
Hi all!
At the hackathon in DC, I have been discussing options for caching Wikidata content with a few people, primarily RobLa and Ryan Lane. Here's a quick overview of the problem, and what I took away from the conversation.
So, what we need is this:
- Wikidata objects are multi-lingual
- Wikidata generates language specific views of the data objects ("items" and
other "entities") as HTML.
- To provide good response times, we need to cache that generated HTML on some
level, but that cache needs to be specific to (i.e. varying on) the user's language.
- anonymous users must have a way to set their preferred language, so they are
able to read pages on wikidata in their native language.
Now, there are (at least) two levels of caching to be considered:
- the parser cache, containing the HTML for every page's content. This cache
uses keys based on the ParserOptions, which includes a field for specifying the language. However, currently only a "canonical" version of the rendered page is cached, disregarding the user language.
That's not true. You are confusing content language (which is what you usually want inside a page, with interface language). If you have German as your language preference in English, you will see pages cached with lang=de (as soon as they have language-specific content, such a TOC or edit links), and sharing that parser cache only with other German-speaking users. I expect very few users do that currently, though, so their current impact would be negligible.
- the squid cache, containing full HTML pages for each URL. Currently, our squid
setup doesn't support language versions of the same page; it does however support multiple versions of the same page in general - a mechanism that is already in use for caching different output for different mobile devices.
For each cache, we have to decide whether we want to split the cache by language, or not use that cache for wikidata. After some discussion with RobLa and Ryan, we came to the following conclusion:
- We will not use the parser cache for data pages (normal wikitext pages on the
wikidata site, e.g. talk pages, will use it). The reasons are: a) memory for the parser cache is relatively scarce, multiplying the number of cached objects by the number of languages (dozens, if not hundreds) is not a good idea. b) generating HTML from Wikidata objects (which are JSON-like array structures) is much quicker than generating HTML from wikitext c) Wikidata will have relatively view direct page view, it will mainly be used via Wikipedia pages or the API.
I'm not sure what you mean by this. Are you talking about a) sites with a "wikidata widget" (such as an infobox)? b) A special page showing wikidata items? c) A wiki for storing wikidata information.
If (a), it makes no sense to have the infobox in your native language and eg. the sidebar labels still in the content language. (even then you could do that without partitioning theparser cache, see the code for edit links)
If (b), special pages don't use the parser cache...
If (c), see below.
Splitting the parser cache by language may be desired in the future, depending on what caching mechanisms will be available for the parser cache, and how expensive HTML generation actually is. We agreed however that this could be left for later and only needs to be addressed in case of actual performance problems.
- We will split the squid cache by language, using a cookie that specifies the
user's language preference (for both logged in users and anons). The same URL is used for all language versions of the page (this keeps purging simple). The reasons are:
(...)
The parser cache already supports per-language splitting. You MAY want to ignore the anon language cookie for "normal" wikis, in anticipation for a smaller cache hit. But it SHOULD be used on multilingual wikis. Specially for Commons, where there's a huge use of in-page language customization, and the parser cache is already very splitted anyway. Incubator will probably benefit a lot from allowing anons to view the skin in their own language. Not sure about meta. For Wikidata, I expect the same cross-wiki usage as Wikimedia Commons, so the anon cookie should be obeyed too.
On 12.07.2012 20:33, Platonides wrote:
- the parser cache, containing the HTML for every page's content. This cache
uses keys based on the ParserOptions, which includes a field for specifying the language. However, currently only a "canonical" version of the rendered page is cached, disregarding the user language.
That's not true. You are confusing content language (which is what you usually want inside a page, with interface language). If you have German as your language preference in English, you will see pages cached with lang=de (as soon as they have language-specific content, such a TOC or edit links), and sharing that parser cache only with other German-speaking users.
Are you sure? That's how I THOUGHT this works, but looking through the code, this doesn't seem to be the case, at least not when editing. in WikiPage::doEdit, the ParserOptions are created with the "canonical" keyword, not the user object, at least...
I didn't *quite* figure out how this works though, or why this is done. Can you explain? See my question at the very bottom for more detailed questions.
b) generating HTML from Wikidata objects (which are JSON-like array structures) is much quicker than generating HTML from wikitext c) Wikidata will have relatively view direct page view, it will mainly be used via Wikipedia pages or the API.
I'm not sure what you mean by this. Are you talking about a) sites with a "wikidata widget" (such as an infobox)? b) A special page showing wikidata items? c) A wiki for storing wikidata information.
a) will be viewed very frequently, while b) and c) will be used rarely.
If (a), it makes no sense to have the infobox in your native language and eg. the sidebar labels still in the content language. (even then you could do that without partitioning theparser cache, see the code for edit links)
Right. For rendering infoboxes, an entirely different rendering mechanism will be used, based on parser functions. This will always use the content language.
If (b), special pages don't use the parser cache...
Items are page content. They are shown as page ontent, on regular wiki pages. No special page required.
But the rendering of these data-content wiki pages will depend on the user interface language.
If (c), see below.
Right, that's what wikidata is and what we are talking about.
The parser cache already supports per-language splitting. You MAY want to ignore the anon language cookie for "normal" wikis, in anticipation for a smaller cache hit. But it SHOULD be used on multilingual wikis.
absolutely.
Specially for Commons, where there's a huge use of in-page language customization, and the parser cache is already very splitted anyway. Incubator will probably benefit a lot from allowing anons to view the skin in their own language. Not sure about meta. For Wikidata, I expect the same cross-wiki usage as Wikimedia Commons, so the anon cookie should be obeyed too.
Yes.
But it's my impression that the mechanism for user-language specific parser cache keys is currently broken. Or we did something wrong to break it. Is there some documentation on this? I'm especially interested in:
a) how and why the "canonical" ParserOptions are used upon save, and how that impacts the parser cache b) how ParserOptions::getUsedOptions() interacts with the rest of the system, and how it is or should be populated.
thanks!
-- daniel
On 13/07/12 20:42, Daniel Kinzler wrote:
But it's my impression that the mechanism for user-language specific parser cache keys is currently broken. Or we did something wrong to break it. Is there some documentation on this? I'm especially interested in:
b) how ParserOptions::getUsedOptions() interacts with the rest of the system, and how it is or should be populated.
The used options are magically populated. You don't need to do anything. When the popts is created, no option has been used. Then the page is parsed, and both the parser and extensions call the parseroptions methods and can conditionally base their output on that result. They MUST access the options through the ParserOptions, using eg. $wgUser->getOption() would be a sin. When the ParserOutput is going to be stored, the ParserOptions::getUsedOptions() is called, which simply returns a list of the methods -which depend on user preferences- that have been called on the ParserOptions, which it has been dutifully tracking.
To get the user language, you want to use $popts->getUserLangObj(). Nothing more needed. Remember you need to use that object and not eg. wfMsg()
a) how and why the "canonical" ParserOptions are used upon save, and how that impacts the parser cache
The "canonical" or not (WikiPage::makeParserOptions) is only used to determine if the ParserOptions should be created with user settings or defaults. The page is always rendered on save with the 'canonical' options to keep the page table links consistent (bug 14404). Then, when the user visits the page after save, it is rerendered with its own preferences if the just-cached general ones doesn't fit his options.
Regards
wikitech-l@lists.wikimedia.org