Wikitech-l July 2012

wikitech-l@lists.wikimedia.org

128 participants
115 discussions

Wikidata: Splitting the Cache by Language

by Daniel Kinzler

Hi all! At the hackathon in DC, I have been discussing options for caching Wikidata content with a few people, primarily RobLa and Ryan Lane. Here's a quick overview of the problem, and what I took away from the conversation. So, what we need is this: * Wikidata objects are multi-lingual * Wikidata generates language specific views of the data objects ("items" and other "entities") as HTML. * To provide good response times, we need to cache that generated HTML on some level, but that cache needs to be specific to (i.e. varying on) the user's language. * anonymous users must have a way to set their preferred language, so they are able to read pages on wikidata in their native language. Now, there are (at least) two levels of caching to be considered: * the parser cache, containing the HTML for every page's content. This cache uses keys based on the ParserOptions, which includes a field for specifying the language. However, currently only a "canonical" version of the rendered page is cached, disregarding the user language. * the squid cache, containing full HTML pages for each URL. Currently, our squid setup doesn't support language versions of the same page; it does however support multiple versions of the same page in general - a mechanism that is already in use for caching different output for different mobile devices. For each cache, we have to decide whether we want to split the cache by language, or not use that cache for wikidata. After some discussion with RobLa and Ryan, we came to the following conclusion: * We will not use the parser cache for data pages (normal wikitext pages on the wikidata site, e.g. talk pages, will use it). The reasons are: a) memory for the parser cache is relatively scarce, multiplying the number of cached objects by the number of languages (dozens, if not hundreds) is not a good idea. b) generating HTML from Wikidata objects (which are JSON-like array structures) is much quicker than generating HTML from wikitext c) Wikidata will have relatively view direct page view, it will mainly be used via Wikipedia pages or the API. Splitting the parser cache by language may be desired in the future, depending on what caching mechanisms will be available for the parser cache, and how expensive HTML generation actually is. We agreed however that this could be left for later and only needs to be addressed in case of actual performance problems. * We will split the squid cache by language, using a cookie that specifies the user's language preference (for both logged in users and anons). The same URL is used for all language versions of the page (this keeps purging simple). The reasons are: a) bypassing caching completely would open us up to DDoS attacks, or even unintentional DoS by spiders, etc. b) the amount of pages in the cache need not be large, only a small subset will be actively viewed. c) squid memory is relatively plenty, the experience with splitting the cache by mobile device shows that this is feasible. Note that logged in users currently bypass the squid cache. This may change in the future, but the above scheme is oblivious to this: it will cache based on the language cookie. Looking more closely, there is four cases to consider: * anonymous with language cookie: vary cache on the language in the cookie, keep URL unchanged * logged in with language cookie: once squid cache is implemented for logged in users, vary cache on the language in the cookie, keep URL unchanged. Until then, bypass the cache. * anonymous with no cookie set (first access, or the user agent doesn't support cookies): this should use the primary content language (english). Squids could also detect the user language somehow, set the cookie, and only then pass the request to the app server - this would be nice, but puts app logic into the squids. Can be added later if desired, MediaWiki is oblivious to this. * logged in with no cookie: skip cache. the response will depend on the user's language preference stored in the database, caching it without the language cookie to vary on would poison the cache. We could try to come up with a scheme to mitigate the overhead this generates, but that only becomes relevant once squid layer caching is supported for logged in users at all. So, that's it: no parser cache, split squid cache, use cookie smartly. What do you think? Does that sound good to you? -- daniel

11 years, 10 months

WikiMania?

by Derric Atzrott

Out of curiosity, is there anyone else here at WikiMania from the mailing list? Thank you, Derric Atzrott Computer Specialist Alizee Pathology

11 years, 10 months

midsummer updates from GSoC students?

by Sumana Harihareswara

Akshay Chugh on the Convention Extension: http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/061362.html (looks like the Wikimania 2013 organizers are working with him to make it work for them!) Ashish Dubey on realtime collaborative editing: http://ashishdubey.info/blog/2012/7/5/realtimeve-first-demo.html (new demo!) Aaron Pramana on the watchlist: http://mw-watchlist.tumblr.com/ Nischay Nahata on SMW optimization: https://greensmw.wordpress.com/ I'd love to hear more from: Robin Pepermans on Incubator: https://gerrit.wikimedia.org/r/#/q/owner:SPQRobin,n,z Platonides on the native uploading app: http://thread.gmane.org/gmane.org.wikimedia.wikilovesmonuments/2641/ Harry Burt on TranslateSVG: https://www.mediawiki.org/wiki/Extension:TranslateSvg/2.0 Ankur Anand on UploadWizard: https://gerrit.wikimedia.org/r/#/dashboard/144 Suhas Rao, on OpenStackManager: https://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/suhasmo… Robin, Platonides, Harry, Ankur, and Suhas: please reply to this thread with updates on how you've been doing. If you're running into any obstacles, please speak up! -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

11 years, 10 months

Wikidata: Splitting the Cache by Language

by bawolff

> Message: 3 > Date: Tue, 10 Jul 2012 13:28:38 -0400 > From: Daniel Kinzler <daniel(a)brightbyte.de> > To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>, > Wikidata-intern <wikidata-intern(a)wikimedia.de> > Subject: [Wikitech-l] Wikidata: Splitting the Cache by Language > Message-ID: <4FFC6646.2080707(a)brightbyte.de> > Content-Type: text/plain; charset=UTF-8 > > Hi all! > > At the hackathon in DC, I have been discussing options for caching Wikidata > content with a few people, primarily RobLa and Ryan Lane. Here's a quick > overview of the problem, and what I took away from the conversation. > > So, what we need is this: > > * Wikidata objects are multi-lingual > * Wikidata generates language specific views of the data objects ("items" and > other "entities") as HTML. > * To provide good response times, we need to cache that generated HTML on some > level, but that cache needs to be specific to (i.e. varying on) the user's language. > * anonymous users must have a way to set their preferred language, so they are > able to read pages on wikidata in their native language. > > Now, there are (at least) two levels of caching to be considered: > > * the parser cache, containing the HTML for every page's content. This cache > uses keys based on the ParserOptions, which includes a field for specifying the > language. However, currently only a "canonical" version of the rendered page is > cached, disregarding the user language. That's not entirely true. In cases where the page differs depending on language ( {{int:...}}, Pages with the babel extension depending on config of that extension), we split the parser cache based on user language Well those might seem like small cases, I would guess (to pull a number out of nowhere, so probably wrong) that at least 75% of pages on commons have the parser cache split by user language. > > * the squid cache, containing full HTML pages for each URL. Currently, our squid > setup doesn't support language versions of the same page; it does however > support multiple versions of the same page in general - a mechanism that is > already in use for caching different output for different mobile devices. > > For each cache, we have to decide whether we want to split the cache by > language, or not use that cache for wikidata. After some discussion with RobLa > and Ryan, we came to the following conclusion: > > * We will not use the parser cache for data pages (normal wikitext pages on the > wikidata site, e.g. talk pages, will use it). The reasons are: > a) memory for the parser cache is relatively scarce, multiplying the number of > cached objects by the number of languages (dozens, if not hundreds) is not a > good idea. > b) generating HTML from Wikidata objects (which are JSON-like array structures) > is much quicker than generating HTML from wikitext > c) Wikidata will have relatively view direct page view, it will mainly be used > via Wikipedia pages or the API. > > Splitting the parser cache by language may be desired in the future, depending > on what caching mechanisms will be available for the parser cache, and how > expensive HTML generation actually is. We agreed however that this could be left > for later and only needs to be addressed in case of actual performance problems. > > * We will split the squid cache by language, using a cookie that specifies the > user's language preference (for both logged in users and anons). The same URL is > used for all language versions of the page (this keeps purging simple). The > reasons are: > a) bypassing caching completely would open us up to DDoS attacks, or even > unintentional DoS by spiders, etc. > b) the amount of pages in the cache need not be large, only a small subset will > be actively viewed. > c) squid memory is relatively plenty, the experience with splitting the cache by > mobile device shows that this is feasible. > > Note that logged in users currently bypass the squid cache. This may change in > the future, but the above scheme is oblivious to this: it will cache based on > the language cookie. Looking more closely, there is four cases to consider: > > * anonymous with language cookie: vary cache on the language in the cookie, keep > URL unchanged > * logged in with language cookie: once squid cache is implemented for logged in > users, vary cache on the language in the cookie, keep URL unchanged. Until then, > bypass the cache. > * anonymous with no cookie set (first access, or the user agent doesn't support > cookies): this should use the primary content language (english). Squids could > also detect the user language somehow, set the cookie, and only then pass the > request to the app server - this would be nice, but puts app logic into the > squids. Can be added later if desired, MediaWiki is oblivious to this. > * logged in with no cookie: skip cache. the response will depend on the user's > language preference stored in the database, caching it without the language > cookie to vary on would poison the cache. We could try to come up with a scheme > to mitigate the overhead this generates, but that only becomes relevant once > squid layer caching is supported for logged in users at all. > > So, that's it: no parser cache, split squid cache, use cookie smartly. > > What do you think? Does that sound good to you? > > -- daniel Does this imply anon users can some how have a language set via a cookie? As it stands you need to be logged in to have a lang preference. Are there concrete plans to have squid servers start partially caching pages for logged in users? (And if so are we including the content area of the page with that?). If there are no concrete plans to have squid start caching logged in views, and only logged in users have a lang preference, it doesn't seem to make a lot of sense to split the squid cache based on language as that would catach 0% of the hits. If there is concern that parser cache would be too diluted by this, why not cache in the normal $wgMemc->set cache. We certainly cache all sorts of stuff there much of it not even very expensive to generate. -- -bawolff

11 years, 10 months

Re: [Wikitech-l] midsummer updates from GSoC students?

by Harry Burt

Okay, so my last email doesn't seem to have made it to the list, so I'm going to rehash it and update a few bits here and there :) --- Hey all. I'm currently about 40% of the way through my own scheduled hours on my Google Summer of Code project TranslateSvg [1],so it's a good time to take a step back and survey the scene. The original project plan consisted largely of five parts: three main "phases" plus an introduction and a long wrapup period. At this point, I'm more or less where I should be: the introduction, phases 1 and 2 completed; 3 in progress; wrapup not yet started. You can see what it means to say "phases 1 and 2 completed" by taking a look at this video [2] (you may need to turn your sound up), which follows a user (me pretending to be French) translating a file into his/her own language. A wizard or guide to make the interface, which is borrowed from the Translate extension, more intuitive to newbies is in the works, as is the addition of a "color" property to help with recolouring text after translation. Reuse onwiki for this visitor could now as simple as [[File:Picturebook 1.svg|thumb|lang=fr|Caption.]] . TranslateSvg currently supports about 76% of all the ~150,000 translatable SVG files on Wikimedia Commons; now the basic import structure is complete, I'm working on pushing that figure up towards 99%. Problems thus far have been minimal and easily worked around with the help of my mentor Max Semenik and developer Niklas Laxström, one of the primary authors of the Translate extension, which TranslateSvg layers on top of. I hope to have a live working demonstration wiki installation up and running within the next fortnight, after phase 3 is complete. I'd love to hear any comments either then or now :) Thanks, Harry -- Harry Burt (User:Jarry1250) GSoC student [1] https://www.mediawiki.org/wiki/Extension:TranslateSvg/2.0 [2] https://commons.wikimedia.org/wiki/File:TranslateSvg_phases_1_and_2.ogv

11 years, 10 months

Gerrit evaluation process

by Rob Lanphier

Hi everyone, As you know, when we moved to Git, we decided we would retire our home grown "Code Review" extension for MediaWiki. Having collectively not had a lot of experience with Git-based code review tools, we decided to try Gerrit. Moving to Git was a very deliberate decision that was discussed over many years, Gerrit wasn't discussed in much detail until after the migration. As discussed earlier on this list, we'd like to revisit the Gerrit decision, and either commit to it for the next couple years or so, or put in place a plan to migrate to some other system. Here's the plan: let's discuss the alternatives here for the next three weeks, keeping this page up-to-date based on the discussion: http://www.mediawiki.org/wiki/Git/Gerrit_evaluation Brion has graciously agreed to act as the judge and jury, and make the call which way to go. He's in a good position to be a reasonably neutral but also very informed party. He also doesn't have the conflict of interest that I have (I've got an admitted pro-Gerrit bias just due to not wanting to take on more work for my team's backlog). You'll notice that the wiki page above has a "The case for Gerrit" and "The case against Gerrit". Chad is lead editor the pro-Gerrit case, so the section "The case for Gerrit" is not NPOV, it's ChadPOV. :) Similarly, the anti-Gerrit case lead editor is David Schoonover. You shouldn't let that stop you from editing that page though. In fact, the anti-Gerrit section was entirely written by Ori Livneh as of this writing. Chad is coming to San Francisco the week of August 6. We're planning to have a meeting to finalize the decision when he's in town. However, that will hopefully be a rubber stamp of whatever consensus emerges on list, rather than something we're still trying to hash out. So, if you'd like to see us move off of Gerrit, now is your chance. Rob

11 years, 10 months

Re: [Wikitech-l] [MediaWiki-commits] [Gerrit] Add explanation for Varnish "hashing" in light of advisories... - change (operations...varnish[testing/persistent])

by Siebrand Mazeland

Hi mark We're telling people who commit multiple commits to the same repo to squash them before submitting. Yesterday I got around 550 emails from multiple commits by you. Could you please look into how that can be reduced? Thanks. -- Siebrand Mazeland M: +31 6 50 69 1239 Skype: siebrand Op 11 jul. 2012 om 07:35 heeft "Mark Bergsma (Code Review)" <gerrit(a)wikimedia.org> het volgende geschreven: > Mark Bergsma has submitted this change and it was merged. > > Change subject: Add explanation for Varnish "hashing" in light of advisories. > ...................................................................... > > > Add explanation for Varnish "hashing" in light of advisories. > --- > M doc/sphinx/phk/index.rst > A doc/sphinx/phk/varnish_does_not_hash.rst > 2 files changed, 146 insertions(+), 0 deletions(-) > > Approvals: > Mark Bergsma: Verified; Looks good to me, approved > > > -- > To view, visit https://gerrit.wikimedia.org/r/15035 > To unsubscribe, visit https://gerrit.wikimedia.org/r/settings > > Gerrit-MessageType: merged > Gerrit-Change-Id: I3f98fc1777fca10fe9fe2c02cdb6324784f50948 > Gerrit-PatchSet: 1 > Gerrit-Project: operations/debs/varnish > Gerrit-Branch: testing/persistent > Gerrit-Owner: Mark Bergsma <mark(a)wikimedia.org> > Gerrit-Reviewer: Mark Bergsma <mark(a)wikimedia.org> > > _______________________________________________ > MediaWiki-commits mailing list > MediaWiki-commits(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

11 years, 10 months

Upstream keyword in bugzilla

by Niklas Laxström

I think the upstream keyword in bugzilla is useless. Can we replace it with a field which takes an url to the upstream bug report? -Niklas -- Niklas Laxström

11 years, 10 months

Wikidata Infoboxes

by jmcclure＠hypergrove.com

With regard to wikidata infoboxes, it is either * that the usual technical APIs implemented as parser functions must be learned by every Wikipedian * or that simple interwiki transclusions are entered via the emerging WYSIWYG tools for Wikipedia With regard to the content of wikidata infoboxes, it is either * that nothing is annotated -- all such (semantic web) queries are driven resolutely towards wikidata * or that infobox triples are annotated via HTML/a -- article content can optionally have HTML/a markup In short it is either * no wikipedias can be considered part of the semantic web * or all wikipedias stand at the center of the semantic web

11 years, 10 months

Take a lesson from dreamhoststatus.com

by jidanni＠jidanni.org

On http://wikitech.wikimedia.org/view/Main_Page there is a link to "Current status" which doesn't show what is currently causing Wikipedia to be down. I suggest you take a lesson from http://www.dreamhost.com/ adding a separate http://www.dreamhoststatus.com/ and make lots of links to it...

11 years, 10 months

← Newer
1
...
5
6
7
8
9
10
11
12
Older →

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l July 2012