Re: [Wikitech-l] Reasonably efficient interwiki transclusion

25 May 2010


      (I'm going to use "local wiki" here for what Peter is calling "distant
wiki", and "foreign wiki" for what he's calling "home wiki".  This
seems to better match the terminology we use for Commons.)
On Tue, May 25, 2010 at 7:41 AM, Peter17 peter017@gmail.com wrote:
...
Yes. The shared database would be only for invalidating the cache when
a template is edited. In my 3rd (preferred) solution, the templates
are still fetched through the API. External wikis can transclude them
and cache them for an arbitrary time, as ForeignAPIRepo does.
...
Ok, I will keep this in mind. Parsing the template on the home wiki
seems necessary because it can use other templates hosted on that wiki
to render correctly... I think it is the most logical way to do, isn't
it?
I think parsing the template on the local wiki is better, because it
gives you more flexibility.  For instance, it can use local
{{SITENAME}} and so forth.  {{CONTENTLANG}} would be especially
useful, if we're assuming that templates will be transcluded to many
languages.
This doesn't mean that it has to use the local wiki's templates.
There would be two ways to approach this:
1) Just don't use the local wiki's templates.  Any template calls from
the foreign wiki's template should go to the foreign wiki, not the
local wiki.  If this is being done over the API, then as an
optimization, you could have the foreign wiki send back all templates
that will be required, not just the actual template requested.
2) Use the local wiki's templates, and assume that the template on the
foreign wiki is designed to be used remotely and will only call local
templates when it's really desired.  This gives even more flexibility
if the foreign template is designed for this use, but it makes it
harder to use templates that aren't designed for foreign use.
At first glance, it seems to me that (1) is the best -- do all parsing
on the local wiki, but use templates from the foreign wiki.  This will
cause errors if the local wiki doesn't have necessary extensions
installed, like ParserFunctions, but it gives more flexibility
overall.
Another issue here is performance.  Parsing is one of the most
expensive operations MediaWiki does.  Nobody's going to care much if
foreign sites request a bunch of templates that can be served out of
Squid, but if there are lots of foreign sites that are requesting
giant infoboxes and those have to be parsed by Wikimedia servers,
Domas is going to come along with an axe pretty soon and everyone's
sites will break.  Better to head that off at the pass.
...
Mmmh.... sorry, I'm not really sure I understand... My suggestion is
to use a shared database that would store the remote calls, not the
content of the pages... In my mind, fetching the distant pages would
be done through the API, not by accessing directly the distant
database. The external wikis will soon be able to access our images
very easily with wgUseInstantCommons but it is still not an access to
the database...
What you're proposing is that Wikimedia servers do this on a cache miss:
1) An application server sends an HTTP request to a Squid with
If-Modified-Since.
2) The Squid checks its cache, finds it's a miss, and passes the
request to another Squid.
3) The other Squid checks its cache, finds it's a miss, and passes the
request to a second application server.
4) The second application server loads up the MediaWiki API and sends
a request to a database server.
5) The database server returns the result to the second application server.
6) The second application server returns the results to the Squids,
which cache it and return it to the first application server.
7) The first application server caches the result in the database.
What I'm proposing is that they do this:
1) An application server sends a database query to a database server
(maybe even using an already-open connection).
2) The database server returns the result.
Having Wikimedia servers send HTTP requests to each other instead of
just doing database queries does not sound like a great idea to me.
You're hitting several extra servers for no reason, including extra
requests to an application server.  On top of that, you're caching
stuff in the database which is already *in* the database!  FileRepo
does this the Right Way, and you should definitely look at how that
works.  It uses polymorphism to use the database if possible, else the
API.
However, someone like Tim Starling should be consulted for a
definitive performance assessment; don't rely on my word alone.
On Tue, May 25, 2010 at 9:11 AM, church.of.emacs.ml
church.of.emacs.ml@googlemail.com wrote:
...
Yes. When I think about this a bit more, it makes sense to parse on the
home wiki, because otherwise (a) you couldn't include other remote
templates or (b) you would need one API call per included template. Both
not feasible.
Just have it return all needed templates at once if you want to
minimize round-trips.
...
However, you'd have to worry that each distant wiki uses only a fair
amount of the home wiki server's resources. E.g. set a limit of
inclusions (that limit would have to be on the home-wiki-server-side)
and disallow infinite loops (they're always fun).
This is probably not enough.  I really doubt Wikimedia is going to let
a sizable fraction of its CPU time go to foreign template use.
Serving images or plain old wikitext from Squid cache is very cheap,
so that's not a big deal, but large-scale parsing will be too much, I
suspect.  (But again, ask Tim about this.)
...
Do I understand this correctly... you can either access a foreign
repository via the API (if you're on another server) or directly via the
database (if you're on the same wiki farm)? Very cool stuff.
Yes, that's how FileRepo works.
On Tue, May 25, 2010 at 9:22 AM, Platonides Platonides@gmail.com wrote:
...
He can internally call the api from the other wiki via FauxRequest.
How will that interact with different configuration settings?  I
thought FauxRequest only handles requests to the current wiki.
...
I'm afraid that it will produce the opposite. A third party downloads a
xml dump for offline use but it doesn't work because it needs a dozen
templates from meta (in the worst case, templates from a dozen other wikis).
My point is that ideally, you'd be able to copy-paste enwiki pages and
then get the templates to work by configuring them to be fetched from
enwiki.  Even more ideally, you might want to fetch the enwiki
templates as of the point in time your page was downloaded, in case
the templates changed syntax (and also to allow indefinite caching).
But I guess that's much better handled by just using a proper export,
and having the templates included in that, so never mind.
On Tue, May 25, 2010 at 9:30 AM, Platonides Platonides@gmail.com wrote:
...
Infinite loops could only happen if both wikis can fetch from the other
one. A simple solution would be to pass with the query who requested it
originally. If the home wiki calls a different wiki, it would blame the
one who asked for it (or maybe building a wiki + template path).
An even simpler solution would be to only set up one wiki to allow
this kind of foreign template request, the way Commons is set up now.
But that might be limiting.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Reasonably efficient interwiki transclusion