Hello to all!
I'm a French student and I am participating the Google Summer of Code this year on Mediawiki!
My mentor is Roan Kattouw (Catrope) and my subject is "Reasonably efficient interwiki transclusion". You can see my application page here: [1].
I have already discussed with my mentor and we have prepared together a draft about my project: [2]. It sums up the current situation and includes some proposals.
It is now open for comments, so, could you please read it and let me know about your remarks and suggestions, on this list and/or on the talk page?
Thanks in advance
[1] http://www.mediawiki.org/wiki/User:Peter17/GSoc_2010 [2] http://www.mediawiki.org/wiki/User:Peter17/Reasonably_efficient_interwiki_tr...
-- Peter Potrowl http://www.mediawiki.org/wiki/User:Peter17
2010/5/24 Peter17 peter017@gmail.com
Hello to all!
I'm a French student and I am participating the Google Summer of Code this year on Mediawiki!
My mentor is Roan Kattouw (Catrope) and my subject is "Reasonably efficient interwiki transclusion". You can see my application page here: [1].
Thanks Peter! I'll follow your project with lots of interest. Nevertheless, a suggestion: take into account, from beginning, Labeled Section Transclusion! It's a mostly interesting extension, with lots of possible uses, but - unluckly and wrongly - it's only seen as a "wikisource tool" :-( . Obviosly you know that recently a template Iwpage, working in wikisource, does a limited interwiki transclusion.
Alex
On Mon, May 24, 2010 at 17:44, Peter17 peter017@gmail.com wrote:
Hello to all!
I'm a French student and I am participating the Google Summer of Code this year on Mediawiki!
My mentor is Roan Kattouw (Catrope) and my subject is "Reasonably efficient interwiki transclusion". You can see my application page here: [1].
The title of the subject is a bit confusing. "Interwiki", for better or worse, refers to interlanguage links.
Consider changing it to "cross-wiki" or something.
-- אָמִיר אֱלִישָׁע אַהֲרוֹנִי Amir Elisha Aharoni
"We're living in pieces, I want to live in peace." - T. Moore
On Mon, May 24, 2010 at 11:18 AM, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:
On Mon, May 24, 2010 at 17:44, Peter17 peter017@gmail.com wrote:
Hello to all!
I'm a French student and I am participating the Google Summer of Code this year on Mediawiki!
My mentor is Roan Kattouw (Catrope) and my subject is "Reasonably efficient interwiki transclusion". You can see my application page here: [1].
The title of the subject is a bit confusing. "Interwiki", for better or worse, refers to interlanguage links.
Consider changing it to "cross-wiki" or something.
No it doesn't. Interwiki links don't have to be interlanguage links. Interlanguage links are a subset of interwiki links... those that happen to also be language codes.
-Chad
On Mon, May 24, 2010 at 18:48, Chad innocentkiller@gmail.com wrote:
On Mon, May 24, 2010 at 11:18 AM, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:
The title of the subject is a bit confusing. "Interwiki", for better or worse, refers to interlanguage links.
No it doesn't. Interwiki links don't have to be interlanguage links. Interlanguage links are a subset of interwiki links... those that happen to also be language codes.
You are right, but that's why i wrote "for better or worse": I'd gladly call them "interlanguage", but very frequently people say "interwiki" and mean "interlanguage". Consider the name of http://meta.wikimedia.org/wiki/Pywikipediabot/interwiki.py , for example.
http://www.mediawiki.org/wiki/User:Peter17/Reasonably_efficient_interwiki_tr...
Seems it doesn't work so well. It was inadvertedly broken for wikitext transclusions when the interwiki points to the nice url. See 'wgEnableScaryTranscluding and Templates/Images?' thread at mediawiki-l
Hi,
On 05/24/2010 04:44 PM, Peter17 wrote:
It is now open for comments, so, could you please read it and let me know about your remarks and suggestions, on this list and/or on the talk page?
first of all, I let me tell you that I'm really excited about this project. It may very well revolutionize the way we organize templates on Wikimedia and also other wiki farms.
Some notes: 1. You propose a shared database. If I interpret this correctly, it only works inside a wiki set on the same server farm and doesn't include external wikis. For example, English Wikipedia could transclude templates from Meta Wiki, but not from Wikia. In contrast, $wgForeignFileRepos works for external Wikis (which is much better).
2. Parsing the wikitext at the home wiki makes it more difficult to use site magic words, e.g. {{CONTENTLANGUAGE}}. You'd have to pass one each and everyone as a template parameter (e.g. {{homewiki::templatename|lang={{CONTENTLANGUAGE}}}})
Kind regards,
--Church of emacs
On Mon, May 24, 2010 at 7:42 PM, church.of.emacs.ml church.of.emacs.ml@googlemail.com wrote:
Hi,
On 05/24/2010 04:44 PM, Peter17 wrote:
It is now open for comments, so, could you please read it and let me know about your remarks and suggestions, on this list and/or on the talk page?
first of all, I let me tell you that I'm really excited about this project. It may very well revolutionize the way we organize templates on Wikimedia and also other wiki farms.
Some notes:
- You propose a shared database. If I interpret this correctly, it only
works inside a wiki set on the same server farm and doesn't include external wikis. For example, English Wikipedia could transclude templates from Meta Wiki, but not from Wikia. In contrast, $wgForeignFileRepos works for external Wikis (which is much better).
If it's done right, you should be able to put various backends on it just like the FileRepo code. Bug 20646 is a good start to something like this I think. Being able to store API urls or database connection info inside a iw_meta field would be awesome for this (and has lots of other applications as well).
-Chad
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On 5/24/2010 6:42 PM, church.of.emacs.ml wrote:
Hi,
On 05/24/2010 04:44 PM, Peter17 wrote:
It is now open for comments, so, could you please read it and let me know about your remarks and suggestions, on this list and/or on the talk page?
Some notes:
- You propose a shared database. If I interpret this correctly, it only
I would have to suggest to not go the shared database route unless the code can be fixed so that shared databases actually work with all of the DB backends.
On Mon, May 24, 2010 at 8:27 PM, Q overlordq@gmail.com wrote:
I would have to suggest to not go the shared database route unless the code can be fixed so that shared databases actually work with all of the DB backends.
I don't see why it shouldn't be easy to get it working with all DB backends. But in any case, for Wikimedia use, a shared database backend is pretty much a must. Having the application servers make HTTP requests to each other to retrieve templates rather than accessing the database directly is just silly, and is going to perform badly. Ideally the code should generalize to work with external wikis too, so that third parties can benefit from our templates as they do from our images. Maybe someday, a copy-pasted Wikipedia article will actually work . . . I can dream.
2010/5/25 Platonides Platonides@gmail.com:>
Seems it doesn't work so well. It was inadvertedly broken for wikitext transclusions when the interwiki points to the nice url. See 'wgEnableScaryTranscluding and Templates/Images?' thread at mediawiki-l
Well, in my tests, images are well included because I enabled $wgUseInstantCommons. As I wrote, "the parameters are totally ignored": they are indeed not substituted.
2010/5/25 Chad innocentkiller@gmail.com:
On Mon, May 24, 2010 at 7:42 PM, church.of.emacs.ml church.of.emacs.ml@googlemail.com wrote:
- You propose a shared database. If I interpret this correctly, it only
works inside a wiki set on the same server farm and doesn't include external wikis. For example, English Wikipedia could transclude templates from Meta Wiki, but not from Wikia. In contrast, $wgForeignFileRepos works for external Wikis (which is much better).
If it's done right, you should be able to put various backends on it just like the FileRepo code. Bug 20646 is a good start to something like this I think. Being able to store API urls or database connection info inside a iw_meta field would be awesome for this (and has lots of other applications as well).
-Chad
Yes. The shared database would be only for invalidating the cache when a template is edited. In my 3rd (preferred) solution, the templates are still fetched through the API. External wikis can transclude them and cache them for an arbitrary time, as ForeignAPIRepo does.
2010/5/25 church.of.emacs.ml church.of.emacs.ml@googlemail.com:
- Parsing the wikitext at the home wiki makes it more difficult to use
site magic words, e.g. {{CONTENTLANGUAGE}}. You'd have to pass one each and everyone as a template parameter (e.g. {{homewiki::templatename|lang={{CONTENTLANGUAGE}}}})
Ok, I will keep this in mind. Parsing the template on the home wiki seems necessary because it can use other templates hosted on that wiki to render correctly... I think it is the most logical way to do, isn't it?
2010/5/25 Aryeh Gregor Simetrical+wikilist@gmail.com:
On Mon, May 24, 2010 at 8:27 PM, Q overlordq@gmail.com wrote:
I would have to suggest to not go the shared database route unless the code can be fixed so that shared databases actually work with all of the DB backends.
I don't see why it shouldn't be easy to get it working with all DB backends. But in any case, for Wikimedia use, a shared database backend is pretty much a must. Having the application servers make HTTP requests to each other to retrieve templates rather than accessing the database directly is just silly, and is going to perform badly. Ideally the code should generalize to work with external wikis too, so that third parties can benefit from our templates as they do from our images. Maybe someday, a copy-pasted Wikipedia article will actually work . . . I can dream.
Mmmh.... sorry, I'm not really sure I understand... My suggestion is to use a shared database that would store the remote calls, not the content of the pages... In my mind, fetching the distant pages would be done through the API, not by accessing directly the distant database. The external wikis will soon be able to access our images very easily with wgUseInstantCommons but it is still not an access to the database...
Thanks for your remarks.
About the question from Alex about transcluding sections: is it possible to request only a section through the API? I searched about this but didn't find :(
-- Peter Potrowl http://www.mediawiki.org/wiki/User:Peter17
About the question from Alex about transcluding sections: is it possible to request only a section through the API? I searched about this but didn't find :(
-- Peter Potrowl
Ask ThomasV, #lst is particularly cared by him, to deepest level of knowledge! I guess he met too your same troubles.
Alex
On Tue, May 25, 2010 at 7:41 AM, Peter17 peter017@gmail.com wrote:
Mmmh.... sorry, I'm not really sure I understand... My suggestion is to use a shared database that would store the remote calls, not the content of the pages... In my mind, fetching the distant pages would be done through the API, not by accessing directly the distant database. The external wikis will soon be able to access our images very easily with wgUseInstantCommons but it is still not an access to the database...
That's not scalable on Wikimedia sites. Making external HTTP requests to other wiki's APIs just isn't fast enough; you must use the database for remote wiki information in the WMF. I suggest taking a deeper look at how the FileRepo does things. The abstract class FileRepo handles the high-level stuff while LocalFile, ForeignDBRepo and ForeignAPIRepo handle the specific implementations for things like getting thumbnails or metadata.
-Chad
On 05/25/2010 01:41 PM, Peter17 wrote:
2010/5/25 church.of.emacs.ml church.of.emacs.ml@googlemail.com:
- Parsing the wikitext at the home wiki makes it more difficult to use
site magic words, e.g. {{CONTENTLANGUAGE}}. You'd have to pass one each and everyone as a template parameter (e.g. {{homewiki::templatename|lang={{CONTENTLANGUAGE}}}})
Ok, I will keep this in mind. Parsing the template on the home wiki seems necessary because it can use other templates hosted on that wiki to render correctly... I think it is the most logical way to do, isn't it?
Yes. When I think about this a bit more, it makes sense to parse on the home wiki, because otherwise (a) you couldn't include other remote templates or (b) you would need one API call per included template. Both not feasible. However, you'd have to worry that each distant wiki uses only a fair amount of the home wiki server's resources. E.g. set a limit of inclusions (that limit would have to be on the home-wiki-server-side) and disallow infinite loops (they're always fun).
What do you propose for linking? If a template on the home wiki links to [[Foobar]], should that be an interwiki link to [[homewiki:Foobar]], or a local link in the distant wiki? In any case, there should be a way of differentiating home-wiki and distant-wiki references (links, inclusions).
On 05/25/2010 02:25 PM, Chad wrote:
That's not scalable on Wikimedia sites. Making external HTTP requests to other wiki's APIs just isn't fast enough; you must use the database for remote wiki information in the WMF. I suggest taking a deeper look at how the FileRepo does things. The abstract class FileRepo handles the high-level stuff while LocalFile, ForeignDBRepo and ForeignAPIRepo handle the specific implementations for things like getting thumbnails or metadata.
Do I understand this correctly... you can either access a foreign repository via the API (if you're on another server) or directly via the database (if you're on the same wiki farm)? Very cool stuff.
Regards, Church of emacs
Aryeh Gregor wrote:
On Mon, May 24, 2010 at 8:27 PM, Q overlordq@gmail.com wrote:
I would have to suggest to not go the shared database route unless the code can be fixed so that shared databases actually work with all of the DB backends.
I don't see why it shouldn't be easy to get it working with all DB backends. But in any case, for Wikimedia use, a shared database backend is pretty much a must. Having the application servers make HTTP requests to each other to retrieve templates rather than accessing the database directly is just silly, and is going to perform badly.
He can internally call the api from the other wiki via FauxRequest.
Ideally the code should generalize to work with external wikis too, so that third parties can benefit from our templates as they do from our images. Maybe someday, a copy-pasted Wikipedia article will actually work . . . I can dream.
I'm afraid that it will produce the opposite. A third party downloads a xml dump for offline use but it doesn't work because it needs a dozen templates from meta (in the worst case, templates from a dozen other wikis).
church.of.emacs.ml wrote:
However, you'd have to worry that each distant wiki uses only a fair amount of the home wiki server's resources. E.g. set a limit of inclusions (that limit would have to be on the home-wiki-server-side) and disallow infinite loops (they're always fun).
Infinite loops could only happen if both wikis can fetch from the other one. A simple solution would be to pass with the query who requested it originally. If the home wiki calls a different wiki, it would blame the one who asked for it (or maybe building a wiki + template path).
What do you propose for linking? If a template on the home wiki links to [[Foobar]], should that be an interwiki link to [[homewiki:Foobar]], or a local link in the distant wiki? In any case, there should be a way of differentiating home-wiki and distant-wiki references (links, inclusions).
The link itself could be generated partly with local data and partly with remote data, eg. a remote template containing "[[[Flag of {{{city}}}]]" called with a city parameter.
On 25 May 2010 15:30, Platonides Platonides@gmail.com wrote:
church.of.emacs.ml wrote:
However, you'd have to worry that each distant wiki uses only a fair amount of the home wiki server's resources. E.g. set a limit of inclusions (that limit would have to be on the home-wiki-server-side) and disallow infinite loops (they're always fun).
Infinite loops could only happen if both wikis can fetch from the other one. A simple solution would be to pass with the query who requested it originally. If the home wiki calls a different wiki, it would blame the one who asked for it (or maybe building a wiki + template path).
or request can have something like a deep counter, to stop request that need more than N iterations. So if you get a request with deep > 20, you can ignore that request. This don't stop a evil wiki passing a false deep level, but the idea of interwiki is a network built on top of the www of wikis you trusth, so you will not add a evil wiki
(I'm going to use "local wiki" here for what Peter is calling "distant wiki", and "foreign wiki" for what he's calling "home wiki". This seems to better match the terminology we use for Commons.)
On Tue, May 25, 2010 at 7:41 AM, Peter17 peter017@gmail.com wrote:
Yes. The shared database would be only for invalidating the cache when a template is edited. In my 3rd (preferred) solution, the templates are still fetched through the API. External wikis can transclude them and cache them for an arbitrary time, as ForeignAPIRepo does.
Ok, I will keep this in mind. Parsing the template on the home wiki seems necessary because it can use other templates hosted on that wiki to render correctly... I think it is the most logical way to do, isn't it?
I think parsing the template on the local wiki is better, because it gives you more flexibility. For instance, it can use local {{SITENAME}} and so forth. {{CONTENTLANG}} would be especially useful, if we're assuming that templates will be transcluded to many languages.
This doesn't mean that it has to use the local wiki's templates. There would be two ways to approach this:
1) Just don't use the local wiki's templates. Any template calls from the foreign wiki's template should go to the foreign wiki, not the local wiki. If this is being done over the API, then as an optimization, you could have the foreign wiki send back all templates that will be required, not just the actual template requested.
2) Use the local wiki's templates, and assume that the template on the foreign wiki is designed to be used remotely and will only call local templates when it's really desired. This gives even more flexibility if the foreign template is designed for this use, but it makes it harder to use templates that aren't designed for foreign use.
At first glance, it seems to me that (1) is the best -- do all parsing on the local wiki, but use templates from the foreign wiki. This will cause errors if the local wiki doesn't have necessary extensions installed, like ParserFunctions, but it gives more flexibility overall.
Another issue here is performance. Parsing is one of the most expensive operations MediaWiki does. Nobody's going to care much if foreign sites request a bunch of templates that can be served out of Squid, but if there are lots of foreign sites that are requesting giant infoboxes and those have to be parsed by Wikimedia servers, Domas is going to come along with an axe pretty soon and everyone's sites will break. Better to head that off at the pass.
Mmmh.... sorry, I'm not really sure I understand... My suggestion is to use a shared database that would store the remote calls, not the content of the pages... In my mind, fetching the distant pages would be done through the API, not by accessing directly the distant database. The external wikis will soon be able to access our images very easily with wgUseInstantCommons but it is still not an access to the database...
What you're proposing is that Wikimedia servers do this on a cache miss:
1) An application server sends an HTTP request to a Squid with If-Modified-Since.
2) The Squid checks its cache, finds it's a miss, and passes the request to another Squid.
3) The other Squid checks its cache, finds it's a miss, and passes the request to a second application server.
4) The second application server loads up the MediaWiki API and sends a request to a database server.
5) The database server returns the result to the second application server.
6) The second application server returns the results to the Squids, which cache it and return it to the first application server.
7) The first application server caches the result in the database.
What I'm proposing is that they do this:
1) An application server sends a database query to a database server (maybe even using an already-open connection).
2) The database server returns the result.
Having Wikimedia servers send HTTP requests to each other instead of just doing database queries does not sound like a great idea to me. You're hitting several extra servers for no reason, including extra requests to an application server. On top of that, you're caching stuff in the database which is already *in* the database! FileRepo does this the Right Way, and you should definitely look at how that works. It uses polymorphism to use the database if possible, else the API.
However, someone like Tim Starling should be consulted for a definitive performance assessment; don't rely on my word alone.
On Tue, May 25, 2010 at 9:11 AM, church.of.emacs.ml church.of.emacs.ml@googlemail.com wrote:
Yes. When I think about this a bit more, it makes sense to parse on the home wiki, because otherwise (a) you couldn't include other remote templates or (b) you would need one API call per included template. Both not feasible.
Just have it return all needed templates at once if you want to minimize round-trips.
However, you'd have to worry that each distant wiki uses only a fair amount of the home wiki server's resources. E.g. set a limit of inclusions (that limit would have to be on the home-wiki-server-side) and disallow infinite loops (they're always fun).
This is probably not enough. I really doubt Wikimedia is going to let a sizable fraction of its CPU time go to foreign template use. Serving images or plain old wikitext from Squid cache is very cheap, so that's not a big deal, but large-scale parsing will be too much, I suspect. (But again, ask Tim about this.)
Do I understand this correctly... you can either access a foreign repository via the API (if you're on another server) or directly via the database (if you're on the same wiki farm)? Very cool stuff.
Yes, that's how FileRepo works.
On Tue, May 25, 2010 at 9:22 AM, Platonides Platonides@gmail.com wrote:
He can internally call the api from the other wiki via FauxRequest.
How will that interact with different configuration settings? I thought FauxRequest only handles requests to the current wiki.
I'm afraid that it will produce the opposite. A third party downloads a xml dump for offline use but it doesn't work because it needs a dozen templates from meta (in the worst case, templates from a dozen other wikis).
My point is that ideally, you'd be able to copy-paste enwiki pages and then get the templates to work by configuring them to be fetched from enwiki. Even more ideally, you might want to fetch the enwiki templates as of the point in time your page was downloaded, in case the templates changed syntax (and also to allow indefinite caching).
But I guess that's much better handled by just using a proper export, and having the templates included in that, so never mind.
On Tue, May 25, 2010 at 9:30 AM, Platonides Platonides@gmail.com wrote:
Infinite loops could only happen if both wikis can fetch from the other one. A simple solution would be to pass with the query who requested it originally. If the home wiki calls a different wiki, it would blame the one who asked for it (or maybe building a wiki + template path).
An even simpler solution would be to only set up one wiki to allow this kind of foreign template request, the way Commons is set up now. But that might be limiting.
2010/5/25 Aryeh Gregor Simetrical+wikilist@gmail.com:
Having Wikimedia servers send HTTP requests to each other instead of just doing database queries does not sound like a great idea to me. You're hitting several extra servers for no reason, including extra requests to an application server. On top of that, you're caching stuff in the database which is already *in* the database! FileRepo does this the Right Way, and you should definitely look at how that works. It uses polymorphism to use the database if possible, else the API.
However, someone like Tim Starling should be consulted for a definitive performance assessment; don't rely on my word alone.
This is true if, indeed, all parsing is done on the distant wiki. However, if parsing is done on the home wiki, you're not simply requesting data that's ready-baked in the DB and API calls make sense. I'm also not convinced this would be a huge performance problem because it'd only be done on parse (thanks to parser cache), but like you I trust Tim's verdict more than mine. Unlike Platonides suggested, you cannot use FauxRequest to do cross-wiki API requests.
To the point of whether parsing on the on the distant wiki makes more sense: I guess there are points to be made both ways. I originally subscribed to the idea of parsing on the home wiki so expanding the same template with the same arguments would always result in the same (preprocessed) wikitext, but I do see how parsing on the local wiki would help for stuff like {{SITENAME}} and {{CONTENTLANG}}.
Roan Kattouw (Catrope)
On Tue, May 25, 2010 at 2:58 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
This is true if, indeed, all parsing is done on the distant wiki. However, if parsing is done on the home wiki, you're not simply requesting data that's ready-baked in the DB and API calls make sense.
That's true -- if parsing is done on the foreign wiki, then you'd have to do API calls or something, not read from the DB. Another reason to avoid that. :)
I'm also not convinced this would be a huge performance problem because it'd only be done on parse (thanks to parser cache), but like you I trust Tim's verdict more than mine.
Templates will often miss the parser cache, because different invocations will use different parameters. Even *with* the parser cache, parsing is *still* one of the most expensive operations Wikimedia does, so I'm not so sanguine.
2010/5/25 Aryeh Gregor Simetrical+wikilist@gmail.com:
Templates will often miss the parser cache, because different invocations will use different parameters. Even *with* the parser cache, parsing is *still* one of the most expensive operations Wikimedia does, so I'm not so sanguine.
I wasn't talking about the templates themselves hitting the parser cache, but about the pages that use them. Of course the number of pages using interwiki transclusion over time plus the edit rate of those pages could grow to become a problem.
Also note that you wouldn't technically be parsing, just preprocessing on the home wiki, which is certain to be less expensive (how much less I don't know), and that you'd be doing this on some wiki anyway, so only the overhead involved in HTTP, the API and initializing the parser is relevant; the actual cost of the operation is not, because you're doing it someplace either way (of course this only applies intra-WMF, not to external clients).
Roan Kattouw (Catrope)
On Tue, May 25, 2010 at 8:58 PM, Roan Kattouw roan.kattouw@gmail.comwrote:
To the point of whether parsing on the on the distant wiki makes more sense: I guess there are points to be made both ways. I originally subscribed to the idea of parsing on the home wiki so expanding the same template with the same arguments would always result in the same (preprocessed) wikitext, but I do see how parsing on the local wiki would help for stuff like {{SITENAME}} and {{CONTENTLANG}}.
Why not mix it? Take other templates etc. from the source wiki and set magic stuff like time / contentlang to target wiki values.
Marco
On Tue, May 25, 2010 at 3:48 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
Also note that you wouldn't technically be parsing, just preprocessing on the home wiki, which is certain to be less expensive (how much less I don't know)
This is a good point.
and that you'd be doing this on some wiki anyway, so only the overhead involved in HTTP, the API and initializing the parser is relevant; the actual cost of the operation is not, because you're doing it someplace either way (of course this only applies intra-WMF, not to external clients).
External clients are what I'm worried about. It's a nonissue for intra-Wikimedia use, but if external clients start using a lot of CPU by using Wikimedia servers for parsing, I expect them to get shut down, and no one wants that.
On Tue, May 25, 2010 at 4:09 PM, Marco Schuster marco@harddisk.is-a-geek.org wrote:
Why not mix it? Take other templates etc. from the source wiki and set magic stuff like time / contentlang to target wiki values.
That's what I suggested, basically. Use templates from the foreign wiki, but do the actual parsing locally, so you get local values for magic words and so on.
Aryeh Gregor wrote:
Ok, I will keep this in mind. Parsing the template on the home wiki seems necessary because it can use other templates hosted on that wiki to render correctly... I think it is the most logical way to do, isn't it?
I think parsing the template on the local wiki is better, because it gives you more flexibility. For instance, it can use local {{SITENAME}} and so forth. {{CONTENTLANG}} would be especially useful, if we're assuming that templates will be transcluded to many languages.
There are imho fewer variables set by the caller wiki, which could be passed with the query.
This doesn't mean that it has to use the local wiki's templates. There would be two ways to approach this:
- Just don't use the local wiki's templates. Any template calls from
the foreign wiki's template should go to the foreign wiki, not the local wiki. If this is being done over the API, then as an optimization, you could have the foreign wiki send back all templates that will be required, not just the actual template requested.
- Use the local wiki's templates, and assume that the template on the
foreign wiki is designed to be used remotely and will only call local templates when it's really desired. This gives even more flexibility if the foreign template is designed for this use, but it makes it harder to use templates that aren't designed for foreign use.
At first glance, it seems to me that (1) is the best -- do all parsing on the local wiki, but use templates from the foreign wiki. This will cause errors if the local wiki doesn't have necessary extensions installed, like ParserFunctions, but it gives more flexibility overall.
Using 1 you could still allow calling the local template by using {{msg:xyz}}
Another issue here is performance. Parsing is one of the most expensive operations MediaWiki does. Nobody's going to care much if foreign sites request a bunch of templates that can be served out of Squid, but if there are lots of foreign sites that are requesting giant infoboxes and those have to be parsed by Wikimedia servers, Domas is going to come along with an axe pretty soon and everyone's sites will break. Better to head that off at the pass.
Probably time to revive the native preprocessor project. We may want to have both ways implemented, with one falling back on the other.
What you're proposing is that Wikimedia servers do this on a cache miss:
- An application server sends an HTTP request to a Squid with
If-Modified-Since.
- The Squid checks its cache, finds it's a miss, and passes the
request to another Squid.
- The other Squid checks its cache, finds it's a miss, and passes the
request to a second application server.
- The second application server loads up the MediaWiki API and sends
a request to a database server.
The database server returns the result to the second application server.
The second application server returns the results to the Squids,
which cache it and return it to the first application server.
- The first application server caches the result in the database.
For intra-Wikimedia query, they could directly ask an apache. They can even send the query to localhost. Using the api seems the completely right approach for remote users, it can be later refined to add more backends.
Anyway, I don't think api request would be cacheable by squids, so it would be directly passed to an application server.
On Tue, May 25, 2010 at 9:22 AM, Platonides Platonides@gmail.com wrote:
He can internally call the api from the other wiki via FauxRequest.
How will that interact with different configuration settings? I thought FauxRequest only handles requests to the current wiki.
Mmh, right. And we are too based on globals to have two mediawiki instances running in the same php environment :(
I'm afraid that it will produce the opposite. A third party downloads a xml dump for offline use but it doesn't work because it needs a dozen templates from meta (in the worst case, templates from a dozen other wikis).
My point is that ideally, you'd be able to copy-paste enwiki pages and then get the templates to work by configuring them to be fetched from enwiki. Even more ideally, you might want to fetch the enwiki templates as of the point in time your page was downloaded, in case the templates changed syntax (and also to allow indefinite caching).
They would need to prepend the interwiki to all template incantations. That sounds like an option to import templates from foreign wiki first time it is used, being automatically updated as long it's not modified locally (skipping the need of the interwiki on the template, but then it'd conflict with local templates).
But I guess that's much better handled by just using a proper export, and having the templates included in that, so never mind.
Yes. Perhaps they could have a Special:ImportFromRemote to do one-click imports.
On Tue, May 25, 2010 at 9:30 AM, Platonides wrote:
Infinite loops could only happen if both wikis can fetch from the other one. A simple solution would be to pass with the query who requested it originally. If the home wiki calls a different wiki, it would blame the one who asked for it (or maybe building a wiki + template path).
An even simpler solution would be to only set up one wiki to allow this kind of foreign template request, the way Commons is set up now. But that might be limiting.
That's how I'd deploy it. But the code should be robust enough to handle the infinite loops that Peter presents.
On Tue, May 25, 2010 at 5:50 PM, Platonides Platonides@gmail.com wrote:
But I guess that's much better handled by just using a proper export, and having the templates included in that, so never mind.
Yes. Perhaps they could have a Special:ImportFromRemote to do one-click imports.
And this is different from interwiki imports via the normal Special:Import how?
-Chad
On Tue, May 25, 2010 at 5:50 PM, Platonides Platonides@gmail.com wrote:
There are imho fewer variables set by the caller wiki, which could be passed with the query.
I don't get what you're saying here.
For intra-Wikimedia query, they could directly ask an apache. They can even send the query to localhost. Using the api seems the completely right approach for remote users, it can be later refined to add more backends.
Anyway, I don't think api request would be cacheable by squids, so it would be directly passed to an application server.
That's even worse. At least if it's cacheable, you have a *chance* of not hitting an Apache or the DB.
That's how I'd deploy it. But the code should be robust enough to handle the infinite loops that Peter presents.
I don't object to that, but I don't think it's essential.
On 2010-05-25 23:41, Peter17 wrote:
2010/5/25 Platonides Platonides@gmail.com:>
Seems it doesn't work so well. It was inadvertedly broken for wikitext transclusions when the interwiki points to the nice url. See 'wgEnableScaryTranscluding and Templates/Images?' thread at mediawiki-l
Well, in my tests, images are well included because I enabled $wgUseInstantCommons. As I wrote, "the parameters are totally ignored": they are indeed not substituted.
I found it a little surprising that $wgUploadPath needed to be an absolute path for this to work. I had imagined that as part of the transclusion the img URLs would have been transformed into the necessary remote wiki URL.
2010/5/26 Jim Tittsler jt@onnz.net:
On 2010-05-25 23:41, Peter17 wrote:
2010/5/25 Platonides Platonides@gmail.com:>
Seems it doesn't work so well. It was inadvertedly broken for wikitext transclusions when the interwiki points to the nice url. See 'wgEnableScaryTranscluding and Templates/Images?' thread at mediawiki-l
Well, in my tests, images are well included because I enabled $wgUseInstantCommons. As I wrote, "the parameters are totally ignored": they are indeed not substituted.
I found it a little surprising that $wgUploadPath needed to be an absolute path for this to work. I had imagined that as part of the transclusion the img URLs would have been transformed into the necessary remote wiki URL.
I didn't set $wgUploadPath. Just $wgUseInstantCommons = true; The images URLs are actually transformed to remote URLs:
I work on my own local wiki, which address is http://localhost/mediawiki/ and transcluding {{mediawikiwiki::User:Peter17}} which contains [[File:Exquisite-network.png]] produces: <a href="http://www.mediawiki.org/wiki/File:Exquisite-network.png" class="image"><img alt="Exquisite-network.png" src="http://upload.wikimedia.org/wikipedia/commons/e/e1/Exquisite-network.png" width="128" height="128" /></a>, so, it actually points to MediaWiki image description page and Commons image.
@peter: here a recent thread into MediaWiki-API ml about API and sections: http://lists.wikimedia.org/pipermail/mediawiki-api/2010-May/subject.html
No mention of labelled sections used by #lst exstesion ... :-( .... but remember the name of ThomasV as a reference.
Alex
Peter17 wrote:
I didn't set $wgUploadPath. Just $wgUseInstantCommons = true; The images URLs are actually transformed to remote URLs:
I work on my own local wiki, which address is http://localhost/mediawiki/ and transcluding {{mediawikiwiki::User:Peter17}} which contains [[File:Exquisite-network.png]] produces: <a href="http://www.mediawiki.org/wiki/File:Exquisite-network.png" class="image"><img alt="Exquisite-network.png" src="http://upload.wikimedia.org/wikipedia/commons/e/e1/Exquisite-network.png" width="128" height="128" /></a>, so, it actually points to MediaWiki image description page and Commons image.
I think he points that the will be wrong unless $wgUploadPath is a full url (it is set as a full url for wmf wikis).
I have updated my proposal with a fourth version [1]
I am still waiting for comments from Tim Starling. I have contacted him on IRC for this.
[1] http://www.mediawiki.org/wiki/User:Peter17/Reasonably_efficient_interwiki_tr...)
-- Peter Potrowl http://www.mediawiki.org/wiki/User:Peter17
* Roan Kattouw roan.kattouw@gmail.com [Tue, 25 May 2010 20:58:54 +0200]:
2010/5/25 Aryeh Gregor Simetrical+wikilist@gmail.com:
Having Wikimedia servers send HTTP requests to each other instead of just doing database queries does not sound like a great idea to me. You're hitting several extra servers for no reason, including extra requests to an application server. On top of that, you're caching stuff in the database which is already *in* the database! FileRepo does this the Right Way, and you should definitely look at how that works. It uses polymorphism to use the database if possible, else
the
API.
However, someone like Tim Starling should be consulted for a definitive performance assessment; don't rely on my word alone.
This is true if, indeed, all parsing is done on the distant wiki. However, if parsing is done on the home wiki, you're not simply requesting data that's ready-baked in the DB and API calls make sense. I'm also not convinced this would be a huge performance problem because it'd only be done on parse (thanks to parser cache), but like you I trust Tim's verdict more than mine. Unlike Platonides suggested, you cannot use FauxRequest to do cross-wiki API requests.
To the point of whether parsing on the on the distant wiki makes more sense: I guess there are points to be made both ways. I originally subscribed to the idea of parsing on the home wiki so expanding the same template with the same arguments would always result in the same (preprocessed) wikitext, but I do see how parsing on the local wiki would help for stuff like {{SITENAME}} and {{CONTENTLANG}}.
Having a something like FarmRequest or FarmApi classes would be a great think for wiki farms (I run a small one). Probably also would help to unificate the remote vs local farm code. Though, a Farm probably should become an object containing wiki configurations. Currently farms are a bit "hackish". Dmitriy
Dmitriy Sintsov wrote:
- Roan Kattouw roan.kattouw@gmail.com [Tue, 25 May 2010 20:58:54
+0200]:
2010/5/25 Aryeh Gregor Simetrical+wikilist@gmail.com:
Having Wikimedia servers send HTTP requests to each other instead of just doing database queries does not sound like a great idea to me. You're hitting several extra servers for no reason, including extra requests to an application server. On top of that, you're caching stuff in the database which is already *in* the database! FileRepo does this the Right Way, and you should definitely look at how that works. It uses polymorphism to use the database if possible, else
the
API.
However, someone like Tim Starling should be consulted for a definitive performance assessment; don't rely on my word alone.
This is true if, indeed, all parsing is done on the distant wiki. However, if parsing is done on the home wiki, you're not simply requesting data that's ready-baked in the DB and API calls make sense. I'm also not convinced this would be a huge performance problem because it'd only be done on parse (thanks to parser cache), but like you I trust Tim's verdict more than mine. Unlike Platonides suggested, you cannot use FauxRequest to do cross-wiki API requests.
To the point of whether parsing on the on the distant wiki makes more sense: I guess there are points to be made both ways. I originally subscribed to the idea of parsing on the home wiki so expanding the same template with the same arguments would always result in the same (preprocessed) wikitext, but I do see how parsing on the local wiki would help for stuff like {{SITENAME}} and {{CONTENTLANG}}.
Having a something like FarmRequest or FarmApi classes would be a great think for wiki farms (I run a small one). Probably also would help to unificate the remote vs local farm code. Though, a Farm probably should become an object containing wiki configurations. Currently farms are a bit "hackish". Dmitriy
^_^ "hackish" isn't that bad in some sense. I'm currently experimenting with some farm code that works completely outside of MediaWiki rather than as a extension sitting inside of it. Using a sandbox it can get access to the MediaWiki install and extract info from it in a secure way which couldn't be extracted as easily from the api. The system works more like a MediaWiki virtual machine than a MediaWiki installation turned WikiFarm. The result is a farm free of mapping issues which can give MediaWiki hostees much more control over the installation then they could on a normal WikiFarm, including the ability for different wiki on the wiki farm to run completely different versions of MediaWiki and upgrade independently, and have control over their own list of installed extensions. ;) In fact this works using complete raw unmodified MediaWiki source code. I have a few "source" directories with MediaWiki source, they don't have any changes to them, and then end up being run in the VM thinking they are a complete installation modified with all the stuff they need to run. ^_^ Tricking MediaWiki into thinking it's a single installation sitting on it's own from the outside is definitely "hackish". In any case, Farm{Request,Api} is a nice and interesting idea.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
Daniel Friesen wrote:
^_^ "hackish" isn't that bad in some sense. I'm currently experimenting with some farm code that works completely outside of MediaWiki rather than as a extension sitting inside of it. Using a sandbox it can get access to the MediaWiki install and extract info from it in a secure way which couldn't be extracted as easily from the api. The system works more like a MediaWiki virtual machine than a MediaWiki installation turned WikiFarm. The result is a farm free of mapping issues which can give MediaWiki hostees much more control over the installation then they could on a normal WikiFarm, including the ability for different wiki on the wiki farm to run completely different versions of MediaWiki and upgrade independently, and have control over their own list of installed extensions. ;) In fact this works using complete raw unmodified MediaWiki source code. I have a few "source" directories with MediaWiki source, they don't have any changes to them, and then end up being run in the VM thinking they are a complete installation modified with all the stuff they need to run. ^_^ Tricking MediaWiki into thinking it's a single installation sitting on it's own from the outside is definitely "hackish". In any case, Farm{Request,Api} is a nice and interesting idea.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
namespaces?
"Dmitriy Sintsov" questpc@rambler.ru wrote in message news:830714463.1275562997.168145444.10411@mcgi21.rambler.ru...
- Roan Kattouw roan.kattouw@gmail.com [Tue, 25 May 2010 20:58:54
+0200]
Having a something like FarmRequest or FarmApi classes would be a great think for wiki farms (I run a small one). Probably also would help to unificate the remote vs local farm code. Though, a Farm probably should become an object containing wiki configurations. Currently farms are a bit "hackish". Dmitriy
One way to achieve this would be to develop the MediaWiki class to actually be what it originally promised: an object representing a wiki, of which there can in principle be more than one instantiated at any one time. Configuration options could determine how the MediaWiki object accesses data, and consequently what sub-entities it is able to produce.
--HM
Platonides wrote:
Daniel Friesen wrote:
^_^ "hackish" isn't that bad in some sense. I'm currently experimenting with some farm code that works completely outside of MediaWiki rather than as a extension sitting inside of it. Using a sandbox it can get access to the MediaWiki install and extract info from it in a secure way which couldn't be extracted as easily from the api. The system works more like a MediaWiki virtual machine than a MediaWiki installation turned WikiFarm. The result is a farm free of mapping issues which can give MediaWiki hostees much more control over the installation then they could on a normal WikiFarm, including the ability for different wiki on the wiki farm to run completely different versions of MediaWiki and upgrade independently, and have control over their own list of installed extensions. ;) In fact this works using complete raw unmodified MediaWiki source code. I have a few "source" directories with MediaWiki source, they don't have any changes to them, and then end up being run in the VM thinking they are a complete installation modified with all the stuff they need to run. ^_^ Tricking MediaWiki into thinking it's a single installation sitting on it's own from the outside is definitely "hackish". In any case, Farm{Request,Api} is a nice and interesting idea.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
namespaces?
For the sandboxing? No, I wanted to use runkit but had issues installing it. So I ended up messing with php's horrid proc_open to sandbox it in another process to act as the vm in the case my system needs to extract info from the wiki (not for virtualizing the actual wiki, that is done in-process in a different less wasteful way) I do have 5.3, but I'm not sure how I'd use php namespaces for that, especially without modifying MediaWiki. The only source modification I want to make at all is locally backporting any patch I commit to trunk to fix the issues with using special wiki configuration.
* Happy-melon happy-melon@live.com [Fri, 4 Jun 2010 00:33:30 +0100]:
One way to achieve this would be to develop the MediaWiki class to actually be what it originally promised: an object representing a wiki, of
which
there can in principle be more than one instantiated at any one time. Configuration options could determine how the MediaWiki object
accesses
data, and consequently what sub-entities it is able to produce.
Current MediaWiki class has some shortcomings. For example, when I've tried to setup rendering urls in my very own way and not using mod_rewrite, I've "cloned" and "refactored" index.php. The problem was with the following call:
# warning: although instances of OutputPage and others are passed, # they are sometimes used as "fixed" wg* globals in other classes # so you cannot pass a non-global here, or use the different names # of passed instances $MW->initialize( $wgTitle, $wgArticle, $wgOut, $wgUser, $wgRequest );
First, I've made an instance of OutputPage with variable name different from default $wgOut. And $wgArticle, too. The engine didn't work as expected, it still was looking for the default names here and there. I was forced to use default wgOut and wgArticle names. But, then, there is no real incapsulation and there is no point to pass these as method parameters..
I'd imagine that "emulated" request or api through the local farm can be done really fast, while real remote interwiki call would be done in usual way (api). Dmitriy
"Dmitriy Sintsov" questpc@rambler.ru wrote in message news:1006208056.1275619880.71836632.61224@mcgi66.rambler.ru...
- Happy-melon happy-melon@live.com [Fri, 4 Jun 2010 00:33:30 +0100]:
One way to achieve this would be to develop the MediaWiki class to actually be what it originally promised: an object representing a wiki, of
which
there can in principle be more than one instantiated at any one time. Configuration options could determine how the MediaWiki object
accesses
data, and consequently what sub-entities it is able to produce.
Current MediaWiki class has some shortcomings. For example, when I've tried to setup rendering urls in my very own way and not using mod_rewrite, I've "cloned" and "refactored" index.php. The problem was with the following call:
# warning: although instances of OutputPage and others are passed, # they are sometimes used as "fixed" wg* globals in other classes # so you cannot pass a non-global here, or use the different names # of passed instances $MW->initialize( $wgTitle, $wgArticle, $wgOut, $wgUser, $wgRequest );
First, I've made an instance of OutputPage with variable name different from default $wgOut. And $wgArticle, too. The engine didn't work as expected, it still was looking for the default names here and there. I was forced to use default wgOut and wgArticle names. But, then, there is no real incapsulation and there is no point to pass these as method parameters..
I'd imagine that "emulated" request or api through the local farm can be done really fast, while real remote interwiki call would be done in usual way (api). Dmitriy
Indeed; it does need a lot of work; doing it properly would probably deprecate all the state globals ($wg(Title|Parser|Article|Out|Request) etc); replacing them with member variables of the MediaWiki class. How other classes would access those variables is an interesting question; I could see an Article::getWiki()->getOut() chain, but that won't work for static functions. It would be a major overhaul, but would probably kill several birds with one stone.
--HM
* Happy-melon happy-melon@live.com [Fri, 4 Jun 2010 10:03:14 +0100]:
"Dmitriy Sintsov" questpc@rambler.ru wrote in message news:1006208056.1275619880.71836632.61224@mcgi66.rambler.ru...
- Happy-melon happy-melon@live.com [Fri, 4 Jun 2010 00:33:30
+0100]:
One way to achieve this would be to develop the MediaWiki class to actually be what it originally promised: an object representing a wiki, of
which
there can in principle be more than one instantiated at any one
time.
Configuration options could determine how the MediaWiki object
accesses
data, and consequently what sub-entities it is able to produce.
Current MediaWiki class has some shortcomings. For example, when
I've
tried to setup rendering urls in my very own way and not using mod_rewrite, I've "cloned" and "refactored" index.php. The problem
was
with the following call:
# warning: although instances of OutputPage and others are passed, # they are sometimes used as "fixed" wg* globals in other classes # so you cannot pass a non-global here, or use the different names # of passed instances $MW->initialize( $wgTitle, $wgArticle, $wgOut, $wgUser, $wgRequest
);
First, I've made an instance of OutputPage with variable name
different
from default $wgOut. And $wgArticle, too. The engine didn't work as expected, it still was looking for the default names here and there.
I
was forced to use default wgOut and wgArticle names. But, then,
there
is
no real incapsulation and there is no point to pass these as method parameters..
I'd imagine that "emulated" request or api through the local farm
can
be
done really fast, while real remote interwiki call would be done in usual way (api). Dmitriy
Indeed; it does need a lot of work; doing it properly would probably deprecate all the state globals ($wg(Title|Parser|Article|Out|Request) etc); replacing them with member variables of the MediaWiki class. How
other
classes would access those variables is an interesting question; I
could
see an Article::getWiki()->getOut() chain, but that won't work for static functions. It would be a major overhaul, but would probably kill several birds with one stone.
Hundreds of extensions would break :-( Compatibility is a huge burden. A crude but simpler approach would be having these globals saved in some context data structure and introduce Farm->switch() method, which would save/replace all the globals. Much less of core has to be changed, then. However, that's a bit more unreliable and risky. However, the code is fragile, anyway (from my exp, one typo sometimes can cause dreaded errors). Dmitriy
-------------------------------------------------- From: "Dmitriy Sintsov" questpc@rambler.ru Sent: Friday, June 04, 2010 11:01 AM To: "Happy-melon" happy-melon@live.com; "Wikimedia developers" wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] Reasonably efficient interwiki transclusion
- Happy-melon happy-melon@live.com [Fri, 4 Jun 2010 10:03:14 +0100]:
"Dmitriy Sintsov" questpc@rambler.ru wrote in message news:1006208056.1275619880.71836632.61224@mcgi66.rambler.ru...
- Happy-melon happy-melon@live.com [Fri, 4 Jun 2010 00:33:30
+0100]:
One way to achieve this would be to develop the MediaWiki class to actually be what it originally promised: an object representing a wiki, of
which
there can in principle be more than one instantiated at any one
time.
Configuration options could determine how the MediaWiki object
accesses
data, and consequently what sub-entities it is able to produce.
Current MediaWiki class has some shortcomings. For example, when
I've
tried to setup rendering urls in my very own way and not using mod_rewrite, I've "cloned" and "refactored" index.php. The problem
was
with the following call:
# warning: although instances of OutputPage and others are passed, # they are sometimes used as "fixed" wg* globals in other classes # so you cannot pass a non-global here, or use the different names # of passed instances $MW->initialize( $wgTitle, $wgArticle, $wgOut, $wgUser, $wgRequest
);
First, I've made an instance of OutputPage with variable name
different
from default $wgOut. And $wgArticle, too. The engine didn't work as expected, it still was looking for the default names here and there.
I
was forced to use default wgOut and wgArticle names. But, then,
there
is
no real incapsulation and there is no point to pass these as method parameters..
I'd imagine that "emulated" request or api through the local farm
can
be
done really fast, while real remote interwiki call would be done in usual way (api). Dmitriy
Indeed; it does need a lot of work; doing it properly would probably deprecate all the state globals ($wg(Title|Parser|Article|Out|Request) etc); replacing them with member variables of the MediaWiki class. How
other
classes would access those variables is an interesting question; I
could
see an Article::getWiki()->getOut() chain, but that won't work for static functions. It would be a major overhaul, but would probably kill several birds with one stone.
Hundreds of extensions would break :-( Compatibility is a huge burden. A crude but simpler approach would be having these globals saved in some context data structure and introduce Farm->switch() method, which would save/replace all the globals. Much less of core has to be changed, then. However, that's a bit more unreliable and risky. However, the code is fragile, anyway (from my exp, one typo sometimes can cause dreaded errors). Dmitriy
MW 2.0? :-D
You wouldn't need to remove the globals, at least immediately; you'd retain them as aliases for the relevant variables of the 'main' wiki; assuming that it makes sense to define one primary wiki, which it usually does.
--HM
Daniel Friesen wrote:
I wanted to use runkit but had issues installing it. So I ended up messing with php's horrid proc_open to sandbox it in another process to act as the vm in the case my system needs to extract info from the wiki (not for virtualizing the actual wiki, that is done in-process in a different less wasteful way)
I was able to run runkit and run mediawiki inside. Sara Golemon hasn't cared about it for years, but the patches are all at http://pecl.php.net/bugs It should be quite easy to make it work on 5.2 (which was the latest version at the time).
wikitech-l@lists.wikimedia.org