For the pywikipedia-l listeners just tuning in: the toolserver has an overload of interwiki bots, and we want to reduce this. As such, we want to switch to a single bot that runs all the interwiki updates from the toolserver.
On 16 January 2012 09:19, Merlijn van Deen valhallasw@arctus.nl wrote:
The only reasonable action we can take to reduce the memory consumption is to let the OS do its job in freeing memory: using one process to track pages that have to be corrected (using the database, if possible), and one process to do the actual fixing (interwiki.py). This should be reasonably easy to implement (i.e. use a pywikibot page generator to generate a list of pages, use a database layer to track interlanguage links and popen('interwiki.py <page>') if this is a fixable situation)
I took some time yesterday to work out some details on this - see http://piratepad.net/T29Uj4j1U4 . It boils down to this:
1) generation of a list of pages to work on: from the database, if possible 2) dispatching interwiki.py based on that list of pages and handling logging 3) interwiki.py itself
My suggestion is to split these tasks, and creating a simple interface (e.g. WSGI) between 1) and 2), and using subprocesses for 2) to 3).
Yesterday, I have been working (mainly) on speeding up the startup of interwiki.py, so that we can spawn one process per Page.
On the Toolserver side, I would appreciate any comments/work/existing work on the creation of an interwiki graph from the database - there are already tools that suggest images based on interwiki links, so this code should be around - and hopefully be adaptable. The only goal for this process would be to create a list of starting pages interwiki.py can use - i.e. graphs with one or more missing links, but without any double links.
On the Pywikipedia side, some thoughts on running interwiki.py in a new process would be welcome. e.g. how can we improve startup time ('kill all the regexps!') and effectively spawn multiple processes to run. What parameters (throttles?) should be tuned, et cetera.
Best, Merlijn
On 21/01/12 13:13, Merlijn van Deen wrote:
On the Toolserver side, I would appreciate any comments/work/existing work on the creation of an interwiki graph from the database - there are already tools that suggest images based on interwiki links, so this code should be around - and hopefully be adaptable. The only goal for this process would be to create a list of starting pages interwiki.py can use
- i.e. graphs with one or more missing links, but without any double links.
http://toolserver.org/~platonides/InterwikiPool/InterwikiPool.php shows, for a given page, the other entries in that interwiki pool (as well as a little summary of the differences).
On the Pywikipedia side, some thoughts on running interwiki.py in a new process would be welcome. e.g. how can we improve startup time ('kill all the regexps!') and effectively spawn multiple processes to run. What parameters (throttles?) should be tuned, et cetera.
It doesn't need to be one process per page. The same process could eg. run 10 interwikis instead.
toolserver-l@lists.wikimedia.org