We've been receiving messages from this domain at unblock(a)toolserver.org
and they appear to be related to this:
Viral advertising for some film. In reality, it's a message with a
crapload of images attached serving no purpose for us.
Can we just block this whole domain from sending mail to toolserver
accounts? It's a nuisance, and the messages are quite large.
If you correspond with me on a regular basis, please read this document:
PGP fingerprint: 2B7A B280 8B12 21CC 260A DF65 6FCE 505A CF83 38F5
This document should be read only by those persons to whom it is
addressed. If you have received this message it was obviously addressed
to you and therefore you can read it.
Additionally, by sending an email to ANY of my addresses or to ANY
mailing lists to which I am subscribed, whether intentionally or
accidentally, you are agreeing that I am "the intended recipient," and
that I may do whatever I wish with the contents of any message received
from you, unless a pre-existing agreement prohibits me from so doing.
This overrides any disclaimer or statement of confidentiality that may
be included on your message.
the linux-servers are now nearly 2 weeks online. While some short- and medium-
running SGE-task are running there already, the number of long-running tasks
is near zero (see graphs at ); in contrast the load on willow is still
It would be nice if more of you could try to move tasks away from the solaris-
boxes to the linux-boxes (or better: make the task so independent it runs on
Please notice that in ~2 months -arch=* will replace -arch=sol as default, so
you should slowly begin to look if your tools are running on linux or not (and
if not how that can be fixed).
A word to the pywikipedia-framework users: At the moment it is unclear if the
old python-unicode-bug is fixed or not on our installation (see ). Testing
(and commenting) is very welcome, but do not run a bot unsupervised.
P.S: If you are still not using SGE, consider it!
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
I noticed today that "TweetmemeBot", a web crawler, was making lots of
requests to my tools and ignoring the robots.txt file. The thing
scrapes at a very high rate, sometimes over 30 requests per minute.
Other tool maintainers (especially geohack) may want to investigate
whether this is also impacting their tools, and explicitly block the
user agent if it is.
yesterday Merlissimo and I successfully tested the installation of the SGE-
version for the toolserver. The last step is now to install the new version on
the live-system. For that, the SGE-service needs to stop completely on the
cluster, the old version has to be removed and the new one has to be
installed. We plan to to this on
Thursday 5. July between 17:30 and 22:30 UTC.
During this time no SGE will work. There will be no restarting (and no
migration) of stopped things after the update.
After the update is done, we will start to use the 2 Linux-boxes for tools too
(I will send details than).
Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885
Occasionally, "toolserver people" (both programmers and users) talk
about joining up tools. Wouldn't it be great if we could use one or
several toolserver tools, and "mash-up" their output to create
something new and useful? And wouldn't it be even better if the users
could do this directly, across tools, without programmers
Some tools already support machine-readable output, and some tools
already use others to perform a specific function. But these are
hardcoded, out formats are often crude (tabbed text, not that there's
anything wrong with that in principle), runtimes add up, and so on.
So, I went ahead and, as a first step towards a pipeline setup called
wpipe ("w" for "wiki", as you no doubt have guessed), implemented an
asset tracker. Here, an asset is a dataset, ideally a JSON file (I
also created a simple structure to hold lists of wiki pages, along
with arbitrary metadata). Each asset is tracked in a database,
accessible via a unique numeric identifier, and its data is stored in
a file. Assets can be created and queried via toolserver command line,
as well as via a web interface.
The usual steps in asset creation involve:
1. reserve (gets a new, unique ID)
2. start (set a flag that the asset data creation has begun)
3. done (store data, either by creating a file directly, or passing
the data to be stored)
4. fail (if there was an error during data creation)
* All assets and associated data are public.
* Creation and last-access time (write or read the actual data) are
tracked, so unused assets can be removed to conserve storage.
* Currently, data creation is limited to toolserver IP adresses (yes,
I know, it can be gamed. Like the rest of the wiki world.)
* The suggested JSON format should be flexible enough for most tools
dealing with lists of wiki pages, but any text-based format will work
for specialist uses.
* The asset system can be used by command-line tools and web tools alike.
* Existing tools should be simple to adapt; if a tool takes a list of
page names, and language/project information, then using asset IDs as
an alternative source should be straightforward.
* A pipeline of tools could be started asynchronously, and their
the next one in the chain could be run, all from the user's browser.
The main web API, and documentation page, is here:
That page also links to a generic "asset browser" :
As an appetizer, and feasibility demo, I adapted my own "CatScan
rewrite" to use assets as an optional output. This can be done by
copying the normal "CatScan rewrite" URL parameters and pasting them
This is intended as a generic starter page for command-line-enabled
tools. At the moment, only "CatScan rewrite" is available, but I plan
to add others. If you have tools you would like to add, I'll be happy
to help you set that up.
Right now, a tool is started by this page via "nohup &"; that could
change to the job submission system, if that's possible from the web
servers, but right now it seems overly complicated (runtime
estimation? memory estimation? sql server access? whatnot)
The web page then returns the reserved output asset ID, while the
actual tool is running; another tool could thus be "watching"
asynchronously, by pulling the status every few seconds.
Of course, this whole shebang doesn't make sense unless others are
willing to join in, with work on this core, or at least by enabling
some of their tools; so please, if you are even slightly interested in
a generic data exchange mechanism between tools, potentially leading
to a pipeline-able ecosystem, by all means step forward!
Overall running the Wikipedia-OpenStreetMap server on the toolserver
cluster (in particular on server Ptolemy) has worked reasonably well and
hasn't needed too much maintenance attention recently anymore. It also
seems to have handled serving map tiles to the OSM-gadget in various
Wikipedia's including the German, Spanish and Russian Wikipedia amongst
others pretty well and more recently also WIWOSM.
However, there remain a number of issues that have never been resolved
entirely satisfactorily on Ptolemy
1) There is still a memory leak in tirex-master as well as a creep in
CPU usage over time. This has to some degree been solved by simply
restarting Tirex every 12 hours in a cron job, very much limiting the
scope of the memory leak and to a lesser degree also the CPU usage
creep. This however means that the request queues are dropped every time.
2) The socket between tirex and mod_tile / render_list always gets
closed before the the successful acknowledgment can be sent from tirex.
This means the requester can't tell if the rendering was successful. In
mod_tile this results in returning http 404 errors for tiles that need
rendering, instead of returning the tiles that were rendered on the fly.
In render_list I got around the issue by simply always assuming the
render request was successful and reconnecting to the socket for each
3) The socket between tirex and mod_tile / render_list refuses
connection for a (random) subset of connection requests. This results in
quite a number of rendering requests from mod_tile being dropped as it
can't connect to the tirex socket. In render_list, I could again work
around the problem by sleeping for n seconds and then retrying the
connection until it eventually succeeds.
4) The performance of postgresql is still below what can be expected
from the server of the specs of Ptolemy. While clustering the rendering
tables by the geometry column several months ago brought up the
performance to a level that it can now more or less keep up with the
limited re-rendering load put on the server from Wikipedia, low-zoom
rendering is still exceptionally slow. Also it can barely keep up with
replication from the OSM servers, and not infrequently drops behind
during busy times. Other servers with much slower I/O performance on the
other hand seem to have no problem keeping up with diff-imports.
Non of these issues are directly critical, but would be nice to solve.
Next to the OpenStreetMap server hosted by the toolserver, the wikimedia
foundation is planning on hosting their own OpenStreetMap tile-server,
at least for the mobile wikipedia client, but presumably also for
inclusion in the "standard" wikipedias that are currently served by the
toolserver. If I understand correctly, they have already purchased the
hardware and are awaiting provisioning until somewhat can puppetize the
OSM tileserver rendering stack.
Secondly, if I understand it correctly, the toolserver cluster is slowly
moving back from Solaris to Debian.
Now with the OpenStreetMap switch of license from CC-BY-SA to ODbL for
the raw map data soon to be completed. (The database has now mostly been
purged of data for which the OSMF did not get permission to relicense),
my understanding is that the OSMF will likely recommend everyone to do a
fresh import of a new ODbL licensed planet for legal purposes, rather
than to apply the soon to be ODbL licensed diffs to a base CC-BY-SA
One question is, would this forced full re-import of the OpenStreetMap
database, which last time took approximately 4 days, be a good
opportunity to change things in the setup? For example could this be
used to migrate from Solaris to Debian? Some of the above mentioned
issues might be solved by a change of OS. Furthermore, it should be
easier and better tested to upgrade the OSM rendering stack.
Unfortunately key components in Debian Squeeze appear to be quite old,
including postgres with version 8.4, mapnik with version 0.7 and the
boost library (which would be necessary to compile mapnik 2.0) (They are
sufficiently recent in Debian Sid, or Ubuntu 12.04), so that they would
likely need back ports of self compilation.
However, an issue is that ptolemy is in productive use in several key
wikipedia's serving the map tiles. So a replacement would be necessary
before being able to take it down for upgrades. At least for the tiles
(I am not sure what the requirement is for WIWOSM), the main issue is
serving tiles. (Re-)rendering would be suspended for several days anyway
during an import. If I understand it correctly, the tiles are currently
stored on the toolserver SAN. All that would be needed would be a apache
webserver with mod_tile installed.
So the main question is, would this be a good time to change things? Are
there any other problems / issues that need fixing or improving? Or
should we simply re-import a fresh ODbL planet into the existing setup,
once the license change has completed?
I was taking a look at our dumps in user-store and none of them are
compressed, and I was socked about that. I know a lot of people use
pywikipedia to parse the dumps, and I know it can handle the bz2 files. any
reason we dont just make them all bz2?
I know a lot of bot creators and authors are curious about how to get
going on Wikimedia Labs. They're also curious about whether it makes
sense to put bots on Labs, and if so, how.
In early June, at the Berlin hackathon, Anomie (Brad Jorsch) talked with
Mutante (Daniel Zahn) at my request to try to figure this out. He wrote
the summary below on June 2nd, and I should have forwarded it far sooner
- my apologies for the delay (I thought it would need more editing and
clarification than I ended up doing).
This guide seems useful to me. Please do make corrections and updates,
and then we can put it up at labsconsole.wikimedia.org for people who
want this sort of HOWTO.
A couple things I want to annotate:
* Wikimedia Labs does not have Toolserver-style database replication
right now. That is on the WMF to-do list but there is no predicted date
in the roadmap at https://www.mediawiki.org/wiki/Wikimedia_Labs . Given
what TParis and Brad say below, that's a prerequisite to the success of
* It seems like Ops would be interested in having some bots run on the
WMF cluster, after sufficient code and design review. (I infer this from
the 5th step mentioned below and from my conversations with Daniel Zahn
and Ryan Lane.) I don't think there's currently a process for this sort
of promotion -- I suppose it would work a little like
Thanks to Brad for this summary!
Engineering Community Manager
Here is my report on the Labs discussion. It turns out that in a way we
weren't thinking big enough, or in another way we were thinking too big.
What Ops (as represented by Daniel Zahn) would prefer as far as
Wikipedia-editing bots would be one Labs project managed by someone (or
some group) who isn't Ops, and that person/group would take care of
giving people accounts within that project to run bots. I guess the
general model is vaguely along the lines of how the Toolserver is run by
WMDE and they give accounts to people. I may have found myself a major
project here. ;)
But I did also get detail on how to get yourself a project on Labs, if
you have reason to, and some clarification on how the whole thing works.
In light of the above it might not be that useful for someone wanting to
move a bot, but in general it's still useful.
Step one is to get a Labs/Gerrit account via
Step two is to put together a proposal (i.e. a paragraph or two saying
what it is you plan to do) and do one of the following:
1. Go on IRC (I guess #wikimedia-labs or #wikimedia-dev or #mediawiki),
find one of the Ops people, and pitch your proposal.
2. Go on the labs-l mailing list and make your proposal.
3. File it as an "enhancement request" bug in Bugzilla.
4. Talk to an Ops person in some other way, e.g. on their talk page or
(You should probably try to get Ops as a whole to decide which of the
above they'd really prefer, or if they'd prefer using [[mw:Developer
access]] or something like it instead.) If they decide they like your
project, which Daniel says the default is currently "yes" unless they
have a reason to say "no", they'll set it up with you as an admin over
Step three then is to go log into labsconsole, and you can create a new
"instance" (virtual server) within your project. I'm told there are
various generic configurations to choose from, that vary mainly in how
much virtual RAM and disk space they have. Then you can log into your
instance and configure it however you need. I guess it will be running
Ubuntu, based on what I was told about how the Puppet configuration
At this level, it is either possible or on the road map to get various
* Public IP addresses. [possible right now]
* Ability to have a Git repository automagically cloned, turned into a
deb package, and installed in a private Wikimedia-sepecific Ubuntu
package repository (which can then be installed from). [possible now]
* A UDP feed of recent changes, similar to the feed available over IRC
from [[meta:IRC/Channels#Recent changes]] but without having to mess
with IRC. In fact, it's the same data that is used to generate those
IRC feeds. [roadmap?]
* Access to a replicated database something like they have on the
Toolserver (i.e. no access to private data). [roadmap]
* Other stuff? I get the impression Ops is open to ideas to make Labs
more useful to people.
BTW, I mentioned Labs to TParis and the rest of the guys over here, and
they say that a replicated copy of the database like the Toolserver has
is absolutely essential before most people could move their stuff from
the Toolserver to Labs.
As far as resources go,
* Each Labs project can request one or more public IP addresses to
assign to their instances. More than one public IP needs a good
reason; maybe that will change when IPv6 is fully supported. Instances
cannot share IP addresses. But all instances (from all projects) are
on a VLAN, so it should be possible to set up one instance as a
gateway to NAT the rest of the instances. No one has bothered to
actually do this yet. Note that accessing Wikipedia requires a public
IP; there is no "back door".
* Each Labs project has a limit on the number of open files (or
something like that) at any one time. No one has ever had to change
the current default limit on this, yet.
* There is something of a limit on disk I/O, e.g., if someone has a lot
of instances all accessing terabyte-sized files all the time, that
would blow things up. This might be a global limitation on all
* I guess most other resources are limited on a per-instance basis. For
example, disk space available and RAM availble.
You *could* stop here, although Ops would really like you to continue on
to step 4.
Step four, once you've gotten your instance configuration figured out
and working, is to write a Puppet configuration that basically instructs
Puppet how to duplicate your setup from scratch. (How to do this will
probably turn out to be a whole howto on its own, once someone who has
time to write it up figures out the details.) At a high level, it
consists of stuff like which Ubuntu packages to install, which extra
files (also stored in the Puppet config in Git) to copy in, and so on.
This puppet configuration needs to get put into Git/Gerrit somewhere.
And then Ops or someone else with the ability to "+2" will have to
approve it, and then you can use it just like the generic configs back
in Step 3 to create more instances. If you need to change the
configuration, same deal with pushing the new version to Gerrit and
having someone "+2" it.
The major advantage of getting things Puppetized is that you can clone
your instance at the click of a button, and if the virtual-server
equivalent of a hard drive crash were to occur you could be up and
running on a new instance pretty much instantly. And handing
administration of the project on to someone else would be easier,
because everything would be documented instead of being a random
mish-mash that no one really knows how it is set up. It's also a
requirement for going on to production.
This is not necessarily an all-or-nothing thing. For example, the list
of packages installed and configuration file settings could be
puppetized while you still log in to update your bot code manually
instead of having that managed through Puppet. But I guess it would need
to be fully completed before you could move to production.
Step five, if you want, is to move from labs to production. The major
advantages are the possibility to get even more resources (maybe even a
dedicated server rather than a virtual one) and the possibility to get
access to private data from the MediaWiki databases. I guess at this
point all configuration changes whould have to go through updating the
Puppet configuration via Gerrit.
I don't know if you were listening when we were discussing downtime, but
it doesn't seem to be that much of a concern on Labs as long as you're
not running something that will bring all of Wikipedia to a halt if it
goes down for an hour or two. Compare this to how the Toolserver was
effectively down for almost three weeks at the end of March, and
everyone survived somehow (see [[Wikipedia:Village pump
(technical)/Archive 98#Toolserver replication lag]]).
Hope that's all clear and accurate!