Getting started with Wikimedia Labs, especially for bot authors - Toolserver-l

20 Jul 2012

I know a lot of bot creators and authors are curious about how to get
going on Wikimedia Labs.  They're also curious about whether it makes
sense to put bots on Labs, and if so, how.

In early June, at the Berlin hackathon, Anomie (Brad Jorsch) talked with
Mutante (Daniel Zahn) at my request to try to figure this out.  He wrote
the summary below on June 2nd, and I should have forwarded it far sooner
- my apologies for the delay (I thought it would need more editing and
clarification than I ended up doing).

This guide seems useful to me.  Please do make corrections and updates,
and then we can put it up at labsconsole.wikimedia.org for people who
want this sort of HOWTO.

A couple things I want to annotate:

* Wikimedia Labs does not have Toolserver-style database replication
right now.  That is on the WMF to-do list but there is no predicted date
in the roadmap at https://www.mediawiki.org/wiki/Wikimedia_Labs .  Given
what TParis and Brad say below, that's a prerequisite to the success of
Tool Labs.
* It seems like Ops would be interested in having some bots run on the
WMF cluster, after sufficient code and design review. (I infer this from
the 5th step mentioned below and from my conversations with Daniel Zahn
and Ryan Lane.)  I don't think there's currently a process for this sort
of promotion -- I suppose it would work a little like
https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment .

Thanks to Brad for this summary!

best,
Sumana Harihareswara
Engineering Community Manager
Wikimedia Foundation

Hi Sumana,

Here is my report on the Labs discussion. It turns out that in a way we
weren't thinking big enough, or in another way we were thinking too big.
What Ops (as represented by Daniel Zahn) would prefer as far as
Wikipedia-editing bots would be one Labs project managed by someone (or
some group) who isn't Ops, and that person/group would take care of
giving people accounts within that project to run bots. I guess the
general model is vaguely along the lines of how the Toolserver is run by
WMDE and they give accounts to people. I may have found myself a major
project here. ;)

But I did also get detail on how to get yourself a project on Labs, if
you have reason to, and some clarification on how the whole thing works.
In light of the above it might not be that useful for someone wanting to
move a bot, but in general it's still useful.

Step one is to get a Labs/Gerrit account via
https://www.mediawiki.org/wiki/Developer_access .

Step two is to put together a proposal (i.e. a paragraph or two saying
what it is you plan to do) and do one of the following:
1. Go on IRC (I guess #wikimedia-labs or #wikimedia-dev or #mediawiki),
   find one of the Ops people, and pitch your proposal.
2. Go on the labs-l mailing list and make your proposal.
3. File it as an "enhancement request" bug in Bugzilla.
4. Talk to an Ops person in some other way, e.g. on their talk page or
   via email.
(You should probably try to get Ops as a whole to decide which of the
above they'd really prefer, or if they'd prefer using [[mw:Developer
access]] or something like it instead.) If they decide they like your
project, which Daniel says the default is currently "yes" unless they
have a reason to say "no", they'll set it up with you as an admin over
that project.

Step three then is to go log into labsconsole, and you can create a new
"instance" (virtual server) within your project. I'm told there are
various generic configurations to choose from, that vary mainly in how
much virtual RAM and disk space they have. Then you can log into your
instance and configure it however you need. I guess it will be running
Ubuntu, based on what I was told about how the Puppet configuration
works.

At this level, it is either possible or on the road map to get various
helpful services:
* Public IP addresses. [possible right now]
* Ability to have a Git repository automagically cloned, turned into a
  deb package, and installed in a private Wikimedia-sepecific Ubuntu
  package repository (which can then be installed from). [possible now]
* A UDP feed of recent changes, similar to the feed available over IRC
  from [[meta:IRC/Channels#Recent changes]] but without having to mess
  with IRC. In fact, it's the same data that is used to generate those
  IRC feeds. [roadmap?]
* Access to a replicated database something like they have on the
  Toolserver (i.e. no access to private data). [roadmap]
* Other stuff? I get the impression Ops is open to ideas to make Labs
  more useful to people.

BTW, I mentioned Labs to TParis and the rest of the guys over here, and
they say that a replicated copy of the database like the Toolserver has
is absolutely essential before most people could move their stuff from
the Toolserver to Labs.

As far as resources go,
* Each Labs project can request one or more public IP addresses to
  assign to their instances. More than one public IP needs a good
  reason; maybe that will change when IPv6 is fully supported. Instances
  cannot share IP addresses. But all instances (from all projects) are
  on a VLAN, so it should be possible to set up one instance as a
  gateway to NAT the rest of the instances. No one has bothered to
  actually do this yet. Note that accessing Wikipedia requires a public
  IP; there is no "back door".
* Each Labs project has a limit on the number of open files (or
  something like that) at any one time. No one has ever had to change
  the current default limit on this, yet.
* There is something of a limit on disk I/O, e.g., if someone has a lot
  of instances all accessing terabyte-sized files all the time, that
  would blow things up. This might be a global limitation on all
  projects.
* I guess most other resources are limited on a per-instance basis. For
  example, disk space available and RAM availble.

You *could* stop here, although Ops would really like you to continue on
to step 4.

Step four, once you've gotten your instance configuration figured out
and working, is to write a Puppet configuration that basically instructs
Puppet how to duplicate your setup from scratch. (How to do this will
probably turn out to be a whole howto on its own, once someone who has
time to write it up figures out the details.) At a high level, it
consists of stuff like which Ubuntu packages to install, which extra
files (also stored in the Puppet config in Git) to copy in, and so on.

This puppet configuration needs to get put into Git/Gerrit somewhere.
And then Ops or someone else with the ability to "+2" will have to
approve it, and then you can use it just like the generic configs back
in Step 3 to create more instances. If you need to change the
configuration, same deal with pushing the new version to Gerrit and
having someone "+2" it.

The major advantage of getting things Puppetized is that you can clone
your instance at the click of a button, and if the virtual-server
equivalent of a hard drive crash were to occur you could be up and
running on a new instance pretty much instantly. And handing
administration of the project on to someone else would be easier,
because everything would be documented instead of being a random
mish-mash that no one really knows how it is set up. It's also a
requirement for going on to production.

This is not necessarily an all-or-nothing thing. For example, the list
of packages installed and configuration file settings could be
puppetized while you still log in to update your bot code manually
instead of having that managed through Puppet. But I guess it would need
to be fully completed before you could move to production.

Step five, if you want, is to move from labs to production. The major
advantages are the possibility to get even more resources (maybe even a
dedicated server rather than a virtual one) and the possibility to get
access to private data from the MediaWiki databases. I guess at this
point all configuration changes whould have to go through updating the
Puppet configuration via Gerrit.

I don't know if you were listening when we were discussing downtime, but
it doesn't seem to be that much of a concern on Labs as long as you're
not running something that will bring all of Wikipedia to a halt if it
goes down for an hour or two. Compare this to how the Toolserver was
effectively down for almost three weeks at the end of March, and
everyone survived somehow (see [[Wikipedia:Village pump
(technical)/Archive 98#Toolserver replication lag]]).

Hope that's all clear and accurate!

Brad