Toolserver-l July 2012

toolserver-l@lists.wikimedia.org

21 participants
21 discussions

by Christopher David Howie

We've been receiving messages from this domain at unblock(a)toolserver.org and they appear to be related to this: <http://www.reddit.com/r/WTF/comments/l44r1/i_just_got_this_email_at_work_i_…>. Viral advertising for some film. In reality, it's a message with a crapload of images attached serving no purpose for us. Can we just block this whole domain from sending mail to toolserver accounts? It's a nuisance, and the messages are quite large. -- Chris Howie http://www.chrishowie.com http://en.wikipedia.org/wiki/User:Crazycomputers If you correspond with me on a regular basis, please read this document: http://www.chrishowie.com/email-preferences/ PGP fingerprint: 2B7A B280 8B12 21CC 260A DF65 6FCE 505A CF83 38F5 ------------------------------------------------------------------------ IMPORTANT INFORMATION/DISCLAIMER This document should be read only by those persons to whom it is addressed. If you have received this message it was obviously addressed to you and therefore you can read it. Additionally, by sending an email to ANY of my addresses or to ANY mailing lists to which I am subscribed, whether intentionally or accidentally, you are agreeing that I am "the intended recipient," and that I may do whatever I wish with the contents of any message received from you, unless a pre-existing agreement prohibits me from so doing. This overrides any disclaimer or statement of confidentiality that may be included on your message.

11 years, 2 months

Please move more longrunning-tasks to linux

by DaB.

Hello all, the linux-servers are now nearly 2 weeks online. While some short- and medium- running SGE-task are running there already, the number of long-running tasks is near zero (see graphs at [1]); in contrast the load on willow is still quite high. It would be nice if more of you could try to move tasks away from the solaris- boxes to the linux-boxes (or better: make the task so independent it runs on both architectures). Please notice that in ~2 months -arch=* will replace -arch=sol as default, so you should slowly begin to look if your tools are running on linux or not (and if not how that can be fixed). A word to the pywikipedia-framework users: At the moment it is unclear if the old python-unicode-bug is fixed or not on our installation (see [2]). Testing (and commenting) is very welcome, but do not run a bot unsupervised. Sincerely, DaB. P.S: If you are still not using SGE, consider it! [1] http://munin.toolserver.org/Miscellaneous/turnera/index.html#sge [2] https://jira.toolserver.org/browse/TS-1466 -- Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885

11 years, 7 months

Help from roots for file undeletion wanted

by Liangent

https://jira.toolserver.org/browse/TS-1459 People in #wikimedia-toolserver suggest I should drop a line here and shouldn't wait long, and this is it. -Liangent

11 years, 9 months

Tweetmeme web crawler

by Carl (CBM)

I noticed today that "TweetmemeBot", a web crawler, was making lots of requests to my tools and ignoring the robots.txt file. The thing scrapes at a very high rate, sometimes over 30 requests per minute. Other tool maintainers (especially geohack) may want to investigate whether this is also impacting their tools, and explicitly block the user agent if it is. - Carl

11 years, 9 months

Update of SGE on Thursday 5. July

by DaB.

Hello all, yesterday Merlissimo and I successfully tested the installation of the SGE- version for the toolserver. The last step is now to install the new version on the live-system. For that, the SGE-service needs to stop completely on the cluster, the old version has to be removed and the new one has to be installed. We plan to to this on Thursday 5. July between 17:30 and 22:30 UTC. During this time no SGE will work. There will be no restarting (and no migration) of stopped things after the update. After the update is done, we will start to use the 2 Linux-boxes for tools too (I will send details than). Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 2B255885

11 years, 9 months

Wpipe

by Magnus Manske

Occasionally, "toolserver people" (both programmers and users) talk about joining up tools. Wouldn't it be great if we could use one or several toolserver tools, and "mash-up" their output to create something new and useful? And wouldn't it be even better if the users could do this directly, across tools, without programmers hard-connecting tools? Some tools already support machine-readable output, and some tools already use others to perform a specific function. But these are hardcoded, out formats are often crude (tabbed text, not that there's anything wrong with that in principle), runtimes add up, and so on. So, I went ahead and, as a first step towards a pipeline setup called wpipe ("w" for "wiki", as you no doubt have guessed), implemented an asset tracker. Here, an asset is a dataset, ideally a JSON file (I also created a simple structure to hold lists of wiki pages, along with arbitrary metadata). Each asset is tracked in a database, accessible via a unique numeric identifier, and its data is stored in a file. Assets can be created and queried via toolserver command line, as well as via a web interface. The usual steps in asset creation involve: 1. reserve (gets a new, unique ID) 2. start (set a flag that the asset data creation has begun) 3. done (store data, either by creating a file directly, or passing the data to be stored) 4. fail (if there was an error during data creation) Some points: * All assets and associated data are public. * Creation and last-access time (write or read the actual data) are tracked, so unused assets can be removed to conserve storage. * Currently, data creation is limited to toolserver IP adresses (yes, I know, it can be gamed. Like the rest of the wiki world.) * The suggested JSON format should be flexible enough for most tools dealing with lists of wiki pages, but any text-based format will work for specialist uses. * The asset system can be used by command-line tools and web tools alike. * Existing tools should be simple to adapt; if a tool takes a list of page names, and language/project information, then using asset IDs as an alternative source should be straightforward. * A pipeline of tools could be started asynchronously, and their progress could be tracked via JavaScript; once a tool has finished, the next one in the chain could be run, all from the user's browser. The main web API, and documentation page, is here: http://toolserver.org/~magnus/wpipe/ That page also links to a generic "asset browser" : http://toolserver.org/~magnus/wpipe/asset_info.php As an appetizer, and feasibility demo, I adapted my own "CatScan rewrite" to use assets as an optional output. This can be done by copying the normal "CatScan rewrite" URL parameters and pasting them here: http://toolserver.org/~magnus/wpipe/toolwrap.php This is intended as a generic starter page for command-line-enabled tools. At the moment, only "CatScan rewrite" is available, but I plan to add others. If you have tools you would like to add, I'll be happy to help you set that up. Right now, a tool is started by this page via "nohup &"; that could change to the job submission system, if that's possible from the web servers, but right now it seems overly complicated (runtime estimation? memory estimation? sql server access? whatnot) The web page then returns the reserved output asset ID, while the actual tool is running; another tool could thus be "watching" asynchronously, by pulling the status every few seconds. Of course, this whole shebang doesn't make sense unless others are willing to join in, with work on this core, or at least by enabling some of their tools; so please, if you are even slightly interested in a generic data exchange mechanism between tools, potentially leading to a pipeline-able ecosystem, by all means step forward! Cheers, Magnus

11 years, 9 months

OpenStreetMap on the toolserver (Ptolemy)

by Kai Krueger

Hello everyone, Overall running the Wikipedia-OpenStreetMap server on the toolserver cluster (in particular on server Ptolemy) has worked reasonably well and hasn't needed too much maintenance attention recently anymore. It also seems to have handled serving map tiles to the OSM-gadget in various Wikipedia's including the German, Spanish and Russian Wikipedia amongst others pretty well and more recently also WIWOSM. However, there remain a number of issues that have never been resolved entirely satisfactorily on Ptolemy 1) There is still a memory leak in tirex-master as well as a creep in CPU usage over time. This has to some degree been solved by simply restarting Tirex every 12 hours in a cron job, very much limiting the scope of the memory leak and to a lesser degree also the CPU usage creep. This however means that the request queues are dropped every time. 2) The socket between tirex and mod_tile / render_list always gets closed before the the successful acknowledgment can be sent from tirex. This means the requester can't tell if the rendering was successful. In mod_tile this results in returning http 404 errors for tiles that need rendering, instead of returning the tiles that were rendered on the fly. In render_list I got around the issue by simply always assuming the render request was successful and reconnecting to the socket for each request. 3) The socket between tirex and mod_tile / render_list refuses connection for a (random) subset of connection requests. This results in quite a number of rendering requests from mod_tile being dropped as it can't connect to the tirex socket. In render_list, I could again work around the problem by sleeping for n seconds and then retrying the connection until it eventually succeeds. 4) The performance of postgresql is still below what can be expected from the server of the specs of Ptolemy. While clustering the rendering tables by the geometry column several months ago brought up the performance to a level that it can now more or less keep up with the limited re-rendering load put on the server from Wikipedia, low-zoom rendering is still exceptionally slow. Also it can barely keep up with replication from the OSM servers, and not infrequently drops behind during busy times. Other servers with much slower I/O performance on the other hand seem to have no problem keeping up with diff-imports. Non of these issues are directly critical, but would be nice to solve. Next to the OpenStreetMap server hosted by the toolserver, the wikimedia foundation is planning on hosting their own OpenStreetMap tile-server, at least for the mobile wikipedia client, but presumably also for inclusion in the "standard" wikipedias that are currently served by the toolserver. If I understand correctly, they have already purchased the hardware and are awaiting provisioning until somewhat can puppetize the OSM tileserver rendering stack. Secondly, if I understand it correctly, the toolserver cluster is slowly moving back from Solaris to Debian. Now with the OpenStreetMap switch of license from CC-BY-SA to ODbL for the raw map data soon to be completed. (The database has now mostly been purged of data for which the OSMF did not get permission to relicense), my understanding is that the OSMF will likely recommend everyone to do a fresh import of a new ODbL licensed planet for legal purposes, rather than to apply the soon to be ODbL licensed diffs to a base CC-BY-SA database. One question is, would this forced full re-import of the OpenStreetMap database, which last time took approximately 4 days, be a good opportunity to change things in the setup? For example could this be used to migrate from Solaris to Debian? Some of the above mentioned issues might be solved by a change of OS. Furthermore, it should be easier and better tested to upgrade the OSM rendering stack. Unfortunately key components in Debian Squeeze appear to be quite old, including postgres with version 8.4, mapnik with version 0.7 and the boost library (which would be necessary to compile mapnik 2.0) (They are sufficiently recent in Debian Sid, or Ubuntu 12.04), so that they would likely need back ports of self compilation. However, an issue is that ptolemy is in productive use in several key wikipedia's serving the map tiles. So a replacement would be necessary before being able to take it down for upgrades. At least for the tiles (I am not sure what the requirement is for WIWOSM), the main issue is serving tiles. (Re-)rendering would be suspended for several days anyway during an import. If I understand it correctly, the tiles are currently stored on the toolserver SAN. All that would be needed would be a apache webserver with mod_tile installed. So the main question is, would this be a good time to change things? Are there any other problems / issues that need fixing or improving? Or should we simply re-import a fresh ODbL planet into the existing setup, once the license change has completed? Kai

11 years, 9 months

Dumps

by John

I was taking a look at our dumps in user-store and none of them are compressed, and I was socked about that. I know a lot of people use pywikipedia to parse the dumps, and I know it can handle the bz2 files. any reason we dont just make them all bz2? John

11 years, 9 months

An issue

by Mohamed ElGedawy

Hello. I made this issue <https://jira.toolserver.org/browse/TS-1453>. Please, I want any developer approve it as soon as possible to continue my job.

11 years, 9 months

Getting started with Wikimedia Labs, especially for bot authors

by Sumana Harihareswara

I know a lot of bot creators and authors are curious about how to get going on Wikimedia Labs. They're also curious about whether it makes sense to put bots on Labs, and if so, how. In early June, at the Berlin hackathon, Anomie (Brad Jorsch) talked with Mutante (Daniel Zahn) at my request to try to figure this out. He wrote the summary below on June 2nd, and I should have forwarded it far sooner - my apologies for the delay (I thought it would need more editing and clarification than I ended up doing). This guide seems useful to me. Please do make corrections and updates, and then we can put it up at labsconsole.wikimedia.org for people who want this sort of HOWTO. A couple things I want to annotate: * Wikimedia Labs does not have Toolserver-style database replication right now. That is on the WMF to-do list but there is no predicted date in the roadmap at https://www.mediawiki.org/wiki/Wikimedia_Labs . Given what TParis and Brad say below, that's a prerequisite to the success of Tool Labs. * It seems like Ops would be interested in having some bots run on the WMF cluster, after sufficient code and design review. (I infer this from the 5th step mentioned below and from my conversations with Daniel Zahn and Ryan Lane.) I don't think there's currently a process for this sort of promotion -- I suppose it would work a little like https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment . Thanks to Brad for this summary! best, Sumana Harihareswara Engineering Community Manager Wikimedia Foundation Hi Sumana, Here is my report on the Labs discussion. It turns out that in a way we weren't thinking big enough, or in another way we were thinking too big. What Ops (as represented by Daniel Zahn) would prefer as far as Wikipedia-editing bots would be one Labs project managed by someone (or some group) who isn't Ops, and that person/group would take care of giving people accounts within that project to run bots. I guess the general model is vaguely along the lines of how the Toolserver is run by WMDE and they give accounts to people. I may have found myself a major project here. ;) But I did also get detail on how to get yourself a project on Labs, if you have reason to, and some clarification on how the whole thing works. In light of the above it might not be that useful for someone wanting to move a bot, but in general it's still useful. Step one is to get a Labs/Gerrit account via https://www.mediawiki.org/wiki/Developer_access . Step two is to put together a proposal (i.e. a paragraph or two saying what it is you plan to do) and do one of the following: 1. Go on IRC (I guess #wikimedia-labs or #wikimedia-dev or #mediawiki), find one of the Ops people, and pitch your proposal. 2. Go on the labs-l mailing list and make your proposal. 3. File it as an "enhancement request" bug in Bugzilla. 4. Talk to an Ops person in some other way, e.g. on their talk page or via email. (You should probably try to get Ops as a whole to decide which of the above they'd really prefer, or if they'd prefer using [[mw:Developer access]] or something like it instead.) If they decide they like your project, which Daniel says the default is currently "yes" unless they have a reason to say "no", they'll set it up with you as an admin over that project. Step three then is to go log into labsconsole, and you can create a new "instance" (virtual server) within your project. I'm told there are various generic configurations to choose from, that vary mainly in how much virtual RAM and disk space they have. Then you can log into your instance and configure it however you need. I guess it will be running Ubuntu, based on what I was told about how the Puppet configuration works. At this level, it is either possible or on the road map to get various helpful services: * Public IP addresses. [possible right now] * Ability to have a Git repository automagically cloned, turned into a deb package, and installed in a private Wikimedia-sepecific Ubuntu package repository (which can then be installed from). [possible now] * A UDP feed of recent changes, similar to the feed available over IRC from [[meta:IRC/Channels#Recent changes]] but without having to mess with IRC. In fact, it's the same data that is used to generate those IRC feeds. [roadmap?] * Access to a replicated database something like they have on the Toolserver (i.e. no access to private data). [roadmap] * Other stuff? I get the impression Ops is open to ideas to make Labs more useful to people. BTW, I mentioned Labs to TParis and the rest of the guys over here, and they say that a replicated copy of the database like the Toolserver has is absolutely essential before most people could move their stuff from the Toolserver to Labs. As far as resources go, * Each Labs project can request one or more public IP addresses to assign to their instances. More than one public IP needs a good reason; maybe that will change when IPv6 is fully supported. Instances cannot share IP addresses. But all instances (from all projects) are on a VLAN, so it should be possible to set up one instance as a gateway to NAT the rest of the instances. No one has bothered to actually do this yet. Note that accessing Wikipedia requires a public IP; there is no "back door". * Each Labs project has a limit on the number of open files (or something like that) at any one time. No one has ever had to change the current default limit on this, yet. * There is something of a limit on disk I/O, e.g., if someone has a lot of instances all accessing terabyte-sized files all the time, that would blow things up. This might be a global limitation on all projects. * I guess most other resources are limited on a per-instance basis. For example, disk space available and RAM availble. You *could* stop here, although Ops would really like you to continue on to step 4. Step four, once you've gotten your instance configuration figured out and working, is to write a Puppet configuration that basically instructs Puppet how to duplicate your setup from scratch. (How to do this will probably turn out to be a whole howto on its own, once someone who has time to write it up figures out the details.) At a high level, it consists of stuff like which Ubuntu packages to install, which extra files (also stored in the Puppet config in Git) to copy in, and so on. This puppet configuration needs to get put into Git/Gerrit somewhere. And then Ops or someone else with the ability to "+2" will have to approve it, and then you can use it just like the generic configs back in Step 3 to create more instances. If you need to change the configuration, same deal with pushing the new version to Gerrit and having someone "+2" it. The major advantage of getting things Puppetized is that you can clone your instance at the click of a button, and if the virtual-server equivalent of a hard drive crash were to occur you could be up and running on a new instance pretty much instantly. And handing administration of the project on to someone else would be easier, because everything would be documented instead of being a random mish-mash that no one really knows how it is set up. It's also a requirement for going on to production. This is not necessarily an all-or-nothing thing. For example, the list of packages installed and configuration file settings could be puppetized while you still log in to update your bot code manually instead of having that managed through Puppet. But I guess it would need to be fully completed before you could move to production. Step five, if you want, is to move from labs to production. The major advantages are the possibility to get even more resources (maybe even a dedicated server rather than a virtual one) and the possibility to get access to private data from the MediaWiki databases. I guess at this point all configuration changes whould have to go through updating the Puppet configuration via Gerrit. I don't know if you were listening when we were discussing downtime, but it doesn't seem to be that much of a concern on Labs as long as you're not running something that will bring all of Wikipedia to a halt if it goes down for an hour or two. Compare this to how the Toolserver was effectively down for almost three weeks at the end of March, and everyone survived somehow (see [[Wikipedia:Village pump (technical)/Archive 98#Toolserver replication lag]]). Hope that's all clear and accurate! Brad

11 years, 9 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Toolserver-l July 2012