Hello everyone,
Overall running the Wikipedia-OpenStreetMap server on the toolserver
cluster (in particular on server Ptolemy) has worked reasonably well and
hasn't needed too much maintenance attention recently anymore. It also
seems to have handled serving map tiles to the OSM-gadget in various
Wikipedia's including the German, Spanish and Russian Wikipedia amongst
others pretty well and more recently also WIWOSM.
However, there remain a number of issues that have never been resolved
entirely satisfactorily on Ptolemy
1) There is still a memory leak in tirex-master as well as a creep in
CPU usage over time. This has to some degree been solved by simply
restarting Tirex every 12 hours in a cron job, very much limiting the
scope of the memory leak and to a lesser degree also the CPU usage
creep. This however means that the request queues are dropped every time.
2) The socket between tirex and mod_tile / render_list always gets
closed before the the successful acknowledgment can be sent from tirex.
This means the requester can't tell if the rendering was successful. In
mod_tile this results in returning http 404 errors for tiles that need
rendering, instead of returning the tiles that were rendered on the fly.
In render_list I got around the issue by simply always assuming the
render request was successful and reconnecting to the socket for each
request.
3) The socket between tirex and mod_tile / render_list refuses
connection for a (random) subset of connection requests. This results in
quite a number of rendering requests from mod_tile being dropped as it
can't connect to the tirex socket. In render_list, I could again work
around the problem by sleeping for n seconds and then retrying the
connection until it eventually succeeds.
4) The performance of postgresql is still below what can be expected
from the server of the specs of Ptolemy. While clustering the rendering
tables by the geometry column several months ago brought up the
performance to a level that it can now more or less keep up with the
limited re-rendering load put on the server from Wikipedia, low-zoom
rendering is still exceptionally slow. Also it can barely keep up with
replication from the OSM servers, and not infrequently drops behind
during busy times. Other servers with much slower I/O performance on the
other hand seem to have no problem keeping up with diff-imports.
Non of these issues are directly critical, but would be nice to solve.
Next to the OpenStreetMap server hosted by the toolserver, the wikimedia
foundation is planning on hosting their own OpenStreetMap tile-server,
at least for the mobile wikipedia client, but presumably also for
inclusion in the "standard" wikipedias that are currently served by the
toolserver. If I understand correctly, they have already purchased the
hardware and are awaiting provisioning until somewhat can puppetize the
OSM tileserver rendering stack.
Secondly, if I understand it correctly, the toolserver cluster is slowly
moving back from Solaris to Debian.
Now with the OpenStreetMap switch of license from CC-BY-SA to ODbL for
the raw map data soon to be completed. (The database has now mostly been
purged of data for which the OSMF did not get permission to relicense),
my understanding is that the OSMF will likely recommend everyone to do a
fresh import of a new ODbL licensed planet for legal purposes, rather
than to apply the soon to be ODbL licensed diffs to a base CC-BY-SA
database.
One question is, would this forced full re-import of the OpenStreetMap
database, which last time took approximately 4 days, be a good
opportunity to change things in the setup? For example could this be
used to migrate from Solaris to Debian? Some of the above mentioned
issues might be solved by a change of OS. Furthermore, it should be
easier and better tested to upgrade the OSM rendering stack.
Unfortunately key components in Debian Squeeze appear to be quite old,
including postgres with version 8.4, mapnik with version 0.7 and the
boost library (which would be necessary to compile mapnik 2.0) (They are
sufficiently recent in Debian Sid, or Ubuntu 12.04), so that they would
likely need back ports of self compilation.
However, an issue is that ptolemy is in productive use in several key
wikipedia's serving the map tiles. So a replacement would be necessary
before being able to take it down for upgrades. At least for the tiles
(I am not sure what the requirement is for WIWOSM), the main issue is
serving tiles. (Re-)rendering would be suspended for several days anyway
during an import. If I understand it correctly, the tiles are currently
stored on the toolserver SAN. All that would be needed would be a apache
webserver with mod_tile installed.
So the main question is, would this be a good time to change things? Are
there any other problems / issues that need fixing or improving? Or
should we simply re-import a fresh ODbL planet into the existing setup,
once the license change has completed?
Kai