Hello everyone,
Overall running the Wikipedia-OpenStreetMap server on the toolserver
cluster (in particular on server Ptolemy) has worked reasonably well
and hasn't needed too much maintenance attention recently anymore.
It also seems to have handled serving map tiles to the OSM-gadget in
various Wikipedia's including the German, Spanish and Russian
Wikipedia amongst others pretty well and more recently also WIWOSM.
However, there remain a number of issues that have never been
resolved entirely satisfactorily on Ptolemy
1) There is still a memory leak in tirex-master as well as a creep
in CPU usage over time. This has to some degree been solved by
simply restarting Tirex every 12 hours in a cron job, very much
limiting the scope of the memory leak and to a lesser degree also
the CPU usage creep. This however means that the request queues are
dropped every time.
2) The socket between tirex and mod_tile / render_list always gets
closed before the the successful acknowledgment can be sent from
tirex. This means the requester can't tell if the rendering was
successful. In mod_tile this results in returning http 404 errors
for tiles that need rendering, instead of returning the tiles that
were rendered on the fly. In render_list I got around the issue by
simply always assuming the render request was successful and
reconnecting to the socket for each request.
3) The socket between tirex and mod_tile / render_list refuses
connection for a (random) subset of connection requests. This
results in quite a number of rendering requests from mod_tile being
dropped as it can't connect to the tirex socket. In render_list, I
could again work around the problem by sleeping for n seconds and
then retrying the connection until it eventually succeeds.
4) The performance of postgresql is still below what can be expected
from the server of the specs of Ptolemy. While clustering the
rendering tables by the geometry column several months ago brought
up the performance to a level that it can now more or less keep up
with the limited re-rendering load put on the server from Wikipedia,
low-zoom rendering is still exceptionally slow. Also it can barely
keep up with replication from the OSM servers, and not infrequently
drops behind during busy times. Other servers with much slower I/O
performance on the other hand seem to have no problem keeping up
with diff-imports.
Non of these issues are directly critical, but would be nice to
solve.
Next to the OpenStreetMap server hosted by the toolserver, the
wikimedia foundation is planning on hosting their own OpenStreetMap
tile-server, at least for the mobile wikipedia client, but
presumably also for inclusion in the "standard" wikipedias that are
currently served by the toolserver. If I understand correctly, they
have already purchased the hardware and are awaiting provisioning
until somewhat can puppetize the OSM tileserver rendering stack.
Secondly, if I understand it correctly, the toolserver cluster is
slowly moving back from Solaris to Debian.
Now with the OpenStreetMap switch of license from CC-BY-SA to ODbL
for the raw map data soon to be completed. (The database has now
mostly been purged of data for which the OSMF did not get permission
to relicense), my understanding is that the OSMF will likely
recommend everyone to do a fresh import of a new ODbL licensed
planet for legal purposes, rather than to apply the soon to be ODbL
licensed diffs to a base CC-BY-SA database.
One question is, would this forced full re-import of the
OpenStreetMap database, which last time took approximately 4 days,
be a good opportunity to change things in the setup? For example
could this be used to migrate from Solaris to Debian? Some of the
above mentioned issues might be solved by a change of OS.
Furthermore, it should be easier and better tested to upgrade the
OSM rendering stack. Unfortunately key components in Debian Squeeze
appear to be quite old, including postgres with version 8.4, mapnik
with version 0.7 and the boost library (which would be necessary to
compile mapnik 2.0) (They are sufficiently recent in Debian Sid, or
Ubuntu 12.04), so that they would likely need back ports of self
compilation.
However, an issue is that ptolemy is in productive use in several
key wikipedia's serving the map tiles. So a replacement would be
necessary before being able to take it down for upgrades. At least
for the tiles (I am not sure what the requirement is for WIWOSM),
the main issue is serving tiles. (Re-)rendering would be suspended
for several days anyway during an import. If I understand it
correctly, the tiles are currently stored on the toolserver SAN. All
that would be needed would be a apache webserver with mod_tile
installed.
So the main question is, would this be a good time to change things?
Are there any other problems / issues that need fixing or improving?
Or should we simply re-import a fresh ODbL planet into the existing
setup, once the license change has completed?
Kai