[Labs-l] Building our ZIM farm @wmflabs

Emmanuel Engelhart kelson at kiwix.org
Thu Mar 5 11:34:35 UTC 2015


Hi

Following Yuvi's and Andrew's invitation, I write this email to explain 
what I want to do with the wmflabs and share with you my first experiences.

== Context ==

Most of the people still don't have a free and cheap broadband access to 
fully enjoy reading Wikimedia web sites. With Kiwix and openZIM, a 
WikimediaCH program, we have been working on solutions for almost ten 
years to bring Wikimedia content "offline".

We have built a multi-platform reader (www.kiwix.org) and have created 
ZIM, a file format to store web site snapshots (www.openzim.org). As a 
result, Kiwix is currently the most successful solution to access 
Wikipedia offline.

== Problem ==

However, one of the weak point of the project is that we still don't 
achieve to generate often enough new fresh snapshots (ZIM files). 
Generating ZIM snapshots periodically (we want to provide a new fresh 
version each month) of +800 projects needs pretty much hardware resources.

This might look like a detail but it's not. The lack of up-to-date 
snapshots brakes many action within our movement to advert more broadly 
our offer. As a consequence, too few people are aware about it reported 
last Wikimedia readership update. An other side effect is that every few 
months, volunteer developers get the idea to build a new offline reader 
based on the XML dumps (the only up2date snapshots we provide for now), 
which is near to be a dead-end approach.

== Goal ==

Our goal with wmflabs is to have a sustainable and efficient solution to 
build, one time a month, new ZIM files for all our projects (for each 
project, one with thumbnails and one without). This is at the same time 
a requirement for and a part of a broader initiative which has for 
purpose to increase the awareness about our "offline offer". Other tasks 
are for example, storing all the ZIM files on Wikimedia servers (we 
currently only store part of them on download.wikimedia.org) and improve 
their accessibility by making them more visible (WPAR has for example 
customised their sidebar to provide a direct access).

== Needs ==

Building a ZIM file from a Mediawiki is done using a tool called 
mwoffliner which is a scraper based on both Parsoid & Mediawiki APIs. 
mwoffliner, after scraping and rewriting content, store them in a 
directory. At the end, the content is then self-sufficient (without 
online dependencies) and can be then packed in one step in a ZIM file 
(using a tool called zimwriterfs).

To run this software you better have:
* A little bit bandwidth
* Low network latency (lots of HTTP requests)
* Fast storage
* Pretty much storage (~100GB per million article)
* Many cores for compression (ZIM, ZIP and picture optimisation)
* Time (~400.000 articles can be dumped per day on a machine)

My guess is that we need a total of around a dozen of VMs and 1.5 TB of 
storage.

== Current achievements ==

We have currently 3 x-large VMs in our "MWoffliner" project:
https://wikitech.wikimedia.org/wiki/Nova_Resource:Mwoffliner

With them we are able to provide, one time a month, ZIM for all 
instances of Wikivoyage, Wikinews, Wikiquote, Wikiversity, Wikibooks, 
Wikispecies, Wikisource, Wiktionary and a few minors Wikipedias.

Here are a few feedbacks about our first months with wmflabs:
* WMFlabs is a great tool, it's fully in the Wikimedia spirit and it works.
* Support on IRC is efficient and friendly
* We faced a little bit instability in December but instances seem to be 
stable now
* The Documentation on wikitech wiki seems to be pretty complete, but 
the overall presentation is to my opinion too chaotic and stepping-in is 
might be easier with a more user-friendly presentation.
* Mediawiki Sementic & OpenStackManager sync/cache/cookie problems are a 
little bit annoying
* Overall VM performance looks good although suffering from sporadic 
instabilities (bandwidth not available, all the processes stuck in 
"kernel time", slow storage).

In general, the wmflabs does the job, we are satisfied and think this is 
an adapted solution to our project.

== Next steps ==

We want to complete our effort and mirror the biggest Wikipedia 
projects. Unfortunately, we have reached the limits of a traditional 
usage of wmflabs. We need more quota and to experiment with the NFS 
storage because an x-large instance in not able to mirror more than 1.5 
millions of articles at a time. How might that be made possible?

Thank you for your help.

Regards
Emmanuel

-- 
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication



More information about the Labs-l mailing list