Fwd: Building our ZIM farm @wmflabs - Offline-l

6 Mar 2015


      Might be of interest for this ML!
-------- Forwarded Message --------
Subject: Building our ZIM farm @wmflabs
Date: Thu, 05 Mar 2015 12:34:35 +0100
From: Emmanuel Engelhart kelson@kiwix.org
To: labs-l@lists.wikimedia.org
Hi
Following Yuvi's and Andrew's invitation, I write this email to explain
what I want to do with the wmflabs and share with you my first experiences.
== Context ==
Most of the people still don't have a free and cheap broadband access to
fully enjoy reading Wikimedia web sites. With Kiwix and openZIM, a
WikimediaCH program, we have been working on solutions for almost ten
years to bring Wikimedia content "offline".
We have built a multi-platform reader (www.kiwix.org) and have created
ZIM, a file format to store web site snapshots (www.openzim.org). As a
result, Kiwix is currently the most successful solution to access
Wikipedia offline.
== Problem ==
However, one of the weak point of the project is that we still don't
achieve to generate often enough new fresh snapshots (ZIM files).
Generating ZIM snapshots periodically (we want to provide a new fresh
version each month) of +800 projects needs pretty much hardware resources.
This might look like a detail but it's not. The lack of up-to-date
snapshots brakes many action within our movement to advert more broadly
our offer. As a consequence, too few people are aware about it reported
last Wikimedia readership update. An other side effect is that every few
months, volunteer developers get the idea to build a new offline reader
based on the XML dumps (the only up2date snapshots we provide for now),
which is near to be a dead-end approach.
== Goal ==
Our goal with wmflabs is to have a sustainable and efficient solution to
build, one time a month, new ZIM files for all our projects (for each
project, one with thumbnails and one without). This is at the same time
a requirement for and a part of a broader initiative which has for
purpose to increase the awareness about our "offline offer". Other tasks
are for example, storing all the ZIM files on Wikimedia servers (we
currently only store part of them on download.wikimedia.org) and improve
their accessibility by making them more visible (WPAR has for example
customised their sidebar to provide a direct access).
== Needs ==
Building a ZIM file from a Mediawiki is done using a tool called
mwoffliner which is a scraper based on both Parsoid & Mediawiki APIs.
mwoffliner, after scraping and rewriting content, store them in a
directory. At the end, the content is then self-sufficient (without
online dependencies) and can be then packed in one step in a ZIM file
(using a tool called zimwriterfs).
To run this software you better have:
* A little bit bandwidth
* Low network latency (lots of HTTP requests)
* Fast storage
* Pretty much storage (~100GB per million article)
* Many cores for compression (ZIM, ZIP and picture optimisation)
* Time (~400.000 articles can be dumped per day on a machine)
My guess is that we need a total of around a dozen of VMs and 1.5 TB of
storage.
== Current achievements ==
We have currently 3 x-large VMs in our "MWoffliner" project:
https://wikitech.wikimedia.org/wiki/Nova_Resource:Mwoffliner
With them we are able to provide, one time a month, ZIM for all
instances of Wikivoyage, Wikinews, Wikiquote, Wikiversity, Wikibooks,
Wikispecies, Wikisource, Wiktionary and a few minors Wikipedias.
Here are a few feedbacks about our first months with wmflabs:
* WMFlabs is a great tool, it's fully in the Wikimedia spirit and it works.
* Support on IRC is efficient and friendly
* We faced a little bit instability in December but instances seem to be
stable now
* The Documentation on wikitech wiki seems to be pretty complete, but
the overall presentation is to my opinion too chaotic and stepping-in is
might be easier with a more user-friendly presentation.
* Mediawiki Sementic & OpenStackManager sync/cache/cookie problems are a
little bit annoying
* Overall VM performance looks good although suffering from sporadic
instabilities (bandwidth not available, all the processes stuck in
"kernel time", slow storage).
In general, the wmflabs does the job, we are satisfied and think this is
an adapted solution to our project.
== Next steps ==
We want to complete our effort and mirror the biggest Wikipedia
projects. Unfortunately, we have reached the limits of a traditional
usage of wmflabs. We need more quota and to experiment with the NFS
storage because an x-large instance in not able to mirror more than 1.5
millions of articles at a time. How might that be made possible?
Thank you for your help.
Regards
Emmanuel
-- 
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication