[Labs-l] Building our ZIM farm @wmflabs

Mon Mar 9 16:41:09 UTC 2015

Emmanuel --

Erik M just pointed out that there is a similar effort towards this goal 
happening in production -- so maybe you can catch up with those folks 
and see what you can contribute, or if you can get what you need from 
their cluster?  I'm cc'ing Gabriel in hopes that he can collaborate with 
you or refer you to the folks who are doing the actual work.

https://phabricator.wikimedia.org/T17017
https://phabricator.wikimedia.org/T91853

-Andrew

On 3/5/15 5:34 AM, Emmanuel Engelhart wrote:
> Hi
>
> Following Yuvi's and Andrew's invitation, I write this email to 
> explain what I want to do with the wmflabs and share with you my first 
> experiences.
>
> == Context ==
>
> Most of the people still don't have a free and cheap broadband access 
> to fully enjoy reading Wikimedia web sites. With Kiwix and openZIM, a 
> WikimediaCH program, we have been working on solutions for almost ten 
> years to bring Wikimedia content "offline".
>
> We have built a multi-platform reader (www.kiwix.org) and have created 
> ZIM, a file format to store web site snapshots (www.openzim.org). As a 
> result, Kiwix is currently the most successful solution to access 
> Wikipedia offline.
>
> == Problem ==
>
> However, one of the weak point of the project is that we still don't 
> achieve to generate often enough new fresh snapshots (ZIM files). 
> Generating ZIM snapshots periodically (we want to provide a new fresh 
> version each month) of +800 projects needs pretty much hardware 
> resources.
>
> This might look like a detail but it's not. The lack of up-to-date 
> snapshots brakes many action within our movement to advert more 
> broadly our offer. As a consequence, too few people are aware about it 
> reported last Wikimedia readership update. An other side effect is 
> that every few months, volunteer developers get the idea to build a 
> new offline reader based on the XML dumps (the only up2date snapshots 
> we provide for now), which is near to be a dead-end approach.
>
> == Goal ==
>
> Our goal with wmflabs is to have a sustainable and efficient solution 
> to build, one time a month, new ZIM files for all our projects (for 
> each project, one with thumbnails and one without). This is at the 
> same time a requirement for and a part of a broader initiative which 
> has for purpose to increase the awareness about our "offline offer". 
> Other tasks are for example, storing all the ZIM files on Wikimedia 
> servers (we currently only store part of them on 
> download.wikimedia.org) and improve their accessibility by making them 
> more visible (WPAR has for example customised their sidebar to provide 
> a direct access).
>
> == Needs ==
>
> Building a ZIM file from a Mediawiki is done using a tool called 
> mwoffliner which is a scraper based on both Parsoid & Mediawiki APIs. 
> mwoffliner, after scraping and rewriting content, store them in a 
> directory. At the end, the content is then self-sufficient (without 
> online dependencies) and can be then packed in one step in a ZIM file 
> (using a tool called zimwriterfs).
>
> To run this software you better have:
> * A little bit bandwidth
> * Low network latency (lots of HTTP requests)
> * Fast storage
> * Pretty much storage (~100GB per million article)
> * Many cores for compression (ZIM, ZIP and picture optimisation)
> * Time (~400.000 articles can be dumped per day on a machine)
>
> My guess is that we need a total of around a dozen of VMs and 1.5 TB 
> of storage.
>
> == Current achievements ==
>
> We have currently 3 x-large VMs in our "MWoffliner" project:
> https://wikitech.wikimedia.org/wiki/Nova_Resource:Mwoffliner
>
> With them we are able to provide, one time a month, ZIM for all 
> instances of Wikivoyage, Wikinews, Wikiquote, Wikiversity, Wikibooks, 
> Wikispecies, Wikisource, Wiktionary and a few minors Wikipedias.
>
> Here are a few feedbacks about our first months with wmflabs:
> * WMFlabs is a great tool, it's fully in the Wikimedia spirit and it 
> works.
> * Support on IRC is efficient and friendly
> * We faced a little bit instability in December but instances seem to 
> be stable now
> * The Documentation on wikitech wiki seems to be pretty complete, but 
> the overall presentation is to my opinion too chaotic and stepping-in 
> is might be easier with a more user-friendly presentation.
> * Mediawiki Sementic & OpenStackManager sync/cache/cookie problems are 
> a little bit annoying
> * Overall VM performance looks good although suffering from sporadic 
> instabilities (bandwidth not available, all the processes stuck in 
> "kernel time", slow storage).
>
> In general, the wmflabs does the job, we are satisfied and think this 
> is an adapted solution to our project.
>
> == Next steps ==
>
> We want to complete our effort and mirror the biggest Wikipedia 
> projects. Unfortunately, we have reached the limits of a traditional 
> usage of wmflabs. We need more quota and to experiment with the NFS 
> storage because an x-large instance in not able to mirror more than 
> 1.5 millions of articles at a time. How might that be made possible?
>
> Thank you for your help.
>
> Regards
> Emmanuel
>