The Whole Wikipedia in English with pictures in one 40GB big file

List overview All Threads
Download

newer

older

XOWA v1.4.1 released

Re: [Offline-l] [Wikitech-l] The...

Emmanuel Engelhart

1 Mar 2014 1 Mar '14

5:01 p.m.

For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.

This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent

This ZIM file is directly and easily usable on many types of devices like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian with Wikionboard.

You don't need modern computers with big CPUs. You can for example create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by using our ZIM dedicated Web server called kiwix-serve. A demo is available here: http://library.kiwix.org/wikipedia_en_all/

Like always, we also provide a packaged version (for the main PC systems) which includes fulltext search index+ZIM file+binaries: http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent

What is interesting too: This file was generated in less than 2 weeks thanks to multiples recent innovations: * The Parsoid (cluster), which gives us an HTML output with additional semantic RDF tags * mwoffliner, a nodejs script able to dumps pages based on the Mediawiki API (and Parsoid API) * zimwriterfs, a solution able to compile any local HTML directory to a ZIM file

We have now an efficient way to generate new ZIM files. Consequently, we will work to industrialize and automatize the ZIM file generation process, one thing which is probably the most oldest and important problem we still face at Kiwix.

All this would not have been possible without the support: * Wikimedia CH and the "ZIM autobuild" project * Wikimedia France and the Afripedia project * Gwicke from the WMF Parsoid dev team.

BTW, we need additional developer helps with javascript/nodejs skills to fix a few issues on mwoffliner: * Recreate the "table of content" based on the HTML DOM (*) * Scrape Mediawiki Resourceloader in a manner it will continue to work offline (***) * Scrape categories (**) * Localized the script (*) * Improve the global performance by introducing usage of workers (**) * Create nodezim, the libzim nodejs binding and use it (***, need also compilation and C++ skills) * Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer (***)

Emmanuel

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Show replies by date

Shahid Farooqui

1 Mar 1 Mar

6:19 p.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

Hey Emmanuel..

Congratulations!

Really an awesome wok done by you and your time :)

Best Regards

Shahid

On 3/1/14, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Hi

For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.

This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent

This ZIM file is directly and easily usable on many types of devices like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian with Wikionboard.

You don't need modern computers with big CPUs. You can for example create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by using our ZIM dedicated Web server called kiwix-serve. A demo is available here: http://library.kiwix.org/wikipedia_en_all/

Like always, we also provide a packaged version (for the main PC systems) which includes fulltext search index+ZIM file+binaries: http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent

What is interesting too: This file was generated in less than 2 weeks thanks to multiples recent innovations:

The Parsoid (cluster), which gives us an HTML output with additional

semantic RDF tags

mwoffliner, a nodejs script able to dumps pages based on the Mediawiki

API (and Parsoid API)

zimwriterfs, a solution able to compile any local HTML directory to a

ZIM file

We have now an efficient way to generate new ZIM files. Consequently, we will work to industrialize and automatize the ZIM file generation process, one thing which is probably the most oldest and important problem we still face at Kiwix.

All this would not have been possible without the support:

Wikimedia CH and the "ZIM autobuild" project

Wikimedia France and the Afripedia project

Gwicke from the WMF Parsoid dev team.

BTW, we need additional developer helps with javascript/nodejs skills to fix a few issues on mwoffliner:

Recreate the "table of content" based on the HTML DOM (*)

Scrape Mediawiki Resourceloader in a manner it will continue to work

offline (***)

Scrape categories (**)

Localized the script (*)

Improve the global performance by introducing usage of workers (**)

Create nodezim, the libzim nodejs binding and use it (***, need also

compilation and C++ skills)

Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer (***)

Emmanuel

Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Samuel Klein

2 Mar 2 Mar

12:33 a.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

Brilliant. Congrats to everyone who is working on this! What is needed to scrape categories?

On Sat, Mar 1, 2014 at 12:01 PM, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Hi

For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.

This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent

This ZIM file is directly and easily usable on many types of devices like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian with Wikionboard.

You don't need modern computers with big CPUs. You can for example create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by using our ZIM dedicated Web server called kiwix-serve. A demo is available here: http://library.kiwix.org/wikipedia_en_all/

Like always, we also provide a packaged version (for the main PC systems) which includes fulltext search index+ZIM file+binaries: http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent

What is interesting too: This file was generated in less than 2 weeks thanks to multiples recent innovations:

The Parsoid (cluster), which gives us an HTML output with additional

semantic RDF tags

mwoffliner, a nodejs script able to dumps pages based on the Mediawiki

API (and Parsoid API)

zimwriterfs, a solution able to compile any local HTML directory to a

ZIM file

We have now an efficient way to generate new ZIM files. Consequently, we will work to industrialize and automatize the ZIM file generation process, one thing which is probably the most oldest and important problem we still face at Kiwix.

All this would not have been possible without the support:

Wikimedia CH and the "ZIM autobuild" project

Wikimedia France and the Afripedia project

Gwicke from the WMF Parsoid dev team.

BTW, we need additional developer helps with javascript/nodejs skills to fix a few issues on mwoffliner:

Recreate the "table of content" based on the HTML DOM (*)

Scrape Mediawiki Resourceloader in a manner it will continue to work

offline (***)

Scrape categories (**)

Localized the script (*)

Improve the global performance by introducing usage of workers (**)

Create nodezim, the libzim nodejs binding and use it (***, need also

compilation and C++ skills)

Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer (***)

Emmanuel

Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

-- Samuel Klein @metasj w:user:sj +1 617 529 4266

Emmanuel Engelhart

10:08 a.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

Le 02/03/2014 01:33, Samuel Klein a écrit :

...

Brilliant. Congrats to everyone who is working on this! What is needed to scrape categories?

0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages), download the list of categories they belong to (with the MW API). 1 - For each dumped page, implement the HTML rendering of the category list at the bottom. 2 - For each category page, get the content HTML rendering from Parsoid and compute and render sorted lists of articles and sub-categories in a similar fashion like the online version (with multiple pages if necessary).

All the stuff must be integrated in the nodejs script and category graph must be stored in redis.

Emmanuel

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Manuel Schneider

3 Mar 3 Mar

11:01 p.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

On 02.03.2014 11:08, Emmanuel Engelhart wrote:

...

Le 02/03/2014 01:33, Samuel Klein a écrit :

...
Brilliant. Congrats to everyone who is working on this! What is needed to scrape categories?

0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages), download the list of categories they belong to (with the MW API). 1 - For each dumped page, implement the HTML rendering of the category list at the bottom. 2 - For each category page, get the content HTML rendering from Parsoid and compute and render sorted lists of articles and sub-categories in a similar fashion like the online version (with multiple pages if necessary).

All the stuff must be integrated in the nodejs script and category graph must be stored in redis.

what about the internal structure inside ZIM which uses category pages (like in the wiki) for the text and a list of pointers to the pages inside the ZIM file to implement the category?

http://openzim.org/wiki/Category_Handling

/Manuel

-- Wikimedia CH - Verein zur Förderung Freien Wissens Lausanne, +41 (21) 34066-22 - www.wikimedia.ch

Emmanuel Engelhart

4 Mar 4 Mar

11:10 a.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

Le 04/03/2014 00:01, Manuel Schneider a écrit :

...

On 02.03.2014 11:08, Emmanuel Engelhart wrote:

...
Le 02/03/2014 01:33, Samuel Klein a écrit :

...
Brilliant. Congrats to everyone who is working on this! What is needed to scrape categories?

0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages), download the list of categories they belong to (with the MW API). 1 - For each dumped page, implement the HTML rendering of the category list at the bottom. 2 - For each category page, get the content HTML rendering from Parsoid and compute and render sorted lists of articles and sub-categories in a similar fashion like the online version (with multiple pages if necessary).

All the stuff must be integrated in the nodejs script and category graph must be stored in redis.

what about the internal structure inside ZIM which uses category pages (like in the wiki) for the text and a list of pointers to the pages inside the ZIM file to implement the category?

http://openzim.org/wiki/Category_Handling

Not sure to 100% understand your question, but it's necessary to store the category graph as a hash table before compiling everything in a ZIM file. That's why I talk about redis.

In addition (but this is not mandatory to enjoy the categories), it would be great to do normalisation & implementation work to store the category graph in a structured manner and avoid storing the lists in HTML pages. This is still something we have on the roadmap.

Emmanuel

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Manuel Schneider

7:37 p.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

On 04.03.2014 12:10, Emmanuel Engelhart wrote:

...

Not sure to 100% understand your question, but it's necessary to store the category graph as a hash table before compiling everything in a ZIM file. That's why I talk about redis.

you talk about the ZIM creation process, I was talking about how to store categories in the ZIM file - so first hash table, redis, then the result should be those HTML pages in NS U and the pointers in NS V and W.

/Manuel

-- Wikimedia CH - Verein zur Förderung Freien Wissens Lausanne, +41 (21) 34066-22 - www.wikimedia.ch

Erik Moeller

2 Mar 2 Mar

1:26 a.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

This is super-exciting - can't wait to play with it :-). Congratulations!

Erik

-- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

Sean Moss-Pultz

3:26 a.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

Amazing... 64GB cards are dropping in price... but I wonder if could fix in 32GB. Great work!!

Sean

On Sun, Mar 2, 2014 at 9:26 AM, Erik Moeller erik@wikimedia.org wrote:

...

This is super-exciting - can't wait to play with it :-). Congratulations!

Erik

Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Federico Leva (Nemo)

8:41 a.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

Emmanuel Engelhart, 01/03/2014 18:01:

...

For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.

This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent

Fantastic; I was impatiently waiting for this moment. I suppose you need reseeders?

In the last months we've seen way more ZIM files than in the past. One day all this will work so seamlessly that it will just be a periodic job on some WMF server for all Wikimedia projects without Kelson's manual awesomeness. 6 years without HTML dumps are a long time.

Thanks to GWicke for helping Kelson exploit Parsoid for additional good byproducts.

Nemo

Asaf Bartov

7 Mar 7 Mar

5:06 p.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

This is excellent! Thanks for this update.

On Sat, Mar 1, 2014 at 9:01 AM, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Hi

For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.

This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent

This ZIM file is directly and easily usable on many types of devices like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian with Wikionboard.

You don't need modern computers with big CPUs. You can for example create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by using our ZIM dedicated Web server called kiwix-serve. A demo is available here: http://library.kiwix.org/wikipedia_en_all/

Like always, we also provide a packaged version (for the main PC systems) which includes fulltext search index+ZIM file+binaries: http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent

What is interesting too: This file was generated in less than 2 weeks thanks to multiples recent innovations:

The Parsoid (cluster), which gives us an HTML output with additional

semantic RDF tags

mwoffliner, a nodejs script able to dumps pages based on the Mediawiki

API (and Parsoid API)

zimwriterfs, a solution able to compile any local HTML directory to a

ZIM file

We have now an efficient way to generate new ZIM files. Consequently, we will work to industrialize and automatize the ZIM file generation process, one thing which is probably the most oldest and important problem we still face at Kiwix.

All this would not have been possible without the support:

Wikimedia CH and the "ZIM autobuild" project

Wikimedia France and the Afripedia project

Gwicke from the WMF Parsoid dev team.

BTW, we need additional developer helps with javascript/nodejs skills to fix a few issues on mwoffliner:

Recreate the "table of content" based on the HTML DOM (*)

Scrape Mediawiki Resourceloader in a manner it will continue to work

offline (***)

Scrape categories (**)

Localized the script (*)

Improve the global performance by introducing usage of workers (**)

Create nodezim, the libzim nodejs binding and use it (***, need also

compilation and C++ skills)

Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer

(***)

Emmanuel

Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

-- Asaf Bartov asaf.bartov@gmail.com

Asaf Bartov

6:25 p.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

btw, are these new improved tools documented anywhere? http://kiwix.org/wiki/Development does not seem to point in the right direction.

I'd like to have a page to which I can refer people who want to create ZIM files for Kiwix.

Cheers,

Asaf

On Sat, Mar 1, 2014 at 9:01 AM, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Hi

For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.

This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent

This ZIM file is directly and easily usable on many types of devices like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian with Wikionboard.

You don't need modern computers with big CPUs. You can for example create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by using our ZIM dedicated Web server called kiwix-serve. A demo is available here: http://library.kiwix.org/wikipedia_en_all/

Like always, we also provide a packaged version (for the main PC systems) which includes fulltext search index+ZIM file+binaries: http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent

What is interesting too: This file was generated in less than 2 weeks thanks to multiples recent innovations:

The Parsoid (cluster), which gives us an HTML output with additional

semantic RDF tags

mwoffliner, a nodejs script able to dumps pages based on the Mediawiki

API (and Parsoid API)

zimwriterfs, a solution able to compile any local HTML directory to a

ZIM file

We have now an efficient way to generate new ZIM files. Consequently, we will work to industrialize and automatize the ZIM file generation process, one thing which is probably the most oldest and important problem we still face at Kiwix.

All this would not have been possible without the support:

Wikimedia CH and the "ZIM autobuild" project

Wikimedia France and the Afripedia project

Gwicke from the WMF Parsoid dev team.

BTW, we need additional developer helps with javascript/nodejs skills to fix a few issues on mwoffliner:

Recreate the "table of content" based on the HTML DOM (*)

Scrape Mediawiki Resourceloader in a manner it will continue to work

offline (***)

Scrape categories (**)

Localized the script (*)

Improve the global performance by introducing usage of workers (**)

Create nodezim, the libzim nodejs binding and use it (***, need also

compilation and C++ skills)

Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer

(***)

Emmanuel

Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

-- Asaf Bartov Wikimedia Foundation http://www.wikimediafoundation.org Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org

Emmanuel Engelhart

8 Mar 8 Mar

9:42 a.m.

New subject: The Whole Wikipedia in English with pictures in one 40GB big file

Le 07/03/2014 19:25, Asaf Bartov a écrit :

...

btw, are these new improved tools documented anywhere? http://kiwix.org/wiki/Development does not seem to point in the right direction.

The usage is pretty straightforward (for IT people) and IMO everything necessary is explained in the READMEs: * mwoffliner: https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/ * zimwriterfs: https://sourceforge.net/p/kiwix/other/ci/master/tree/zimwriterfs/

NB: The goal is not that everybody creates its own full wikipedia ZIM file. The goal is that we (Wikimedia) provide these files, often enough to always have up2date ZIM information (so at least one time per month). Thus, the challenge is now to setup an infrastructure similar to the one which creates the XML dumps.

Emmanuel

PS: We really want to make a post @blog.wikimedia.org (so in English). If someone is volunteer to write this, I would really appreciate his help.

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

3903

Age (days ago)

3910

Last active (days ago)

offline-l@lists.wikimedia.org

12 comments

9 participants

tags (0)

participants (9)

Asaf Bartov
Asaf Bartov
Emmanuel Engelhart
Erik Moeller
Federico Leva (Nemo)
Manuel Schneider
Samuel Klein
Sean Moss-Pultz
Shahid Farooqui