Hi
For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.
This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent
This ZIM file is directly and easily usable on many types of devices like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian with Wikionboard.
You don't need modern computers with big CPUs. You can for example create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by using our ZIM dedicated Web server called kiwix-serve. A demo is available here: http://library.kiwix.org/wikipedia_en_all/
Like always, we also provide a packaged version (for the main PC systems) which includes fulltext search index+ZIM file+binaries: http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent
What is interesting too: This file was generated in less than 2 weeks thanks to multiples recent innovations: * The Parsoid (cluster), which gives us an HTML output with additional semantic RDF tags * mwoffliner, a nodejs script able to dumps pages based on the Mediawiki API (and Parsoid API) * zimwriterfs, a solution able to compile any local HTML directory to a ZIM file
We have now an efficient way to generate new ZIM files. Consequently, we will work to industrialize and automatize the ZIM file generation process, one thing which is probably the most oldest and important problem we still face at Kiwix.
All this would not have been possible without the support: * Wikimedia CH and the "ZIM autobuild" project * Wikimedia France and the Afripedia project * Gwicke from the WMF Parsoid dev team.
BTW, we need additional developer helps with javascript/nodejs skills to fix a few issues on mwoffliner: * Recreate the "table of content" based on the HTML DOM (*) * Scrape Mediawiki Resourceloader in a manner it will continue to work offline (***) * Scrape categories (**) * Localized the script (*) * Improve the global performance by introducing usage of workers (**) * Create nodezim, the libzim nodejs binding and use it (***, need also compilation and C++ skills) * Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer (***)
Emmanuel
Hey Emmanuel..
Congratulations!
Really an awesome wok done by you and your time :)
Best Regards
Shahid
On 3/1/14, Emmanuel Engelhart kelson@kiwix.org wrote:
Hi
For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.
This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent
This ZIM file is directly and easily usable on many types of devices like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian with Wikionboard.
You don't need modern computers with big CPUs. You can for example create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by using our ZIM dedicated Web server called kiwix-serve. A demo is available here: http://library.kiwix.org/wikipedia_en_all/
Like always, we also provide a packaged version (for the main PC systems) which includes fulltext search index+ZIM file+binaries: http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent
What is interesting too: This file was generated in less than 2 weeks thanks to multiples recent innovations:
- The Parsoid (cluster), which gives us an HTML output with additional
semantic RDF tags
- mwoffliner, a nodejs script able to dumps pages based on the Mediawiki
API (and Parsoid API)
- zimwriterfs, a solution able to compile any local HTML directory to a
ZIM file
We have now an efficient way to generate new ZIM files. Consequently, we will work to industrialize and automatize the ZIM file generation process, one thing which is probably the most oldest and important problem we still face at Kiwix.
All this would not have been possible without the support:
- Wikimedia CH and the "ZIM autobuild" project
- Wikimedia France and the Afripedia project
- Gwicke from the WMF Parsoid dev team.
BTW, we need additional developer helps with javascript/nodejs skills to fix a few issues on mwoffliner:
- Recreate the "table of content" based on the HTML DOM (*)
- Scrape Mediawiki Resourceloader in a manner it will continue to work
offline (***)
- Scrape categories (**)
- Localized the script (*)
- Improve the global performance by introducing usage of workers (**)
- Create nodezim, the libzim nodejs binding and use it (***, need also
compilation and C++ skills)
- Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer (***)
Emmanuel
Kiwix - Wikipedia Offline & more
- Web: http://www.kiwix.org
- Twitter: https://twitter.com/KiwixOffline
- more: http://www.kiwix.org/wiki/Communication
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
Brilliant. Congrats to everyone who is working on this! What is needed to scrape categories?
On Sat, Mar 1, 2014 at 12:01 PM, Emmanuel Engelhart kelson@kiwix.org wrote:
Hi
For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.
This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent
This ZIM file is directly and easily usable on many types of devices like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian with Wikionboard.
You don't need modern computers with big CPUs. You can for example create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by using our ZIM dedicated Web server called kiwix-serve. A demo is available here: http://library.kiwix.org/wikipedia_en_all/
Like always, we also provide a packaged version (for the main PC systems) which includes fulltext search index+ZIM file+binaries: http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent
What is interesting too: This file was generated in less than 2 weeks thanks to multiples recent innovations:
- The Parsoid (cluster), which gives us an HTML output with additional
semantic RDF tags
- mwoffliner, a nodejs script able to dumps pages based on the Mediawiki
API (and Parsoid API)
- zimwriterfs, a solution able to compile any local HTML directory to a
ZIM file
We have now an efficient way to generate new ZIM files. Consequently, we will work to industrialize and automatize the ZIM file generation process, one thing which is probably the most oldest and important problem we still face at Kiwix.
All this would not have been possible without the support:
- Wikimedia CH and the "ZIM autobuild" project
- Wikimedia France and the Afripedia project
- Gwicke from the WMF Parsoid dev team.
BTW, we need additional developer helps with javascript/nodejs skills to fix a few issues on mwoffliner:
- Recreate the "table of content" based on the HTML DOM (*)
- Scrape Mediawiki Resourceloader in a manner it will continue to work
offline (***)
- Scrape categories (**)
- Localized the script (*)
- Improve the global performance by introducing usage of workers (**)
- Create nodezim, the libzim nodejs binding and use it (***, need also
compilation and C++ skills)
- Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer (***)
Emmanuel
Kiwix - Wikipedia Offline & more
- Web: http://www.kiwix.org
- Twitter: https://twitter.com/KiwixOffline
- more: http://www.kiwix.org/wiki/Communication
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
Le 02/03/2014 01:33, Samuel Klein a écrit :
Brilliant. Congrats to everyone who is working on this! What is needed to scrape categories?
0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages), download the list of categories they belong to (with the MW API). 1 - For each dumped page, implement the HTML rendering of the category list at the bottom. 2 - For each category page, get the content HTML rendering from Parsoid and compute and render sorted lists of articles and sub-categories in a similar fashion like the online version (with multiple pages if necessary).
All the stuff must be integrated in the nodejs script and category graph must be stored in redis.
Emmanuel
On 02.03.2014 11:08, Emmanuel Engelhart wrote:
Le 02/03/2014 01:33, Samuel Klein a écrit :
Brilliant. Congrats to everyone who is working on this! What is needed to scrape categories?
0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages), download the list of categories they belong to (with the MW API). 1 - For each dumped page, implement the HTML rendering of the category list at the bottom. 2 - For each category page, get the content HTML rendering from Parsoid and compute and render sorted lists of articles and sub-categories in a similar fashion like the online version (with multiple pages if necessary).
All the stuff must be integrated in the nodejs script and category graph must be stored in redis.
what about the internal structure inside ZIM which uses category pages (like in the wiki) for the text and a list of pointers to the pages inside the ZIM file to implement the category?
http://openzim.org/wiki/Category_Handling
/Manuel
Le 04/03/2014 00:01, Manuel Schneider a écrit :
On 02.03.2014 11:08, Emmanuel Engelhart wrote:
Le 02/03/2014 01:33, Samuel Klein a écrit :
Brilliant. Congrats to everyone who is working on this! What is needed to scrape categories?
0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages), download the list of categories they belong to (with the MW API). 1 - For each dumped page, implement the HTML rendering of the category list at the bottom. 2 - For each category page, get the content HTML rendering from Parsoid and compute and render sorted lists of articles and sub-categories in a similar fashion like the online version (with multiple pages if necessary).
All the stuff must be integrated in the nodejs script and category graph must be stored in redis.
what about the internal structure inside ZIM which uses category pages (like in the wiki) for the text and a list of pointers to the pages inside the ZIM file to implement the category?
Not sure to 100% understand your question, but it's necessary to store the category graph as a hash table before compiling everything in a ZIM file. That's why I talk about redis.
In addition (but this is not mandatory to enjoy the categories), it would be great to do normalisation & implementation work to store the category graph in a structured manner and avoid storing the lists in HTML pages. This is still something we have on the roadmap.
Emmanuel
On 04.03.2014 12:10, Emmanuel Engelhart wrote:
Not sure to 100% understand your question, but it's necessary to store the category graph as a hash table before compiling everything in a ZIM file. That's why I talk about redis.
you talk about the ZIM creation process, I was talking about how to store categories in the ZIM file - so first hash table, redis, then the result should be those HTML pages in NS U and the pointers in NS V and W.
/Manuel
This is super-exciting - can't wait to play with it :-). Congratulations!
Erik
Amazing... 64GB cards are dropping in price... but I wonder if could fix in 32GB. Great work!!
Sean
On Sun, Mar 2, 2014 at 9:26 AM, Erik Moeller erik@wikimedia.org wrote:
This is super-exciting - can't wait to play with it :-). Congratulations!
Erik
Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
Emmanuel Engelhart, 01/03/2014 18:01:
For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.
This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent
Fantastic; I was impatiently waiting for this moment. I suppose you need reseeders?
In the last months we've seen way more ZIM files than in the past. One day all this will work so seamlessly that it will just be a periodic job on some WMF server for all Wikimedia projects without Kelson's manual awesomeness. 6 years without HTML dumps are a long time.
Thanks to GWicke for helping Kelson exploit Parsoid for additional good byproducts.
Nemo
This is excellent! Thanks for this update.
A.
On Sat, Mar 1, 2014 at 9:01 AM, Emmanuel Engelhart kelson@kiwix.org wrote:
Hi
For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.
This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent
This ZIM file is directly and easily usable on many types of devices like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian with Wikionboard.
You don't need modern computers with big CPUs. You can for example create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by using our ZIM dedicated Web server called kiwix-serve. A demo is available here: http://library.kiwix.org/wikipedia_en_all/
Like always, we also provide a packaged version (for the main PC systems) which includes fulltext search index+ZIM file+binaries: http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent
What is interesting too: This file was generated in less than 2 weeks thanks to multiples recent innovations:
- The Parsoid (cluster), which gives us an HTML output with additional
semantic RDF tags
- mwoffliner, a nodejs script able to dumps pages based on the Mediawiki
API (and Parsoid API)
- zimwriterfs, a solution able to compile any local HTML directory to a
ZIM file
We have now an efficient way to generate new ZIM files. Consequently, we will work to industrialize and automatize the ZIM file generation process, one thing which is probably the most oldest and important problem we still face at Kiwix.
All this would not have been possible without the support:
- Wikimedia CH and the "ZIM autobuild" project
- Wikimedia France and the Afripedia project
- Gwicke from the WMF Parsoid dev team.
BTW, we need additional developer helps with javascript/nodejs skills to fix a few issues on mwoffliner:
- Recreate the "table of content" based on the HTML DOM (*)
- Scrape Mediawiki Resourceloader in a manner it will continue to work
offline (***)
- Scrape categories (**)
- Localized the script (*)
- Improve the global performance by introducing usage of workers (**)
- Create nodezim, the libzim nodejs binding and use it (***, need also
compilation and C++ skills)
- Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer
(***)
Emmanuel
Kiwix - Wikipedia Offline & more
- Web: http://www.kiwix.org
- Twitter: https://twitter.com/KiwixOffline
- more: http://www.kiwix.org/wiki/Communication
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
btw, are these new improved tools documented anywhere? http://kiwix.org/wiki/Development does not seem to point in the right direction.
I'd like to have a page to which I can refer people who want to create ZIM files for Kiwix.
Cheers,
Asaf
On Sat, Mar 1, 2014 at 9:01 AM, Emmanuel Engelhart kelson@kiwix.org wrote:
Hi
For the first time, we have achieved to release a complete dump of all encyclopedic articles of the Wikipedia in English, *with thumbnails*.
This ZIM file is 40 GB big and contains the current 4.5 million articles with their 3.5 millions pictures: http://download.kiwix.org/zim/wikipedia_en_all.zim.torrent
This ZIM file is directly and easily usable on many types of devices like Android smartphones and Win/OSX/Linux PCs with Kiwix, or Symbian with Wikionboard.
You don't need modern computers with big CPUs. You can for example create a (read-only) Wikipedia mirror on a RaspberryPi for ~100USD by using our ZIM dedicated Web server called kiwix-serve. A demo is available here: http://library.kiwix.org/wikipedia_en_all/
Like always, we also provide a packaged version (for the main PC systems) which includes fulltext search index+ZIM file+binaries: http://download.kiwix.org/portable/wikipedia_en_all.zip.torrent
What is interesting too: This file was generated in less than 2 weeks thanks to multiples recent innovations:
- The Parsoid (cluster), which gives us an HTML output with additional
semantic RDF tags
- mwoffliner, a nodejs script able to dumps pages based on the Mediawiki
API (and Parsoid API)
- zimwriterfs, a solution able to compile any local HTML directory to a
ZIM file
We have now an efficient way to generate new ZIM files. Consequently, we will work to industrialize and automatize the ZIM file generation process, one thing which is probably the most oldest and important problem we still face at Kiwix.
All this would not have been possible without the support:
- Wikimedia CH and the "ZIM autobuild" project
- Wikimedia France and the Afripedia project
- Gwicke from the WMF Parsoid dev team.
BTW, we need additional developer helps with javascript/nodejs skills to fix a few issues on mwoffliner:
- Recreate the "table of content" based on the HTML DOM (*)
- Scrape Mediawiki Resourceloader in a manner it will continue to work
offline (***)
- Scrape categories (**)
- Localized the script (*)
- Improve the global performance by introducing usage of workers (**)
- Create nodezim, the libzim nodejs binding and use it (***, need also
compilation and C++ skills)
- Evaluate necessary work to merge mwoffliner and new WMF PDF Renderer
(***)
Emmanuel
Kiwix - Wikipedia Offline & more
- Web: http://www.kiwix.org
- Twitter: https://twitter.com/KiwixOffline
- more: http://www.kiwix.org/wiki/Communication
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
Le 07/03/2014 19:25, Asaf Bartov a écrit :
btw, are these new improved tools documented anywhere? http://kiwix.org/wiki/Development does not seem to point in the right direction.
The usage is pretty straightforward (for IT people) and IMO everything necessary is explained in the READMEs: * mwoffliner: https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/ * zimwriterfs: https://sourceforge.net/p/kiwix/other/ci/master/tree/zimwriterfs/
NB: The goal is not that everybody creates its own full wikipedia ZIM file. The goal is that we (Wikimedia) provide these files, often enough to always have up2date ZIM information (so at least one time per month). Thus, the challenge is now to setup an infrastructure similar to the one which creates the XML dumps.
Emmanuel
PS: We really want to make a post @blog.wikimedia.org (so in English). If someone is volunteer to write this, I would really appreciate his help.