Re: [Offline-l] The Whole Wikipedia in English with pictures in one 40GB big file

4 Mar 2014


      Le 04/03/2014 00:01, Manuel Schneider a écrit :
...
On 02.03.2014 11:08, Emmanuel Engelhart wrote:
...
Le 02/03/2014 01:33, Samuel Klein a écrit :
...
Brilliant.  Congrats to everyone who is working on this!
What is needed to scrape categories?
0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages),
download the list of categories they belong to (with the MW API).
1 - For each dumped page, implement the HTML rendering of the category
list at the bottom.
2 - For each category page, get the content HTML rendering from Parsoid
and compute and render sorted lists of articles and sub-categories in a
similar fashion like the online version (with multiple pages if necessary).
All the stuff must be integrated in the nodejs script and category graph
must be stored in redis.
what about the internal structure inside ZIM which uses category pages
(like in the wiki) for the text and a list of pointers to the pages
inside the ZIM file to implement the category?
http://openzim.org/wiki/Category_Handling
Not sure to 100% understand your question, but it's necessary to store
the category graph as a hash table before compiling everything in a ZIM
file. That's why I talk about redis.
In addition (but this is not mandatory to enjoy the categories), it
would be great to do normalisation & implementation work to store the
category graph in a structured manner and avoid storing the lists in
HTML pages. This is still something we have on the roadmap.
Emmanuel
-- 
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Offline-l] The Whole Wikipedia in English with pictures in one 40GB big file