On 02.03.2014 11:08, Emmanuel Engelhart wrote:
Le 02/03/2014 01:33, Samuel Klein a écrit :
Brilliant. Congrats to everyone who is working on this! What is needed to scrape categories?
0 - For all dumped pages (so at least NS_MAIN and NS_CATEGORY pages), download the list of categories they belong to (with the MW API). 1 - For each dumped page, implement the HTML rendering of the category list at the bottom. 2 - For each category page, get the content HTML rendering from Parsoid and compute and render sorted lists of articles and sub-categories in a similar fashion like the online version (with multiple pages if necessary).
All the stuff must be integrated in the nodejs script and category graph must be stored in redis.
what about the internal structure inside ZIM which uses category pages (like in the wiki) for the text and a list of pointers to the pages inside the ZIM file to implement the category?
http://openzim.org/wiki/Category_Handling
/Manuel