Hi Bruno,
I have been using the WikiExtractor for this task: https://github.com/attardi/wikiextractor
Hope this helps. Cheers,
Marco
On 2/22/16 23:32, wiki-research-l-request@lists.wikimedia.org wrote:
Date: Mon, 22 Feb 2016 23:12:08 +0100 From: "Federico Leva (Nemo)"nemowiki@gmail.com To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] "Quick" request Message-ID:56CB87B8.9050008@gmail.com Content-Type: text/plain; charset=utf-8; format=flowed
Bruno Goncalves, 22/02/2016 22:58:
There used to be official HTML dumps https://dumps.wikimedia.org/other/static_html_dumps/ but they haven't been updated in almost a decade:)
The job is effectively done by Kiwix now. http://download.kiwix.org/zim/wikipedia/ For instance: wikipedia_en_all_nopic_2015-05.zim 17-May-2015 10:27 15G
There are several tools to extract the HTML from a ZIM file: http://www.openzim.org/wiki/Readers
Nemo