Re: [Wiki-research-l] "Quick" request

23 Feb 2016


      Hi Bruno,
I have been using the WikiExtractor for this task:
https://github.com/attardi/wikiextractor
Hope this helps.
Cheers,
Marco
On 2/22/16 23:32, wiki-research-l-request@lists.wikimedia.org wrote:
...
Date: Mon, 22 Feb 2016 23:12:08 +0100
From: "Federico Leva (Nemo)"nemowiki@gmail.com
To: Research into Wikimedia content and communities
   wiki-research-l@lists.wikimedia.org
Subject: Re: [Wiki-research-l] "Quick" request
Message-ID:56CB87B8.9050008@gmail.com
Content-Type: text/plain; charset=utf-8; format=flowed
Bruno Goncalves, 22/02/2016 22:58:
...
...
There used to be official HTML dumps
https://dumps.wikimedia.org/other/static_html_dumps/  but they haven't
been updated in almost a decade:)
The job is effectively done by Kiwix now.
http://download.kiwix.org/zim/wikipedia/
For instance:
    wikipedia_en_all_nopic_2015-05.zim        17-May-2015 10:27   15G
There are several tools to extract the HTML from a ZIM file:
http://www.openzim.org/wiki/Readers
Nemo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] "Quick" request