On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:
On 2018-05-03 20:54, Aidan Hogan wrote:
I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project.
The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/
In case you need pure HTML and not the ZIM file format, you could check out mwoffliner[1], the tool used to generate ZIM files. It dumps HTML files locally before generating the ZIM file. Though, HTML is an intermediary for the tool it could be held back if you wish. See [2] for more information about what options the tool accepts.
I'm not sure if it's possible to instruct the tool to stop immediately after the dumping of the pages thus avoiding the creation of the ZIM file altogether. But you could work around it by perusing the 'verbose' output (turned on through the '--verbose' option) of the tool to identify when dumping has been completed and stop it manually.
In case of any doubts about using the tool, feel free to reach out.
References: [1]: https://github.com/openzim/mwoffliner [2]: https://github.com/openzim/mwoffliner/blob/master/lib/parameterList.js