On Thu, 16 Dec 2010 07:50:56 +0200, Andrew Dunbar hippytrail@gmail.com wrote:
On 15 December 2010 20:24, Manuel Schneider manuel.schneider@wikimedia.ch wrote:
Hi Andrew,
maybe you'd like to check out ZIM: This is an standardized file format for compressed HTML dumps, focused on Wikimedia content at the moment.
There is some C++ code around to read and write ZIM files and there are several projects using that, eg. the WP1.0 project, the Israeli and Kenyan Wikipedia Offline initiatives and more. Also the Wikimedia Foundation is currently in progress to adopt the format to provide ZIM files from Wikimedia wikis in the future.
This is very interesting and I'll be watching it. Where do the HTML dumps come from?
I do the HDML dumps on my own, using a customed version of the dumpHTML extension and additional scripts.
Emmanuel