Monthly ZIM dumps?

List overview All Threads
Download

newer

older

Kiwix 1.1 for iOS released

Request to support Kwiizi

Gordon Mohr

24 Jul 2015 24 Jul '15

6:38 p.m.

The 2015-05 enwiki nopic dump is a great resource for getting bulk article text – much better in my experience than using scripts that try to strip it out of XML dumps, or wrestling with a full MW+Parsoid system.

I see threads from earlier in the year that the goal is monthly ZIM dumps.

Any projections for when that might be achieved, or perhaps just when the process that succeeded in creating the 2015-05 dump(s) might be repeated as another one-off?

- Gordon

Show replies by date

Emmanuel Engelhart

30 Jul 30 Jul

12:55 p.m.

Dear Gordon

On 25.07.2015 01:38, Gordon Mohr wrote:

...

The 2015-05 enwiki nopic dump is a great resource for getting bulk article text – much better in my experience than using scripts that try to strip it out of XML dumps, or wrestling with a full MW+Parsoid system.

Thank you. You use it for a research purpose?

...

I see threads from earlier in the year that the goal is monthly ZIM dumps.

Any projections for when that might be achieved, or perhaps just when the process that succeeded in creating the 2015-05 dump(s) might be repeated as another one-off?

Fixing that problem is my top-priority and we are getting better and better. Something you can see by yourself if you look at http://download.kiwix.org/zim/. Unfortunately we deal with limited hardware resources and the software solution to do these snapshots (mwoffliner) is still a little bit buggy.

WPEN being the "worse" snapshot to generate, it is also the one which suffers the most of these problems.

That said, I think we will achieve full monthly updates in the next months and I plan a new snapshot of WPEN in August (anyway).

Kind regards Emmanuel

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Gabriel Wicke

5:20 p.m.

We also have an experimental set of Parsoid HTML dumps available at http://dumps.wikimedia.org/htmldumps/dumps/. This is currently a one-off run, but I do hope that we will be able to run this once per week. Please see https://phabricator.wikimedia.org/T93396 for more information & feedback.

Gabriel

On Thu, Jul 30, 2015 at 10:55 AM, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Dear Gordon

On 25.07.2015 01:38, Gordon Mohr wrote:

...
The 2015-05 enwiki nopic dump is a great resource for getting bulk article text – much better in my experience than using scripts that try to strip it out of XML dumps, or wrestling with a full MW+Parsoid system.

Thank you. You use it for a research purpose?

I see threads from earlier in the year that the goal is monthly ZIM dumps.

...
Any projections for when that might be achieved, or perhaps just when the process that succeeded in creating the 2015-05 dump(s) might be repeated as another one-off?

Fixing that problem is my top-priority and we are getting better and better. Something you can see by yourself if you look at http://download.kiwix.org/zim/. Unfortunately we deal with limited hardware resources and the software solution to do these snapshots (mwoffliner) is still a little bit buggy.

WPEN being the "worse" snapshot to generate, it is also the one which suffers the most of these problems.

That said, I think we will achieve full monthly updates in the next months and I plan a new snapshot of WPEN in August (anyway).

Kind regards Emmanuel

-- Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Gordon Mohr

18 Aug 18 Aug

7:05 p.m.

I've used the 2015-05 wp en zim dump to get full-text for experiments with topic-modeling – specifically the Doc2Vec ("Paragraph Vectors") algorithm available in the python open-source library 'gensim'.

I'll also likely use it to get test/seed text (mostly article abstracts) for my "wiki reference in tiny chunks" project, Thunkpedia. (For prior iterations, I've used either DBpedia long-abstracts or bulk scraping of the HTTP APIs, but I expect grabbing the first-sections of zim-dump articles will dominate those options in every relevant dimension.)

There are a lot of hackish scripts floating around for coercing text from XML dumps, but since the zim dump already has semantically-significant templates expanded, it could (and probably should) be the preferred text source for many projects. My code to iterate (or dump) article plain-text will be on github at some point.

I see other (so far non-en) 2015-08 dumps starting to appear, so looking forward to the WPEN biggie whenever it arrives. Thanks!

- Gordon

On Thu, Jul 30, 2015 at 10:55 AM, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Dear Gordon

On 25.07.2015 01:38, Gordon Mohr wrote:

...
The 2015-05 enwiki nopic dump is a great resource for getting bulk article text – much better in my experience than using scripts that try to strip it out of XML dumps, or wrestling with a full MW+Parsoid system.

Thank you. You use it for a research purpose?

...
I see threads from earlier in the year that the goal is monthly ZIM dumps.

Any projections for when that might be achieved, or perhaps just when the process that succeeded in creating the 2015-05 dump(s) might be repeated as another one-off?

Fixing that problem is my top-priority and we are getting better and better. Something you can see by yourself if you look at http://download.kiwix.org/zim/. Unfortunately we deal with limited hardware resources and the software solution to do these snapshots (mwoffliner) is still a little bit buggy.

WPEN being the "worse" snapshot to generate, it is also the one which suffers the most of these problems.

That said, I think we will achieve full monthly updates in the next months and I plan a new snapshot of WPEN in August (anyway).

Kind regards Emmanuel

-- Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

3255

Age (days ago)

3281

Last active (days ago)

offline-l@lists.wikimedia.org

3 comments

3 participants

tags (0)

participants (3)

Emmanuel Engelhart
Gabriel Wicke
Gordon Mohr