I've used the 2015-05 wp en zim dump to get full-text for experiments
with topic-modeling – specifically the Doc2Vec ("Paragraph Vectors")
algorithm available in the python open-source library 'gensim'.
I'll also likely use it to get test/seed text (mostly article
abstracts) for my "wiki reference in tiny chunks" project, Thunkpedia.
(For prior iterations, I've used either DBpedia long-abstracts or bulk
scraping of the HTTP APIs, but I expect grabbing the first-sections of
zim-dump articles will dominate those options in every relevant
dimension.)
There are a lot of hackish scripts floating around for coercing text
from XML dumps, but since the zim dump already has
semantically-significant templates expanded, it could (and probably
should) be the preferred text source for many projects. My code to
iterate (or dump) article plain-text will be on github at some point.
I see other (so far non-en) 2015-08 dumps starting to appear, so
looking forward to the WPEN biggie whenever it arrives. Thanks!
- Gordon
On Thu, Jul 30, 2015 at 10:55 AM, Emmanuel Engelhart <kelson(a)kiwix.org> wrote:
Dear Gordon
On 25.07.2015 01:38, Gordon Mohr wrote:
The 2015-05 enwiki nopic dump is a great resource for getting bulk
article text – much better in my experience than using scripts that
try to strip it out of XML dumps, or wrestling with a full MW+Parsoid
system.
Thank you. You use it for a research purpose?
I see threads from earlier in the year that the
goal is monthly ZIM dumps.
Any projections for when that might be achieved, or perhaps just when
the process that succeeded in creating the 2015-05 dump(s) might be
repeated as another one-off?
Fixing that problem is my top-priority and we are getting better and better.
Something you can see by yourself if you look at
http://download.kiwix.org/zim/. Unfortunately we deal with limited hardware
resources and the software solution to do these snapshots (mwoffliner) is
still a little bit buggy.
WPEN being the "worse" snapshot to generate, it is also the one which
suffers the most of these problems.
That said, I think we will achieve full monthly updates in the next months
and I plan a new snapshot of WPEN in August (anyway).
Kind regards
Emmanuel
--
Kiwix - Wikipedia Offline & more
* Web:
http://www.kiwix.org
* Twitter:
https://twitter.com/KiwixOffline
* more:
http://www.kiwix.org/wiki/Communication
_______________________________________________
Offline-l mailing list
Offline-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/offline-l