Hi,
I was wondering if there is any place where I can find text (without markup, etc) only versions of wikipedia suitable for NLP tasks? I've been able to find a couple of old ones for the english wikipedia but I would like to analyze different languages (mandarin, arabic, etc...).
Of course, any pointers to software that I can use to convert the usual XML dumps to text would be great as well.
Best,
Bruno
******************************************* Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com *******************************************
Visual Editor uses Parasoid to covert markup to HTML. It could then be possible to strip the HTML with a standard library. https://m.mediawiki.org/wiki/Parsoid
There are some alternative parsers listed here, but I have no idea on how well any perform/scale. https://m.mediawiki.org/wiki/Alternative_parsers
Would love to hear if anyone has a better answer. Obviously a plain text dump or even an HTML dump could save a good amount of processing.
Cheers, Scott
On Mon, Feb 22, 2016, 15:18 Bruno Goncalves bgoncalves@gmail.com wrote:
Hi,
I was wondering if there is any place where I can find text (without markup, etc) only versions of wikipedia suitable for NLP tasks? I've been able to find a couple of old ones for the english wikipedia but I would like to analyze different languages (mandarin, arabic, etc...).
Of course, any pointers to software that I can use to convert the usual XML dumps to text would be great as well.
Best,
Bruno
Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thanks for the suggestions. I'll take a look.
There used to be official HTML dumps https://dumps.wikimedia.org/other/static_html_dumps/ but they haven't been updated in almost a decade :) HTML or Plain Text dumps would be a boon for the NLP world.
Best,
B
******************************************* Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com *******************************************
On Mon, Feb 22, 2016 at 11:10 AM, Scott Hale computermacgyver@gmail.com wrote:
Visual Editor uses Parasoid to covert markup to HTML. It could then be possible to strip the HTML with a standard library. https://m.mediawiki.org/wiki/Parsoid
There are some alternative parsers listed here, but I have no idea on how well any perform/scale. https://m.mediawiki.org/wiki/Alternative_parsers
Would love to hear if anyone has a better answer. Obviously a plain text dump or even an HTML dump could save a good amount of processing.
Cheers, Scott
On Mon, Feb 22, 2016, 15:18 Bruno Goncalves bgoncalves@gmail.com wrote:
Hi,
I was wondering if there is any place where I can find text (without markup, etc) only versions of wikipedia suitable for NLP tasks? I've been able to find a couple of old ones for the english wikipedia but I would like to analyze different languages (mandarin, arabic, etc...).
Of course, any pointers to software that I can use to convert the usual XML dumps to text would be great as well.
Best,
Bruno
Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Dr. Scott Hale Data Scientist Oxford Internet Institute University of Oxford http://www.scotthale.net/
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Bruno Goncalves, 22/02/2016 22:58:
There used to be official HTML dumps https://dumps.wikimedia.org/other/static_html_dumps/ but they haven't been updated in almost a decade :)
The job is effectively done by Kiwix now. http://download.kiwix.org/zim/wikipedia/ For instance: wikipedia_en_all_nopic_2015-05.zim 17-May-2015 10:27 15G
There are several tools to extract the HTML from a ZIM file: http://www.openzim.org/wiki/Readers
Nemo
The job is effectively done by Kiwix now. http://download.kiwix.org/zim/wikipedia/ For instance: wikipedia_en_all_nopic_2015-05.zim 17-May-2015 10:27 15G
Humm... It seems like they are all several months old?
******************************************* Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com *******************************************
Bruno Goncalves, 23/02/2016 00:19:
wikipedia_en_all_nopic_2015-05.zim 17-May-2015 10:27 15G
Humm... It seems like they are all several months old?
As you can see, Kelson recently focused on other things like the "wp1" releases. The ZIM dump production is now orders of magnitudes easier than it was years ago with the dumpHTML methods, so if you have a cogent need for a more recent dump you can tell Kelson (cc) and he'll probably be able to help.
Feel free to send patches as well. ;-) https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/
Nemo
wiki-research-l@lists.wikimedia.org