Would love to hear if anyone has a better answer. Obviously a plain text dump or even an HTML dump could save a good amount of processing.

Cheers,

Scott

On Mon, Feb 22, 2016, 15:18 Bruno Goncalves <bgoncalves@gmail.com> wrote:

Hi,

I was wondering if there is any place where I can find text (without markup, etc) only versions of wikipedia suitable for NLP tasks? I've been able to find a couple of old ones for the english wikipedia but I would like to analyze different languages (mandarin, arabic, etc...).

Of course, any pointers to software that I can use to convert the usual XML dumps to text would be great as well.

Best,

Bruno

*******************************************
Bruno Miguel Tavares Gonçalves, PhD
Homepage: www.bgoncalves.com
Email: bgoncalves@gmail.com
*******************************************

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l