"Quick" request

List overview All Threads
Download

newer

older

Re: [Wiki-research-l] "Quick"...

Upcoming research newsletter...

Bruno Goncalves

22 Feb 2016 22 Feb '16

4:17 p.m.

Hi,

I was wondering if there is any place where I can find text (without markup, etc) only versions of wikipedia suitable for NLP tasks? I've been able to find a couple of old ones for the english wikipedia but I would like to analyze different languages (mandarin, arabic, etc...).

Of course, any pointers to software that I can use to convert the usual XML dumps to text would be great as well.

Best,

Bruno

******************************************* Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com *******************************************

Attachments:

attachment.htm (text/html — 926 bytes)

Show replies by date

Scott Hale

22 Feb 22 Feb

5:10 p.m.

Visual Editor uses Parasoid to covert markup to HTML. It could then be possible to strip the HTML with a standard library. https://m.mediawiki.org/wiki/Parsoid

There are some alternative parsers listed here, but I have no idea on how well any perform/scale. https://m.mediawiki.org/wiki/Alternative_parsers

Would love to hear if anyone has a better answer. Obviously a plain text dump or even an HTML dump could save a good amount of processing.

Cheers, Scott

On Mon, Feb 22, 2016, 15:18 Bruno Goncalves bgoncalves@gmail.com wrote:

...

Hi,

I was wondering if there is any place where I can find text (without markup, etc) only versions of wikipedia suitable for NLP tasks? I've been able to find a couple of old ones for the english wikipedia but I would like to analyze different languages (mandarin, arabic, etc...).

Of course, any pointers to software that I can use to convert the usual XML dumps to text would be great as well.

Best,

Bruno

Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Dr. Scott Hale Data Scientist Oxford Internet Institute University of Oxford http://www.scotthale.net/

Bruno Goncalves

10:58 p.m.

Thanks for the suggestions. I'll take a look.

There used to be official HTML dumps https://dumps.wikimedia.org/other/static_html_dumps/ but they haven't been updated in almost a decade :) HTML or Plain Text dumps would be a boon for the NLP world.

Best,

******************************************* Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com *******************************************

On Mon, Feb 22, 2016 at 11:10 AM, Scott Hale computermacgyver@gmail.com wrote:

...

Visual Editor uses Parasoid to covert markup to HTML. It could then be possible to strip the HTML with a standard library. https://m.mediawiki.org/wiki/Parsoid

There are some alternative parsers listed here, but I have no idea on how well any perform/scale. https://m.mediawiki.org/wiki/Alternative_parsers

Would love to hear if anyone has a better answer. Obviously a plain text dump or even an HTML dump could save a good amount of processing.

Cheers, Scott

On Mon, Feb 22, 2016, 15:18 Bruno Goncalves bgoncalves@gmail.com wrote:

...
Hi,

I was wondering if there is any place where I can find text (without markup, etc) only versions of wikipedia suitable for NLP tasks? I've been able to find a couple of old ones for the english wikipedia but I would like to analyze different languages (mandarin, arabic, etc...).

Of course, any pointers to software that I can use to convert the usual XML dumps to text would be great as well.

Best,

Bruno

Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Dr. Scott Hale Data Scientist Oxford Internet Institute University of Oxford http://www.scotthale.net/

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Federico Leva (Nemo)

11:12 p.m.

Bruno Goncalves, 22/02/2016 22:58:

...

There used to be official HTML dumps https://dumps.wikimedia.org/other/static_html_dumps/ but they haven't been updated in almost a decade :)

The job is effectively done by Kiwix now. http://download.kiwix.org/zim/wikipedia/ For instance: wikipedia_en_all_nopic_2015-05.zim 17-May-2015 10:27 15G

There are several tools to extract the HTML from a ZIM file: http://www.openzim.org/wiki/Readers

Nemo

Bruno Goncalves

23 Feb 23 Feb

12:19 a.m.

...

The job is effectively done by Kiwix now. http://download.kiwix.org/zim/wikipedia/ For instance: wikipedia_en_all_nopic_2015-05.zim 17-May-2015 10:27 15G

Humm... It seems like they are all several months old?

******************************************* Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves@gmail.com *******************************************

Federico Leva (Nemo)

9:08 a.m.

Bruno Goncalves, 23/02/2016 00:19:

...

   wikipedia_en_all_nopic_2015-05.zim        17-May-2015 10:27   15G
Humm... It seems like they are all several months old?

As you can see, Kelson recently focused on other things like the "wp1" releases. The ZIM dump production is now orders of magnitudes easier than it was years ago with the dumpHTML methods, so if you have a cogent need for a more recent dump you can tell Kelson (cc) and he'll probably be able to help.

Feel free to send patches as well. ;-) https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/

Nemo

3224

Age (days ago)

3225

Last active (days ago)

wiki-research-l@lists.wikimedia.org

5 comments

3 participants

tags (0)

participants (3)

Bruno Goncalves
Federico Leva (Nemo)
Scott Hale