Re: [Wiki-research-l] "Quick" request

22 Feb 2016


      Thanks for the suggestions. I'll take a look.
There used to be official HTML dumps
https://dumps.wikimedia.org/other/static_html_dumps/ but they haven't been
updated in almost a decade :) HTML or Plain Text dumps would be a boon for
the NLP world.
Best,
B
*******************************************
Bruno Miguel Tavares Gonçalves, PhD
Homepage: www.bgoncalves.com
Email: bgoncalves@gmail.com
*******************************************
On Mon, Feb 22, 2016 at 11:10 AM, Scott Hale computermacgyver@gmail.com
wrote:
...
Visual Editor uses Parasoid to covert markup to HTML. It could then be
possible to strip the HTML with a standard library.
https://m.mediawiki.org/wiki/Parsoid
There are some alternative parsers listed here, but I have no idea on how
well any perform/scale.
https://m.mediawiki.org/wiki/Alternative_parsers
Would love to hear if anyone has a better answer. Obviously a plain text
dump or even an HTML dump could save a good amount of processing.
Cheers,
Scott
On Mon, Feb 22, 2016, 15:18 Bruno Goncalves bgoncalves@gmail.com wrote:
...
Hi,
I was wondering if there is any place where I can find text (without
markup, etc) only versions of wikipedia suitable for NLP tasks? I've been
able to find a couple of old ones for the english wikipedia but I would
like to analyze different languages (mandarin, arabic, etc...).
Of course, any pointers to software that I can use to convert the usual
XML dumps to text would be great as well.
Best,
Bruno

Bruno Miguel Tavares Gonçalves, PhD
Homepage: www.bgoncalves.com
Email: bgoncalves@gmail.com


Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Dr. Scott Hale
Data Scientist
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/

Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] "Quick" request