Re: [Wiki-research-l] "Quick" request

22 Feb 2016

Visual Editor uses Parasoid to covert markup to HTML. It could then be
possible to strip the HTML with a standard library.
https://m.mediawiki.org/wiki/Parsoid

There are some alternative parsers listed here, but I have no idea on how
well any perform/scale.
https://m.mediawiki.org/wiki/Alternative_parsers

Would love to hear if anyone has a better answer. Obviously a plain text
dump or even an HTML dump could save a good amount of processing.

Cheers,
Scott

On Mon, Feb 22, 2016, 15:18 Bruno Goncalves &lt;bgoncalves(a)gmail.com&gt; wrote:

...
  Hi,

 I was wondering if there is any place where I can find text (without
 markup, etc) only versions of wikipedia suitable for NLP tasks? I've been
 able to find a couple of old ones for the english wikipedia but I would
 like to analyze different languages (mandarin, arabic, etc...).

 Of course, any pointers to software that I can use to convert the usual
 XML dumps to text would be great as well.

 Best,

 Bruno

 *******************************************
 Bruno Miguel Tavares Gonçalves, PhD
 Homepage: www.bgoncalves.com
 Email: bgoncalves(a)gmail.com
 *******************************************
 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 -- 
Dr. Scott Hale
Data Scientist
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] "Quick" request