Text processin on Wikipedia articles - AI

10 Oct 2016

Hi all,

I'm trying to do text processing on a subset of Wikipedia articles(about
300k) to calculate tf-idf scores in those articles and look for certain
word occurences.

I'm using the Wikipedia dump to extract the subset of articles. Through a
simple script I can scrape the dump and extract articles but they're in
Wikitext syntax.

I'd like to know if the noise added by Wikitext syntax would be significant
or not? Should I go for parsing of articles to reduce them to bare text
content or is there a way to ignore the Wikitext syntax while processing?

Please note that parsing looks like a much harder job for my use case as I
need only a subset of articles and I'm unable to find a utility which
returns only the text content of a chosen set of articles from dump.
-- 
-Thanks and Regards,
Sumit Asthana,
B.Tech final year,
Dept. of CSE,
IIT Patna