Hi all,

I'm trying to do text processing on a subset of Wikipedia articles(about 300k) to calculate tf-idf scores in those articles and look for certain word occurences.

I'm using the Wikipedia dump to extract the subset of articles. Through a simple script I can scrape the dump and extract articles but they're in Wikitext syntax.

I'd like to know if the noise added by Wikitext syntax would be significant or not? Should I go for parsing of articles to reduce them to bare text content or is there a way to ignore the Wikitext syntax while processing?

Please note that parsing looks like a much harder job for my use case as I need only a subset of articles and I'm unable to find a utility which returns only the text content of a chosen set of articles from dump.
--

-Thanks and Regards,

Sumit Asthana,

B.Tech final year,

Dept. of CSE,

IIT Patna