Hi all,
I'm trying to do text processing on a subset of Wikipedia articles(about 300k) to calculate tf-idf scores in those articles and look for certain word occurences.
I'm using the Wikipedia dump to extract the subset of articles. Through a simple script I can scrape the dump and extract articles but they're in Wikitext syntax.
I'd like to know if the noise added by Wikitext syntax would be significant or not? Should I go for parsing of articles to reduce them to bare text content or is there a way to ignore the Wikitext syntax while processing?
Please note that parsing looks like a much harder job for my use case as I need only a subset of articles and I'm unable to find a utility which returns only the text content of a chosen set of articles from dump.