If you have concerns regarding wikitext syntax I'd suggest you to use
cirrus dumps :
It's written in elasticsearch bulk format which is relatively easy to
parse (one json doc per line):
- odd lines elasticsearch metadata
- even lines are articles
You'll find 2 files per wiki a "content" dump for content namespaces and
a general dump for other namespaces (help, talk, ...).
You'll have access to:
- the text representation of the article with templates transcluded in
the text field
- the auxiliary_text : flat text representation of info boxes and other
- the source_text: the original source text with wikitext syntax
You can see how it looks like by adding the param ?action=cirrusDump to
any wiki articles 
If you want to work with elastic to compute your term stats you'll find
useful information in this blogpost.
Le 10/10/2016 à 07:56, Sumit Asthana a écrit :
I'm trying to do text processing on a subset of Wikipedia
articles(about 300k) to calculate tf-idf scores in those articles and
look for certain word occurences.
I'm using the Wikipedia dump to extract the subset of articles.
Through a simple script I can scrape the dump and extract articles but
they're in Wikitext syntax.
I'd like to know if the noise added by Wikitext syntax would be
significant or not? Should I go for parsing of articles to reduce them
to bare text content or is there a way to ignore the Wikitext syntax
Please note that parsing looks like a much harder job for my use case
as I need only a subset of articles and I'm unable to find a utility
which returns only the text content of a chosen set of articles from dump.
-Thanks and Regards,
B.Tech final year,
Dept. of CSE,
AI mailing list