Hi,

If you have concerns regarding wikitext syntax I'd suggest you to use cirrus dumps[1] :

It's written in elasticsearch bulk format[2] which is relatively easy to parse (one json doc per line):

- odd lines elasticsearch metadata

- even lines are articles

You'll find 2 files per wiki a "content" dump for content namespaces and a general dump for other namespaces (help, talk, ...).

You'll have access to:

- the text representation of the article with templates transcluded in the text field

- the auxiliary_text : flat text representation of info boxes and other tables

- the source_text: the original source text with wikitext syntax

You can see how it looks like by adding the param ?action=cirrusDump to any wiki articles [3]

If you want to work with elastic to compute your term stats you'll find useful information in this blogpost[4].


[1] https://dumps.wikimedia.org/other/cirrussearch/current/

[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

[3] https://en.wikipedia.org/wiki/Basque_pelota_ball?action=cirrusDump

[4] https://www.elastic.co/blog/loading-wikipedia


Le 10/10/2016 à 07:56, Sumit Asthana a écrit :
Hi all,

I'm trying to do text processing on a subset of Wikipedia articles(about 300k) to calculate tf-idf scores in those articles and look for certain word occurences.

I'm using the Wikipedia dump to extract the subset of articles. Through a simple script I can scrape the dump and extract articles but they're in Wikitext syntax.

I'd like to know if the noise added by Wikitext syntax would be significant or not? Should I go for parsing of articles to reduce them to bare text content or is there a way to ignore the Wikitext syntax while processing?

Please note that parsing looks like a much harder job for my use case as I need only a subset of articles and I'm unable to find a utility which returns only the text content of a chosen set of articles from dump.
--
-Thanks and Regards,
Sumit Asthana,
B.Tech final year,
Dept. of CSE,
IIT Patna


_______________________________________________
AI mailing list
AI@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai