Hi,
If you have concerns regarding wikitext syntax I'd suggest you to use cirrus dumps[1] :
It's written in elasticsearch bulk format[2] which is relatively easy to parse (one json doc per line):
- odd lines elasticsearch metadata
- even lines are articles
You'll find 2 files per wiki a "content" dump for content namespaces and a general dump for other namespaces (help, talk, ...).
You'll have access to:
- the text representation of the article with templates transcluded in the text field
- the auxiliary_text : flat text representation of info boxes and other tables
- the source_text: the original source text with wikitext syntax
You can see how it looks like by adding the param ?action=cirrusDump to any wiki articles [3]
If you want to work with elastic to compute your term stats you'll find useful information in this blogpost[4].
[1] https://dumps.wikimedia.org/other/cirrussearch/current/
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.ht...
[3] https://en.wikipedia.org/wiki/Basque_pelota_ball?action=cirrusDump
[4] https://www.elastic.co/blog/loading-wikipedia
Le 10/10/2016 à 07:56, Sumit Asthana a écrit :
Hi all,
I'm trying to do text processing on a subset of Wikipedia articles(about 300k) to calculate tf-idf scores in those articles and look for certain word occurences.
I'm using the Wikipedia dump to extract the subset of articles. Through a simple script I can scrape the dump and extract articles but they're in Wikitext syntax.
I'd like to know if the noise added by Wikitext syntax would be significant or not? Should I go for parsing of articles to reduce them to bare text content or is there a way to ignore the Wikitext syntax while processing?
Please note that parsing looks like a much harder job for my use case as I need only a subset of articles and I'm unable to find a utility which returns only the text content of a chosen set of articles from dump. -- -Thanks and Regards, Sumit Asthana, B.Tech final year, Dept. of CSE, IIT Patna
AI mailing list AI@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ai