Hi,
If you have concerns regarding wikitext syntax I'd suggest you to use
cirrus dumps[1] :
It's written in elasticsearch bulk format[2] which is relatively easy to
parse (one json doc per line):
- odd lines elasticsearch metadata
- even lines are articles
You'll find 2 files per wiki a "content" dump for content namespaces and
a general dump for other namespaces (help, talk, ...).
You'll have access to:
- the text representation of the article with templates transcluded in
the text field
- the auxiliary_text : flat text representation of info boxes and other
tables
- the source_text: the original source text with wikitext syntax
You can see how it looks like by adding the param ?action=cirrusDump to
any wiki articles [3]
If you want to work with elastic to compute your term stats you'll find
useful information in this blogpost[4].
[1]
https://dumps.wikimedia.org/other/cirrussearch/current/
[2]
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.h…
[3]
https://en.wikipedia.org/wiki/Basque_pelota_ball?action=cirrusDump
[4]
https://www.elastic.co/blog/loading-wikipedia
Le 10/10/2016 à 07:56, Sumit Asthana a écrit :
Hi all,
I'm trying to do text processing on a subset of Wikipedia
articles(about 300k) to calculate tf-idf scores in those articles and
look for certain word occurences.
I'm using the Wikipedia dump to extract the subset of articles.
Through a simple script I can scrape the dump and extract articles but
they're in Wikitext syntax.
I'd like to know if the noise added by Wikitext syntax would be
significant or not? Should I go for parsing of articles to reduce them
to bare text content or is there a way to ignore the Wikitext syntax
while processing?
Please note that parsing looks like a much harder job for my use case
as I need only a subset of articles and I'm unable to find a utility
which returns only the text content of a chosen set of articles from dump.
--
-Thanks and Regards,
Sumit Asthana,
B.Tech final year,
Dept. of CSE,
IIT Patna
_______________________________________________
AI mailing list
AI(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai