Hi,
Thanks for the quick answers, and for the useful link.
My previous e-mail was not detailed enough; sorry about that. Let me clarify: - I don't need to crawl the entire Wikipedia, only (for example) articles in a category. ~1,000 articles would be a good start, and I definitely won't be going above ~40,000 articles. - For every article in the data set, I need to follow every interlanguage link, and get the article creation date (i.e. creation date of [[en:Brad Pitt]], [[fr:Brad Pitt]], [[it:Brad Pitt]], etc). As far as I can tell, this means that I need one query for every language link.
The data are reasonably easy to get through the API. If my queries risk overloading the server, I am obviously happy to go through the toolserver (once my account gets approved!).
Robin Ryder ---- Postdoctoral researcher CEREMADE - Paris Dauphine and CREST - INSEE
On 24.09.2010, 14:32 Robin wrote:
I would like to collect data on interlanguage links for academic research purposes. I really do not want to use the dumps, since I would need to download dumps of all language Wikipedias, which would be huge. I have written a script which goes through the API, but I am wondering
how
often it is acceptable for me to query the API. Assuming I do not run parallel queries, do I need to wait between each query? If so, how long?
Crawling all the Wikipedias is not an easy task either. Probably, toolserver.org would be more suitable. What data do you need, exactly?
-- Best regards, Max Semenik ([[User:MaxSem]])
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2010/9/24 Robin Ryder robin.ryder@ensae.fr:
- I don't need to crawl the entire Wikipedia, only (for example) articles in
a category. ~1,000 articles would be a good start, and I definitely won't be going above ~40,000 articles.
- For every article in the data set, I need to follow every interlanguage
link, and get the article creation date (i.e. creation date of [[en:Brad Pitt]], [[fr:Brad Pitt]], [[it:Brad Pitt]], etc). As far as I can tell, this means that I need one query for every language link.
Unfortunately, this is true. You can't use a generator because those don't work with interwiki titles, and you can't query multiple titles in one request because prop=revisions only allows that in get-only-the-latest-revision mode (and you want the earliest revision).
Hitting the API repeatedly without waiting between requests and without making parallel requests is considered acceptable usage AFAIK, but I do think that the Toolserver would better suit your needs.
Roan Kattouw (Catrope)
On 9/24/2010 8:49 AM, Robin Ryder wrote:
Hi,
Thanks for the quick answers, and for the useful link.
My previous e-mail was not detailed enough; sorry about that. Let me clarify:
- I don't need to crawl the entire Wikipedia, only (for example) articles in
a category. ~1,000 articles would be a good start, and I definitely won't be going above ~40,000 articles.
- For every article in the data set, I need to follow every interlanguage
link, and get the article creation date (i.e. creation date of [[en:Brad Pitt]], [[fr:Brad Pitt]], [[it:Brad Pitt]], etc). As far as I can tell, this means that I need one query for every language link.
The data are reasonably easy to get through the API. If my queries risk overloading the server, I am obviously happy to go through the toolserver (once my account gets approved!).
The first part is easy to do if accuracy doesn't matter. Precision and recall are often around 50% for categories in Wikipedia, so if you really care about being right you have to construct your own categories, and it helps to have a synoptic view. Often you can get that view from Freebase and Dbpedia but I'm increasingly coming around to infoexing wikipedia directly because, for things I care about, I can do better than Dbpedia... Freebase does add some special value because they do gardening, data cleaning, data mining, hand edits and other things that clean up the mess.
Secondly, it's not hard at all to run, say, 200k requests against the API over the span of a few days. I think you could get your creation dates from the history records.
wikitech-l@lists.wikimedia.org