Hi,
I would like to collect data on interlanguage links for academic research purposes. I really do not want to use the dumps, since I would need to download dumps of all language Wikipedias, which would be huge. I have written a script which goes through the API, but I am wondering how often it is acceptable for me to query the API. Assuming I do not run parallel queries, do I need to wait between each query? If so, how long?
Thanks in advance for your answers,
Robin Ryder ---- Postdoctoral researcher CEREMADE - Paris Dauphine and CREST - INSEE
On 24.09.2010, 14:32 Robin wrote:
I would like to collect data on interlanguage links for academic research purposes. I really do not want to use the dumps, since I would need to download dumps of all language Wikipedias, which would be huge. I have written a script which goes through the API, but I am wondering how often it is acceptable for me to query the API. Assuming I do not run parallel queries, do I need to wait between each query? If so, how long?
Crawling all the Wikipedias is not an easy task either. Probably, toolserver.org would be more suitable. What data do you need, exactly?
Hi,
You don't need the full dumps. Look at (for example) the tr.wp dump that is running at the moment:
http://download.wikimedia.org/trwiki/20100924/
you'll see the text dumps and also dumps of various SQL tables. Look at the one that is labelled "Wiki interlanguage link records."
You ought to be able to reasonably download those for all of the 'pedias that you are interested in; it will certainly be better than trawling with the API. They have (if I understand correctly what you are asking) just the data you want.
Cheers, Robert
On 9/24/10, Max Semenik maxsem.wiki@gmail.com wrote:
On 24.09.2010, 14:32 Robin wrote:
I would like to collect data on interlanguage links for academic research purposes. I really do not want to use the dumps, since I would need to download dumps of all language Wikipedias, which would be huge. I have written a script which goes through the API, but I am wondering how often it is acceptable for me to query the API. Assuming I do not run parallel queries, do I need to wait between each query? If so, how long?
Crawling all the Wikipedias is not an easy task either. Probably, toolserver.org would be more suitable. What data do you need, exactly?
-- Best regards, Max Semenik ([[User:MaxSem]])
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Sep 24, 2010 at 1:19 PM, Max Semenik maxsem.wiki@gmail.com wrote:
On 24.09.2010, 14:32 Robin wrote:
I would like to collect data on interlanguage links for academic research purposes. I really do not want to use the dumps, since I would need to download dumps of all language Wikipedias, which would be huge. I have written a script which goes through the API, but I am wondering
how
often it is acceptable for me to query the API. Assuming I do not run parallel queries, do I need to wait between each query? If so, how long?
Crawling all the Wikipedias is not an easy task either. Probably, toolserver.org would be more suitable. What data do you need, exactly?
Full dumps are not required for retrieving interlanguage links. For example, the last fr dump contains a dedicated file for them : http://download.wikimedia.org/frwiki/20100915/frwiki-20100915-langlinks.sql....
It will be a lot faster to download this file (only 75M) than making more than 1 million calls to the API for the fr wiki.
Nico
wikitech-l@lists.wikimedia.org