Hi Tarun, the Research: page you cite lists our TokTrack data set (until 2016, EN.WP) and
the WikiWho API (for data up to now), which provides this persistence data on token level
already , so you might use that instead of computing it yourself. Of course, there are
also certain API restrictions on our side, but I'm sure we can accommodate you, since
you do not need to do a request for every revision ever written, but rather only every
article. Let me know if you need any assistance.
Best,
Fabian
-----Ursprüngliche Nachricht-----
Von: Wiki-research-l <wiki-research-l-bounces(a)lists.wikimedia.org> Im Auftrag von
Chadha Tarun (ID SIS)
Gesendet: Mittwoch, 14. Oktober 2020 15:13
An: wiki-research-l(a)lists.wikimedia.org
Betreff: [Wiki-research-l] Large number of queries to the Wikipedia api
Dear fellow wiki-researchers,
Greetings from Zurich!
Together with Jérôme (in cc), we are working on a research problem that would definitely
benefit from your expert insights.
Our goal: we want to predict a measure of the average "quality" of the edits
made at the user level based on a set of covariates. To achieve this, we need to compute a
measure of edit quality, and then aggregate those measures at the editor level. To do so,
we envision relying on Aaron Halfaker's "word persistence" method [1] by
querying the Wikipedia API [2].
Our main issue: we are dealing with approximately 20 million edits in this project. If we
do all these queries serially (and assuming 4-5 seconds per query), then we would need
approximately 2.5 years to complete the job!
Our question for you: how do you guys typically handle such computationally intensive data
processing tasks?
One option to speed this up is to run several parallel processes to query the server. Does
anybody know whether there is a formal limit on the number of connections a single IP can
open to the API, and for how long? We also worry that opening several hundred connections
at the same time may adversely affect the availability of the server for others...
We thank you in advance for your help and insights!
Sincerely,
Tarun & Jérôme @ ETH Zurich
[1]
https://meta.wikimedia.org/wiki/Research:Content_persistence
[2]
https://en.wikipedia.org/w/api.php
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l