If APIs are not an option, you can also consider calculating word
persistence yourself -- see
This changes your challenge from that of API limits to handling the very
large history dumps but you might find that more appealing.
On Wed, Oct 14, 2020 at 9:31 AM Flöck, Fabian <Fabian.Floeck(a)gesis.org>
wrote:
Hi Tarun, the Research: page you cite lists our
TokTrack data set (until
2016, EN.WP) and the WikiWho API (for data up to now), which provides this
persistence data on token level already , so you might use that instead of
computing it yourself. Of course, there are also certain API restrictions
on our side, but I'm sure we can accommodate you, since you do not need to
do a request for every revision ever written, but rather only every
article. Let me know if you need any assistance.
Best,
Fabian
-----Ursprüngliche Nachricht-----
Von: Wiki-research-l <wiki-research-l-bounces(a)lists.wikimedia.org> Im
Auftrag von Chadha Tarun (ID SIS)
Gesendet: Mittwoch, 14. Oktober 2020 15:13
An: wiki-research-l(a)lists.wikimedia.org
Betreff: [Wiki-research-l] Large number of queries to the Wikipedia api
Dear fellow wiki-researchers,
Greetings from Zurich!
Together with Jérôme (in cc), we are working on a research problem that
would definitely benefit from your expert insights.
Our goal: we want to predict a measure of the average "quality" of the
edits made at the user level based on a set of covariates. To achieve this,
we need to compute a measure of edit quality, and then aggregate those
measures at the editor level. To do so, we envision relying on Aaron
Halfaker's "word persistence" method [1] by querying the Wikipedia API
[2].
Our main issue: we are dealing with approximately 20 million edits in this
project. If we do all these queries serially (and assuming 4-5 seconds per
query), then we would need approximately 2.5 years to complete the job!
Our question for you: how do you guys typically handle such
computationally intensive data processing tasks?
One option to speed this up is to run several parallel processes to query
the server. Does anybody know whether there is a formal limit on the number
of connections a single IP can open to the API, and for how long? We also
worry that opening several hundred connections at the same time may
adversely affect the availability of the server for others...
We thank you in advance for your help and insights!
Sincerely,
Tarun & Jérôme @ ETH Zurich
[1]
https://meta.wikimedia.org/wiki/Research:Content_persistence
[2]
https://en.wikipedia.org/w/api.php
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation