[Wiki-research-l] Large number of queries to the Wikipedia api

14 Oct 2020


      Dear fellow wiki-researchers,
Greetings from Zurich!
Together with Jérôme (in cc), we are working on a research problem that would definitely benefit from your expert insights.
Our goal: we want to predict a measure of the average "quality" of the edits made at the user level based on a set of covariates. To achieve this, we need to compute a measure of edit quality, and then aggregate those measures at the editor level. To do so, we envision relying on Aaron Halfaker's "word persistence" method [1] by querying the Wikipedia API [2].
Our main issue: we are dealing with approximately 20 million edits in this project. If we do all these queries serially (and assuming 4-5 seconds per query), then we would need approximately 2.5 years to complete the job!
Our question for you: how do you guys typically handle such computationally intensive data processing tasks?
One option to speed this up is to run several parallel processes to query the server. Does anybody know whether there is a formal limit on the number of connections a single IP can open to the API, and for how long? We also worry that opening several hundred connections at the same time may adversely affect the availability of the server for others...
We thank you in advance for your help and insights!
Sincerely,
Tarun & Jérôme @ ETH Zurich
[1] https://meta.wikimedia.org/wiki/Research:Content_persistence
[2] https://en.wikipedia.org/w/api.php

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[Wiki-research-l] Large number of queries to the Wikipedia api