John,
Assuming you find non-existing pages by title, you can query up to 500
titles in a single query: [0] Which means you would need only 200
requests to get 100K titles. Do those request serially (not in
parallel) and I doubt you will hit any rate limit, while being
conscious about server limits. You were probably hitting multiple
404's in parallel which is not ideal.
As an alternative, creating a tool [1] inside our infrastructure that
generates daily dumps of all page titles is also a possibility maybe
people could find interesting. After all it would be a slow, but
unique SQL query per day. Here a query that would work (note you would
get the titles incoded in Mediawiki format [2] ).
[0] <https://www.mediawiki.org/wiki/API:Query#Specifying_pages>
[1] <https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Wiki_Replicas>
[2] SELECT page_title FROM enwiki_p.page WHERE page_namespace = 0
On Wed, Feb 27, 2019 at 3:01 PM John Bohannon <john.bohannon(a)gmail.com> wrote:
Hello!
I'm hoping to get advice on how we should approach the following challenge...
I am building a public website that will provide information that is automatically
harvested from online news articles about the work of scientists. The goal is to make it
easier to create and maintain scientific content on Wikipedia.
Here's some news about the project:
https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer
And here is the prototype of the site:
https://quicksilver.primer.ai
What I am working on now is a self-updating version of this site.
The goal is to provide daily refreshed information for scientists most likely to be
missing from Wikipedia.
For now I am focusing on English-language news and English-language Wikipedia. Eventually
this will expand to other languages.
The ~100 scientists shown on any given day are selected from ~100k scientists that the
system is tracking for news updates.
So here's the challenge:
To choose the 100 scientists most in need of an update on Wikipedia, we need to query
Wikipedia each day for the 100k scientists to see if they have an article yet, and if so
to get its content (to check if we have new information).
I am getting throttled by the Wikipedia servers. 100k is a lot of queries.
What is the most polite, sanctioned method for programmatic access to Wikipedia for a
daily job on this scale?
Many thanks for help/advice!
John Bohannon
http://johnbohannon.org
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Jaime Crespo
<http://wikimedia.org>