Re: [Analytics] Wikipedia throttling

27 Feb 2019

Hi John,

While dumps is how you could start, you need daily updates, so dumps won't
do.  I would suggest a Lambda architecture
<https://en.wikipedia.org/wiki/Lambda_architecture> using dumps and
EventStreams <https://wikitech.wikimedia.org/wiki/EventStreams>.  The
streams are pushed out via a very simple protocol, Server-sent events, and
there are examples there for python and js clients.  We publish streams
that deliver real-time updates for pages created, deleted, moved, etc, you
can find all streams here <https://stream.wikimedia.org/?doc>.  When you
subscribe you can specify multiple streams
<https://wikitech.wikimedia.org/wiki/EventStreams#Stream_selection>.  In
your case, if you watched the pages created, deleted, and undeleted, you
could determine the existence of the 100k scientists you're looking for on
a real-time basis.

Right now there is no server-side filtering so you'll have to consume the
full stream and filter client side.  Let us know if you have other
questions or if this proposal doesn't work for some reason.

On Wed, Feb 27, 2019 at 9:06 AM Chico Venancio &lt;chicocvenancio(a)gmail.com&gt;
wrote:

...
  Using the dumps
https://meta.wikimedia.org/wiki/Data_dumps is the best
 way to go through that many pages daily.

 Chico Venancio
 (+55 98) 9 8800 2743

 Em qua, 27 de fev de 2019 às 11:01, John Bohannon &lt;john.bohannon(a)gmail.com&gt;
 escreveu:

 Hello!

 I'm hoping to get advice on how we should approach the following
 *challenge*...

 I am building a public website that will provide information that is
 automatically harvested from online news articles about the work of
 scientists. The goal is to make it easier to create and maintain scientific
 content on Wikipedia.

 Here's some news about the project:
 https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer

 And here is the prototype of the site:  https://quicksilver.primer.ai

 What I am working on now is a self-updating version of this site.

 The goal is to provide daily refreshed information for scientists most
 likely to be missing from Wikipedia.

 For now I am focusing on English-language news and English-language
 Wikipedia. Eventually this will expand to other languages.

 The  ~100 scientists shown on any given day are selected from ~100k
 scientists that the system is tracking for news updates.

 So here's the *challenge*:

 To choose the 100 scientists most in need of an update on Wikipedia, we
 need to query Wikipedia each day for the 100k scientists to see if they
 have an article yet, and if so to get its content (to check if we have new
 information).

 I am getting throttled by the Wikipedia servers. 100k is a lot of queries.

 What is the most polite, sanctioned method for programmatic access to
 Wikipedia for a daily job on this scale?

 Many thanks for help/advice!

 John Bohannon
 http://johnbohannon.org
 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics
  _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Wikipedia throttling