Wikipedia throttling

List overview All Threads
Download

newer

older

Hadoop Yarn running application...

Spark 1.x to be removed from...

John Bohannon

26 Feb 2019 26 Feb '19

9:19 a.m.

Hello!

I'm hoping to get advice on how we should approach the following challenge...

I am building a public website that will provide information that is automatically harvested from online news articles about the work of scientists. The goal is to make it easier to create and maintain scientific content on Wikipedia.

Here's some news about the project: https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer

And here is the prototype of the site: https://quicksilver.primer.ai https://quicksilver.primer.ai/

What I am working on now is a self-updating version of this site.

The goal is to provide daily refreshed information for scientists most likely to be missing from Wikipedia.

For now I am focusing on English-language news and English-language Wikipedia. Eventually this will expand to other languages.

The ~100 scientists shown on any given day are selected from ~100k scientists that the system is tracking for news updates.

So here's the challenge:

To choose the 100 scientists most in need of an update on Wikipedia, we need to query Wikipedia each day for the 100k scientists to see if they have an article yet, and if so to get its content (to check if we have new information).

I am getting throttled by the Wikipedia servers. 100k is a lot of queries.

What is the most polite, sanctioned method for programmatic access to Wikipedia for a daily job on this scale?

Many thanks for help/advice!

John Bohannon http://johnbohannon.org

Attachments:

attachment.htm (text/html — 2.6 KB)

Show replies by date

Chico Venancio

27 Feb 27 Feb

6:05 a.m.

Using the dumps https://meta.wikimedia.org/wiki/Data_dumps is the best way to go through that many pages daily.

Chico Venancio (+55 98) 9 8800 2743

Em qua, 27 de fev de 2019 às 11:01, John Bohannon john.bohannon@gmail.com escreveu:

...

Hello!

I'm hoping to get advice on how we should approach the following *challenge*...

I am building a public website that will provide information that is automatically harvested from online news articles about the work of scientists. The goal is to make it easier to create and maintain scientific content on Wikipedia.

Here's some news about the project: https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer

And here is the prototype of the site: https://quicksilver.primer.ai

What I am working on now is a self-updating version of this site.

The goal is to provide daily refreshed information for scientists most likely to be missing from Wikipedia.

For now I am focusing on English-language news and English-language Wikipedia. Eventually this will expand to other languages.

The ~100 scientists shown on any given day are selected from ~100k scientists that the system is tracking for news updates.

So here's the *challenge*:

To choose the 100 scientists most in need of an update on Wikipedia, we need to query Wikipedia each day for the 100k scientists to see if they have an article yet, and if so to get its content (to check if we have new information).

I am getting throttled by the Wikipedia servers. 100k is a lot of queries.

What is the most polite, sanctioned method for programmatic access to Wikipedia for a daily job on this scale?

Many thanks for help/advice!

John Bohannon http://johnbohannon.org _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

6:45 a.m.

Hi John,

While dumps is how you could start, you need daily updates, so dumps won't do. I would suggest a Lambda architecture https://en.wikipedia.org/wiki/Lambda_architecture using dumps and EventStreams https://wikitech.wikimedia.org/wiki/EventStreams. The streams are pushed out via a very simple protocol, Server-sent events, and there are examples there for python and js clients. We publish streams that deliver real-time updates for pages created, deleted, moved, etc, you can find all streams here https://stream.wikimedia.org/?doc. When you subscribe you can specify multiple streams https://wikitech.wikimedia.org/wiki/EventStreams#Stream_selection. In your case, if you watched the pages created, deleted, and undeleted, you could determine the existence of the 100k scientists you're looking for on a real-time basis.

Right now there is no server-side filtering so you'll have to consume the full stream and filter client side. Let us know if you have other questions or if this proposal doesn't work for some reason.

On Wed, Feb 27, 2019 at 9:06 AM Chico Venancio chicocvenancio@gmail.com wrote:

...

Using the dumps https://meta.wikimedia.org/wiki/Data_dumps is the best way to go through that many pages daily.

Chico Venancio (+55 98) 9 8800 2743

Em qua, 27 de fev de 2019 às 11:01, John Bohannon john.bohannon@gmail.com escreveu:

...
Hello!

I'm hoping to get advice on how we should approach the following *challenge*...

I am building a public website that will provide information that is automatically harvested from online news articles about the work of scientists. The goal is to make it easier to create and maintain scientific content on Wikipedia.

Here's some news about the project: https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer

And here is the prototype of the site: https://quicksilver.primer.ai

What I am working on now is a self-updating version of this site.

The goal is to provide daily refreshed information for scientists most likely to be missing from Wikipedia.

For now I am focusing on English-language news and English-language Wikipedia. Eventually this will expand to other languages.

The ~100 scientists shown on any given day are selected from ~100k scientists that the system is tracking for news updates.

So here's the *challenge*:

To choose the 100 scientists most in need of an update on Wikipedia, we need to query Wikipedia each day for the 100k scientists to see if they have an article yet, and if so to get its content (to check if we have new information).

I am getting throttled by the Wikipedia servers. 100k is a lot of queries.

What is the most polite, sanctioned method for programmatic access to Wikipedia for a daily job on this scale?

Many thanks for help/advice!

John Bohannon http://johnbohannon.org _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Jaime Crespo

7:41 a.m.

John,

Assuming you find non-existing pages by title, you can query up to 500 titles in a single query: [0] Which means you would need only 200 requests to get 100K titles. Do those request serially (not in parallel) and I doubt you will hit any rate limit, while being conscious about server limits. You were probably hitting multiple 404's in parallel which is not ideal.

As an alternative, creating a tool [1] inside our infrastructure that generates daily dumps of all page titles is also a possibility maybe people could find interesting. After all it would be a slow, but unique SQL query per day. Here a query that would work (note you would get the titles incoded in Mediawiki format [2] ).

[0] https://www.mediawiki.org/wiki/API:Query#Specifying_pages [1] https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Wiki_Replicas [2] SELECT page_title FROM enwiki_p.page WHERE page_namespace = 0

On Wed, Feb 27, 2019 at 3:01 PM John Bohannon john.bohannon@gmail.com wrote:

...

Hello!

I'm hoping to get advice on how we should approach the following challenge...

I am building a public website that will provide information that is automatically harvested from online news articles about the work of scientists. The goal is to make it easier to create and maintain scientific content on Wikipedia.

Here's some news about the project: https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer

And here is the prototype of the site: https://quicksilver.primer.ai

What I am working on now is a self-updating version of this site.

The goal is to provide daily refreshed information for scientists most likely to be missing from Wikipedia.

For now I am focusing on English-language news and English-language Wikipedia. Eventually this will expand to other languages.

The ~100 scientists shown on any given day are selected from ~100k scientists that the system is tracking for news updates.

So here's the challenge:

To choose the 100 scientists most in need of an update on Wikipedia, we need to query Wikipedia each day for the 100k scientists to see if they have an article yet, and if so to get its content (to check if we have new information).

I am getting throttled by the Wikipedia servers. 100k is a lot of queries.

What is the most polite, sanctioned method for programmatic access to Wikipedia for a daily job on this scale?

Many thanks for help/advice!

John Bohannon http://johnbohannon.org _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jaime Crespo http://wikimedia.org

2137

Age (days ago)

2138

Last active (days ago)

analytics@lists.wikimedia.org

3 comments

4 participants

tags (0)

participants (4)

Chico Venancio
Dan Andreescu
Jaime Crespo
John Bohannon