Parallel gets?

List overview All Threads
Download

newer

older

OFF: Idle 80

Huwiki has new namespaces today,...

Roy Smith

27 Mar 2023 27 Mar '23

12:57 p.m.

I need to issue a bunch of Page.get() requests in parallel. My understanding is that pywikibot uses the requests library which is incompatible with async_io, so that's out. So what do people use? Threading https://docs.python.org/3.9/library/threading.html? Or, I see there's an async_io friendly requests port https://github.com/rdbhost/yieldfromRequests. Is there a way to make pywikibot use that?

Attachments:

attachment.htm (text/html — 719 bytes)

Show replies by date

John

27 Mar 27 Mar

1:06 p.m.

I’ve not checked in the v3+ version but there used to be a preload page generator that batch retrieved stuff from the API. Then pass the preloaded page object on to the parallel processing part.

On Mon, Mar 27, 2023 at 3:58 PM Roy Smith roy@panix.com wrote:

...

I need to issue a bunch of Page.get() requests in parallel. My understanding is that pywikibot uses the requests library which is incompatible with async_io, so that's out. So what do people use? Threading https://docs.python.org/3.9/library/threading.html? Or, I see there's an async_io friendly requests port https://github.com/rdbhost/yieldfromRequests. Is there a way to make pywikibot use that?

pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org

info＠gno.de

2:24 p.m.

Roy Smith

3:32 p.m.

Thanks for the replies. I've got something basically working with concurrent.futures, so I guess I'll go with that.

I see there's a max_queue_size setting available in user_config.py. I assume that is related to one or another of these examples?

...

On Mar 27, 2023, at 5:24 PM, info@gno.de wrote:

Pywikibot uses request for io methods. I had have a look at several similar libraries with asyncio support but none of them were long term supported.

It is a good advice to ensure that pages are preloaded. Anyway Pywikibot does not use asyncio (yet) but it used Threads to save pages asynchronously. The common BaseBot.treat() or BaseBot.treat_page() cannot be used asynchronously because it is not Thread safe.

You can find concurrent programming examples within the framework. weblinkchecker for example uses Threads to retrieve web pages in parallel. archivebot is able to process all pages from a generator in parallel using concurrent futures. Other examples with concurrent futures can be found in login script, preload_sites, fixing_redirects and watchlist script.

I hope that helps a bit

Best xqt

...
Am 27.03.2023 um 22:06 schrieb John phoenixoverride@gmail.com:

I’ve not checked in the v3+ version but there used to be a preload page generator that batch retrieved stuff from the API. Then pass the preloaded page object on to the parallel processing part.

On Mon, Mar 27, 2023 at 3:58 PM Roy Smith <roy@panix.com mailto:roy@panix.com> wrote: I need to issue a bunch of Page.get() requests in parallel. My understanding is that pywikibot uses the requests library which is incompatible with async_io, so that's out. So what do people use? Threading https://docs.python.org/3.9/library/threading.html? Or, I see there's an async_io friendly requests port https://github.com/rdbhost/yieldfromRequests. Is there a way to make pywikibot use that?

pywikibot mailing list -- pywikibot@lists.wikimedia.org mailto:pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/QDE2AHAX4O6G5YTHLGGKWK5LLKKRYUST/ To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org mailto:pywikibot-leave@lists.wikimedia.org _______________________________________________ pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org

pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org

info＠gno.de

6:09 p.m.

Roy Smith

7:18 p.m.

Mostly for the sake of the archives, I ended up with this https://github.com/roysmith/dyk-tools/blob/60a0b1d5c6c5f7310a541f30388b898b4a906b10/dyk_tools/web/core.py#L69, which is pretty much straight out of the example in the python library manual:

...

def _is_approved(nom):
    return nom.is_approved()

with ThreadPoolExecutor(max_workers=10) as executor:
    future_to_nom = {
        executor.submit(_is_approved, nom): nom for nom in nom_list.nominations()
    }
    for future in as_completed(future_to_nom):
        nom = future_to_nom[future]
        if future.result():
            approved_noms.append(nom)
        else:
            unapproved_noms.append(nom)

One of the nagging questions in my mind as I was exploring this was whether APISite is thread-safe. I haven't found anything in the pywikibot docs which says one way or the other, but apparently it is.

I don't have a good feel for what max_workers should be. For what I'm doing, 10 seems to work well, taking about 2-3 seconds to process 70 nominations. The largest number of nominations I would ever expect to see is around 200. According to the library docs, max_workers defaults to 5 * number_of_processors. I don't actually have a clue what that works out to on a toolforge k8s instance, nor do I have any idea how the production enwiki would like it if I threw 100 parallel API requests at it all at once. So for now, I'll just hardwire it to 10.

Kunal Mehta

28 Mar 28 Mar

10:04 a.m.

Hi,

On 3/27/23 15:57, Roy Smith wrote:

...

I need to issue a bunch of Page.get() requests in parallel.

Please don't. From https://www.mediawiki.org/wiki/API:Etiquette#Request_limit:

"Making your requests in series rather than in parallel, by waiting for one request to finish before sending a new request, should result in a safe request rate."

Instead of making parallel requests, you should make batched requests, which is how the preloading stuff Xqt mentioned works.

-- Kunal / Legoktm

486

Age (days ago)

487

Last active (days ago)

pywikibot@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

info＠gno.de
John
Kunal Mehta
Roy Smith