https://bugzilla.wikimedia.org/show_bug.cgi?id=55889
Web browser: --- Bug ID: 55889 Summary: Thread-safe versions of the Generators Product: Pywikibot Version: core (2.0) Hardware: All OS: All Status: NEW Severity: enhancement Priority: Unprioritized Component: General Assignee: Pywikipedia-bugs@lists.wikimedia.org Reporter: crangasi2001@yahoo.com Classification: Unclassified Mobile Platform: ---
Currently, the generators functions use yield, which is not tread-safe. PWB should offer a thread-safe version using one of the many interesting suggestions from http://www.dabeaz.com/generators/Generators.pdf (or any other method :P)
https://bugzilla.wikimedia.org/show_bug.cgi?id=55889
Merlijn van Deen valhallasw@arctus.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |valhallasw@arctus.nl
--- Comment #1 from Merlijn van Deen valhallasw@arctus.nl --- What is the goal you want to achieve by this? Remember that threads in Python are useless for computations, due to the GIL.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55889
--- Comment #2 from Strainu crangasi2001@yahoo.com --- As I understand it, I/O happens outside of GIL. As the API requests are the most time-consuming part of many of my robots (and more precisely the connection to the servers), being able to do requests from several threads should somewhat improve performance (as long as the throttling is not too aggressive).
I've noticed that the preloading limit is not only 50 pages, making this problem even more stringent for many small pages. It's probably also a good idea for things like image upload/download.
If it helps, we can do some tests to see if performance is increased for a simple file downloader?
https://bugzilla.wikimedia.org/show_bug.cgi?id=55889
--- Comment #3 from Merlijn van Deen valhallasw@arctus.nl --- I see. There are already some features in place, but we are maybe not using asynchronous requests at all points where it might be useful.
First of all, connections should be re-used - this is already a feature in the httplib2 library.
The next layer, comms.threadedhttp, supports asynchronous requests ('features' would be a closer term - basically, you create a request and then wait for a lock to be released). However, I don't think we use this feature anywhere, as it's not exposed in the higher-up layers.
For saving pages, which (I think) is the most relevant place for async request, we already have support, where requests that do not return a reply that has to be handled can be handled asynchronously - see Page.put_async.
For pagegenerators, we might be able to win a bit by requesting the (i+1)th page before returning the i-th page (or, for the PreloadingGenerator, by requesting the (i+1)th batch before all pages from the i-th batch have been returned).
https://bugzilla.wikimedia.org/show_bug.cgi?id=55889
--- Comment #4 from Strainu crangasi2001@yahoo.com --- (In reply to comment #3)
The next layer, comms.threadedhttp, supports asynchronous requests. [...] I don't think we use this feature anywhere, as it's not exposed in the higher-up layers.
I've noticed that while writing the answer to Gerard's questions today :)
For saving pages, which (I think) is the most relevant place for async request, we already have support, where requests that do not return a reply that has to be handled can be handled asynchronously - see Page.put_async.
I've experimented with put_async with mixed results. When the upload works, it's mostly OK, however when one request hits an error (like a 504 from the server) it just keeps trying again and again, keeping the thread blocked.
Instead, the request should probably be de-queued, processed and, if a callback has been registered, the callback should be called in order to allow the bot to re-queue the request. This, however, could cause trouble if the order of the requests is important. The bot can receive a callback, but AFAIK it cannot remove already queued requests. Also, what happens if no callback has been registered? Should we simply re-queue the request? I don't have a perfect solution at this time, but this is a point that should be considered.
Another possible issue, that PWB can't really do much about, is that one can get a 504 even if the save is successful, making the re-queueing useless. I don't have a good solution for that either, but we could consult with the Wikimedia developers.
For pagegenerators, we might be able to win a bit by requesting the (i+1)th page before returning the i-th page (or, for the PreloadingGenerator, by requesting the (i+1)th batch before all pages from the i-th batch have been returned).
This should be especially useful if it can be controlled by the user. Do you have any ideas on how to do this?
I think there were some good ideas brought up on this bug. Should we start a thread on the mailing list so we can gather more input on this?
https://bugzilla.wikimedia.org/show_bug.cgi?id=55889
Merlijn van Deen valhallasw@arctus.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- Priority|Unprioritized |Low Summary|Thread-safe versions of the |Improve support for |Generators |asynchronous requests | |(saving/preloading pages)
https://bugzilla.wikimedia.org/show_bug.cgi?id=55889
Ricordisamoa ricordisamoa@live.it changed:
What |Removed |Added ---------------------------------------------------------------------------- Depends on| |55220
https://bugzilla.wikimedia.org/show_bug.cgi?id=55889
--- Comment #5 from Gerrit Notification Bot gerritadmin@wikimedia.org --- Change 172023 had a related patch set uploaded by John Vandenberg: Asynchronous HTTP requests
https://gerrit.wikimedia.org/r/172023
https://bugzilla.wikimedia.org/show_bug.cgi?id=55889
Gerrit Notification Bot gerritadmin@wikimedia.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |PATCH_TO_REVIEW
pywikipedia-bugs@lists.wikimedia.org