jayvdb added a subscriber: jayvdb. jayvdb added a comment.
fwiw, Change 172023 made the threadedhttp.Request more usable, and made it more accessible as new function http.fetch now returns a threadedhttp.Request object (whereas the only previous function was http.request, which returned unicode). (it also means we can replace threadedhttp/httplib2 with another library more easily)
regarding the initial focus of this task, being thread-safe generators, it would be easy to add a tool (pywikibot/tools.py) which wraps any generator with a semaphore. Allowing threaded-apps to just wrap the outer generator will allow them to use multiple consumers of the outer generator, without all generators needing locking which would slow down unthreaded-apps. I agree this is a low priority, as it is simple to use a managed 'worker' model like weblinkchecker.py does, where a single thread hands out tasks to threads.
! In T57889#599738, @Strainu wrote:
(In reply to comment #3)
The next layer, comms.threadedhttp, supports asynchronous requests. [...] I don't think we use this feature anywhere, as it's not exposed in the higher-up layers.
I've noticed that while writing the answer to Gerard's questions today :)
For saving pages, which (I think) is the most relevant place for async request, we already have support, where requests that do not return a reply that has to be handled can be handled asynchronously - see Page.put_async.
I've experimented with put_async with mixed results. When the upload works, it's mostly OK, however when one request hits an error (like a 504 from the server) it just keeps trying again and again, keeping the thread blocked.
I would like to experiment with having an async thread pool available to avoid this being a deal breaker. Another approach is to move failed requests from the main async thread to a 'failed request' thread, which manages them differently and escalates many failures so that it eventually kills the job if the error rate is too high. My first excursion in this area is https://gerrit.wikimedia.org/r/#/c/176691/ , to explore whether there are bugs in the existing multiple threads implementation.
Instead, the request should probably be de-queued, processed and, if a callback has been registered, the callback should be called in order to allow the bot to re-queue the request. This, however, could cause trouble if the order of the requests is important. The bot can receive a callback, but AFAIK it cannot remove already queued requests. Also, what happens if no callback has been registered? Should we simply re-queue the request? I don't have a perfect solution at this time, but this is a point that should be considered.
comms.http now allows for additional callbacks, which can be experimented with to develop failover/resending strategies, etc.
Another possible issue, that PWB can't really do much about, is that one can get a 504 even if the save is successful, making the re-queueing useless. I don't have a good solution for that either, but we could consult with the Wikimedia developers.
For pagegenerators, we might be able to win a bit by requesting the (i+1)th page before returning the i-th page (or, for the PreloadingGenerator, by requesting the (i+1)th batch before all pages from the i-th batch have been returned).
This should be especially useful if it can be controlled by the user. Do you have any ideas on how to do this?
It would be good if preloading was able to be set by command line options, so operators can override scripts default settings for different workloads where the scripts default preloading settings are not ideal.
Also wikidata tasks are now regularly slowed down because they use at least two sites (wikibase server and the client), regularly flicking between them. The same problem exists to a lesser extend with shared media host (Wikimedia Commons) + client site scripts.
TASK DETAIL https://phabricator.wikimedia.org/T57889
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
To: jayvdb Cc: pywikipedia-bugs, valhallasw, Strainu, jayvdb, GWicke
pywikipedia-bugs@lists.wikimedia.org