On Fri, May 29, 2015 at 9:36 PM, Brad Jorsch (Anomie) bjorsch@wikimedia.org wrote:
On Thu, May 28, 2015 at 2:39 PM, John Mark Vandenberg jayvdb@gmail.com wrote:
[T96942 https://phabricator.wikimedia.org/T96942] was reported by pywikibot devs almost as soon as we detected that the test wikis were failing in our travis-ci tests.
At 20:22 in the timezone of the main API developer (me).
It was 12 hours before a MediaWiki API fix was submitted to Gerrit,
09:31, basically first thing in the morning for the main API developer. There's really not much to complain about there.
In the proposed deploy sequence, 12 hours is a serious amount of time.
So if a shorter deploy process is implemented, we need to find ways to get bug reports to you sooner, and ensure you are not the only one who can notice and fix bugs related to API breakages, etc.
and it took four additional *days* to get merged.
That part does suck.
On a positive note, the API breakage this week was rectified much quicker and pywikibot test builds are green again.
https://phabricator.wikimedia.org/T100775 https://travis-ci.org/wikimedia/pywikibot-core/builds/64631025
This also doesnt give clients sufficient time to workaround MediaWiki's wonderful intentional API breakages. e.g. raw continue, which completely broke pywikibot and needed a large chunk of code rewritten urgently, both for pywikibot core and the much older and harder to fix pywikibot compat, which is still used as part of processes that wiki communities rely on.
The continuation change hasn't actually broken anything yet.
Hmm. You still don't appreciate that you actually really truly fair-dinkum broke pywikibot....? warnings are part of the API, and adding/changing them can break clients. warnings are one of the more brittle part of the API.
The impact on pywikibot core users wasnt so apparent, as the pywikibot core devs fixed the problem when it hit the test servers and it was merged before it hit production servers. Not all users had 'git pull' the latest pywikibot core code, and they informed us their bots were broken, but as far as I am aware we didnt get any pywikibot core bug reports submitted because (often after mentioning problems on IRC) their problems disappeared after they ran 'git pull'.
However pywikipedia / compat isnt actively maintained, and it broke badly in production, with some scripts being broken for over a month:
https://phabricator.wikimedia.org/T74667 https://phabricator.wikimedia.org/T74749
pywikibot core is gradually improving its understanding of the API warning system, but it isnt well supported yet. As a result, generally pywikibot reports warnings to the user. IIRC your API warnings system can send multiple distinct warnings as a single string, with each warning separated by only a new line, which is especially nasty for user agents to 'understand'. (but this may only be in older versions of the API - I'm not sure)
So adding a new warning to the API can result in the same warning appearing many many times on the user console / logs, and thousands of warnings on the screen sends the users into panic mode.
I strongly recommend fixing the warning system before using it again aggressively like was done for rawcontinue. e.g. It would be nice if the API emitted codes for each warning scenario (dont reuse codes for similar scenarios), so we don't need to do string matching to detect & discard expected warnings, and you can i18n those messages without breaking clients. (I think there is already a phab task for this.)
I also strongly recommend that Wikimedia gets heavily involved in decommissioning pywikibot compat bots on Wikimedia servers, and any other very old unmaintained clients, so that the API can be aggressively updated without breaking the many bots still using compat. pywikibot devs did some initial work with WMF staff at Lyon on this front, and we need to keep that moving ahead.
Unless pywikibot was treating warnings as errors and that's what broke it?
Yes, some of the code was raising an exception when it detected an API warning.
However another part of the breakage was that the JSON structure of the new rawcontinue warnings was not what was expected. Some pywikipedia / compat code assumed that the presence of warnings implied that there would be a warning related to the module used, e.g. result['warnings']['allpages'] , but that didnt exist because this warning was in result['warnings']['query'].
https://gerrit.wikimedia.org/r/#/c/176910/4/wikipedia.py,cm https://gerrit.wikimedia.org/r/#/c/170075/2/wikipedia.py,cm
It's coming soon though. Nor should a "large chunk of code" *need* rewriting, just add one parameter to your action=query requests.
I presume you mean we could have just added rawcontinue='' . Our very limited testing at the time (we didnt have much time to fix the bugs before it would hit production), and subsequent testing, indicates that the rawcontinue parameter can be used even for very old versions of MediaWiki. It appears to be silently ignored by earlier versions of MediaWiki. If this is true, it would have been good to very explicitly mention in the mediawiki-api-announce message that adding rawcontinue='' is backwards compatible , and to which version it will work.
To be safe, we implemented version detection so that rawcontinue is only used on the MediaWiki versions that require it. That was not trivial, as detecting the version means requesting action=query&meta=siteinfo , and that would cause this warning if rawcontinue wasnt used. Do you notice the loop there? In order to avoid using a feature on the wrong version, we need to use the feature. In case you are interested, the pywikibot core merged change was https://gerrit.wikimedia.org/r/#/c/168529/ (there were two other attempts at solving it which were abandoned.)
Thankfully, you eventually reduced the occurrences of those warnings (https://gerrit.wikimedia.org/r/#/c/201193/ ), so that a single simple meta=siteinfo query doesn't cause an unnecessary warning to appear in the result. I am not sure whether we can rely on that though, as you strenuously assert that meta=siteinfo could require continuation - if so, our site info abstraction layer is probably broken whenever meta=siteinfo actually does ever require continuation.
As a result of this, I ask that more care is given to changes to the output of siteinfo, as that is how clients detect your version, which is a critical step in negotiating the 'protocol version' used. pywikibot tries to use feature detection when sensible instead of version detection, but that is harder to implement and involves shooting off more requests to the server to detect what fails, slowing down the bootstrapping sequence for each site.
Or you're referring to unit tests rather than actual breakage?
unit tests, development processes, test process, and production usage. Everything broke. Jenkins was -1'ing all patches uploaded into Gerrit pywikibot because a very simple doctest failed. As a result, a solution needed to be merged before normal development activities could resume. ;-)
https://lists.wikimedia.org/pipermail/pywikipedia-l/2014-October/009140.html
But a notice about the warnings was sent to mediawiki-api-announce in September 2014,[1] a bit over a month before the warnings started.[2]
Yea, we're also to blame for not picking up on this. We are paying a bit more attention to that mailing list now as a result;-)
It is hard to know what would have worked better, but the 'its just a warning' mentality feels like the underlying problem. A single notice and a month delay is probably not enough, and "About a month from now" might have got a better response than "Sometime during the MediaWiki 1.25 development cycle".
IMO Wikimedia should have be doing impact analysis before flicking the switch. i.e. tracking how many Wikimedia wiki clients were switching to the new format to avoid the warnings during that month, and which user agents were not responding. I understand that is happening now before the final switch to 'continue' as default, but it would have been useful to start that process before releasing the warning.
Another example is the action=help rewrite not being backwards compatible. pywikibot wasnt broken, as it only uses the help module for older MW releases; but it wouldnt surprise me if there are clients that were parsing the help text and they would have been broken.
Comments on that and other proposed changes were requested on mediawiki-api-announce in July 2014,[3] three months before the change[4] was merged. No concerns were raised at the requested location[5] or on the mediawiki-api mailing list.
This change didnt affect pywikibot, so I dont know of any actual impact, and there may have been none, but afaik there was not appropriate notice of this intentional breakage, removing functionality. An RFC is great, and in that RFC there was a concern raise, albeit not backed with an actual usage which would be broken. A real notice of the actual breakage is still needed, even if there was no concerns raised during the RFC.
My point is not that these changes were catastrophic - tools were updated, the wikis lived on, and a few pywikibot compat users probably switched to pywikibot core as a result of a month long breakage.
My point is that shortening the timeframes of deploying reduces the ability for volunteers to detect and raise bugs about breakages, and/or update their client code.
-- John Vandenberg