On Fri, May 29, 2015 at 9:36 PM, Brad Jorsch (Anomie)
<bjorsch(a)wikimedia.org> wrote:
On Thu, May 28, 2015 at 2:39 PM, John Mark Vandenberg
<jayvdb(a)gmail.com>
wrote:
[T96942
<https://phabricator.wikimedia.org/T96942>] was reported by
pywikibot devs almost as soon as we detected that
the test wikis were failing in our travis-ci tests.
At 20:22 in the timezone of the main API developer (me).
It was 12 hours before a MediaWiki API fix was
submitted to Gerrit,
09:31, basically first thing in the morning for the main API developer.
There's really not much to complain about there.
In the proposed deploy sequence, 12 hours is a serious amount of time.
So if a shorter deploy process is implemented, we need to find ways to
get bug reports to you sooner, and ensure you are not the only one who
can notice and fix bugs related to API breakages, etc.
and it took
four additional *days* to get merged.
That part does suck.
On a positive note, the API breakage this week was rectified much
quicker and pywikibot test builds are green again.
https://phabricator.wikimedia.org/T100775
https://travis-ci.org/wikimedia/pywikibot-core/builds/64631025
This also
doesnt give clients sufficient time to workaround
MediaWiki's wonderful intentional API breakages. e.g. raw continue,
which completely broke pywikibot and needed a large chunk of code
rewritten urgently, both for pywikibot core and the much older and
harder to fix pywikibot compat, which is still used as part of
processes that wiki communities rely on.
The continuation change hasn't actually broken anything yet.
Hmm. You still don't appreciate that you actually really truly
fair-dinkum broke pywikibot....? warnings are part of the API, and
adding/changing them can break clients. warnings are one of the more
brittle part of the API.
The impact on pywikibot core users wasnt so apparent, as the pywikibot
core devs fixed the problem when it hit the test servers and it was
merged before it hit production servers. Not all users had 'git pull'
the latest pywikibot core code, and they informed us their bots were
broken, but as far as I am aware we didnt get any pywikibot core bug
reports submitted because (often after mentioning problems on IRC)
their problems disappeared after they ran 'git pull'.
However pywikipedia / compat isnt actively maintained, and it broke
badly in production, with some scripts being broken for over a month:
https://phabricator.wikimedia.org/T74667
https://phabricator.wikimedia.org/T74749
pywikibot core is gradually improving its understanding of the API
warning system, but it isnt well supported yet. As a result,
generally pywikibot reports warnings to the user.
IIRC your API warnings system can send multiple distinct warnings as a
single string, with each warning separated by only a new line, which
is especially nasty for user agents to 'understand'. (but this may
only be in older versions of the API - I'm not sure)
So adding a new warning to the API can result in the same warning
appearing many many times on the user console / logs, and thousands of
warnings on the screen sends the users into panic mode.
I strongly recommend fixing the warning system before using it again
aggressively like was done for rawcontinue. e.g. It would be nice if
the API emitted codes for each warning scenario (dont reuse codes for
similar scenarios), so we don't need to do string matching to detect &
discard expected warnings, and you can i18n those messages without
breaking clients. (I think there is already a phab task for this.)
I also strongly recommend that Wikimedia gets heavily involved in
decommissioning pywikibot compat bots on Wikimedia servers, and any
other very old unmaintained clients, so that the API can be
aggressively updated without breaking the many bots still using
compat. pywikibot devs did some initial work with WMF staff at Lyon
on this front, and we need to keep that moving ahead.
Unless pywikibot was treating warnings as errors and
that's what broke it?
Yes, some of the code was raising an exception when it detected an API warning.
However another part of the breakage was that the JSON structure of
the new rawcontinue warnings was not what was expected. Some
pywikipedia / compat code assumed that the presence of warnings
implied that there would be a warning related to the module used, e.g.
result['warnings']['allpages'] , but that didnt exist because this
warning was in result['warnings']['query'].
https://gerrit.wikimedia.org/r/#/c/176910/4/wikipedia.py,cm
https://gerrit.wikimedia.org/r/#/c/170075/2/wikipedia.py,cm
It's coming
soon though. Nor should a "large chunk of code" *need* rewriting, just add
one parameter to your action=query requests.
I presume you mean we could have just added rawcontinue='' . Our very
limited testing at the time (we didnt have much time to fix the bugs
before it would hit production), and subsequent testing, indicates
that the rawcontinue parameter can be used even for very old versions
of MediaWiki. It appears to be silently ignored by earlier versions
of MediaWiki. If this is true, it would have been good to very
explicitly mention in the mediawiki-api-announce message that adding
rawcontinue='' is backwards compatible , and to which version it will
work.
To be safe, we implemented version detection so that rawcontinue is
only used on the MediaWiki versions that require it. That was not
trivial, as detecting the version means requesting
action=query&meta=siteinfo , and that would cause this warning if
rawcontinue wasnt used. Do you notice the loop there? In order to
avoid using a feature on the wrong version, we need to use the
feature. In case you are interested, the pywikibot core merged change
was
https://gerrit.wikimedia.org/r/#/c/168529/ (there were two other
attempts at solving it which were abandoned.)
Thankfully, you eventually reduced the occurrences of those warnings
(
https://gerrit.wikimedia.org/r/#/c/201193/ ), so that a single simple
meta=siteinfo query doesn't cause an unnecessary warning to appear in
the result. I am not sure whether we can rely on that though, as you
strenuously assert that meta=siteinfo could require continuation - if
so, our site info abstraction layer is probably broken whenever
meta=siteinfo actually does ever require continuation.
As a result of this, I ask that more care is given to changes to the
output of siteinfo, as that is how clients detect your version, which
is a critical step in negotiating the 'protocol version' used.
pywikibot tries to use feature detection when sensible instead of
version detection, but that is harder to implement and involves
shooting off more requests to the server to detect what fails, slowing
down the bootstrapping sequence for each site.
Or you're referring to unit tests rather than
actual breakage?
unit tests, development processes, test process, and production usage.
Everything broke. Jenkins was -1'ing all patches uploaded into Gerrit
pywikibot because a very simple doctest failed. As a result, a
solution needed to be merged before normal development activities
could resume. ;-)
https://lists.wikimedia.org/pipermail/pywikipedia-l/2014-October/009140.html
But a notice
about the warnings was sent to mediawiki-api-announce in September 2014,[1]
a bit over a month before the warnings started.[2]
[1]:
https://lists.wikimedia.org/pipermail/mediawiki-api-announce/2014-September…
[2]:
https://gerrit.wikimedia.org/r/#/c/160222/
Yea, we're also to blame for not picking up on this. We are paying a
bit more attention to that mailing list now as a result;-)
It is hard to know what would have worked better, but the 'its just a
warning' mentality feels like the underlying problem. A single notice
and a month delay is probably not enough, and "About a month from now"
might have got a better response than "Sometime during the MediaWiki
1.25 development cycle".
IMO Wikimedia should have be doing impact analysis before flicking the
switch. i.e. tracking how many Wikimedia wiki clients were switching
to the new format to avoid the warnings during that month, and which
user agents were not responding. I understand that is happening now
before the final switch to 'continue' as default, but it would have
been useful to start that process before releasing the warning.
Another
example is the action=help rewrite not being backwards
compatible. pywikibot wasnt broken, as it only uses the help module
for older MW releases; but it wouldnt surprise me if there are clients
that were parsing the help text and they would have been broken.
Comments on that and other proposed changes were requested on
mediawiki-api-announce in July 2014,[3] three months before the change[4]
was merged. No concerns were raised at the requested location[5] or on the
mediawiki-api mailing list.
[3]:
https://lists.wikimedia.org/pipermail/mediawiki-api-announce/2014-July/0000…
[4]:
https://gerrit.wikimedia.org/r/#/c/160798/
[5]:
https://www.mediawiki.org/wiki/API/Architecture_work/Planning#HTMLizing_act…
This change didnt affect pywikibot, so I dont know of any actual
impact, and there may have been none, but afaik there was not
appropriate notice of this intentional breakage, removing
functionality.
An RFC is great, and in that RFC there was a concern raise, albeit not
backed with an actual usage which would be broken. A real notice of
the actual breakage is still needed, even if there was no concerns
raised during the RFC.
My point is not that these changes were catastrophic - tools were
updated, the wikis lived on, and a few pywikibot compat users probably
switched to pywikibot core as a result of a month long breakage.
My point is that shortening the timeframes of deploying reduces the
ability for volunteers to detect and raise bugs about breakages,
and/or update their client code.
--
John Vandenberg