On 18/11/11 18:51, Ben Hartshorne wrote:
Hi everyone,
I just posted a
note<http://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-br…
the blog about our new external store but wanted to add a few details
here. The deploy went smoothly, and I'm very happy with how the project
progressed overall.
I thought the ES apaches had disappeared time ago. Maybe those were just
the memcached apaches.
The project originally included recompressing all of
the object types in
the external store databases, continuing the work that was started in
2010.
Are you aware of
https://bugzilla.wikimedia.org/20757#c9 ?
Are you very sure compressOld won't break anything? Anything that
touches text table has a big potential for data loss.
I spent some time doing verification that things
were behaving as
expected and it turns out they weren't. Upon examining the count of
different data types in the external store content, I found that some types
that are no longer supposed to be used were still getting created. I've
filed
https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the
investigation and resolution of those differences.
I found out the source of those 'gzip,external' entries: AbuseFilter.
Noted at the bug.
During the deploy there was a brief (about 10 minute)
period during which
article saves failed due to the external store databases being in read-only
mode. As expected, some folks showed up in IRC telling us of the
'problem'. After the migration was complete we brainstormed a bit in IRC
about good ways of informing editors of planned maintenance such as this
migration. The regular databases (s3, etc.) have a read-only mode flag so
that the affected wikis show a reasonable error, but the external store
databases are a little different. Because of the way they're spread out,
the outage of a specific database cluster does not affect specific language
projects, but instead affects a specific time range for all wikis.
Additionally, the currently writable external store database affects
article edits on all wikis.
You could have made everything read-only, too. It's a wider scope than
strictly needed, but I don't think it's that important to keep eg.
watchlist changeable if edits don't work.
There were a few suggestions thrown around:
2) make mediawiki cache the change to conceal the
outage from editors. The
idea here is that mediawiki would notice that the backend database is
currently in read-only mode and would cache the change and write it to the
DB when it returns to read-write mode. There are a number of technical
challenges here, as well as the introduction of another system (the change
cache), but it's an interesting way around the problem, since rather than
addressing how to inform editors of impending maintenance it simply
eliminates the necessity for that communication.
I don't like it. Some changes get cached unnoticed somewhere (eg.
memcached), then suddenly they fail to transfer to the end system, and a
few days later a bunch of content magically disappears.
It might be acceptable to directly store in text table if all ES are
down, although that's probably not in our interest.
3) throw up a banner on the edit page itself. (...)
During the maintenance, we could
change the message to be more visible, or we could take more drastic action
such as disabling the edit or save buttons.
I'm not opposed to advisory edit
banners, but don't hide Save buttons if
it may well be working (and less the edit ones).