Nicolas Dumazet wrote:
See
http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/WikiDump.py?r1=35655…
It should now sort dumps by (LastDumpFailed, Age) in place of the
previous by Age order.
Awesome, thanks! :)
I've updated the live script; note that there's currently going to be no
visible difference, as the aborted dumps were already all at the bottom
of the list.
There's a distinction between dumps which have been _aborted_, and those
which have had elements _fail_.
_Aborted_ means that the system broke entirely (probably because the
server it was running on died, or the dump process crashed or was killed
manually). The dump monitor later found the expired lock file and
declared it aborted, without necessarily being sure what went wrong.
These are marked in status.html with: <span class="failed">, and are
what's currently caught by Nicolas' change.
Other dumps may run to completion, but still have _elements_ which
failed. You'll see these such as the rash on May 26 marked as "Dump
complete, 20 items failed" (in this case the MediaWiki-generated parts
worked, but the raw SQL parts couldn't contact the server).
These are marked in status.html with: <span class='done failed'>
These don't currently get caught by the change to dbListByAge().
What might be most useful here could be patching up the actual runner
process to retry individual components, perhaps with an exponential time
backoff...
For problems that are due to something like software configuration (DB
permission change, PHP config bug, etc), just blindly retrying won't
help until the actual problem is fixed -- but retrying *will* work once
it is. If it can alert the site administrators and just hold off until
the fix comes in, that might simplify things... instead of a trail of
identically FAILED dumps in a row, it could just hold in place.
There is some provision for e-mail alerts on failure, but it doesn't
seem to be working in practice; probably some local mail config, I'll
see if we can get that sorted out.
-- brion