Nicolas Dumazet wrote:
See http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/WikiDump.py?r1=35655&...
It should now sort dumps by (LastDumpFailed, Age) in place of the previous by Age order.
Awesome, thanks! :)
I've updated the live script; note that there's currently going to be no visible difference, as the aborted dumps were already all at the bottom of the list.
There's a distinction between dumps which have been _aborted_, and those which have had elements _fail_.
_Aborted_ means that the system broke entirely (probably because the server it was running on died, or the dump process crashed or was killed manually). The dump monitor later found the expired lock file and declared it aborted, without necessarily being sure what went wrong.
These are marked in status.html with: <span class="failed">, and are what's currently caught by Nicolas' change.
Other dumps may run to completion, but still have _elements_ which failed. You'll see these such as the rash on May 26 marked as "Dump complete, 20 items failed" (in this case the MediaWiki-generated parts worked, but the raw SQL parts couldn't contact the server).
These are marked in status.html with: <span class='done failed'>
These don't currently get caught by the change to dbListByAge().
What might be most useful here could be patching up the actual runner process to retry individual components, perhaps with an exponential time backoff...
For problems that are due to something like software configuration (DB permission change, PHP config bug, etc), just blindly retrying won't help until the actual problem is fixed -- but retrying *will* work once it is. If it can alert the site administrators and just hold off until the fix comes in, that might simplify things... instead of a trail of identically FAILED dumps in a row, it could just hold in place.
There is some provision for e-mail alerts on failure, but it doesn't seem to be working in practice; probably some local mail config, I'll see if we can get that sorted out.
-- brion