When a database dump fails, as seen in "20 items failed" on http://download.wikimedia.org/backup-index.html
then the dump script continues with the next database in turn. Apparently they often fail in groups. I guess this is because the failure doesn't depend on the database itself (e.g. svwiki) but on some other circumstance. When that circumstance is solved, all dumps succeed.
Right now (20080529), frwiki is being dumped. Previous successful dumps of frwiki were done on 20080514 and 20080420, http://download.wikimedia.org/frwiki/
The last successful dump of svwiki was made on 20080406, but the one on 20080425 failed, and the one on 20080524 was aborted, http://download.wikimedia.org/svwiki/
Hopefully, but only hopefully, the next svwiki dump will succeed in just a few days or weeks. But who knows. Maybe it too will be aborted and the next dump of frwiki will run instead.
What I want is the dump script to be rewritten so it prioritizes those databases (websites) that haven't been successfully dumped in a long time. It seems unfair that the French should have one every fortnight when us Swedes are waiting almost two months.
Other than this, I want the dumps to fail less often. Why do they fail? Has this been investigated? What can be done to help this?
Lars Aronsson wrote: [snip]
What I want is the dump script to be rewritten so it prioritizes those databases (websites) that haven't been successfully dumped in a long time. It seems unfair that the French should have one every fortnight when us Swedes are waiting almost two months.
Currently the system sorts by order of last dump *attempt*, as I recall.
Other than this, I want the dumps to fail less often. Why do they fail? Has this been investigated? What can be done to help this?
There are several common reasons, which get addressed as they get investigated...
* Loss of database connection during run
This was the traditional problem.
Due to length of runs for big wikis, it became relatively common for the biggest wikis to fail to finish because some DB server broke or maintenance had to be done before it was done. When the server went down, the process would just die.
The first level of this was worked around last year by improving the reconnection behavior when individual connections would go down.
A second level of failures was then discovered, and was worked around a couple months ago by breaking the text fetching for the slowest parts of the dump into a subprocess which can be restarted, connecting to another server. This allows the system to recover even if the set of available DB servers has changed, since it is able to reload its configuration.
A hanging issue with that code, where the recovery system would get confused and go into a loop instead of bailing out gracefully, was discovered recently and fixed.
* Transitory hardware errors
For a while we had several breakages due to benet, the server dumps ran on, encountering disk errors which hung the system. The machine was replaced some weeks ago.
* Transitory configuration errors
Bad copy of code, broken PHP install, change in MySQL priveleges, etc. These will cause a rash of scary-looking "failure!"s in a row, but are easily fixed case-by-case, and the next runs continue just fine.
* Full disk
This still happens occasionally, breaking dumps until space is resolved; the dump system doesn't have a very good disk-space-management scheme. It can optionally delete old dumps after a few runs, but this is currently disabled as it doesn't distinguish between good and bad dumps. :)
Dumps are currently sharing space with upload backups; we're waiting on delivery of new fileservers with more space available.
Note that the dump monitor script is available in our SVN, and patches are welcome:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/
-- brion vibber (brion @ wikimedia.org)
2008/5/30 Brion Vibber brion@wikimedia.org:
Lars Aronsson wrote: [snip]
What I want is the dump script to be rewritten so it prioritizes those databases (websites) that haven't been successfully dumped in a long time. It seems unfair that the French should have one every fortnight when us Swedes are waiting almost two months.
Currently the system sorts by order of last dump *attempt*, as I recall.
Is that likely to be changed? Lars' way does seem significantly better...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Thomas Dalton wrote:
2008/5/30 Brion Vibber brion@wikimedia.org:
Lars Aronsson wrote: [snip]
What I want is the dump script to be rewritten so it prioritizes those databases (websites) that haven't been successfully dumped in a long time. It seems unfair that the French should have one every fortnight when us Swedes are waiting almost two months.
Currently the system sorts by order of last dump *attempt*, as I recall.
Is that likely to be changed? Lars' way does seem significantly better...
Patches are welcome. :)
- -- brion
On Fri, May 30, 2008 at 12:11 PM, Brion Vibber brion@wikimedia.org wrote:
Thomas Dalton wrote:
Is that likely to be changed? Lars' way does seem significantly better...
Patches are welcome. :)
Asking other people to do your job isn't.
Anthony wrote:
Asking other people to do your job isn't.
This is actually quite rude when you consider the amount of work that the employed development team have to do, there are currently only two paid developers.
You are suggesting that the CTO spend valuable time programming an aesthetic change that would not have any considerable benefit whatsoever, this makes little sense and would only serve to lengthen the time that other projects -- perhaps more desired by the community -- take to surface.
MinuteElectron.
On Sat, May 31, 2008 at 3:58 AM, MinuteElectron minuteelectron@googlemail.com wrote:
Anthony wrote:
Asking other people to do your job isn't.
This is actually quite rude when you consider the amount of work that the employed development team have to do, there are currently only two paid developers.
You are suggesting that the CTO spend valuable time programming an aesthetic change that would not have any considerable benefit whatsoever, this makes little sense and would only serve to lengthen the time that other projects -- perhaps more desired by the community -- take to surface.
MinuteElectron.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Agreed. If anytime a new feature or bug was to be fixed in Mediawiki and it was expected that Brion or Tim (the only paid devs) fix it, I'm sure we wouldn't have as many great features as we do. This isn't speaking to their abilities, they're both fantastic programmers, better than I, for certain. Just simply that time is limited in the day, and no one can do everything. This community relies heavily on volunteer input, and Brion asking for a patch is part of that.
-Chad
On Sat, May 31, 2008 at 9:11 AM, Chad innocentkiller@gmail.com wrote:
This community relies heavily on volunteer input, and Brion asking for a patch is part of that.
I didn't see Brion's comment as a serious request for a patch. In fact, I'm not even sure if he thinks the idea is a good one in the first place (or that it's "an aesthetic change that would not have any considerable benefit whatsoever").
Oh well, whatever...
Anthony wrote:
On Sat, May 31, 2008 at 9:11 AM, Chad innocentkiller@gmail.com wrote:
This community relies heavily on volunteer input, and Brion asking for a patch is part of that.
I didn't see Brion's comment as a serious request for a patch. In fact, I'm not even sure if he thinks the idea is a good one in the first place (or that it's "an aesthetic change that would not have any considerable benefit whatsoever").
That particular change is a good idea, but not a high-priority fix (Wikipedia is broken, must be fixed immediately!) so I'm not necessarily going to jump on it that second.
My role isn't to personally do all software development for Wikimedia; it's to make sure that necessary things get done to keep us online. While I do some programming myself, my primary responsibility is increasingly as an architect, project manager, gatekeeper, and mentor.
Our own programming staff is still very small; throw in a couple contract projects and a whole bunch of volunteers with their own individual assignments and areas of interest, and it's really a lot bigger.
When some interesting project exists, I have several possibilities: * do it myself * assign it to a staff programmer (Tim :) * find someone to assign it to as a contract project * find someone interested in poking at it for the fun and experience * wait for someone interested to poke at it and be there to help them
It might be tempting to try to take on every project myself, but that's not a good use of Foundation resources! ;)
This is an open source project, and there's a lot of room for people to "scratch an itch" on particular projects that interest them. Being open about our issues, soliciting improvements, and being there to help new programmers learn by doing is how we grow our developer team.
-- brion
On Sun, Jun 1, 2008 at 2:01 PM, Brion Vibber brion@wikimedia.org wrote:
Anthony wrote:
On Sat, May 31, 2008 at 9:11 AM, Chad innocentkiller@gmail.com wrote:
This community relies heavily on volunteer input, and Brion asking for a patch is part of that.
I didn't see Brion's comment as a serious request for a patch. In fact, I'm not even sure if he thinks the idea is a good one in the first place (or that it's "an aesthetic change that would not have any considerable benefit whatsoever").
That particular change is a good idea, but not a high-priority fix (Wikipedia is broken, must be fixed immediately!) so I'm not necessarily going to jump on it that second.
There's currently no valid full history English Wikipedia database dump available. (There was a completed history dump on 20080103, but I seem to remember it not unzipping properly. If I'm wrong on that, maybe there has been one successful dump, but it's no longer available for download, except maybe on bittorrent.) I'd consider this pretty important. Obviously "Wikipedia is broken" would be more important, but I assume you mean broken technically and not socially in which case I'd say that doesn't seem to be the case.
Ordering the dumps so that failed ones get regenerated first is one step that might help mitigate this problem, but ultimately a redesign of the dump system is probably going to be required.
That's my view of the situation, for what it's worth (probably nothing).
On 31/05/2008, MinuteElectron minuteelectron@googlemail.com wrote:
Anthony wrote:
Asking other people to do your job isn't.
This is actually quite rude when you consider the amount of work that the employed development team have to do, there are currently only two paid developers.
Agreed.
You are suggesting that the CTO spend valuable time programming an aesthetic change that would not have any considerable benefit whatsoever, this makes little sense and would only serve to lengthen the time that other projects -- perhaps more desired by the community -- take to surface.
Strongly disagree. Doing dumps in the right order is not aesthetic and would have considerable benefit. Perhaps you've misunderstood the issue?
I've taken a look at the code, and it's a little beyond me to fix, I think (for a start, it's in Python, which I don't really know), but it seems the problem is that the failed dumps aren't being deleted. When the code looks to see when the latest dump was it does so by looking in the dump directory and seeing what the latest dump in there is, if the failed ones were deleted as soon as it's realised they've failed, the problem with the order would be fixed (and it would free up disc space). There may be parts of the failed dump that have succeeded and it may seem wasteful to just delete them, but, as I understand it, the whole dump has to be redone to redo the failed parts, and if the whole lot is deleted it will be redone almost straight away, so there's only going to be a few hours in which the deleted dump may have been useful.
On Sat, May 31, 2008 at 9:23 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
There may be parts of the failed dump that have succeeded and it may seem wasteful to just delete them, but, as I understand it, the whole dump has to be redone to redo the failed parts, and if the whole lot is deleted it will be redone almost straight away, so there's only going to be a few hours in which the deleted dump may have been useful.
That wouldn't be true for something like the English Wikipedia dump, which usually runs for many days before failing. (Unless I'm misunderstanding what you mean by "parts".)
See http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/WikiDump.py?r1=35655&...
It should now sort dumps by (LastDumpFailed, Age) in place of the previous by Age order.
2008/5/31 Anthony wikimail@inbox.org:
On Sat, May 31, 2008 at 9:23 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
There may be parts of the failed dump that have succeeded and it may seem wasteful to just delete them, but, as I understand it, the whole dump has to be redone to redo the failed parts, and if the whole lot is deleted it will be redone almost straight away, so there's only going to be a few hours in which the deleted dump may have been useful.
That wouldn't be true for something like the English Wikipedia dump, which usually runs for many days before failing. (Unless I'm misunderstanding what you mean by "parts".)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Nicolas Dumazet wrote:
See http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/WikiDump.py?r1=35655&...
It should now sort dumps by (LastDumpFailed, Age) in place of the previous by Age order.
Awesome, thanks! :)
I've updated the live script; note that there's currently going to be no visible difference, as the aborted dumps were already all at the bottom of the list.
There's a distinction between dumps which have been _aborted_, and those which have had elements _fail_.
_Aborted_ means that the system broke entirely (probably because the server it was running on died, or the dump process crashed or was killed manually). The dump monitor later found the expired lock file and declared it aborted, without necessarily being sure what went wrong.
These are marked in status.html with: <span class="failed">, and are what's currently caught by Nicolas' change.
Other dumps may run to completion, but still have _elements_ which failed. You'll see these such as the rash on May 26 marked as "Dump complete, 20 items failed" (in this case the MediaWiki-generated parts worked, but the raw SQL parts couldn't contact the server).
These are marked in status.html with: <span class='done failed'>
These don't currently get caught by the change to dbListByAge().
What might be most useful here could be patching up the actual runner process to retry individual components, perhaps with an exponential time backoff...
For problems that are due to something like software configuration (DB permission change, PHP config bug, etc), just blindly retrying won't help until the actual problem is fixed -- but retrying *will* work once it is. If it can alert the site administrators and just hold off until the fix comes in, that might simplify things... instead of a trail of identically FAILED dumps in a row, it could just hold in place.
There is some provision for e-mail alerts on failure, but it doesn't seem to be working in practice; probably some local mail config, I'll see if we can get that sorted out.
-- brion
On Fri, May 30, 2008 at 8:35 PM, Anthony wikimail@inbox.org wrote:
Asking other people to do your job isn't.
Brion's job is to do what the Board tells him. If they told him to make fixing the dumps his first priority, he would, but they haven't, so it's *not* his job (or anyone's) to fix this immediately.
On Sat, May 31, 2008 at 1:38 PM, Anthony wikimail@inbox.org wrote:
I didn't see Brion's comment as a serious request for a patch. In fact, I'm not even sure if he thinks the idea is a good one in the first place (or that it's "an aesthetic change that would not have any considerable benefit whatsoever").
When a developer in an open-source project says that patches are welcome, I sure hope that they're serious, because otherwise they're knowingly tempting people to waste time on patches that won't be accepted.
wikitech-l@lists.wikimedia.org