Quick update on dump status:
* Dumps are back up and running on srv31, the old dump batch host.
Please note that unlike the wikis sites themselves, dump activity is *not* considered time-critical -- there is no emergency requirement to get them running as soon as possible.
Getting dumps running again after a few days is nearly as good as getting them running again immediately. Yes, it sucks when it takes longer than we'd like. No, it's not the end of the world.
* Dump runner redesign is in progress.
I've chatted a bit with Tim in the past on rearranging the architecture of the dump system to allow for horizontal scaling, which will make the big history dumps much much faster by distributing the work across multiple CPUs or hosts where it's currently limited to a single thread per wiki.
We seem to be in agreement on the basic arch, and Tomasz is now in charge of making this happen; he'll be poking at infrastructure for this over the next few days -- using his past experience with distributed index build systems at Amazon to guide his research -- and will report to y'all later this week with some more concrete details.
* Dump format changes are in progress.
Robert Rohde's p.o.c code for diff-based dumps is in our SVN and available for testing.
We'll be looking at what the possibility on integrating this is to see what the effect on dump performance is; currently performance and reliability are our primary concerns, rather than output file size, but they can intersect since the bzip2 data compression is a time factor.
This will be pushed back to later if we don't see an immediate generation-speed improvement, but it's very much a desired project since it will make the full-history dump files much smaller.
-- brion
"Brion Vibber" brion@wikimedia.org wrote in message news:49A42729.4070806@wikimedia.org...
Quick update on dump status:
- Dumps are back up and running on srv31, the old dump batch host.
Please note that unlike the wikis sites themselves, dump activity is *not* considered time-critical -- there is no emergency requirement to get them running as soon as possible.
...
- Dump runner redesign is in progress.
...
- Dump format changes are in progress.
Brion -- thanks very much for the status report. I think most of us understand that you need to prioritize your resouces, particularly when server alarms are going off left and right; just knowing that someone is aware of an issue and has it somewhere on their to-do list is very positive.
Russ
On February 23, Brion Vibber wrote about the development of a new dump process:
I've been needing to reprioritize resources for this for a while; all of us having many other things to do at the same time
I don't really see why this should be. Is there still a shortage of developers? What's the plan to fix that?
On February 24, the dumps started to roll again, but only with 3 parallel processes. This has since been reduced to 2 processes. The oldest dump at the bottom of http://download.wikimedia.org/backup-index.html is now from January 19, which is 7 weeks old. It was bad enough when the cycle was 3-4 weeks during November-January.
On February 24, Brion wrote about restarting the current process:
Please note that unlike the wikis sites themselves, dump activity is *not* considered time-critical -- there is no emergency requirement to get them running as soon as possible.
This is a language I don't understand. If they are "not time-critical" (not at all?) that means they could wait 4 weeks or 4 years. So why are you pretending to produce dumps at all, when you could just switch them off for the coming 3 years? Things just can't be "not time-critical". Every activity that is performed needs to be completed, or it shouldn't be performed. Maybe it can wait 4 hours or 4 days, but 4 weeks is painfully slow and 4 months is almost useless.
I need a weekly dump of current pages in order to keep improving Wikipedia. If I get one every 4 weeks, it means I'm working one week and idling for 3 weeks. I could live with that. But during July-October 2008 I didn't get any dumps, because we were waiting for new storage systems to be installed, and right now I'm not getting any either. So was the new storage indeed not the bottleneck?
This will be pushed back to later if we don't see an immediate generation-speed improvement, but it's very much a desired project since it will make the full-history dump files much smaller.
Is size all that important? I need frequent dumps, not smaller dumps.
On Sun, Mar 8, 2009 at 5:20 PM, Lars Aronsson lars@aronsson.se wrote:
I don't really see why this should be. Is there still a shortage of developers? What's the plan to fix that?
There's always a shortage of developers, in every software project known to man (really, in every open-ended project known to man). You always have limited time that has to be carefully prioritized. This is not fixable until Wikimedia has infinite money.
On 08.03.2009 18:17:18, Aryeh Gregor wrote:
On Sun, Mar 8, 2009 at 5:20 PM, Lars Aronsson lars@aronsson.se wrote:
I don't really see why this should be. Is there still a shortage of developers? What's the plan to fix that?
There's always a shortage of developers, in every software project known to man (really, in every open-ended project known to man). You always have limited time that has to be carefully prioritized. This is not fixable until Wikimedia has infinite money.
Note that infinite money does not help you to infinite resources (time, people, ...) ;)
On Mon, Mar 9, 2009 at 10:01 AM, Leon Weber leon@leonweber.de wrote:
Note that infinite money does not help you to infinite resources (time, people, ...) ;)
Sure it does, you just infinitely clone Brion.
On Wed, Feb 25, 2009 at 3:58 AM, Brion Vibber brion@wikimedia.org wrote:
Quick update on dump status:
- Dumps are back up and running on srv31, the old dump batch host.
Please note that unlike the wikis sites themselves, dump activity is *not* considered time-critical -- there is no emergency requirement to get them running as soon as possible.
Getting dumps running again after a few days is nearly as good as getting them running again immediately. Yes, it sucks when it takes longer than we'd like. No, it's not the end of the world.
- Dump runner redesign is in progress.
I've chatted a bit with Tim in the past on rearranging the architecture of the dump system to allow for horizontal scaling, which will make the big history dumps much much faster by distributing the work across multiple CPUs or hosts where it's currently limited to a single thread per wiki.
We seem to be in agreement on the basic arch, and Tomasz is now in charge of making this happen; he'll be poking at infrastructure for this over the next few days -- using his past experience with distributed index build systems at Amazon to guide his research -- and will report to y'all later this week with some more concrete details.
Has the dumper been tweaked to remove all hidden revisions, including hidden usernames recently fixed in bug 17792?
-- John Vandenberg
On Mon, Mar 9, 2009 at 2:36 PM, John Vandenberg jayvdb@gmail.com wrote:
Has the dumper been tweaked to remove all hidden revisions, including hidden usernames recently fixed in bug 17792?
That was a bug in contributions, not in deleted revisions. There is no reason to expect that that bug has any relevance to dumps.
On Mon, Mar 9, 2009 at 6:09 PM, Andrew Garrett andrew@werdn.us wrote:
On Mon, Mar 9, 2009 at 2:36 PM, John Vandenberg jayvdb@gmail.com wrote:
Has the dumper been tweaked to remove all hidden revisions, including hidden usernames recently fixed in bug 17792?
That was a bug in contributions, not in deleted revisions. There is no reason to expect that that bug has any relevance to dumps.
I would like to see some affirmative statement that the dump routine has been improved in light of the various items now being suppressed with RevisionDelete, as it is being used to remove items which would previously have been removed via Oversight.
I'm asking because we did have problems with this on the toolserver.
-- John Vandenberg
On Mon, Mar 9, 2009 at 12:37 AM, John Vandenberg jayvdb@gmail.com wrote:
On Mon, Mar 9, 2009 at 6:09 PM, Andrew Garrett andrew@werdn.us wrote:
On Mon, Mar 9, 2009 at 2:36 PM, John Vandenberg jayvdb@gmail.com wrote:
Has the dumper been tweaked to remove all hidden revisions, including hidden usernames recently fixed in bug 17792?
That was a bug in contributions, not in deleted revisions. There is no reason to expect that that bug has any relevance to dumps.
I would like to see some affirmative statement that the dump routine has been improved in light of the various items now being suppressed with RevisionDelete, as it is being used to remove items which would previously have been removed via Oversight.
I'm asking because we did have problems with this on the toolserver.
There were changes to the dumper made in consideration of RevisionDelete. I don't know if anyone has checked to verify if the output is correctly hiding what it is supposed to hide or not.
-Robert Rohde
On Mon, Mar 9, 2009 at 6:57 PM, Robert Rohde rarohde@gmail.com wrote:
On Mon, Mar 9, 2009 at 12:37 AM, John Vandenberg jayvdb@gmail.com wrote:
On Mon, Mar 9, 2009 at 6:09 PM, Andrew Garrett andrew@werdn.us wrote:
On Mon, Mar 9, 2009 at 2:36 PM, John Vandenberg jayvdb@gmail.com wrote:
Has the dumper been tweaked to remove all hidden revisions, including hidden usernames recently fixed in bug 17792?
That was a bug in contributions, not in deleted revisions. There is no reason to expect that that bug has any relevance to dumps.
I would like to see some affirmative statement that the dump routine has been improved in light of the various items now being suppressed with RevisionDelete, as it is being used to remove items which would previously have been removed via Oversight.
I'm asking because we did have problems with this on the toolserver.
There were changes to the dumper made in consideration of RevisionDelete. I don't know if anyone has checked to verify if the output is correctly hiding what it is supposed to hide or not.
Lar, Kylu and I did limited checks on this a while ago using the dumps which had already been created, and all seem ok. We didnt think to check the hiding of usernames.
-- John Vandenberg
wikitech-l@lists.wikimedia.org