Hello,
the current dump building seem to be dead and perhaps should be killed by hand.
Best regards
Andim
"Andreas Meier" andreasmeier80@gmx.de wrote in message news:4997D645.8050605@gmx.de...
Hello,
the current dump building seem to be dead and perhaps should be killed by hand.
Reported: https://bugzilla.wikimedia.org/show_bug.cgi?id=17535
What is with this? Why are the XML dumps (the primary product of the projects: re-usable content) the absolute effing lowest possible effing priority? Why?
I just finished (I thought) putting together some new software to update iwikis on the wiktionaries. It is set up to read the "langlinks" and "all-titles" part of the dumps. Just as I do that, the dumps fall down. Again. And no-one cares one whit; not even a reply here. (The bug was replied to after 4 days, and *might* be fixed presently, after 9 days?)
My course of action now is to write new code to use thousands of API calls to get the information, albeit as efficiently as I can. When I do that, the chance that it will ever go back to using the dumps is a very close approximation to zero. After all, it will work somewhat better that way.
Other people, *many*, *many*, other people are being *forced* to do the same, to maintain their apps and functions based on the WMF data. And there is no chance in hell they will go back to the dump "service" either.
Brion, Tim, et al: you are worried about overall server load? Get the dumps working. This morning. And make it crystal clear that they will not break, and you will be checking them n times a day and they can be utterly, totally, absolutely relied upon.
It's like that. People will use what *works*.
Want people to use the dumps? Make them WORK.
Want everyone to just dynamically crawl the live DB, with whatever screwy lousy inefficiency? FIne, just continue as you are, where that is all that can be relied upon!
Look at the other threads: people asking if they can crawl the English WP at one per second, or maybe what? Is that what you want? That is what you are telling people to do, when the dump "service" says "2009-02-12 06:52:16 pswiki: Dump in progress" at the top on the 22nd of February.
FYI for all others: if you want content dumps of the English Wiktionary, they are available in the usual XML format at
http://devtionary.info/w/dump/xmlu/
at ~ 09;00 UTC. Every day.
With my best regards, Robert
On Tue, Feb 17, 2009 at 7:35 PM, Russell Blau russblau@hotmail.com wrote:
"Andreas Meier" andreasmeier80@gmx.de wrote in message news:4997D645.8050605@gmx.de...
Hello,
the current dump building seem to be dead and perhaps should be killed by hand.
Reported: https://bugzilla.wikimedia.org/show_bug.cgi?id=17535
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi,
Maybe I should offer a constructive suggestion?
Clearly, trying to do these dumps (particularly "history" dumps) as it is being done from the servers is proving hard to manage
I also realize that you can't just put the set of daily permanent-media backups on line, as they contain lots of user info, plus deleted and oversighted revs, etc.
But would it be possible to put each backup disc (before sending one of the several copies off to its secure storage) in a machine that would filter all the content into a public file (or files)? Then someone else could download each disc (i.e. a 10-15 GB chunk of updates) and sort it into the useful files for general download?
Then someone can produce a current (for example) English 'pedia XML file; and with more work the cumulative history files (if we want that as one file).
There would be delays, each of your permanent media backup discs has to be (probably manually, but changers are available) loaded on the "filter" system, and I don't know how many discs WMF generates per day. (;-) and then it has to filter all the revision data etc. But it still would easily be available for others in 48-72 hours, which beats the present ~6 weeks when the dumps are working.
No shortage of people with a box or two and any number of Tbyte hard drives that might be willing to help, if they can get the raw backups.
Best, Robert
Robert Ullmann wrote:
Hi,
Maybe I should offer a constructive suggestion?
They are better than rants :)
Clearly, trying to do these dumps (particularly "history" dumps) as it is being done from the servers is proving hard to manage
I also realize that you can't just put the set of daily permanent-media backups on line, as they contain lots of user info, plus deleted and oversighted revs, etc.
But would it be possible to put each backup disc (before sending one of the several copies off to its secure storage) in a machine that would filter all the content into a public file (or files)? Then someone else could download each disc (i.e. a 10-15 GB chunk of updates) and sort it into the useful files for general download?
I don't think they move backup copies off to secure storage. They have the db replicated and the backup discs would be copies of that same dumps. (Some sysadmin to confirm?)
Then someone can produce a current (for example) English 'pedia XML file; and with more work the cumulative history files (if we want that as one file).
There would be delays, each of your permanent media backup discs has to be (probably manually, but changers are available) loaded on the "filter" system, and I don't know how many discs WMF generates per day. (;-) and then it has to filter all the revision data etc. But it still would easily be available for others in 48-72 hours, which beats the present ~6 weeks when the dumps are working.
No shortage of people with a box or two and any number of Tbyte hard drives that might be willing to help, if they can get the raw backups.
The problem is that WMF can't provide that raw unfiltered information. Perhaps you could donate a box on the condition that it could only be used for dump processing, but giving out unfiltered data would be too risky.
Hoi, There have been previous offers for developer time and for hardware... Thanks, GerardM
2009/2/23 Platonides Platonides@gmail.com
Robert Ullmann wrote:
Hi,
Maybe I should offer a constructive suggestion?
They are better than rants :)
Clearly, trying to do these dumps (particularly "history" dumps) as it is being done from the servers is proving hard to manage
I also realize that you can't just put the set of daily permanent-media backups on line, as they contain lots of user info, plus deleted and oversighted revs, etc.
But would it be possible to put each backup disc (before sending one of the several copies off to its secure storage) in a machine that would filter all the content into a public file (or files)? Then someone else could download each disc (i.e. a 10-15 GB chunk of updates) and sort it into the useful files for general download?
I don't think they move backup copies off to secure storage. They have the db replicated and the backup discs would be copies of that same dumps. (Some sysadmin to confirm?)
Then someone can produce a current (for example) English 'pedia XML file; and with more work the cumulative history files (if we want that as one file).
There would be delays, each of your permanent media backup discs has to be (probably manually, but changers are available) loaded on the "filter" system, and I don't know how many discs WMF generates per day. (;-) and then it has to filter all the revision data etc. But it still would easily be available for others in 48-72 hours, which beats the present ~6 weeks when the dumps are working.
No shortage of people with a box or two and any number of Tbyte hard drives that might be willing to help, if they can get the raw backups.
The problem is that WMF can't provide that raw unfiltered information. Perhaps you could donate a box on the condition that it could only be used for dump processing, but giving out unfiltered data would be too risky.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The reason these dumps are not rewritten more efficiently is that this job was handed to me (at my request) and I have not been able to get to it, even though it is the first thing on my list for development work. So, if there are going to be rants, they can be directed at me, not at the whole team.
The work was started already by a volunteer. As I am the blocking factor, someone else should probably take it on and get it done, though it will make me sad. Brion discussed this with me about a week and a half ago and I still wanted to keep it then but it doesn't make sense. The in-office needs that I am also responsible for take virtually all of my time. Perhaps they shouldn't, but that is how it has worked out.
So, I am very sorry for having needlessly held things up. (I also have aa crawler that requests pages changed since the latest xml dump, so that projects I am on can keep a current xml file; we've been running that way for at least a year.)
Ariel
Στις 23-02-2009, ημέρα Δευ, και ώρα 00:37 +0100, ο/η Gerard Meijssen έγραψε:
Hoi, There have been previous offers for developer time and for hardware... Thanks, GerardM
2009/2/23 Platonides Platonides@gmail.com
Robert Ullmann wrote:
Hi,
Maybe I should offer a constructive suggestion?
They are better than rants :)
Clearly, trying to do these dumps (particularly "history" dumps) as it is being done from the servers is proving hard to manage
I also realize that you can't just put the set of daily permanent-media backups on line, as they contain lots of user info, plus deleted and oversighted revs, etc.
But would it be possible to put each backup disc (before sending one of the several copies off to its secure storage) in a machine that would filter all the content into a public file (or files)? Then someone else could download each disc (i.e. a 10-15 GB chunk of updates) and sort it into the useful files for general download?
I don't think they move backup copies off to secure storage. They have the db replicated and the backup discs would be copies of that same dumps. (Some sysadmin to confirm?)
Then someone can produce a current (for example) English 'pedia XML file; and with more work the cumulative history files (if we want that as one file).
There would be delays, each of your permanent media backup discs has to be (probably manually, but changers are available) loaded on the "filter" system, and I don't know how many discs WMF generates per day. (;-) and then it has to filter all the revision data etc. But it still would easily be available for others in 48-72 hours, which beats the present ~6 weeks when the dumps are working.
No shortage of people with a box or two and any number of Tbyte hard drives that might be willing to help, if they can get the raw backups.
The problem is that WMF can't provide that raw unfiltered information. Perhaps you could donate a box on the condition that it could only be used for dump processing, but giving out unfiltered data would be too risky.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ariel T. Glenn wrote:
The reason these dumps are not rewritten more efficiently is that this job was handed to me (at my request) and I have not been able to get to it, even though it is the first thing on my list for development work. So, if there are going to be rants, they can be directed at me, not at the whole team.
The work was started already by a volunteer. As I am the blocking factor, someone else should probably take it on and get it done, though it will make me sad. Brion discussed this with me about a week and a half ago and I still wanted to keep it then but it doesn't make sense. The in-office needs that I am also responsible for take virtually all of my time. Perhaps they shouldn't, but that is how it has worked out.
So, I am very sorry for having needlessly held things up. (I also have aa crawler that requests pages changed since the latest xml dump, so that projects I am on can keep a current xml file; we've been running that way for at least a year.)
Is the source for the new dump system on SVN somewhere?
yep, http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/ +)
2009/2/23 Alex mrzmanwiki@gmail.com:
Ariel T. Glenn wrote:
The reason these dumps are not rewritten more efficiently is that this job was handed to me (at my request) and I have not been able to get to it, even though it is the first thing on my list for development work. So, if there are going to be rants, they can be directed at me, not at the whole team.
The work was started already by a volunteer. As I am the blocking factor, someone else should probably take it on and get it done, though it will make me sad. Brion discussed this with me about a week and a half ago and I still wanted to keep it then but it doesn't make sense. The in-office needs that I am also responsible for take virtually all of my time. Perhaps they shouldn't, but that is how it has worked out.
So, I am very sorry for having needlessly held things up. (I also have aa crawler that requests pages changed since the latest xml dump, so that projects I am on can keep a current xml file; we've been running that way for at least a year.)
Is the source for the new dump system on SVN somewhere?
-- Alex (wikipedia:en:User:Mr.Z-man)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ariel T. Glenn wrote:
The reason these dumps are not rewritten more efficiently is that this job was handed to me (at my request) and I have not been able to get to it, even though it is the first thing on my list for development work. [...] The in-office needs that I am also responsible for take virtually all of my time. Perhaps they shouldn't, but that is how it has worked out.
Hi Ariel, I hope you find the time and peace you need for this development. It might be a bit worrying if this was handed to you (by Brion? when?) without also handing you the necessary resources. But the internal organization there is not my task.
However, quite independent of your development work, the current system for dumps seems to have stopped on February 12. That's the impression I get from looking at http://download.wikimedia.org/backup-index.html
Despite all its shortcomings (3-4 weeks between dumps, no history dumps for en.wikipedia), the current dump system is very useful. What's not useful is that it was out of service from July to October 2008 and now again appears to be broken since February 12.
Certainly, things do fail. But when they do, and I ask about this on #wikimedia-tech on February 20, a week after things stopped, I don't expect Brion to say "oops". I want him to know about it 12 hours after it happend and to have a plan. Apparently (I'm just guessing from what I hear), serv31 is broken and serv31 was not in the Nagios watchdog system. OK, will this be fixed? When?
Still today, February 23, no explanation has been posted on that dump website or on these mailing lists. That's the real surprise.
I have other issues I want to deal with: mapping extensions, new visionsary solutions, new ways to involve new people in creating free knowledge. But if basic planning, routines and resource allocation don't work inside the WMF, then we have to start with the basics. What's wrong there? How can it be helped?
"Lars Aronsson" lars@aronsson.se wrote in message news:Pine.LNX.4.64.0902231202140.1043@localhost.localdomain...
However, quite independent of your development work, the current system for dumps seems to have stopped on February 12. That's the impression I get from looking at http://download.wikimedia.org/backup-index.html
Despite all its shortcomings (3-4 weeks between dumps, no history dumps for en.wikipedia), the current dump system is very useful. What's not useful is that it was out of service from July to October 2008 and now again appears to be broken since February 12.
...
Still today, February 23, no explanation has been posted on that dump website or on these mailing lists. That's the real surprise.
I have to second this. I tried to report this outage several times last week - on IRC, on this mailing list, and on Bugzilla. All reports -- NOT COMPLAINTS, JUST REPORTS -- were met with absolute silence. I fully understand that time and resources are limited, and not everything can be fixed immediately, but at least some acknowledgement of the reports would be appreciated. It is extremely disheartening to members of the user community of what is supposed to be a collaborative project when attempts to contribute by reporting a service outage are ignored.
Russ
"Russell Blau" russblau@hotmail.com wrote in message news:gnuacf$hf0$1@ger.gmane.org...
I have to second this. I tried to report this outage several times last week - on IRC, on this mailing list, and on Bugzilla. All reports -- NOT COMPLAINTS, JUST REPORTS -- were met with absolute silence.
Two updates on this.
1) Brion did respond to the Bugzilla report (albeit two+ days after it was posted), which I overlooked when posting earlier. He said "The box they were running on (srv31) is dead. We'll reassign them over the weekend if we can't bring the box back up."
2) Within the last hour, the server log at http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker was tripped in the data center.
Russ
Thanks for the update Russell!
On Feb 23, 2009, at 10:04 AM, Russell Blau wrote:
"Russell Blau" russblau@hotmail.com wrote in message news:gnuacf$hf0$1@ger.gmane.org...
I have to second this. I tried to report this outage several times last week - on IRC, on this mailing list, and on Bugzilla. All reports -- NOT COMPLAINTS, JUST REPORTS -- were met with absolute silence.
Two updates on this.
- Brion did respond to the Bugzilla report (albeit two+ days after
it was posted), which I overlooked when posting earlier. He said "The box they were running on (srv31) is dead. We'll reassign them over the weekend if we can't bring the box back up."
- Within the last hour, the server log at
http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker was tripped in the data center.
Russ
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hmm:
On Mon, Feb 23, 2009 at 9:04 PM, Russell Blau russblau@hotmail.com wrote:
- Within the last hour, the server log at
http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker was tripped in the data center.
So we conclude that
Feb 12th: a breaker trips, taking four servers offline
(8 days go by, with a number of reports)
Feb 20th: it is noted that srv31 is down, (noted that AC is off?)
(3 days go by)
Feb 23rd: the tripped breaker is found, srv31 restarted (and 8+ hours later, the dumps have not resumed)
Really? I mean is this for real?
The sequence ought to be something like: breaker trips, monitor shows within a minute or two that 4 servers are offline, and not scheduled to be. In the next 5 minutes someone looks at the server(s), notes that there is no AC power, walks directly to the panel and resets the breaker. How is this *not* done? I'm sorry, I just don't get it. I've run data centres, and it just is not possible to have servers down for AC power for more than a few minutes unless there is a fault one can't locate. (Or grid down, and running a subset on the generators ;-)
Can someone explain all this? Is the whole thing just completely beyond the resource available to manage it?
Best regards, Robert
On Tue, Feb 24, 2009 at 1:07 PM, Robert Ullmann rlullmann@gmail.com wrote:
Really? I mean is this for real?
The sequence ought to be something like: breaker trips, monitor shows within a minute or two that 4 servers are offline, and not scheduled to be. In the next 5 minutes someone looks at the server(s), notes that there is no AC power, walks directly to the panel and resets the breaker. How is this *not* done? I'm sorry, I just don't get it. I've run data centres, and it just is not possible to have servers down for AC power for more than a few minutes unless there is a fault one can't locate. (Or grid down, and running a subset on the generators ;-)
Can someone explain all this? Is the whole thing just completely beyond the resource available to manage it?
Constructive suggestions for improvement are far more welcome than complaints and outrage.
If you have no suggestions for improvement, it is perhaps more prudent to express concern that dumps are not working and to wait for a response. This is admittedly less fun than piecing together information and "lining up" those responsible for something not being operational.
On Tue, Feb 24, 2009 at 6:49 AM, Andrew Garrett andrew@werdn.us wrote:
On Tue, Feb 24, 2009 at 1:07 PM, Robert Ullmann rlullmann@gmail.com wrote:
Really? I mean is this for real?
The sequence ought to be something like: breaker trips, monitor shows within a minute or two that 4 servers are offline, and not scheduled to be. In the next 5 minutes someone looks at the server(s), notes that there is no AC power, walks directly to the panel and resets the breaker. How is this *not* done? I'm sorry, I just don't get it. I've run data centres, and it just is not possible to have servers down for AC power for more than a few minutes unless there is a fault one can't locate. (Or grid down, and running a subset on the generators ;-)
Can someone explain all this? Is the whole thing just completely beyond the resource available to manage it?
Constructive suggestions for improvement are far more welcome than complaints and outrage.
If you have no suggestions for improvement, it is perhaps more prudent to express concern that dumps are not working and to wait for a response. This is admittedly less fun than piecing together information and "lining up" those responsible for something not being operational.
Andrew: this is NOT FUN AT ALL. Do you think it is "fun" to have to complain bitterly and repeatedly because simply reporting critical-down problems elicits little or no reply and no corrective action for days and weeks? Fun? Fun?
Okay, I'll put it this way: the following should be done:
All servers should be monitored, on several levels (ping, various queries, checking processes)
Someone should be "watching" the monitor 24x7. (being right there, or by SMS, whatever ;)
When a server is reported down (in this case hard; won't reply to ping) it should be physically looked at within minutes.
If it has no AC power, the circuit breaker is the first thing to check.
When restarted, the things it was doing should be restarted (this has not been done yet at this writing).
Now I can say these things as "constructive suggestions", but are they are not of course: they are fundamental operational procedure for a data centre. Please explain to me why I should have to "suggest" them? Eh? I am confused (seriously! I am not being snarky here). What is going on?
best, Robert
Let me ask a separate question (Ariel may be interested in this):
What if we took the regular permanent media backups, and WMF filtered them in house just to remove the classified stuff (;-), and then put them somewhere where others could convert them to the desired format(s)? (Build all-history files, whatever.)
What is the standard backup procedure?
(I ask as I haven't seen any description or reference to it ... :-)
Robert
Robert Ullmann wrote:
All servers should be monitored, on several levels (ping, various queries, checking processes)
Nagios should have been monitoring them.
Someone should be "watching" the monitor 24x7. (being right there, or by SMS, whatever ;)
Don't know if there can be a nagios "silent" failure, where it doesn't get disconnected from irc.
When restarted, the things it was doing should be restarted (this has not been done yet at this writing).
The worry bit is that it seems srv136 will now work as apache. So, where will dumps be done?
The worry bit is that it seems srv136 will now work as apache. So, where will dumps be done?
I'm not sure where (or if it has changed), but they are running now .... (:-)
To Ariel Glenn:
On getting them to work better in the future, this is what I would suggest:
First, note that everything except the "all history" dumps presents no problem. It isn't perfect, but it is workable. The biggest "all pages current" dump is enwiki, which takes about a day and a half, and the compressed output file (bz2) still fits neatly on a DVD.
As to the history files, these are the problem; each contains all of the preceding history and they just grow and grow. They must be partitioned somehow. Suggestions have been made concerning alphabetical partitions (very traditional for encyclopaedias ;-); you yourself suggested page id.
I suggest the history be partitioned into "blocks" by *revision ID*
Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in "block 1", and so on. The English Wiktionary at the moment would have 7 blocks; the English Wikipedia would have 273.
The dumps would continue as now up to "all pages current", including the split-stub dump for the history (very important, as it provides the "snapshot" of the DB state). But then when it gets to history, it re-builds the last block done (possibly completing it), and then writes 0-n new ones as needed.
Note that (to pick a random number) "block 71" of the enwiki defined this way *has not changed* in a long time; only the current block(s) need to be (re-)written. The history stays the same. (Of course?!)
If someone somewhere needs a copy of the wiki with all history as of a given date, they can start with the split-stub for that date and read in all the required blocks. But that isn't your problem any more. (;-) They can do that with their disk and servers.
It would probably be best to still sort by page-id order within each block, as they will compress much better that way.
One reason to rebuild the last block (or two) is to filter out deleted and oversighted revisions. Deleted and oversighted revisions older than some specific time (a small number of weeks) would remain. But note that that is true *anyway*, as someone can always look at a 3-month old dump under any method.
With my best regards, Robert
2009/2/25 Robert Ullmann rlullmann@gmail.com:
I suggest the history be partitioned into "blocks" by *revision ID*
Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in "block 1", and so on. The English Wiktionary at the moment would have 7 blocks; the English Wikipedia would have 273.
One problem with that is that you won't get such good compression ratios. Most of the revisions of a single article are very similar to the revisions before and after it, so they compress down very small. If you break up the articles between different blocks you don't get that advantage (at least, not to the same extent).
But server space saved by compression would be would be compensated by the stability, and flexibility provided by this method. this would allow what ever server is controlling the dump process to designate and delegate parallel processes for the same dump. so block 1 could be on server 1 and block 2 could be on server 3. that would give the flexibility to use as many servers as are available for this task more efficiently. if block 200 of en.wp breaks for some reason you dont have to rebuild the previous 199 blocks you can just delegate a server to rebuild that single block. that would allow the dump process to be a little more crash friendly (even though I know we dont want to admit crashes happen :) ) this also enables the dump time in future dumps to be cut drasticlly. Id recommend either 10m or 10% of the database which ever is larger for new dumps to screen out a majority of the deletions. what are your thoughts on this process brion (and the rest of the tech team)?
Betacommand
On Wed, Feb 25, 2009 at 9:00 AM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/2/25 Robert Ullmann rlullmann@gmail.com:
I suggest the history be partitioned into "blocks" by *revision ID*
Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in "block 1", and so on. The English Wiktionary at the moment would have 7 blocks; the English Wikipedia would have 273.
One problem with that is that you won't get such good compression ratios. Most of the revisions of a single article are very similar to the revisions before and after it, so they compress down very small. If you break up the articles between different blocks you don't get that advantage (at least, not to the same extent).
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2009/2/25 John Doe phoenixoverride@gmail.com:
But server space saved by compression would be would be compensated by the stability, and flexibility provided by this method.
True, I didn't mean to say it was a bad idea, I was just pointing out one disadvantage you may not have considered.
2009/2/25 John Doe phoenixoverride@gmail.com:
Id recommend either 10m or 10% of the database which ever is larger for new dumps to screen out a majority of the deletions. what are your thoughts on this process brion (and the rest of the tech team)?
Another idea: If $revision is deleted/oversighted/whateverhowmadeinvisible, then find out the block ID for the dump so that only this specific block needs to be re-created in next dump run. Or, better: do not recreate the dump block, but only remove the offending revision(s) from it. Shoulda save a lot of dump preparation time, IMO.
Marco
Marco Schuster wrote:
Another idea: If $revision is deleted/oversighted/whateverhowmadeinvisible, then find out the block ID for the dump so that only this specific block needs to be re-created in next dump run. Or, better: do not recreate the dump block, but only remove the offending revision(s) from it. Shoulda save a lot of dump preparation time, IMO.
Marco
That's already done. New dumps insert the content from the previous ones (when available, enwiki has a hard time on it).
On Thu, Feb 26, 2009 at 5:08 AM, John Doe phoenixoverride@gmail.com wrote:
But server space saved by compression would be would be compensated by the stability, and flexibility provided by this method. this would allow what ever server is controlling the dump process to designate and delegate parallel processes for the same dump.
Not nearly -- we're talking about a 100-fold decrease in compression ratio if we don't compress revisions of the same page adjacent to one another.
-- Andrew Garrett
Hi,
On Thu, Feb 26, 2009 at 2:29 AM, Andrew Garrett andrew@werdn.us wrote:
On Thu, Feb 26, 2009 at 5:08 AM, John Doe phoenixoverride@gmail.com wrote:
But server space saved by compression would be would be compensated by the stability, and flexibility provided by this method. this would allow what ever server is controlling the dump process to designate and delegate parallel processes for the same dump.
Not nearly -- we're talking about a 100-fold decrease in compression ratio if we don't compress revisions of the same page adjacent to one another.
-- Andrew Garrett
No, not nearly that bad. Keep in mind that ~10x of the compression is just from having English text and repeated XML tags, etc. (Note the compression ratio of the all-articles dump, which has only one revision of each article.)
If the revisions in each "block" are sorted by pageid, so that the revs of the same article are together, you'll get a very large part of the other 10x factor. Revisions to pages tend to cluster in time (think edits and reverts :-) as one or more people work on an article, or it is of news interest (see "Slumdog Millionaire" ;-) or whatever. You can see this for any given article, like this:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvlimi...
look at the first three digits of the revid, when they are the same, they would be in the same "block" (this is assuming 1M revs/block as I suggested). You can check any title you like (remember _ for space, and % escapes for a lot of characters, but a good browser will do that for you in a lot of cases) Since the majority of edits are for a minority of titles (some version of the 80/20 rule applies), most edits/revisions will be in the same block as a number of others for that page.
So we will get most, but not all, of the other 10X compression ratio.
But even if the compressed blocks are (say) 20% bigger, the win is that once they are some weeks old, they NEVER need to be re-built. Each dump (which should then be about weekly, with the same compute resource, as the queue runs faster ;-) need only build or re-build a few blocks. (And there is no need at all to parallelize any given dump, just run 3-5 different ones in parallel as now.)
best, Robert
Robert Ullmann wrote:
look at the first three digits of the revid, when they are the same, they would be in the same "block" (this is assuming 1M revs/block as I suggested). You can check any title you like (remember _ for space, and % escapes for a lot of characters, but a good browser will do that for you in a lot of cases) Since the majority of edits are for a minority of titles (some version of the 80/20 rule applies), most edits/revisions will be in the same block as a number of others for that page.
Not only do you need to keep them in the same block. You also need to keep them inside the compression window. Unless you are going to reorder those 1M revisions to keep revisions to the same article together.
On Thu, Feb 26, 2009 at 4:48 PM, Platonides Platonides@gmail.com wrote:
Not only do you need to keep them in the same block. You also need to keep them inside the compression window. Unless you are going to reorder those 1M revisions to keep revisions to the same article together.
He already said that should be done (each block clustered by page id).
On Tue, Feb 24, 2009 at 5:09 PM, Robert Ullmann rlullmann@gmail.com wrote: <snip>
I suggest the history be partitioned into "blocks" by *revision ID*
Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in "block 1", and so on. The English Wiktionary at the moment would have 7 blocks; the English Wikipedia would have 273.
<snip>
Though there are arguments in favor of this, I think they are outweighed by the fact that one would need to go through every block in order to reconstruct the history of even a single page. In my opinion partitioning on page id is a much better idea since it would keep each page's history in a single place.
-Robert Rohde
--- El mié, 25/2/09, Robert Ullmann rlullmann@gmail.com escribió:
De: Robert Ullmann rlullmann@gmail.com Asunto: Re: [Wikitech-l] Dump processes seem to be dead Para: "Wikimedia developers" wikitech-l@lists.wikimedia.org Fecha: miércoles, 25 febrero, 2009 2:09 you yourself suggested page id.
I suggest the history be partitioned into "blocks" by *revision ID*
I've checked some alternatives to slice the huge dump files in chunks with a more manageable size. I first thought about dividing the blocks by rev_id, like you suggest. Then, I realized that it can pose some problems for parsers recovering information, since revisions corresponding to the same page may fall in different dump files.
Once you have surpassed the page_id tag, you cannot remember it if the process stops due to some error, unless you save breakpoint information to recover it later on, when you restart the process again.
Partitioning by page_id, you can maintain all revs of the same page in the same block, while you don't disturb algorithms looking for individual revisions.
Yes, the chunks would be slightly bigger, but the difference is not that much with either 7zip or bzip2, and you favor simplicity of recovering tools.
Best,
F.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2009/2/24 Robert Ullmann rlullmann@gmail.com:
When a server is reported down (in this case hard; won't reply to ping) it should be physically looked at within minutes.
Is there anyone within minutes of the servers at all times? Aren't they at a remote data centre?
On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
Is there anyone within minutes of the servers at all times? Aren't they at a remote data centre?
Isn't Rob on-site?
2009/2/24 Aryeh Gregor Simetrical+wikilist@gmail.com:
On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
Is there anyone within minutes of the servers at all times? Aren't they at a remote data centre?
Isn't Rob on-site?
He's based somewhere near the data centre, but I'm not sure he's actually there unless there is something which needs his attention. He's certainly not there 24/7 (regrettably, WMF is still using human sysadmins...).
Hoi, Is there also a "Rob" in Amsterdam and Seoul ? Thanks, GerardM
2009/2/24 Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com
On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
Is there anyone within minutes of the servers at all times? Aren't they at a remote data centre?
Isn't Rob on-site?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
afaik there are "hands" in amsterdam that can be called upon to do stuff as necessary in the centre like any other hosting customer, but the need is not quite of the same level as tampa due to size, servers there etc. seoul no longer operates so this is not an issue.
regards
mark
On Tue, Feb 24, 2009 at 2:55 PM, Gerard Meijssen gerard.meijssen@gmail.comwrote:
Hoi, Is there also a "Rob" in Amsterdam and Seoul ? Thanks, GerardM
2009/2/24 Aryeh Gregor <Simetrical+wikilist@gmail.com Simetrical%2Bwikilist@gmail.com< Simetrical%2Bwikilist@gmail.com Simetrical%252Bwikilist@gmail.com>
On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
Is there anyone within minutes of the servers at all times? Aren't they at a remote data centre?
Isn't Rob on-site?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Most of that hasn't been touched in years, and it seems to be mainly a Python wrapper around the dump scripts in /phase3/maintenance/ which also don't seem to have had significant changes recently. Has anything been done recently (in a very broad sense of the word)? Or at least, has anything been written down about what the plans are?
Nicolas Dumazet wrote:
yep, http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/ +)
2009/2/23 Alex mrzmanwiki@gmail.com:
Ariel T. Glenn wrote:
The reason these dumps are not rewritten more efficiently is that this job was handed to me (at my request) and I have not been able to get to it, even though it is the first thing on my list for development work. So, if there are going to be rants, they can be directed at me, not at the whole team.
The work was started already by a volunteer. As I am the blocking factor, someone else should probably take it on and get it done, though it will make me sad. Brion discussed this with me about a week and a half ago and I still wanted to keep it then but it doesn't make sense. The in-office needs that I am also responsible for take virtually all of my time. Perhaps they shouldn't, but that is how it has worked out.
So, I am very sorry for having needlessly held things up. (I also have aa crawler that requests pages changed since the latest xml dump, so that projects I am on can keep a current xml file; we've been running that way for at least a year.)
Is the source for the new dump system on SVN somewhere?
-- Alex (wikipedia:en:User:Mr.Z-man)
On Mon, Feb 23, 2009 at 11:08 AM, Alex mrzmanwiki@gmail.com wrote:
Most of that hasn't been touched in years, and it seems to be mainly a Python wrapper around the dump scripts in /phase3/maintenance/ which also don't seem to have had significant changes recently. Has anything been done recently (in a very broad sense of the word)? Or at least, has anything been written down about what the plans are?
In a "very broad sense" (and not directly connected to main problems), I wrote a compressor [1] that converts full-text history dumps into an "edit syntax" that provides ~95% compression on the larger dumps while keeping it in a plain text format that could still be searched and processed without needing a full decompression.
That's one of several ways to modify the way dump process operates in order to make the output easier to work with (if it takes ~2 TB to expand enwiki's full history, then that is not practical for most users even if we solve the problem of generating it). It is not necessarily true that my specific technology is the right answer, but various changes in formatting to aid distribution, generation, and use are one of the areas that ought to be considered when reimplementing the dump process.
The largest gains are almost certainly going to be in parallelization though. A single monolithic dumper is impractical for enwiki.
-Robert Rohde
[1] http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/editsyntax/
Robert Rohde wrote:
The largest gains are almost certainly going to be in parallelization though. A single monolithic dumper is impractical for enwiki.
-Robert Rohde
Using dumps compressed per blocks, as the ones I used for http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040812.html would allow several processes/computers to write the same dump on different offsets and reading from the last one on different positions as well.
As sharing a transaction between different servers would be tricky, they should probably dump from the previously dumped page.sql.gz
<whisper>Patches on bugs 16082 and 16176 to add Export features are awaiting review</whisper>
Ariel,
Thank you for giving some insight into what has been going on behind the scenes. I have a few questions that will hopefully get some answers to those of us eager to help out in any way we can.
What are the planned code changes to speed the process up? Can we help this volunteer with the coding or architectural decisions? How much time do they have to dedicate to it? Some visibility into the fix and timeline would benefit a lot of us. It would also help us know how we can help out!
Thanks again for shedding some light on the issue.
On Feb 22, 2009, at 8:12 PM, Ariel T. Glenn wrote:
The reason these dumps are not rewritten more efficiently is that this job was handed to me (at my request) and I have not been able to get to it, even though it is the first thing on my list for development work. So, if there are going to be rants, they can be directed at me, not at the whole team.
The work was started already by a volunteer. As I am the blocking factor, someone else should probably take it on and get it done, though it will make me sad. Brion discussed this with me about a week and a half ago and I still wanted to keep it then but it doesn't make sense. The in-office needs that I am also responsible for take virtually all of my time. Perhaps they shouldn't, but that is how it has worked out.
So, I am very sorry for having needlessly held things up. (I also have aa crawler that requests pages changed since the latest xml dump, so that projects I am on can keep a current xml file; we've been running that way for at least a year.)
Ariel
Στις 23-02-2009, ημέρα Δευ, και ώρα 00:37 +0100, ο/ η Gerard Meijssen έγραψε:
Hoi, There have been previous offers for developer time and for hardware... Thanks, GerardM
2009/2/23 Platonides Platonides@gmail.com
Robert Ullmann wrote:
Hi,
Maybe I should offer a constructive suggestion?
They are better than rants :)
Clearly, trying to do these dumps (particularly "history" dumps) as it is being done from the servers is proving hard to manage
I also realize that you can't just put the set of daily permanent-media backups on line, as they contain lots of user info, plus deleted and oversighted revs, etc.
But would it be possible to put each backup disc (before sending one of the several copies off to its secure storage) in a machine that would filter all the content into a public file (or files)? Then someone else could download each disc (i.e. a 10-15 GB chunk of updates) and sort it into the useful files for general download?
I don't think they move backup copies off to secure storage. They have the db replicated and the backup discs would be copies of that same dumps. (Some sysadmin to confirm?)
Then someone can produce a current (for example) English 'pedia XML file; and with more work the cumulative history files (if we want that as one file).
There would be delays, each of your permanent media backup discs has to be (probably manually, but changers are available) loaded on the "filter" system, and I don't know how many discs WMF generates per day. (;-) and then it has to filter all the revision data etc. But it still would easily be available for others in 48-72 hours, which beats the present ~6 weeks when the dumps are working.
No shortage of people with a box or two and any number of Tbyte hard drives that might be willing to help, if they can get the raw backups.
The problem is that WMF can't provide that raw unfiltered information. Perhaps you could donate a box on the condition that it could only be used for dump processing, but giving out unfiltered data would be too risky.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2009/2/23 Ariel T. Glenn ariel@wikimedia.org:
The reason these dumps are not rewritten more efficiently is that this job was handed to me (at my request) and I have not been able to get to it, even though it is the first thing on my list for development work. So, if there are going to be rants, they can be directed at me, not at the whole team.
The work was started already by a volunteer. As I am the blocking factor, someone else should probably take it on and get it done, though it will make me sad. Brion discussed this with me about a week and a half ago and I still wanted to keep it then but it doesn't make sense. The in-office needs that I am also responsible for take virtually all of my time. Perhaps they shouldn't, but that is how it has worked out.
In that case, it seems the mistake was assigning what should have been a top-priority task to someone that couldn't actually make it their top priority due to other commitments. If someone is unable to guarantee that they'll have time to do something, they shouldn't be assigned something so time critical.
Στις 23-02-2009, ημέρα Δευ, και ώρα 19:02 +0000, ο/η Thomas Dalton έγραψε:
2009/2/23 Ariel T. Glenn ariel@wikimedia.org:
The reason these dumps are not rewritten more efficiently is that this job was handed to me (at my request) and I have not been able to get to it, even though it is the first thing on my list for development work. So, if there are going to be rants, they can be directed at me, not at the whole team.
The work was started already by a volunteer. As I am the blocking factor, someone else should probably take it on and get it done, though it will make me sad. Brion discussed this with me about a week and a half ago and I still wanted to keep it then but it doesn't make sense. The in-office needs that I am also responsible for take virtually all of my time. Perhaps they shouldn't, but that is how it has worked out.
In that case, it seems the mistake was assigning what should have been a top-priority task to someone that couldn't actually make it their top priority due to other commitments. If someone is unable to guarantee that they'll have time to do something, they shouldn't be assigned something so time critical.
I asked for it, and that's why it was assigned to me. I should have recognized much sooner that I could not actually get it done and should have brought this to Brion's attention instead of continuing to hang on to it after he brought it to my attention.
Ariel
On 2/23/09 12:13 PM, Ariel T. Glenn wrote:
I asked for it, and that's why it was assigned to me. I should have recognized much sooner that I could not actually get it done and should have brought this to Brion's attention instead of continuing to hang on to it after he brought it to my attention.
I've been needing to reprioritize resources for this for a while; all of us having many other things to do at the same time and lots of folks being out sick during cold/flu season may not sound like a good excuse for this dragging on longer than I'd like the last few weeks, but I'm afraid it's the best I can offer at the moment.
Anyway, rest assured that this remains very much on my mind -- we haven't forgotten that the current dump process sucks and needs to be fixed up.
-- brion
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Robert Ullmann:
What is with this?
wrong list. the Foundation needs to allocate the resources to fix dumps. it hasn't done so, therefore dumps are still broken. perhaps you might ask the Foundation why dumps have such a low priority.
- river.
2009/2/22 Robert Ullmann rlullmann@gmail.com:
Want everyone to just dynamically crawl the live DB, with whatever screwy lousy inefficiency? FIne, just continue as you are, where that is all that can be relied upon!
Even if you had the dumps, you have another problem: They're incredibly big and so a bit difficult to parse. So, a small suggestion if the dumps will ever be workin' again: Split the history and current db stuff by alphabet, please.
Marco
PS: Are there any measurements what traffic is generated by ppl who download the dumps? Have there been any attempts to distribute them via BitTorrent?
On 2/23/09 3:08 AM, Marco Schuster wrote:
Even if you had the dumps, you have another problem: They're incredibly big and so a bit difficult to parse. So, a small suggestion if the dumps will ever be workin' again: Split the history and current db stuff by alphabet, please.
Define alphabet -- how should Chinese and Japanese texts be broken up?
We're much more likely to break them up simply by page ID.
PS: Are there any measurements what traffic is generated by ppl who download the dumps?
Not currently.
Have there been any attempts to distribute them via BitTorrent?
By third parties, with AFAIK very little usage.
-- brion
wikitech-l@lists.wikimedia.org