The database dumps (http://download.wikimedia.org/backup-index.html) don't seem to have made any progress since 7/1. I realize they can appear stalled in the normal process (http://meta.wikimedia.org/wiki/Data_dumps#Schedule), but in the recent past (as far as I know) they have not been stalled this long without there being something actually wrong.
Are they indeed still stuck (http://lists.wikimedia.org/pipermail/wikitech-l/2008-July/038625.html)? And is there anything I (or other community members) can do about it?
Thank you for your time.
yegg@alum.mit.edu wrote:
The database dumps (http://download.wikimedia.org/backup-index.html) don't seem to have made any progress since 7/1. I realize they can appear stalled in the normal process (http://meta.wikimedia.org/wiki/Data_dumps#Schedule), but in the recent past (as far as I know) they have not been stalled this long without there being something actually wrong.
Are they indeed still stuck (http://lists.wikimedia.org/pipermail/wikitech-l/2008-July/038625.html)?
Yep.
And is there anything I (or other community members) can do about it?
Nope. We just gotta get in and unplug it when we have a moment. Right now it's tending to stick because we're still sharing space between upload backups and download dumps.
Still waiting on the new fileservers -- this server order has been stuck for a loooong time, and we're not very happy about it...
-- brion
The dumps are stuck again (7/19).
An additional problem is that when they are restarted, the pending order is changed, the code is not picking up the oldest *successful* dump to do. en.wikt was almost at the top when the last restart was done, and got moved to somewhere near the bottom
It is getting more painful for us, as we have dozens of tools that work from the XML dumps, and they are all now 6+ weeks out of date.
Maybe we could run en.wikt? Pretty please? Robert
Brion
For the sake of love, and all that is good and holy, what does it take to get an en.wikt dump?
Every time it gets stuck, and you "reset" it, we get "dumped" at least half-way down the queue.
And it seems to be stuck again ...
We can fix this, if there is some way I can possibly be allowed to help?
Best Regards, Robert
On Fri, Jul 11, 2008 at 3:20 PM, yegg@alum.mit.edu wrote:
The database dumps (http://download.wikimedia.org/backup-index.html) don't seem to have made any progress since 7/1. I realize they can appear stalled in the normal process (http://meta.wikimedia.org/wiki/Data_dumps#Schedule), but in the recent past (as far as I know) they have not been stalled this long without there being something actually wrong.
Are they indeed still stuck (http://lists.wikimedia.org/pipermail/wikitech-l/2008-July/038625.html)? And is there anything I (or other community members) can do about it?
Thank you for your time.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Robert Ullmann wrote:
Brion
For the sake of love, and all that is good and holy, what does it take to get an en.wikt dump?
When the big batch of servers we got in two days ago are all unpacked and have their disks installed, we can start shuffling some data around.
At that point, I'll have free disk space necessary to run dumps.
- -- brion
It is good that we will have new disks and it likely won't get stuck; but that doesn't address the primary problem of the length of time these things take. Let me try to be more constructive.
First thing is that the projects are hugely different in size. This causes a fundamental queuing problem: with n threads, and more than n huge tasks in the queue, the threads will all end up doing those. (we recently saw a number of days in which it was working on enwiki, frwiki, dewiki, and jawiki and nothing else). This can be fixed with a thread that is restricted to smaller tasks. Like in a market or a bank, with an express lane. (My bank has one teller only for deposits and withdrawals in 500s and 1000s notes, no other transactions.)
However there are other problems and opportunities. Each project does a number of minor tasks, and then 4 larger ones:
* main articles, current versions * all pages, current * all-history of all pages, bz2 compressed * all-history, re-compressed in 7z
For the enwiki, main articles takes (7/14 numbers) 10 hours, 30 min, all pages 16 hours, 10 min. All-history bz2 was estimated at 67 days when it got stuck. (would have been shorter, as that estimate was right at the start) For jawiki (7/24): main was 48 min, all pages 65 min, all history bz2 3 days, 18 hours, 7z 2 days, 12 hours.
Some observations then:
* the main articles dump is a subset of all pages. The latter might usefully be only a dump of all the pages *not* in the first. * alternatively, the process could dump the first, then copy the file and continue with the others for the second (yes, one has to be careful with the bz2 compression state) * or it could write both at the same time, saving the DB access time if not the compression time * the all-history dump might be only the 7z. Yes, it takes longer than the bz2, but direct to 7z will be much less total time. * alternatively, write both bz2 and 7z at the same time (if we must have the bz2, but I don't see why; methinks anyone would want the 7z) * make the all-history dump(s) separate tasks, in separate queue(s); without them the rest will go very well
Note that the all-history dumps are cumulative: each contains everything that was in the previous, plus all the new versions. We might reconsider whether we want those at all, or make each an incremental. (I'm not sure what these are for exactly) A dump that is taken over a several month period is also hardly a snapshot, from a DB integrity POV it is nearly useless. But no matter.
* the format of the all-history dump could be changed to store only differences (going backward from current) in each XML record * or a variant of the 7z compressor used that knows where to search for the matching strings, rather than a general search; it would then be *much* faster. (as it is an LZ77-class method, this doesn't change the decompressor logic)
Either of these last two would make the all-history dumps at least a couple of orders of magnitude faster.
best regards, Robert
2008/8/2 Robert Ullmann rlullmann@gmail.com
A dump that is taken over a several month period is also hardly a snapshot, from a DB integrity POV it is nearly useless. But no matter.
You could also take one or two DB slaves out of replication for the whole dump period to keep the database consistent and then, after the dump is finished, let it replicate again. Dunno though if that is possible with MySQL.
Marco
Robert Ullmann wrote:
It is good that we will have new disks and it likely won't get stuck; but that doesn't address the primary problem of the length of time these things take. Let me try to be more constructive.
It just paralellizes it ;)
First thing is that the projects are hugely different in size. This causes a fundamental queuing problem: with n threads, and more than n huge tasks in the queue, the threads will all end up doing those. (we recently saw a number of days in which it was working on enwiki, frwiki, dewiki, and jawiki and nothing else). This can be fixed with a thread that is restricted to smaller tasks. Like in a market or a bank, with an express lane. (My bank has one teller only for deposits and withdrawals in 500s and 1000s notes, no other transactions.)
Seems reasonable.
Some observations then:
- the main articles dump is a subset of all pages. The latter might usefully
be only a dump of all the pages *not* in the first.
- alternatively, the process could dump the first, then copy the file and
continue with the others for the second (yes, one has to be careful with the bz2 compression state)
If you mean what i think you mean, it won't work.
- or it could write both at the same time, saving the DB access time if not
the compression time
There's one snapshot of the DB for the articles content. All the metadata is extracted at one point (the stub-* files). Then it is filled with content from last full dump and getting new revisions from db.
- the all-history dump might be only the 7z. Yes, it takes longer than the
bz2, but direct to 7z will be much less total time.
- alternatively, write both bz2 and 7z at the same time (if we must have the
bz2, but I don't see why; methinks anyone would want the 7z)
AFAIK the 7z is reading the bz2. It's much easier to recompress on a different format than recreating the xml. Plus it's much less load on the db servers.
- make the all-history dump(s) separate tasks, in separate queue(s); without
them the rest will go very well
That could work. But note that the difference with metadata such as templatelinks will be even greater.
Note that the all-history dumps are cumulative: each contains everything that was in the previous, plus all the new versions. We might reconsider whether we want those at all, or make each an incremental. (I'm not sure what these are for exactly)
So you would need all dumps since January (the first full, then incremental) to get the status at August? It may be better or worse depending on what you'll do with the data.
A dump that is taken over a several month period is also hardly a snapshot, from a DB integrity POV it is nearly useless. But no matter.
See above. The history dump reflects the status at the beginning. You're getting through a month the contents on the history. There is a difference with the additional metadata, such as template and image usage. Not easy to fix if you wanted to, because even if you dumped them in the same transaction as the revision table, it will contain outdated information to be updated by the job queue.
- the format of the all-history dump could be changed to store only
differences (going backward from current) in each XML record
Has been proposed before for the db store. It was determined that there was little difference with just compressing. Moreover, it would make the process slower, as you would also need to diff the revisions. The worst case would be a history merge, where there're new intermediate revisions, so you need to recover the full contents of each revision (from db/undiffing the last dump) and diff it again.
- or a variant of the 7z compressor used that knows where to search for the
matching strings, rather than a general search; it would then be *much* faster. (as it is an LZ77-class method, this doesn't change the decompressor logic)
Could work. Are you volunteering to write it?
Either of these last two would make the all-history dumps at least a couple of orders of magnitude faster.
best regards, Robert
Knowing little about the current dump generation process, but some about terabyte-scale data handling (actually, we here are well into the petabyte range by now;-), how about this: * Set up the usual MySQL replication slave * At one point in time, disconnect it from the MySQL master, but leave it running in read-only mode * Use that as the dump base
This should result in a single-point-in-time snapshot. Also, it will reduce load to the rest of the system. Not sure if IDs will change internally, though.
Independent of that, * Run several parallel processes on several servers (assuming we have several) * Each process generates the complete history dump of a single article, or a small group of them, bz2ipped to save intermittent disk space * Success/faliure is checked, so each process can be rerun if needed * At the end, all these files are appended into a single bzip2/7zip file
This will need more diskspace while the entire thing is running, as small text files compress less well than larger ones. Also, it eats more CPU cycles, for starting all these processes, and then for re-bzip2ing the intermediate files.
But, it is a lot less error-prone (if a process or a bunch of them fail, just restart them), and it scales better (just throw more machines at it to make it faster; or use apaches during low-traffic hours). Individual processes should be less memory-intensive, so several of them can run on the same machine.
My 2c
Magnus
Magnus Manske wrote:
Knowing little about the current dump generation process, but some about terabyte-scale data handling (actually, we here are well into the petabyte range by now;-), how about this:
- Set up the usual MySQL replication slave
- At one point in time, disconnect it from the MySQL master, but leave
it running in read-only mode
- Use that as the dump base
This should result in a single-point-in-time snapshot.
Why? As i already said, the revision status is a snapshot, it's done in a transaction.
Also, it will reduce load to the rest of the system. Not sure if IDs will change internally, though.
IDs won't change, but you don't need the disconnected slave. Once you have the revisions, you will be querying external storage. That's where the load goes.
Independent of that,
- Run several parallel processes on several servers (assuming we have several)
- Each process generates the complete history dump of a single
article, or a small group of them, bz2ipped to save intermittent disk space
- Success/faliure is checked, so each process can be rerun if needed
- At the end, all these files are appended into a single bzip2/7zip file
The system we use is not exactly that. It's writing compressed data from the compressed reading of last dump and the revision snapshot. It's never using uncompressed data. The little processes would need to know where in the last file is the section they're doing. However, if you knew in which part it was in the old dump... it's worthwhile considering.
This will need more diskspace while the entire thing is running, as small text files compress less well than larger ones. Also, it eats more CPU cycles, for starting all these processes, and then for re-bzip2ing the intermediate files.
Not neccessarily. If the number of files per bzip2 group is large enough, there is almost no difference.
But, it is a lot less error-prone (if a process or a bunch of them fail, just restart them), and it scales better (just throw more machines at it to make it faster; or use apaches during low-traffic hours). Individual processes should be less memory-intensive, so several of them can run on the same machine.
My 2c
Magnus
We are talking very happily here, but what is slowing the dump process? Brion, Tim, there's some profiling information about that? Is it I/O waiting for the revisions fetched for external storage? Is it disk speed when reading/writing? Is it CPU for decompressing previous dump? Is it CPU for compressing? How is dbzip2 helping with it?*
*I thought you were using dbzip2, but i now see mw:Dbzip2 says "dbzip2 is not ready for public use yet" Has it been indefinitely postponed?
On Sat, Aug 2, 2008 at 9:49 PM, Platonides Platonides@gmail.com wrote:
Magnus Manske wrote:
Independent of that,
- Run several parallel processes on several servers (assuming we have several)
- Each process generates the complete history dump of a single
article, or a small group of them, bz2ipped to save intermittent disk space
- Success/faliure is checked, so each process can be rerun if needed
- At the end, all these files are appended into a single bzip2/7zip file
The system we use is not exactly that. It's writing compressed data from the compressed reading of last dump and the revision snapshot. It's never using uncompressed data. The little processes would need to know where in the last file is the section they're doing. However, if you knew in which part it was in the old dump... it's worthwhile considering.
Why is it using the old dump instead of the "real" storage? For performance reasons?
Does that mean that if there's an error in an old dump, it will stay there forever?
How does this cope with deleted revisions?
This will need more diskspace while the entire thing is running, as small text files compress less well than larger ones. Also, it eats more CPU cycles, for starting all these processes, and then for re-bzip2ing the intermediate files.
Not neccessarily. If the number of files per bzip2 group is large enough, there is almost no difference.
Yes. we'd have to find a balance between many fast processes with lots of overhead and few slow ones that, when failing, will set back the dump for weeks.
At work, I'm using a computing farm with several thousand cores, and the suggested time per process is < 2h. May be worth contemplating, even though the technical situation for Wikimedia is very much different.
Magnus
Magnus Manske wrote:
Why is it using the old dump instead of the "real" storage? For performance reasons?
Yes. It's nicer to fill the stub by reading the last dump. Most revisions are already there and in the order you will need them. If it's not, it's retrieved from external storage. From dumpTextPass.php usage: "Use a prior dump file as a text source, to save pressure on the database."
Does that mean that if there's an error in an old dump, it will stay there forever?
Only until the dump generation fails and a new one is created from scratch ;) Any reason for old dumps to be more corruptable than the db blobs?
How does this cope with deleted revisions?
The revision contents are read from the old dump, but the revisions and pages are read from the stub, created from db.
On Sat, Aug 02, 2008 at 09:01:26PM +0100, Magnus Manske wrote:
Knowing little about the current dump generation process, but some about terabyte-scale data handling (actually, we here are well into the petabyte range by now;-), how about this:
- Set up the usual MySQL replication slave
- At one point in time, disconnect it from the MySQL master, but leave
it running in read-only mode
- Use that as the dump base
This should result in a single-point-in-time snapshot. Also, it will reduce load to the rest of the system. Not sure if IDs will change internally, though.
That's roughly equivalent to what Phil Greenspun says that the "SQL studs" at Mass General Hospital do with their backups, though in their case it's breaking a RAID mirror rather than a replication.
Cheers, -- jra
wikitech-l@lists.wikimedia.org