Why was the newst copy of enwiki with the full history removed from the downloads site? I checked around and was only able to find one place with it: http://www.archive.org/details/enwiki-20080103
You'll want the "enwiki-20080103-pages-meta-history.xml.7z" file, which is about 17GB. There is another file that is 130GB, but that is the SAME thing, just compressed with bz2 insteaf of 7z, making it larger, so don't get that one.
Tomasz, I am willing to volunteer my services as a programmer to help with this problem in making full history enwiki dumps, if it is possible (I can't donate hardware/money). What are the issues which are causing it to be so slow and what methods are you employing to improve it?
I know that LiveJournal has some sort of live backup system using MySQL and Perl, but couldn't find any details on their presentations. You might be able to ask one of their developers for help, on their LJ blog. Can Wikimedia afford a snapshot server? It doesn't need to be as fast as the others.
In the long run, whatever this system is, it will probably need to be integrated into some sort of backup, because it would be a huge pain if something happened at the data center and you needed to restore from the partial quasi-backups in the current systems.
How does the current dump method work? Are they incremental in the sense that they build up on previous dumps, instead of re-dumping all of the data?
For future dumps, we might have to resort to some form of snapshot server that is fed all updates either from memcaches or mysqls. This allows for a live backup to be performed, so it's useful for not just dumps.
Is it possible to suspend any individual slaves temporarily during off peak hours to flush the database to disk and then copy the database files to another computer? If not, we may still be able to use a "stale" database files copied to another computer, as long as we only use data from it that is at least a few days old, so we know that it's been flushed to disk (not sure how mysql flushes the data...).
Of course, this may all be totally off, since I don't know a lot about the current configuration and issues, so I'll take whatever input you have to help work on something better.
While no expert, I'll try to clarify what I can.
Nathan J. Yoder wrote:
I know that LiveJournal has some sort of live backup system using MySQL and Perl, but couldn't find any details on their presentations. You might be able to ask one of their developers for help, on their LJ blog. Can Wikimedia afford a snapshot server? It doesn't need to be as fast as the others.
In the long run, whatever this system is, it will probably need to be integrated into some sort of backup, because it would be a huge pain if something happened at the data center and you needed to restore from the partial quasi-backups in the current systems.
The databases are replicated. So if the master db died, recovering would just be a matter of flipping a switch to promote a slave to master. (Meantime the sites would be read-only) Even if the exploded, the databases are replicated at Europe (that may not include private data, can someone shed a light if they are replicated at Europe, the office or nowhere?).
How does the current dump method work? Are they incremental in the sense that they build up on previous dumps, instead of re-dumping all of the data?
Yes. I don't know what was changed by Tomasz, but I doubt he modified that. The new dumps read the latest one and query the mysql ES servers just for new page content.
For future dumps, we might have to resort to some form of snapshot server that is fed all updates either from memcaches or mysqls. This allows for a live backup to be performed, so it's useful for not just dumps.
I don't see how memcaches can be used. A server being feed from mysqls is just one of mysql slaves. Thus that's available now.
Is it possible to suspend any individual slaves temporarily during off peak hours to flush the database to disk and then copy the database files to another computer? If not, we may still be able to use a "stale" database files copied to another computer, as long as we only use data from it that is at least a few days old, so we know that it's been flushed to disk (not sure how mysql flushes the data...).
Sure. But database files can't be published (they contain private data) so it's only good for internal backup.
Of course, this may all be totally off, since I don't know a lot about the current configuration and issues, so I'll take whatever input you have to help work on something better.
Asking is the first step :)
Nathan J. Yoder wrote:
Why was the newst copy of enwiki with the full history removed from the downloads site? I checked around and was only able to find one place with it: http://www.archive.org/details/enwiki-20080103
We almost filled the disk on the storage cluster and needed to purge older snapshots. Tim ran a purge on all 2008 snapshots which is why you don't see them anymore. Thankfully I have archive copies saved of several wiki's including en that can be restored.
You'll want the "enwiki-20080103-pages-meta-history.xml.7z" file, which is about 17GB. There is another file that is 130GB, but that is the SAME thing, just compressed with bz2 insteaf of 7z, making it larger, so don't get that one.
Tomasz, I am willing to volunteer my services as a programmer to help with this problem in making full history enwiki dumps, if it is possible (I can't donate hardware/money). What are the issues which are causing it to be so slow and what methods are you employing to improve it?
Currently pulling page text is really really really slow. Even spinning up multiple instances of pullers doesn't really help us much.
I know that LiveJournal has some sort of live backup system using MySQL and Perl, but couldn't find any details on their presentations. You might be able to ask one of their developers for help, on their LJ blog. Can Wikimedia afford a snapshot server? It doesn't need to be as fast as the others.
Very cool. Their right down the street from us so perhaps a contact could be made.
In the long run, whatever this system is, it will probably need to be integrated into some sort of backup, because it would be a huge pain if something happened at the data center and you needed to restore from the partial quasi-backups in the current systems.
Possibly, were looking at many different ways of incorporating backups to be mysql slaves, snapshots & xml. Plus were adding offiste backup to make our emergency recovery even better.
How does the current dump method work? Are they incremental in the sense that they build up on previous dumps, instead of re-dumping all of the data?
Each full history snapshot first checks to see if a previous one has run and only does new work.
For future dumps, we might have to resort to some form of snapshot server that is fed all updates either from memcaches or mysqls. This allows for a live backup to be performed, so it's useful for not just dumps.
Possibly but the crux of it is simply the page text from external storage. Fetching the meta content while long is very short in the grand scheme of things.
Is it possible to suspend any individual slaves temporarily during off peak hours to flush the database to disk and then copy the database files to another computer? If not, we may still be able to use a "stale" database files copied to another computer, as long as we only use data from it that is at least a few days old, so we know that it's been flushed to disk (not sure how mysql flushes the data...).
Spinning down a slave wont help us much since external storage is the slowdown. But mirroring that content elsewhere might be the way to go. External storage by itself is just a set of mysql db's. I'm curious to see if there might be a better storage sub system to optimize for this.
Of course, this may all be totally off, since I don't know a lot about the current configuration and issues, so I'll take whatever input you have to help work on something better.
No worries, feel free to find me on freenode to chat more about this and how you can help.
--tomasz
Tomasz Finc wrote:
For future dumps, we might have to resort to some form of snapshot server that is fed all updates either from memcaches or mysqls. This allows for a live backup to be performed, so it's useful for not just dumps.
Possibly but the crux of it is simply the page text from external storage. Fetching the meta content while long is very short in the grand scheme of things.
Is it possible to suspend any individual slaves temporarily during off peak hours to flush the database to disk and then copy the database files to another computer? If not, we may still be able to use a "stale" database files copied to another computer, as long as we only use data from it that is at least a few days old, so we know that it's been flushed to disk (not sure how mysql flushes the data...).
Spinning down a slave wont help us much since external storage is the slowdown. But mirroring that content elsewhere might be the way to go. External storage by itself is just a set of mysql db's. I'm curious to see if there might be a better storage sub system to optimize for this.
AFAIK External Storage is used with direct assignation. A wiki gets assigned an ES cluster and uses it for a long period of time. Thus from one dump to the next all text will be at the same ES cluster with high probability. Could the dump be run on one of the current ES cluster slaves? Moving the old dump might be a greater problem than getting the articles (though it's much 'easier' to move, since everything is on a single data block and pipelines nicely) but the dumper machine could became an ES slave.
xmldatadumps-l@lists.wikimedia.org