In the recent weeks I have been following the database dumps of some languages of Wikipedia. I download and analyze a dump, do various improvements, and then wait for the next dump to become available for a new analysis. There are 2 or 3 weeks between each dump. There appear to be two parallel dump processes continuously running, http://download.wikimedia.org/backup-index.html
What takes most time in each dump is the large file with complete version history, pages-meta-history.xml.bz2 and pages-meta-history.xml.7z
This is the largest file in compressed format, but since it contains every version of every article it is also very highly compressed, and expands to become enormous. I guess that very few people find use for this file. In addition, only a very small portion of its contents is changed between two dumps. So we spend a lot of time and effort (and delay of other things) in order to create very little for very few users.
I think that this dump should be made incremental. Every week, only that week's additional versions need to be dumped. This can then be added to the dump of the previous week, the week before that, etc., which hasn't really changed. This way, the dump process could be made much faster, and the two parallel dump processes would complete the cycle in less time, so new dumps of the same project could be made available more frequently.
Or is it already done this way, behind the scenes, only that it isn't visible from the outside?
It already works that way on the backend, pretty much.
We can't make the old increments available forever beacuse of things we ar obligated to discontinue distributing, so incrementals to the users would not be so useful.
On 10/19/07, Lars Aronsson lars@aronsson.se wrote:
In the recent weeks I have been following the database dumps of some languages of Wikipedia. I download and analyze a dump, do various improvements, and then wait for the next dump to become available for a new analysis. There are 2 or 3 weeks between each dump. There appear to be two parallel dump processes continuously running, http://download.wikimedia.org/backup-index.html
What takes most time in each dump is the large file with complete version history, pages-meta-history.xml.bz2 and pages-meta-history.xml.7z
This is the largest file in compressed format, but since it contains every version of every article it is also very highly compressed, and expands to become enormous. I guess that very few people find use for this file. In addition, only a very small portion of its contents is changed between two dumps. So we spend a lot of time and effort (and delay of other things) in order to create very little for very few users.
I think that this dump should be made incremental. Every week, only that week's additional versions need to be dumped. This can then be added to the dump of the previous week, the week before that, etc., which hasn't really changed. This way, the dump process could be made much faster, and the two parallel dump processes would complete the cycle in less time, so new dumps of the same project could be made available more frequently.
Or is it already done this way, behind the scenes, only that it isn't visible from the outside?
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Lars Aronsson wrote:
Or is it already done this way, behind the scenes, only that it isn't visible from the outside?
No.
AFAIK it is done as follows:
Precondition: The last full dump (if not present, treat as empty). 1- Take an snapshot of the wiki status (page table?) and create stub-meta-history 2- Read stub-meta-history and fill the page content with the last dump page contents. If a page content is not on previous dump, get it from the external storage in a blocking way.
Result: A bzipped2 full history dump. The bzip2 dump is then uncompressed and 7zipped.
If there's an error on a call to the external storage, the process can't be resumed and the dump fails.
I had been recently thinking on it and think it could be done as this: Precondition: The last full dump (if not present, treat as empty) and its greatest revid. 1a- Take an snapshot of the wiki status (page table?) and create stub-meta-history 1b- While reading the revisions, if revid is greater than the lastdumpgreaterrevid (LDGR), add it to N files (a file per M revisions). 2-Run N processes grabbing these page contents. Store them on a new-format dump (the external storage equivalent), one per revid list file. If one fails, just rerun it.
3- Read stub-meta-history and fill the page content with the last dump page contents. If a page text is not on previous dump, grab from the list file if revid > LDGR else, get it from the external storage saving it on a different file.
Revisions not present on last dump nor incremental dumps will occur on restored pages, and still be able to block it, but being much less, it's much more unlikely that they fail.
4-Save the new dump LDGR with the new bzipped dump.
Making available the M+1 incremental dumps, using the smaller meta-stubs-history, last dump can be recreated using the previous one (=less download size).
Wikimedia would still provide the full dumps, but you would only be need ed the first time.
Comments?
Bleh. Someone pulling increments couldn't build a point in time snapshot, they would need to always pull the full. And we want people using point in time versions of the site not mangled mixes.
Also, I expect that once 7zed the incremets will not be too much smaller than the full, especially if partitoned by revid.
On 10/19/07, Platonides Platonides@gmail.com wrote:
Lars Aronsson wrote:
Or is it already done this way, behind the scenes, only that it isn't visible from the outside?
No.
AFAIK it is done as follows:
Precondition: The last full dump (if not present, treat as empty). 1- Take an snapshot of the wiki status (page table?) and create stub-meta-history 2- Read stub-meta-history and fill the page content with the last dump page contents. If a page content is not on previous dump, get it from the external storage in a blocking way.
Result: A bzipped2 full history dump. The bzip2 dump is then uncompressed and 7zipped.
If there's an error on a call to the external storage, the process can't be resumed and the dump fails.
I had been recently thinking on it and think it could be done as this: Precondition: The last full dump (if not present, treat as empty) and its greatest revid. 1a- Take an snapshot of the wiki status (page table?) and create stub-meta-history 1b- While reading the revisions, if revid is greater than the lastdumpgreaterrevid (LDGR), add it to N files (a file per M revisions). 2-Run N processes grabbing these page contents. Store them on a new-format dump (the external storage equivalent), one per revid list file. If one fails, just rerun it.
3- Read stub-meta-history and fill the page content with the last dump page contents. If a page text is not on previous dump, grab from the list file if revid > LDGR else, get it from the external storage saving it on a different file.
Revisions not present on last dump nor incremental dumps will occur on restored pages, and still be able to block it, but being much less, it's much more unlikely that they fail.
4-Save the new dump LDGR with the new bzipped dump.
Making available the M+1 incremental dumps, using the smaller meta-stubs-history, last dump can be recreated using the previous one (=less download size).
Wikimedia would still provide the full dumps, but you would only be need ed the first time.
Comments?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ok, this topic has been already discussed in a previous thread. Actually, Brion has already put some thougths about possible solutions in his blog:
http://leuksman.com/log/2007/10/02/wiki-data-dumps/
Hopefully, we'll finally find a consensus for the solution.
BTW:
1. Many people does research from the info in the whole meta-history dumps. This is why it is so critic.
2. We should warrant that we have a valid copy of the whole revision history of every page, firstly for backup purposes in case (I hope it will never happen) someday something goes wrong with the databases, and secondly to allow anyone interested in looking up any of those revisions to do so.
Incremental backups are, in my view, a good idea, but as Gregory has pointed out, it is difficult to provide permanent access to the big initial dump, or to the complete collection of fragments. Howevver, I think is more a matter of convenience in the creation and recovery process. The problem is that, with a single huge file, there are more chances for an error in the db access to occur.
Saludos,
Felipe.
Gregory Maxwell gmaxwell@gmail.com escribió: Bleh. Someone pulling increments couldn't build a point in time snapshot, they would need to always pull the full. And we want people using point in time versions of the site not mangled mixes.
Also, I expect that once 7zed the incremets will not be too much smaller than the full, especially if partitoned by revid.
On 10/19/07, Platonides wrote:
Lars Aronsson wrote:
Or is it already done this way, behind the scenes, only that it isn't visible from the outside?
No.
AFAIK it is done as follows:
Precondition: The last full dump (if not present, treat as empty). 1- Take an snapshot of the wiki status (page table?) and create stub-meta-history 2- Read stub-meta-history and fill the page content with the last dump page contents. If a page content is not on previous dump, get it from the external storage in a blocking way.
Result: A bzipped2 full history dump. The bzip2 dump is then uncompressed and 7zipped.
If there's an error on a call to the external storage, the process can't be resumed and the dump fails.
I had been recently thinking on it and think it could be done as this: Precondition: The last full dump (if not present, treat as empty) and its greatest revid. 1a- Take an snapshot of the wiki status (page table?) and create stub-meta-history 1b- While reading the revisions, if revid is greater than the lastdumpgreaterrevid (LDGR), add it to N files (a file per M revisions). 2-Run N processes grabbing these page contents. Store them on a new-format dump (the external storage equivalent), one per revid list file. If one fails, just rerun it.
3- Read stub-meta-history and fill the page content with the last dump page contents. If a page text is not on previous dump, grab from the list file if revid > LDGR else, get it from the external storage saving it on a different file.
Revisions not present on last dump nor incremental dumps will occur on restored pages, and still be able to block it, but being much less, it's much more unlikely that they fail.
4-Save the new dump LDGR with the new bzipped dump.
Making available the M+1 incremental dumps, using the smaller meta-stubs-history, last dump can be recreated using the previous one (=less download size).
Wikimedia would still provide the full dumps, but you would only be need ed the first time.
Comments?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
---------------------------------
Sé un Mejor Amante del Cine ¿Quieres saber cómo? ¡Deja que otras personas te ayuden!.
I cannot understand how the diff on the database dump could be similar in size to the full database (when compressed). How much is the rate of pages that are not edited in a month (or in the time between two database dumps)?
There is not need to keep the diff files forever. Provided that there is a way to recreate the full database with a set of recent files, the is no need to keep the older ones. (the only difference I can see is that from an old dump and old diff files, delete revision of a page may be recreate, but this lead to discussion that is more political than technical).
AnyFile
On 10/20/07, Any File anysomefile@gmail.com wrote:
I cannot understand how the diff on the database dump could be similar in size to the full database (when compressed). How much is the rate of pages that are not edited in a month (or in the time between two database dumps)?
Compare the size of the 7zed current version of the pages with a full history dump. The compression is *very* effective. In the case of frwiki, for example, the complete history is just 2.3x larger than the most recent revisions alone.
If the incremental files are partitioned by rev-id thus making each file end up containing at least one revision from a significant fraction of the articles it's quite likely that a single incremental set could end up substantially larger than the full dump.
There is not need to keep the diff files forever.
What gave you the impression that I thought there was?
Provided that there is a way to recreate the full database with a set of recent files, the is no need to keep the older ones.
The proposed scheme didn't appear to offer any way to recreate a point-in-time dump except by grabbing the full copy.
Gregory Maxwell wrote:
Bleh. Someone pulling increments couldn't build a point in time snapshot, they would need to always pull the full. And we want people using point in time versions of the site not mangled mixes.
They'd use the stubs version.
Also, I expect that once 7zed the incremets will not be too much smaller than the full, especially if partitoned by revid.
I wasn't proposing a file per revid, but a file per N revisions, where N is a number which fits our needs ;-)
On 10/20/07, Platonides Platonides@gmail.com wrote:
Gregory Maxwell wrote:
Bleh. Someone pulling increments couldn't build a point in time snapshot, they would need to always pull the full. And we want people using point in time versions of the site not mangled mixes.
They'd use the stubs version.
Okay, you didn't mention that.... but please no: I have had a hard enough time explaining to people that the separate SQL dumps aren't consistent with the history dumps.
I don't want to end up in a situation where the only way to get a sane copy of the site is stitching together dozens of files on the recipients side.... people will do it wrong, or just skip building a point in time version at all.. and make a big mess.
I'd rather go back to having separate metadata and text dumps than end up with people needing to combine an old full dump, N large incremental files, and a new stub dump through a bunch of complex manipulation in order to arrive at a consistent copy of the site.
If we wanted to do that on the back end.. fine.
Also, I expect that once 7zed the incremets will not be too much smaller than the full, especially if partitoned by revid.
I wasn't proposing a file per revid, but a file per N revisions, where N is a number which fits our needs ;-)
Partition by revid doesn't necessarily mean one rev per file... and thats certainly not what I thought you were suggesting.
You will screw compression if you partition by revid (i.e. in groups of revs, failing to keep all revs of a single article in one place). If you don't want to take my word for it try it yourself.
On 10/20/07, Gregory Maxwell gmaxwell@gmail.com wrote:
On 10/20/07, Platonides Platonides@gmail.com wrote:
Gregory Maxwell wrote:
Bleh. Someone pulling increments couldn't build a point in time snapshot, they would need to always pull the full. And we want people using point in time versions of the site not mangled mixes.
They'd use the stubs version.
Okay, you didn't mention that.... but please no: I have had a hard enough time explaining to people that the separate SQL dumps aren't consistent with the history dumps.
I don't want to end up in a situation where the only way to get a sane copy of the site is stitching together dozens of files on the recipients side.... people will do it wrong, or just skip building a point in time version at all.. and make a big mess.
I'd rather go back to having separate metadata and text dumps than end up with people needing to combine an old full dump, N large incremental files, and a new stub dump through a bunch of complex manipulation in order to arrive at a consistent copy of the site.
If we wanted to do that on the back end.. fine.
Additionally I just don't see a lot of demand for incremental full-history dumps. For research purposes you're generally going to have to download the whole dump anyway, and even if it takes a few days or you have to get someone to make you a few DVD-Rs it's no big deal (*). For mirror/fork purposes you want a live feed and/or some sort of API access.
API access would be great. Reasonably priced live feeds would be great too. But incremental full-history dumps would be a lot of work for little benefit, IMO.
(*) In my experience it takes about 3-5 times as long to uncompress and import the dump as it does to download it, and that's for the .bz2 dump; if something could be done to cut down *that* component, I'd be all for it. I guess incremental dumps would help that part too, though some sort of index file would probably be a better solution.
Also, I expect that once 7zed the incremets will not be too much smaller than the full, especially if partitoned by revid.
I wasn't proposing a file per revid, but a file per N revisions, where N is a number which fits our needs ;-)
Partition by revid doesn't necessarily mean one rev per file... and thats certainly not what I thought you were suggesting.
You will screw compression if you partition by revid (i.e. in groups of revs, failing to keep all revs of a single article in one place). If you don't want to take my word for it try it yourself.
Very good point, but if you still grouped the revs by article surely it'd be a smaller file with fewer revs. Pathological cases aside, prepending extra data to a file makes the compressed file bigger, not smaller, right?
A key question is what's the demand? Who wants dumps, and for what purposes? Are you willing to pay for them? If 10-15 people each chipped in $10/month toward a dedicated server, the possibilities are fairly endless. Each person could create a custom dump geared toward their particular needs, if necessary.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
[snip]
Can you take a quick look at my blog entry of last week on this subject & comment?
http://leuksman.com/log/2007/10/14/incremental-dumps/
- -- brion vibber (brion @ wikimedia.org)
wikitech-l@lists.wikimedia.org