Ok, this topic has been already discussed in a previous thread. Actually, Brion has
already put some thougths about possible solutions in his blog:
http://leuksman.com/log/2007/10/02/wiki-data-dumps/
Hopefully, we'll finally find a consensus for the solution.
BTW:
1. Many people does research from the info in the whole meta-history dumps. This is why it
is so critic.
2. We should warrant that we have a valid copy of the whole revision history of every
page, firstly for backup purposes in case (I hope it will never happen) someday something
goes wrong with the databases, and secondly to allow
anyone interested in looking up any of those revisions to do so.
Incremental backups are, in my view, a good idea, but as Gregory has pointed out, it is
difficult to provide permanent access to the big initial dump, or to the complete
collection of fragments. Howevver, I think is more a matter of convenience in the creation
and recovery process. The problem is that, with a single huge file, there are more chances
for an error in the db access to occur.
Saludos,
Felipe.
Gregory Maxwell <gmaxwell(a)gmail.com> escribió: Bleh. Someone pulling increments
couldn't build a point in time
snapshot, they would need to always pull the full. And we want people
using point in time versions of the site not mangled mixes.
Also, I expect that once 7zed the incremets will not be too much
smaller than the full, especially if partitoned by revid.
On 10/19/07, Platonides
wrote:
Lars Aronsson wrote:
Or is it already done this way, behind the
scenes, only that it
isn't visible from the outside?
No.
AFAIK it is done as follows:
Precondition: The last full dump (if not present, treat as empty).
1- Take an snapshot of the wiki status (page table?) and create
stub-meta-history
2- Read stub-meta-history and fill the page content with the last dump
page contents. If a page content is not on previous dump, get it from
the external storage in a blocking way.
Result: A bzipped2 full history dump.
The bzip2 dump is then uncompressed and 7zipped.
If there's an error on a call to the external storage, the process can't
be resumed and the dump fails.
I had been recently thinking on it and think it could be done as this:
Precondition: The last full dump (if not present, treat as empty) and
its greatest revid.
1a- Take an snapshot of the wiki status (page table?) and create
stub-meta-history
1b- While reading the revisions, if revid is greater than the
lastdumpgreaterrevid (LDGR), add it to N files (a file per M revisions).
2-Run N processes grabbing these page contents. Store them on a
new-format dump (the external storage equivalent), one per revid list
file. If one fails, just rerun it.
3- Read stub-meta-history and fill the page content with the last dump
page contents. If a page text is not on previous dump, grab from the
list file if revid > LDGR else, get it from the external storage saving
it on a different file.
Revisions not present on last dump nor incremental dumps will occur on
restored pages, and still be able to block it, but being much less, it's
much more unlikely that they fail.
4-Save the new dump LDGR with the new bzipped dump.
Making available the M+1 incremental dumps, using the smaller
meta-stubs-history, last dump can be recreated using the previous one
(=less download size).
Wikimedia would still provide the full dumps, but you would only be need
ed the first time.
Comments?
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
---------------------------------
Sé un Mejor Amante del Cine
¿Quieres saber cómo? ¡Deja que otras personas te ayuden!.