Re: [Xmldatadumps-l] enwiki page-meta-history

2 Jul 2009


      Nathan J. Yoder wrote:
...
Why was the newst copy of enwiki with the full history removed from
the downloads site?  I checked around and was only able to find one
place with it:
http://www.archive.org/details/enwiki-20080103
We almost filled the disk on the storage cluster and needed to purge 
older snapshots. Tim ran a purge on all 2008 snapshots which is why you 
don't see them anymore. Thankfully I have archive copies saved of 
several wiki's including en that can be restored.
...
You'll want the "enwiki-20080103-pages-meta-history.xml.7z" file,
which is about 17GB. There is another file that is 130GB, but that is
the SAME thing, just compressed with bz2 insteaf of 7z, making it
larger, so don't get that one.
Tomasz, I am willing to volunteer my services as a programmer to help
with this problem in making full history enwiki dumps, if it is
possible (I can't donate hardware/money).  What are the issues which
are causing it to be so slow and what methods are you employing to
improve it?
Currently pulling page text is really really really slow. Even spinning 
up multiple instances of pullers doesn't really help us much.
...
I know that LiveJournal has some sort of live backup system using
MySQL and Perl, but couldn't find any details on their presentations.
You might be able to ask one of their developers for help, on their LJ
blog. Can Wikimedia afford a snapshot server? It doesn't need to be as
fast as the others.
Very cool. Their right down the street from us so perhaps a contact 
could be made.
...
In the long run, whatever this system is, it will probably need to be
integrated into some sort of backup, because it would be a huge pain
if something happened at the data center and you needed to restore
from the partial quasi-backups in the current systems.
Possibly, were looking at many different ways of incorporating backups 
to be mysql slaves, snapshots & xml. Plus were adding offiste backup to 
make our emergency recovery even better.
...
How does the current dump method work?  Are they incremental in the
sense that they build up on previous dumps, instead of re-dumping all
of the data?
Each full history snapshot first checks to see if a previous one has run 
and only does new work.
...
For future dumps, we might have to resort to some form of snapshot
server that is fed all updates either from memcaches or mysqls.  This
allows for a live backup to be performed, so it's useful for not just
dumps.
Possibly but the crux of it is simply the page text from external 
storage. Fetching the meta content while long is very short in the grand 
scheme of things.
...
Is it possible to suspend any individual slaves temporarily during off
peak hours to flush the database to disk and then copy the database
files to another computer?  If not, we may still be able to use a
"stale" database files copied to another computer, as long as we only
use data from it that is at least a few days old, so we know that it's
been flushed to disk (not sure how mysql flushes the data...).
Spinning down a slave wont help us much since external storage is the 
slowdown. But mirroring that content elsewhere might be the way to go. 
External storage by itself is just a set of mysql db's. I'm curious to 
see if there might be a better storage sub system to optimize for this.
...
Of course, this may all be totally off, since I don't know a lot about
the current configuration and issues, so I'll take whatever input you
have to help work on something better.
No worries, feel free to find me on freenode to chat more about this and 
how you can help.
--tomasz

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] enwiki page-meta-history