Why was the newst copy of enwiki with the full history removed from
the downloads site? I checked around and was only able to find one
place with it:
http://www.archive.org/details/enwiki-20080103
You'll want the "enwiki-20080103-pages-meta-history.xml.7z" file,
which is about 17GB. There is another file that is 130GB, but that is
the SAME thing, just compressed with bz2 insteaf of 7z, making it
larger, so don't get that one.
Tomasz, I am willing to volunteer my services as a programmer to help
with this problem in making full history enwiki dumps, if it is
possible (I can't donate hardware/money). What are the issues which
are causing it to be so slow and what methods are you employing to
improve it?
I know that LiveJournal has some sort of live backup system using
MySQL and Perl, but couldn't find any details on their presentations.
You might be able to ask one of their developers for help, on their LJ
blog. Can Wikimedia afford a snapshot server? It doesn't need to be as
fast as the others.
In the long run, whatever this system is, it will probably need to be
integrated into some sort of backup, because it would be a huge pain
if something happened at the data center and you needed to restore
from the partial quasi-backups in the current systems.
How does the current dump method work? Are they incremental in the
sense that they build up on previous dumps, instead of re-dumping all
of the data?
For future dumps, we might have to resort to some form of snapshot
server that is fed all updates either from memcaches or mysqls. This
allows for a live backup to be performed, so it's useful for not just
dumps.
Is it possible to suspend any individual slaves temporarily during off
peak hours to flush the database to disk and then copy the database
files to another computer? If not, we may still be able to use a
"stale" database files copied to another computer, as long as we only
use data from it that is at least a few days old, so we know that it's
been flushed to disk (not sure how mysql flushes the data...).
Of course, this may all be totally off, since I don't know a lot about
the current configuration and issues, so I'll take whatever input you
have to help work on something better.
Sebastian Graf wrote:
> Hello Tomasz,
>
> thanks for your quick response.
>
> Unfortunately I am in need not only of *text* but of *english text*
> since we are currently working on an revisioned indexer.
>
> Are there any english dumps available except the enwiki?
Yup, you can grab
enwikisource
enwikiversity
enwikinews
enwiktionary
enwikiquote
metawiki
commonswiki
--tomasz
Hello everybody,
I am a worker at the computer science departement at the University of
Konstanz in Germany. We are working on a revisioned native XML
database. Wikipedia is therefore the optimal playground when it comes
to huge amounts of data since the xml dump is perfect for our
application.
At the moment I am looking for a new dump for the enwiki which
contains all revisions. I know that this XML has to be really huge,
but that's why we want to use it. Unfortunately I couldn't find any
file called "page-meta-history" on the enwiki download section. Can
you help me with some dump, an idea how to get the data,...?
greetings
sebastian
--------------------------------------------------
Sebastian Graf
Distributed Systems Lab
University of Konstanz
Phone: +49 7531 88 4319
Mail: sebastian.graf(a)uni-konstanz.de
Hi,
I'm trying to get a hold of the wikipedia dump , in particular
enwiki-latest-pages-meta-history.xml.bz2
It seems that on the page where it's supposed to be
(http://download.wikipedia.org/enwiki/latest/) it's weighing at 0.6KB
whereas I was used for it to be 147GB
What happened to the data and where did it went ?
Also , on the wikipedia (
http://en.wikipedia.org/wiki/Wikipedia_database ) page I read
"As of January 17 </wiki/January_17>, 2009 </wiki/2009>, it seems that
all snapshots of pages-meta-history.xml.7z hosted
at http://download.wikipedia.org/enwiki/ are missing. The developers at
Wikimedia Foundation are working to address this issue
(http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html).
There are other ways to obtain this file"
I checked the other ways of obtaining the file that they describe , none
worked.
Why did the dumps vanished and how can I download a copy of them ?
Thank you
Greetings,
I noticed that this enwiki dump (http://dumps.wikimedia.org/enwiki/20090520/)
was completed on the 25th but in fact it is not complete. It is missing the
behemoth (pages-meta-history.xml).
bilal