I'm a student looking to work on MediaWiki during this year's Google
Summer of Code, and one of the ideas I've been interested in is in
various formats for the data dumps (and dump work in general).
How useful would dumps from wikipedia be, if they were in sqlite
databases? Would it be useful to have all the dumps as sqlite
(history, stubs, current, etc)? Or are there certain dumps (current,
for example) which would be very useful as databases?
The dumps wouldn't be direct dumps from the mysql database (unlike the
old SQL Dumps) - they'll be in a format optimized for data processing
and imports. I'll also write supporting code such as libraries for
reading the databases, etc.
What do you folks think?
Yuvi Panda T
Well, that used up all my good luck for the year, but the bz2s are ready
for download. The md5sums are still calculating, give them a couple
hours to show up. If all continues to go well we'll have the 7z files
in 4-5 days.
As before I do not plan to provide a single 350gb file of the bz2, nor a
single 7z file for download.
Well that, like many things about dumps, took longer than I would have
liked but the January enwikipedia run is finally complete. Unless
someone really really wants them (and then we might talk off list about
it) I am not going to provide a single file for download of the history
dumps; instead they are available in 15 pieces, both bzip2 and 7z
The next run is under way; I expect it to take a while as well, as we
are still working bugs out of the system.
Please get your eyeballs on these, if you haven't picked them up
already, and let me know of any issues with the contents.
Hi, has anyone got plans to create individual torrents for "All pages with
complete page edit history (.bz2)" ? I downloaded them and turns out I have
several files that seem to be corrupted. I am unable to re-download them but
feel the torrent would be able to fix the corrupted parts. All of the
individual parts for the dumps except 1st,8th,9th,10th ones are complete.
I need these dumps because I will analyse revisions in hopes of better
identifying vandalism on the wikis through machine learning. I however need
the database to process this soon as my assignment is due in about a month.
It seems that the dumps are using an old version of 7zip (4.57 is visible
in database dump progress <http://dumps.wikimedia.org/backup-index.html>
Current stable version is 9.20 with apparently several optimizations done
Should 7zip be upgraded for the dumps to see if it's going faster or taking
less resources ?
After hearing a resounding silence from folks about whether the existing
dumps are running well (perhas no news is good news?), I fixed up a
small glitch which had caused the zh abstract dumps to fail under 1.17
and started up all jobs. Happy trails.