I'm a student looking to work on MediaWiki during this year's Google
Summer of Code, and one of the ideas I've been interested in is in
various formats for the data dumps (and dump work in general).
How useful would dumps from wikipedia be, if they were in sqlite
databases? Would it be useful to have all the dumps as sqlite
(history, stubs, current, etc)? Or are there certain dumps (current,
for example) which would be very useful as databases?
The dumps wouldn't be direct dumps from the mysql database (unlike the
old SQL Dumps) - they'll be in a format optimized for data processing
and imports. I'll also write supporting code such as libraries for
reading the databases, etc.
What do you folks think?
--
Yuvi Panda T
http://yuvi.in/
Well, that used up all my good luck for the year, but the bz2s are ready
for download. The md5sums are still calculating, give them a couple
hours to show up. If all continues to go well we'll have the 7z files
in 4-5 days.
As before I do not plan to provide a single 350gb file of the bz2, nor a
single 7z file for download.
Happy trails,
Ariel
Hi,
The order in the database dump
progress<http://dumps.wikimedia.org/backup-index.html> page
is quite strange.
Some dumps in progress are not where they should, but even some complete
dumps are not sorted
(for 03/19 : up to 08:40 then from 05:27 to 00:33, with events from 08:38 to
08:27, or 00:13 later in the page)
Nico
- 2011-03-19 14:59:21
plwikinews<http://dumps.wikimedia.org/plwikinews/20110319>
: Dump complete
- 2011-03-19 14:49:07 fawiki <http://dumps.wikimedia.org/fawiki/20110319>
: Dump complete
- 2011-03-19 08:43:09
arwikisource<http://dumps.wikimedia.org/arwikisource/20110319>
: Dump complete
- 2011-03-19 08:41:44
vecwikisource<http://dumps.wikimedia.org/vecwikisource/20110319>
: Dump complete
- 2011-03-19 08:40:47
brwikisource<http://dumps.wikimedia.org/brwikisource/20110319>
: Dump complete
- 2011-03-19 05:27:47
fywiktionary<http://dumps.wikimedia.org/fywiktionary/20110319>
: Dump complete
- ...
- 2011-03-19 00:33:09
plwikibooks<http://dumps.wikimedia.org/plwikibooks/20110319>
: Dump complete
- 2011-03-20 18:09:16 frwiki <http://dumps.wikimedia.org/frwiki/20110318>
: Dump in progress
- 2011-03-19 11:36:00 in-progress All pages with complete page edit
history (.bz2)
2011-03-20 18:09:16: frwiki 53719 pages (0.488/sec), 7264000 revs
(66.039/sec), 92.7% prefetched, ETA 2011-03-30 14:03:59 [max 63349860]
- These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and statistical use, most
mirror sites won't want or need this.
- pages-meta-history.xml.bz2 8.1 GB (written)
- 2011-03-20 18:09:15 eswiki<http://dumps.wikimedia.org/eswiki/20110318>
: Dump in progress
- 2011-03-20 10:23:01 in-progress All pages with complete page edit
history (.bz2)
2011-03-20 18:09:15: eswiki 5252 pages (0.188/sec), 1979000 revs
(70.745/sec), 94.7% prefetched, ETA 2011-03-27 18:55:13 [max 44960402]
- These dumps can be *very* large, uncompressing up to 20 times the
archive download size. Suitable for archival and statistical use, most
mirror sites won't want or need this.
- pages-meta-history.xml.bz2 3.4 GB (written)
- 2011-03-19 08:38:52 kbdwiki<http://dumps.wikimedia.org/kbdwiki/20110318>
: Dump complete
- ...
- 2011-03-19 08:27:04
azwikisource<http://dumps.wikimedia.org/azwikisource/20110318>
: Dump complete
- 2011-03-19 00:13:10
itwikinews<http://dumps.wikimedia.org/itwikinews/20110318>
: Dump complete
Well that, like many things about dumps, took longer than I would have
liked but the January enwikipedia run is finally complete. Unless
someone really really wants them (and then we might talk off list about
it) I am not going to provide a single file for download of the history
dumps; instead they are available in 15 pieces, both bzip2 and 7z
compressed.
The next run is under way; I expect it to take a while as well, as we
are still working bugs out of the system.
Please get your eyeballs on these, if you haven't picked them up
already, and let me know of any issues with the contents.
Thanks,
Ariel
http://dumps.wikimedia.org/enwiki/20110115/
Hi, has anyone got plans to create individual torrents for "All pages with
complete page edit history (.bz2)" ? I downloaded them and turns out I have
several files that seem to be corrupted. I am unable to re-download them but
feel the torrent would be able to fix the corrupted parts. All of the
individual parts for the dumps except 1st,8th,9th,10th ones are complete.
I need these dumps because I will analyse revisions in hopes of better
identifying vandalism on the wikis through machine learning. I however need
the database to process this soon as my assignment is due in about a month.
Hi,
It seems that the dumps are using an old version of 7zip (4.57 is visible
in database dump progress <http://dumps.wikimedia.org/backup-index.html>
page).
Current stable version is 9.20 with apparently several optimizations done
since 4.57.
Should 7zip be upgraded for the dumps to see if it's going faster or taking
less resources ?
Nico
After hearing a resounding silence from folks about whether the existing
dumps are running well (perhas no news is good news?), I fixed up a
small glitch which had caused the zh abstract dumps to fail under 1.17
and started up all jobs. Happy trails.
Ariel