Re: [Wikitech-l] new dumps ?

28 Sep 2006


      ...
Erik Zachte wrote:
...
Sometimes the jump job reports all is well when it is not (Brion knows this)
This part's now fixed; .bz2 failures will not report .7z success on the next run
around (but could on the current run while the program's still running).
-- brion vibber (brion @ pobox.com)
If I can possibly please make a suggestion / request, one thing I would quite like from the download.wikipedia.org site is an index
file somewhere (could be plain text or XML) that indicates what the latest valid individual dump files are for a given Wikipedia
site.
The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to files from approx
Aug-17, which looks to be the latest dump where everything reported as succeeded; but the later dumps are still useful for most of
the files, just not the really large ones). An index file of this type would supersede the
http://download.wikipedia.org/enwiki/latest/ directory, and would probably live in  http://download.wikipedia.org/enwiki/ .
Given that sometimes some dumps fail, it could be good to make automating the downloading and processing of dump files easier. That
way dump consumers could have a cron job that say once a day would get the latest index file, and download the latest dump files
they wanted, if they had been updated.
It might help here to show a rough mock-up example of the type of file I'm thinking of:
===================================================================
<mediawiki xsi:schemaLocation="http://download.wikipedia.org/xml/export-0.1/" version="0.1" xml:lang="en">
  <siteinfo>
     <sitename>English Wikipedia</sitename>
  </siteinfo>
  <dump type="site_stats.sql.gz">
      <desc>A few statistics such as the page count.</desc>
      <url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql...</url>
      <size_in_bytes>451</size_in_bytes>
      <timestamp>2006-09-24T16:29:01Z</timestamp>
      <md5sum>e4defa79c36823c67ed4d937f8f7013c</md5sum>
  </dump>
    <dump type="pages-articles.xml.bz2">
      <desc>Articles, templates, image descriptions, and primary meta-pages.</desc>
      <timestamp>2006-09-24T22:12:24Z</timestamp>
      <url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-pages-articles...</url>
      <md5sum>2742b1b4b131d9a28887823da91cf2a5</md5sum>
      <size_in_bytes>1710328527</size_in_bytes>
  </dump>
.... snip various dump entries ....
<dump type="pages-meta-history.xml.7z">
      <desc>All pages with complete edit history (.7z)</desc>
      <timestamp>2006-08-16T12:55:00Z</timestamp>
      <url>http://download.wikipedia.org/enwiki/20060816/enwiki-20060816-pages-meta-his...</url>
      <md5sum>24160a71229bee02bb813825bf7413db</md5sum>
      <size_in_bytes>5132097632</size_in_bytes>
  </dump>
</mediawiki>
===================================================================
... the above file is probably invalid XML and needs to tweaked and so forth, but hopefully it illustrates the idea (e.g. the
pages-articles.xml.bz2 entry is recent, whereas the pages-meta-history.xml.7z file is a month older, but both represent the latest
valid dump for that type of file). Someone who for example only wants the "All pages with complete edit history (.7z)" file can
download this file once a day, then when the entry changes have it download the file, verify the size in bytes matches, verify the
Md5sum matches, and if everything is good, extract the file, maybe then locally verify it's a valid XML file, and if it's all still
good process the file in an automated way. Also after every individual dump file was successfully created, the index file would
probably have to be updated (to ensure it was always current). At the moment the above information is I think currently on the
download.wikipedia.org site, but it's just scattered out over a number of different places; this would basically unify all that
information into a nice useful data format.
All the best,
Nick.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] new dumps ?