Re: [Wikitech-l] downloading wikipedia database dumps

9 Jan 2010


      On Fri, Jan 8, 2010 at 2:37 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
...
Er. I've maintained a non-WMF disaster recovery archive for a long
time, though its no longer completely current since the rsync went
away and web fetching is lossy.
It saved our rear a number of times, saving thousands of images from
irreparable loss.
While I certainly can't fault your good will, I do find it disturbing
that it was necessary.  Ideally, Wikimedia should have internal
backups of sufficient quality that we don't have to depend on what
third parties happen to have saved for any circumstance short of
meteors falling from the heavens.
...
Moreover it allowed things like image hashing before
we had that in the database, and it would allow perceptual lossy hash
matching if I ever got around to implementing tools to access the
output.
If the goal is some version of "do something useful for Wikimedia",
then it actually seems rather bizarre to have the first step be "copy
X TB of gradually changing data to privately owned and managed
servers".  For Wikimedia applications, it would seem much more natural
to make tools and technology available to do such things within
Wikimedia.  That way developers could  work on such problems without
having to worry about how much disk space they can personally afford.
Again, there is nothing wrong with you generously doing such things
with your own resources, but ideally running duplicate repositories
for the benefit of Wikimedia should be unnecessary.
...
There really are use cases.  Moreover, making complete copies of the
public data available as dumps to the public is a WMF board supported
initiative.
I agree with the goal of making WMF content available, but given that
we don't offer any image dump right now and a comprehensive dump as
such would be usable to almost no one, then I don't think a classic
dump is where we should start.  Even you don't seem to want that.  If
I understand correctly, you'd like to have an easier way to reliably
download individual image files.  You wouldn't actually want to be
presented with some form of monolithic multi-terabyte tarball each
month.
Hence, I would say say it makes more sense to discuss way to make
individual images and user specified subsets of images more easily
available.  The same gateways that could allow you to keep
synchronized could also help other people to download individual
files.  Other goals could see functions like export pages expanded to
include options for download all associated image files at the same
time one downloads a set of wikitext.
The general point I am trying to make is that if we think about what
people really want, and how the files are likely to be used, then
there may be better delivery approaches than trying to create huge
image dumps.
-Robert Rohde

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] downloading wikipedia database dumps