Re: [Wikitech-l] downloading wikipedia database dumps

9 Jan 2010


      I think having access to them on Commons repository is much easier to
handle. A subset should be good enough.
Having 11 TB of images needs huge research capabilities in order to handle
all of them and work with all of them.
Maybe a special API or advanced API functions would allow people enough
access and at the same time save the bandwidth and the hassle to handle this
behemoth collection.
bilal
--
Verily, with hardship comes ease.
On Fri, Jan 8, 2010 at 1:57 PM, Tomasz Finc tfinc@wikimedia.org wrote:
...
William Pietri wrote:
...
On 01/07/2010 01:40 AM, Jamie Morken wrote:
...
I have a
suggestion for wikipedia!!  I think that the database dumps including
the image files should be made available by a wikipedia bittorrent
tracker so that people would be able to download the wikipedia backups
including the images (which currently they can't do) and also so that
wikipedia's bandwidth costs would be reduced. [...]
Is the bandwidth used really a big problem? Bandwidth is pretty cheap
these days, and given Wikipedia's total draw, I suspect the occasional
dump download isn't much of a problem.
No, bandwidth is not really the problem here. I think the core issue is
to have bulk access to images.
There have been a number of these requests in the past and after talking
 back and forth, it has usually been the case that a smaller subset of
the data works just as well.
A good example of this was the Deutsche Fotokek archive made late last
year.
http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )
This provided an easily retrievable high quality subset of our image
data which researchers could use.
Now if we were to snapshot image data and store them for a particular
project the amount of duplicate image data would become significant.
That's because we re-use a ton of image data between projects and
rightfully so.
If instead we package all of commons into a tarball then we get roughly
6T's of image data which after numerous conversation has been a bit more
then most people want to process.
So what does everyone think of going down the collections route?
If we provide enough different and up to date ones then we could easily
give people a large but manageable amount of data to work with.
If there is a page already for this then please feel free to point me to
it otherwise I'll create one.
--tomasz

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] downloading wikipedia database dumps