Re: [Xmldatadumps-l] [Wiki-research-l] Wikipedia dumps downloader

29 Jun 2011

emijrp wrote:
...
  I didn't mean security problems. I meant just
deleted files by weird
 terms of service. Commons hosts a lot of images which can be
 problematic, like nudes or copyrighted materials in some jurisdictions.
 They can deleted what they want and close every account they want, and
 we will lost the backups. Period. 
Good point.

...
  And we don't only need to keep a copy of every
file. We need several
 copies everywhere, not only in the Amazon coolcloud. 
Sure. Relying *just* on Amazon would be very bad.

...
      Wikimedia Foundation has provided image dumps
several times in the
     past, and also rsync3 access to some individuals so that they could
     clone it.

 Ah, OK, that is enough (?). Then, you are OK with old-and-broken XML
 dumps, because people can slurp all the pages using an API scrapper. 
If all people that wants it can get it, then it's enough. Not so much in 
a timely manner, though, but that could be fixed. I'm quite confident 
that if rediris rang me tomorrow offering 20Tb for hosting commosns 
image dumps, that could be managed without too much problems.

...
      It's like the enwiki history dump. An image
dump is complex, and
     even less useful.

 It is not complex, just resources consuming. If they need to buy another
 10 TB of space and more CPU, they can. $16M were donated last year. They
 just need to put resources in relevant stuff. WMF always says "we host
 the 5th website in the world", I say that they need to act like that.

 Less useful? I hope they don't need such a useless dump for recovering
 images, just like happened in the past. 
Yes, that seems sensible. You just need to convince them :)
But note that they are already making another datacenter and developing 
a system with which they would keep a copy of every upload on both of 
them. They are not so mean.

...
          Community donates images to Commons,
community
         donates money every year, and now community needs to develop a
         software
         to extract all the images and packed them,

     There's no *need* for that. In fact, such script would be trivial
     from the toolserver.

 Ah, OK, only people with toolserver account may have access to an image
 dump. And you say it is trivial from Toolserver and very complex from
 Wikimedia main servers. 
Come on. Making a script to dowload all images is trivial from the 
toolserver. It's just not so easy using the api.
The complexity is for making a dump that *anyone* can download. And it's 
just resources, not technical.

...
          and of course, host them in a permanent way.
Crazy, right?
     WMF also tries hard to not lose images.
 I hope that, but we remember a case of lost images. 
Yes. That's a reason for making copies, and I support that. But there's 
a difference between "failures happen" and "WMF is not trying to keep 
copies".

...
      We want to provide some redundance on our own.
That's perfectly
     fine, but it's not a requirement.

 That _is_ a requirement. We can't trust Wikimedia Foundation. They lost
 images. They have problems to generate English Wikipedia dumps and image
 dumps. They had a hardware failure some months ago in the RAID which
 hosts the XML dumps, and they didn't offer those dumps during months,
 while trying to fix the crash. 

...
  You just don't understand how dangerous is the
current status (and it
 was worst in the past). 
The big problem is its huge size. If it was 2MB everyone and his 
grandmother would keep a copy.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] [Wiki-research-l] Wikipedia dumps downloader