[Foundation-l] [Wiki-research-l] Wikipedia dumps downloader

Tue Jun 28 21:10:41 UTC 2011

2011/6/28 Platonides <platonides at gmail.com>

> emijrp wrote:
>
>> Hi;
>>
>> @Derrick: I don't trust Amazon.
>>
>
> I disagree. Note that we only need them to keep a redundant copy of a file.
> If they tried to tamper the file we could detect it with the hashes (which
> should be properly secured, that's no problem).
>
>
I didn't mean security problems. I meant just deleted files by weird terms
of service. Commons hosts a lot of images which can be problematic, like
nudes or copyrighted materials in some jurisdictions. They can deleted what
they want and close every account they want, and we will lost the backups.
Period.

And we don't only need to keep a copy of every file. We need several copies
everywhere, not only in the Amazon coolcloud.

> I'd like having the hashes for the xml dumps content instead of the
> compressed one, though, so it could be easily stored with better compression
> without weakening the integrity check.
>
>
>  Really, I don't trust Wikimedia
>> Foundation either. They can't and/or they don't want to provide image
>> dumps (what is worst?).
>>
>
> Wikimedia Foundation has provided image dumps several times in the past,
> and also rsync3 access to some individuals so that they could clone it.
>

Ah, OK, that is enough (?). Then, you are OK with old-and-broken XML dumps,
because people can slurp all the pages using an API scrapper.

> It's like the enwiki history dump. An image dump is complex, and even less
> useful.
>
>
It is not complex, just resources consuming. If they need to buy another 10
TB of space and more CPU, they can. $16M were donated last year. They just
need to put resources in relevant stuff. WMF always says "we host the 5th
website in the world", I say that they need to act like that.

Less useful? I hope they don't need such a useless dump for recovering
images, just like happened in the past.

>
>  Community donates images to Commons, community
>> donates money every year, and now community needs to develop a software
>> to extract all the images and packed them,
>>
>
> There's no *need* for that. In fact, such script would be trivial from the
> toolserver.

Ah, OK, only people with toolserver account may have access to an image
dump. And you say it is trivial from Toolserver and very complex from
Wikimedia main servers.

 and of course, host them in a permanent way. Crazy, right?
>>
>
> WMF also tries hard to not lose images.

I hope that, but we remember a case of lost images.

> We want to provide some redundance on our own. That's perfectly fine, but
> it's not a requirement.

That _is_ a requirement. We can't trust Wikimedia Foundation. They lost
images. They have problems to generate English Wikipedia dumps and image
dumps. They had a hardware failure some months ago in the RAID which hosts
the XML dumps, and they didn't offer those dumps during months, while trying
to fix the crash.

> Consider that WMF could be automatically deleting page history older than a
> month,

 or images not used on any article. *That* would be a real problem.
>
>
You just don't understand how dangerous is the current status (and it was
worst in the past).

>
>  @Milos: Instead of spliting image dump using the first letter of
>> filenames, I thought about spliting using the upload date (YYYY-MM-DD).
>> So, first chunks (2005-01-01) will be tiny, and recent ones of several
>> GB (a single day).
>>
>> Regards,
>> emijrp
>>
>
> I like that idea since it means the dumps are static. They could be placed
> in tape inside a safe and not needed to be taken out unless data loss
> arises.
>