Re: [Xmldatadumps-l] [Wiki-research-l] Wikipedia dumps downloader

List overview All Threads
Download

newer

older

Fwd: Wikis around Europe!

Wikipedia dumps downloader

emijrp

28 Jun 2011 28 Jun '11

12:21 p.m.

Hi;

@Derrick: I don't trust Amazon. Really, I don't trust Wikimedia Foundation either. They can't and/or they don't want to provide image dumps (what is worst?). Community donates images to Commons, community donates money every year, and now community needs to develop a software to extract all the images and packed them, and of course, host them in a permanent way. Crazy, right?

@Milos: Instead of spliting image dump using the first letter of filenames, I thought about spliting using the upload date (YYYY-MM-DD). So, first chunks (2005-01-01) will be tiny, and recent ones of several GB (a single day).

Regards, emijrp

2011/6/28 Derrick Coetzee dcoetzee@eecs.berkeley.edu

...

As a Commons admin I've thought a lot about the problem of distributing Commons dumps. As for distribution, I believe BitTorrent is absolutely the way to go, but the Torrent will require a small network of dedicated permaseeds (servers that seed indefinitely). These can easily be set up at low cost on Amazon EC2 "small" instances

the disk storage for the archives is free, since small instances

include a large (~120 GB) ephemeral storage volume at no additional cost, and the cost of bandwidth can be controlled by configuring the BitTorrent client with either a bandwidth throttle or a transfer cap (or both). In fact, I think all Wikimedia dumps should be available through such a distribution solution, just as all Linux installation media are today.

Additionally, it will be necessary to construct (and maintain) useful subsets of Commons media, such as "all media used on the English Wikipedia", or "thumbnails of all images on Wikimedia Commons", of particular interest to certain content reusers, since the full set is far too large to be of interest to most reusers. It's on this latter point that I want your feedback: what useful subsets of Wikimedia Commons does the research community want? Thanks for your feedback.

--=20 Derrick Coetzee User:Dcoetzee, English Wikipedia and Wikimedia Commons administrator http://www.eecs.berkeley.edu/~dcoetzee/

Attachments:

attachment.htm (text/html — 2.5 KB)

Show replies by date

Platonides

28 Jun 28 Jun

3:28 p.m.

New subject: [Wiki-research-l] Wikipedia dumps downloader

emijrp wrote:

...

Hi;

@Derrick: I don't trust Amazon.

I disagree. Note that we only need them to keep a redundant copy of a file. If they tried to tamper the file we could detect it with the hashes (which should be properly secured, that's no problem).

I'd like having the hashes for the xml dumps content instead of the compressed one, though, so it could be easily stored with better compression without weakening the integrity check.

...

Really, I don't trust Wikimedia Foundation either. They can't and/or they don't want to provide image dumps (what is worst?).

Wikimedia Foundation has provided image dumps several times in the past, and also rsync3 access to some individuals so that they could clone it. It's like the enwiki history dump. An image dump is complex, and even less useful.

...

Community donates images to Commons, community donates money every year, and now community needs to develop a software to extract all the images and packed them,

There's no *need* for that. In fact, such script would be trivial from the toolserver.

...

and of course, host them in a permanent way. Crazy, right?

WMF also tries hard to not lose images. We want to provide some redundance on our own. That's perfectly fine, but it's not a requirement. Consider that WMF could be automatically deleting page history older than a month, or images not used on any article. *That* would be a real problem.

...

@Milos: Instead of spliting image dump using the first letter of filenames, I thought about spliting using the upload date (YYYY-MM-DD). So, first chunks (2005-01-01) will be tiny, and recent ones of several GB (a single day).

Regards, emijrp

I like that idea since it means the dumps are static. They could be placed in tape inside a safe and not needed to be taken out unless data loss arises.

emijrp

4:10 p.m.

New subject: [Wiki-research-l] Wikipedia dumps downloader

2011/6/28 Platonides platonides@gmail.com

...

emijrp wrote:

...
Hi;

@Derrick: I don't trust Amazon.

I disagree. Note that we only need them to keep a redundant copy of a file. If they tried to tamper the file we could detect it with the hashes (which should be properly secured, that's no problem).

I didn't mean security problems. I meant just deleted files by weird terms of service. Commons hosts a lot of images which can be problematic, like nudes or copyrighted materials in some jurisdictions. They can deleted what they want and close every account they want, and we will lost the backups. Period.

And we don't only need to keep a copy of every file. We need several copies everywhere, not only in the Amazon coolcloud.

...

I'd like having the hashes for the xml dumps content instead of the compressed one, though, so it could be easily stored with better compression without weakening the integrity check.

Really, I don't trust Wikimedia

...
Foundation either. They can't and/or they don't want to provide image dumps (what is worst?).

Wikimedia Foundation has provided image dumps several times in the past, and also rsync3 access to some individuals so that they could clone it.

Ah, OK, that is enough (?). Then, you are OK with old-and-broken XML dumps, because people can slurp all the pages using an API scrapper.

...

It's like the enwiki history dump. An image dump is complex, and even less useful.

It is not complex, just resources consuming. If they need to buy another 10 TB of space and more CPU, they can. $16M were donated last year. They just need to put resources in relevant stuff. WMF always says "we host the 5th website in the world", I say that they need to act like that.

Less useful? I hope they don't need such a useless dump for recovering images, just like happened in the past.

...

Community donates images to Commons, community

...
donates money every year, and now community needs to develop a software to extract all the images and packed them,

There's no *need* for that. In fact, such script would be trivial from the toolserver.

Ah, OK, only people with toolserver account may have access to an image dump. And you say it is trivial from Toolserver and very complex from Wikimedia main servers.

and of course, host them in a permanent way. Crazy, right?

...

...
WMF also tries hard to not lose images.

I hope that, but we remember a case of lost images.

...

We want to provide some redundance on our own. That's perfectly fine, but it's not a requirement.

That _is_ a requirement. We can't trust Wikimedia Foundation. They lost images. They have problems to generate English Wikipedia dumps and image dumps. They had a hardware failure some months ago in the RAID which hosts the XML dumps, and they didn't offer those dumps during months, while trying to fix the crash.

...

Consider that WMF could be automatically deleting page history older than a month,

or images not used on any article. *That* would be a real problem.

...

You just don't understand how dangerous is the current status (and it was worst in the past).

...

@Milos: Instead of spliting image dump using the first letter of

...
filenames, I thought about spliting using the upload date (YYYY-MM-DD). So, first chunks (2005-01-01) will be tiny, and recent ones of several GB (a single day).

Regards, emijrp

I like that idea since it means the dumps are static. They could be placed in tape inside a safe and not needed to be taken out unless data loss arises.

Platonides

5:23 p.m.

New subject: [Wiki-research-l] Wikipedia dumps downloader

emijrp wrote:

...

I didn't mean security problems. I meant just deleted files by weird terms of service. Commons hosts a lot of images which can be problematic, like nudes or copyrighted materials in some jurisdictions. They can deleted what they want and close every account they want, and we will lost the backups. Period.

Good point.

...

And we don't only need to keep a copy of every file. We need several copies everywhere, not only in the Amazon coolcloud.

Sure. Relying *just* on Amazon would be very bad.

...

Wikimedia Foundation has provided image dumps several times in the
past, and also rsync3 access to some individuals so that they could
clone it.
Ah, OK, that is enough (?). Then, you are OK with old-and-broken XML dumps, because people can slurp all the pages using an API scrapper.

If all people that wants it can get it, then it's enough. Not so much in a timely manner, though, but that could be fixed. I'm quite confident that if rediris rang me tomorrow offering 20Tb for hosting commosns image dumps, that could be managed without too much problems.

...

It's like the enwiki history dump. An image dump is complex, and
even less useful.
It is not complex, just resources consuming. If they need to buy another 10 TB of space and more CPU, they can. $16M were donated last year. They just need to put resources in relevant stuff. WMF always says "we host the 5th website in the world", I say that they need to act like that.

Less useful? I hope they don't need such a useless dump for recovering images, just like happened in the past.

Yes, that seems sensible. You just need to convince them :) But note that they are already making another datacenter and developing a system with which they would keep a copy of every upload on both of them. They are not so mean.

...

    Community donates images to Commons, community
    donates money every year, and now community needs to develop a
    software
    to extract all the images and packed them,


There's no *need* for that. In fact, such script would be trivial
from the toolserver.
Ah, OK, only people with toolserver account may have access to an image dump. And you say it is trivial from Toolserver and very complex from Wikimedia main servers.

Come on. Making a script to dowload all images is trivial from the toolserver. It's just not so easy using the api. The complexity is for making a dump that *anyone* can download. And it's just resources, not technical.

...

    and of course, host them in a permanent way. Crazy, right?
WMF also tries hard to not lose images.
I hope that, but we remember a case of lost images.

Yes. That's a reason for making copies, and I support that. But there's a difference between "failures happen" and "WMF is not trying to keep copies".

...

We want to provide some redundance on our own. That's perfectly
fine, but it's not a requirement.
That _is_ a requirement. We can't trust Wikimedia Foundation. They lost images. They have problems to generate English Wikipedia dumps and image dumps. They had a hardware failure some months ago in the RAID which hosts the XML dumps, and they didn't offer those dumps during months, while trying to fix the crash.

...

You just don't understand how dangerous is the current status (and it was worst in the past).

The big problem is its huge size. If it was 2MB everyone and his grandmother would keep a copy.

4924

Age (days ago)

4924

Last active (days ago)

xmldatadumps-l@lists.wikimedia.org

3 comments

2 participants

tags (0)

participants (2)

emijrp
Platonides