34 TB Wikimedia Commons files on archive.org: you can help

List overview All Threads
Download

newer

older

Bugzilla Weekly Report

Deploying a centralized...

Federico Leva (Nemo)

1 Aug 2014 1 Aug '14

10:42 a.m.

WikiTeam[1] has released an update of the chronological archive of all Wikimedia Commons files, up to 2013. Now at ~34 TB total. https://archive.org/details/wikimediacommons I wrote to – I think – all the mirrors in the world, but apparently nobody is interested in such a mass of media apart from the Internet Archive (and the mirrorservice.org which took Kiwix). The solution is simple: take a small bite and preserve a copy yourself. One slice only takes one click, from your browser to your torrent client, and typically 20-40 GB on your disk (biggest slice 1400 GB, smallest 216 MB). https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive#Image_tarballs

Nemo

P.s.: Please help spread the word everywhere.

[1] https://github.com/WikiTeam/wikiteam

Show replies by date

Antoine Musso

1 Aug 1 Aug

4:17 p.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

Le 01/08/2014 16:42, Federico Leva (Nemo) a écrit :

...

WikiTeam[1] has released an update of the chronological archive of all Wikimedia Commons files, up to 2013. Now at ~34 TB total. https://archive.org/details/wikimediacommons I wrote to – I think – all the mirrors in the world, but apparently nobody is interested in such a mass of media apart from the Internet Archive (and the mirrorservice.org which took Kiwix). The solution is simple: take a small bite and preserve a copy yourself. One slice only takes one click, from your browser to your torrent client, and typically 20-40 GB on your disk (biggest slice 1400 GB, smallest 216 MB). https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive#Image_tarballs

Hello,

Have you thought about contacting companies having massive storage such as Dropbox ? Maybe they will be happy to share a few TB :-]

-- Antoine "hashar" Musso

MZMcBride

7:27 p.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

Antoine Musso wrote:

...

Le 01/08/2014 16:42, Federico Leva (Nemo) a écrit :

...
WikiTeam[1] has released an update of the chronological archive of all Wikimedia Commons files, up to 2013. Now at ~34 TB total. https://archive.org/details/wikimediacommons I wrote to – I think – all the mirrors in the world, but apparently nobody is interested in such a mass of media apart from the Internet Archive (and the mirrorservice.org which took Kiwix). The solution is simple: take a small bite and preserve a copy yourself. One slice only takes one click, from your browser to your torrent client, and typically 20-40 GB on your disk (biggest slice 1400 GB, smallest 216 MB).

https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive#Image_tarbal ls

Hello,

Have you thought about contacting companies having massive storage such as Dropbox ? Maybe they will be happy to share a few TB :-]

I believe Amazon has donated space at some point, but I don't know (m)any details. I very briefly searched around and found https://wikitech.wikimedia.org/wiki/Amazon_Public_Data_Sets.

MZMcBride

Federico Leva (Nemo)

2 Aug 2 Aug

9:36 a.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

Yes, "all the mirrors in the world" included Amazon. No reply from them either and I'm not going to write companies who don't have a mirroring program/an explicit interest in the offer. It's appreciated if others do, though.

Nemo

MZMcBride

12:14 p.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

Federico Leva (Nemo) wrote:

...

Yes, "all the mirrors in the world" included Amazon. No reply from them either and I'm not going to write companies who don't have a mirroring program/an explicit interest in the offer. It's appreciated if others do, though.

Yeah, 34 TB is still a lot of data, unfortunately. I think most people reading this list recognize and appreciate this. (I actually have a draft e-mail about Dispenser requesting 24 TB just a few weeks ago....)

I'd personally like to see a price breakdown for this project. Doing a bit of quick research, it sounds like storage alone would probably cost maybe $4,000 USD, but it depends whether you're buying individual 2 TB drives or you're buying larger 20 TB drives. More than this, though, is the ongoing and recurring costs assuming you want to keep this data online. Is having this (backup) data be available online an explicit goal here? Or is the primary goal simply to have an offline backup of this data?

In either case (online or offline), a price breakdown would help nearly any volunteer organization (such as a Wikimedia chapter) decide whether to help in this effort. Crowd-sourcing the funding for this project is also a possibility, either via individual donations (Kickstarter, perhaps) or via small grants from various Internet-related or free content-related organizations (EFF, Mozilla, Wikimedia, et al.).

Soliciting money for this project requires a much clearer, detailed plan. The current shoe-string strategy of everyone downloading a piece of the 34 TB is certainly romantic, but it also seems to be impractical and silly.

MZMcBride

Federico Leva (Nemo)

12:51 p.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

Thanks MZ for your suggestion to ask money. I'm not interested. For those who are: https://meta.wikimedia.org/wiki/Grants:IdeaLab/Commons_tarballs_seedbox

Nemo

Pine W

8:17 p.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

When we get Wikimedia Cascadia (name decision still pending) approved and have our legal paperwork in order we could potentially host a Commons or other Wikimedia backup. I think this would be doable if we can work out the legal issues and WMF approves a GAC request for some cheap storage.

Pine On Aug 2, 2014 9:51 AM, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:

...

Thanks MZ for your suggestion to ask money. I'm not interested. For those who are: https://meta.wikimedia.org/wiki/Grants:IdeaLab/Commons_tarballs_seedbox

Nemo

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Jeremy Baron

8:18 p.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

On Aug 2, 2014 8:17 PM, "Pine W" wiki.pine@gmail.com wrote:

...

I think this would be doable if we can work out the legal issues and WMF approves a GAC request for some cheap storage.

What legal issues do you envision?

-Jeremy

Pine W

8:25 p.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

No offense, I would prefer that Cascadia discuss potential legal issues privately with WMF before we start speculating in public.

There is probably a way to make this successful in the end.

Pine On Aug 2, 2014 5:19 PM, "Jeremy Baron" jeremy@tuxmachine.com wrote:

...

On Aug 2, 2014 8:17 PM, "Pine W" wiki.pine@gmail.com wrote:

...
I think this would be doable if we can work out the legal issues and WMF approves a GAC request for some cheap storage.

What legal issues do you envision?

-Jeremy _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Jeremy Baron

8:31 p.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

On Aug 2, 2014 8:25 PM, "Pine W" wiki.pine@gmail.com wrote:

...

No offense, I would prefer that Cascadia discuss potential legal issues privately with WMF before we start speculating in public.

There is probably a way to make this successful in the end.

I find that rather confusing.

As the legal team's email footers say, they are not your lawyer (and not your chapter's lawyer either).

I can't think of any legal issues you'd encounter besides the ones WMF already deals with. (copyright/trademark/defamation/trade secrets/national security/CDA 230/DMCA/etc.) If you want legal advice on any of those issues then you need to consult counsel outside WMF.

Or maybe there's a concern I haven't imagined yet.

-Jeremy

Pine W

9:01 p.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

Yes, those plus a few others. Yes WMF counsel can't act as Cascadia's counsel but I would want to see what terms WMF might offer like indemnifying Cascadia for any issues relating to archiving the Commons content.

Pine On Aug 2, 2014 5:32 PM, "Jeremy Baron" jeremy@tuxmachine.com wrote:

...

On Aug 2, 2014 8:25 PM, "Pine W" wiki.pine@gmail.com wrote:

...
No offense, I would prefer that Cascadia discuss potential legal issues privately with WMF before we start speculating in public.

There is probably a way to make this successful in the end.

I find that rather confusing.

As the legal team's email footers say, they are not your lawyer (and not your chapter's lawyer either).

I can't think of any legal issues you'd encounter besides the ones WMF already deals with. (copyright/trademark/defamation/trade secrets/national security/CDA 230/DMCA/etc.) If you want legal advice on any of those issues then you need to consult counsel outside WMF.

Or maybe there's a concern I haven't imagined yet.

-Jeremy _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

8:27 p.m.

New subject: 34 TB Wikimedia Commons files on archive.org: you can help

S3 prices have dropped since that page was last modified so the 2200 quoted is about 1200 now a month. If the data doesn't need retrieved except for rare cases, putting in on Glacier would drop it closer to 350-400 On Aug 2, 2014 12:51 PM, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:

...

Thanks MZ for your suggestion to ask money. I'm not interested. For those who are: https://meta.wikimedia.org/wiki/Grants:IdeaLab/Commons_tarballs_seedbox

Nemo

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Mietchen

9:57 p.m.

New subject: [Commons-l] 34 TB Wikimedia Commons files on archive.org: you can help

The issue of mirroring Wikimedia content has been discussed with a number of scholarly institutions engaged in data-rich research, and the response was generally of the "send us the specs, and we will see what we can do" kind.

I would be interested in giving this another go if someone could provide me with those specs, preferably for Wikimedia projects as a whole as well as broken down by individual projects or languages or timestamps etc.

The WikiTeam's Commons archive would make for a good test dataset.

Daniel

-- http://www.naturkundemuseum-berlin.de/en/institution/mitarbeiter/mietchen-da... https://en.wikipedia.org/wiki/User:Daniel_Mietchen/Publications http://okfn.org http://wikimedia.org

On Fri, Aug 1, 2014 at 4:42 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

WikiTeam[1] has released an update of the chronological archive of all Wikimedia Commons files, up to 2013. Now at ~34 TB total. https://archive.org/details/wikimediacommons I wrote to – I think – all the mirrors in the world, but apparently nobody is interested in such a mass of media apart from the Internet Archive (and the mirrorservice.org which took Kiwix). The solution is simple: take a small bite and preserve a copy yourself. One slice only takes one click, from your browser to your torrent client, and typically 20-40 GB on your disk (biggest slice 1400 GB, smallest 216 MB). https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive#Image_tarballs

Nemo

P.s.: Please help spread the word everywhere.

[1] https://github.com/WikiTeam/wikiteam

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Federico Leva (Nemo)

3 Aug 3 Aug

3:18 a.m.

New subject: [Commons-l] 34 TB Wikimedia Commons files on archive.org: you can help

Daniel Mietchen, 03/08/2014 03:57:

...

The issue of mirroring Wikimedia content has been discussed with a number of scholarly institutions engaged in data-rich research, and the response was generally of the "send us the specs, and we will see what we can do" kind.

I would be interested in giving this another go if someone could provide me with those specs, preferably for Wikimedia projects as a whole as well as broken down by individual projects or languages or timestamps etc.

The WikiTeam's Commons archive would make for a good test dataset.

Ariel keeps https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Requir... up to date. Anything else needed?

Nemo

Daniel Mietchen

7:23 p.m.

New subject: [Commons-l] 34 TB Wikimedia Commons files on archive.org: you can help

That seems to be sufficient to get things rolling. I will give it a try. Thanks! -- http://www.naturkundemuseum-berlin.de/en/institution/mitarbeiter/mietchen-da... https://en.wikipedia.org/wiki/User:Daniel_Mietchen/Publications http://okfn.org http://wikimedia.org

On Sun, Aug 3, 2014 at 9:18 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Daniel Mietchen, 03/08/2014 03:57:

...
The issue of mirroring Wikimedia content has been discussed with a number of scholarly institutions engaged in data-rich research, and the response was generally of the "send us the specs, and we will see what we can do" kind.

I would be interested in giving this another go if someone could provide me with those specs, preferably for Wikimedia projects as a whole as well as broken down by individual projects or languages or timestamps etc.

The WikiTeam's Commons archive would make for a good test dataset.

Ariel keeps https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Requir... up to date. Anything else needed?

Nemo

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

3763

Age (days ago)

3765

Last active (days ago)

wikitech-l@lists.wikimedia.org

14 comments

7 participants

tags (0)

participants (7)

Antoine Musso
Daniel Mietchen
Federico Leva (Nemo)
Jeremy Baron
MZMcBride
OQ
Pine W