"Tarballs" of all 2004-2012 Commons files now available at archive.org

List overview All Threads
Download

newer

older

First Wikimedia-related...

Re: [Commons-l]...

Federico Leva (Nemo)

13 Oct 2013 13 Oct '13

9:09 a.m.

Show replies by date

Maarten Dammers

13 Oct 13 Oct

11:50 a.m.

New subject: "Tarballs" of all 2004-2012 Commons files now available at archive.org

Hi Nemo, Op 13-10-2013 11:09, Federico Leva (Nemo) schreef:

...

Nice, this was really needed for https://meta.wikimedia.org/wiki/Right_to_fork (although I hope that never happens). I wonder who is going to use it and for what. Are you keeping statistics so we can get an idea what gets downloaded and how many times? Would it make sense if the WMF seeds some (or all) of these torrents? Maarten

Federico Leva (Nemo)

12:38 p.m.

New subject: "Tarballs" of all 2004-2012 Commons files now available at archive.org

Maarten Dammers, 13/10/2013 13:50:

...

There are already counts, see e.g. the descriptions I linked. This archive is mostly a "just in case", but you can never say.

...

Would it make sense if the WMF seeds some (or all) of these torrents?

Probably not: WMF has very limited bandwidth even for XML dumps, it wouldn't be particularly useful. However, there are many mirrors out there, so convincing some to seed a few torrents would be nice. Archive.org has decent bandwidth and no throttling; unless you happen to reach them via the horrible he.net links (especially transatlantic), it's mostly fine. https://monitor.archive.org/weathermap/weathermap.html Nemo

Emilio J. Rodríguez-Posada

9:31 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

Nice work Nemo! 2013/10/13 Federico Leva (Nemo) <nemowiki(a)gmail.com>

...

WikiTeam has just finished archiving all Wikimedia Commons files up to 2012 (and some more) on the Internet Archive: https://archive.org/details/ **wikimediacommons <https://archive.org/details/wikimediacommons> So far it's about 24 TB of archives and there are also a hundred torrents you can help seed, ranging from few hundred MB to over a TB, most around 400 GB. Everything is documented at <https://meta.wikimedia.org/** wiki/Mirroring_Wikimedia_**project_XML_dumps#Media_**tarballs<https://me… and if you want here are some ideas to help WikiTeam with coding: < https://code.google.com/p/**wikiteam/issues/list<https://code.google.com…

Nemo -- You received this message because you are subscribed to the Google Groups "wikiteam-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to wikiteam-discuss+unsubscribe@**googlegroups.com<wikiteam-discuss%2Bunsubscribe@googlegroups.com> . For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/gro… .

Samuel Klein

10:03 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

+1 kudos to the whole wikiteam! On Sun, Oct 13, 2013 at 5:31 PM, Emilio J. Rodríguez-Posada <emijrp(a)gmail.com> wrote:

...

Nice work Nemo! 2013/10/13 Federico Leva (Nemo) <nemowiki(a)gmail.com>

WikiTeam has just finished archiving all Wikimedia Commons files up to 2012 (and some more) on the Internet Archive: https://archive.org/details/wikimediacommons So far it's about 24 TB of archives and there are also a hundred torrents you can help seed, ranging from few hundred MB to over a TB, most around 400 GB. Everything is documented at <https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media_tarballs> and if you want here are some ideas to help WikiTeam with coding: <https://code.google.com/p/wikiteam/issues/list>. Nemo -- You received this message because you are subscribed to the Google Groups "wikiteam-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to wikiteam-discuss+unsubscribe(a)googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.

_______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

-- Samuel Klein @metasj w:user:sj +1 617 529 4266

Fabrice Florin

11:32 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

Nicely done, you guys! Fabrice On Oct 13, 2013, at 3:03 PM, Samuel Klein wrote:

...

+1 kudos to the whole wikiteam! On Sun, Oct 13, 2013 at 5:31 PM, Emilio J. Rodríguez-Posada <emijrp(a)gmail.com> wrote:

Nice work Nemo! 2013/10/13 Federico Leva (Nemo) <nemowiki(a)gmail.com>

_______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

-- Samuel Klein @metasj w:user:sj +1 617 529 4266 _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

_______________________________ Fabrice Florin Product Manager Wikimedia Foundation http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)

Hay (Husky)

14 Oct 14 Oct

11:15 a.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

Brilliant! Now we just need someone with a good idea, a fast connection and huge hard drives to do cool stuff with all those images :) -- Hay On Mon, Oct 14, 2013 at 1:32 AM, Fabrice Florin <fflorin(a)wikimedia.org> wrote:

...

Nicely done, you guys! Fabrice On Oct 13, 2013, at 3:03 PM, Samuel Klein wrote: +1 kudos to the whole wikiteam! On Sun, Oct 13, 2013 at 5:31 PM, Emilio J. Rodríguez-Posada <emijrp(a)gmail.com> wrote: Nice work Nemo! 2013/10/13 Federico Leva (Nemo) <nemowiki(a)gmail.com> WikiTeam has just finished archiving all Wikimedia Commons files up to 2012 (and some more) on the Internet Archive: https://archive.org/details/wikimediacommons So far it's about 24 TB of archives and there are also a hundred torrents you can help seed, ranging from few hundred MB to over a TB, most around 400 GB. Everything is documented at <https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media_tarballs> and if you want here are some ideas to help WikiTeam with coding: <https://code.google.com/p/wikiteam/issues/list>. Nemo -- You received this message because you are subscribed to the Google Groups "wikiteam-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to wikiteam-discuss+unsubscribe(a)googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l -- Samuel Klein @metasj w:user:sj +1 617 529 4266 _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l _______________________________ Fabrice Florin Product Manager Wikimedia Foundation http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF) _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

On Mon, Oct 14, 2013 at 1:32 AM, Fabrice Florin <fflorin(a)wikimedia.org> wrote:

...

-- Kind regards, -- Hay Kranen Wikipedian in Residence Koninklijke Bibliotheek & Nationaal Archief (the Netherlands) http://www.twitter.com/hayify

Gerard Meijssen

11:22 a.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

Hoi, The basic problem with all these images at archive.org is the same Commons has. How do you find that useful image. Commons is where you can contribute but how about actually using and finding all the great material that is hidden so well ? Thanks, GerardM On 14 October 2013 13:15, Hay (Husky) <huskyr(a)gmail.com> wrote:

...

400

GB. Everything is documented at <

https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media…

and if you want here are some ideas to help WikiTeam with coding: <https://code.google.com/p/wikiteam/issues/list>. Nemo -- You received this message because you are subscribed to the Google Groups "wikiteam-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to wikiteam-discuss+unsubscribe(a)googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l -- Samuel Klein @metasj w:user:sj +1 617 529

4266

_______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l _______________________________ Fabrice Florin Product Manager Wikimedia Foundation http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF) _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

On Mon, Oct 14, 2013 at 1:32 AM, Fabrice Florin <fflorin(a)wikimedia.org> wrote:

400

GB. Everything is documented at <

https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media…

4266

-- Kind regards, -- Hay Kranen Wikipedian in Residence Koninklijke Bibliotheek & Nationaal Archief (the Netherlands) http://www.twitter.com/hayify _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Emilio J. Rodríguez-Posada

12:18 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

2013/10/14 Gerard Meijssen <gerard.meijssen(a)gmail.com>

...

Basically, you can't. Internet Archive has this problem in several other topics, like its Wayback Machine, there is not search engine to search the billions grabbed websites by keyword of whatever. Internet Archive is a pile of hard disks and a time capsule with backups, and they try to do the best at showing the materials (media players, pdf viewers), but it is not always easy or possible.

...

On 14 October 2013 13:15, Hay (Husky) <huskyr(a)gmail.com> wrote:

torrents

you can help seed, ranging from few hundred MB to over a TB, most

around 400

GB. Everything is documented at <

https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media…

and if you want here are some ideas to help WikiTeam with coding: <https://code.google.com/p/wikiteam/issues/list>. Nemo -- You received this message because you are subscribed to the Google

Groups

"wikiteam-discuss" group. To unsubscribe from this group and stop receiving emails from it, send

email to wikiteam-discuss+unsubscribe(a)googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l -- Samuel Klein @metasj w:user:sj +1 617 529

4266

On Mon, Oct 14, 2013 at 1:32 AM, Fabrice Florin <fflorin(a)wikimedia.org> wrote:

torrents

you can help seed, ranging from few hundred MB to over a TB, most

around 400

GB. Everything is documented at <

https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media…

and if you want here are some ideas to help WikiTeam with coding: <https://code.google.com/p/wikiteam/issues/list>. Nemo -- You received this message because you are subscribed to the Google

Groups

"wikiteam-discuss" group. To unsubscribe from this group and stop receiving emails from it, send

4266

_______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Federico Leva (Nemo)

12:26 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

Emilio J. Rodríguez-Posada, 14/10/2013 14:18:

...

Internet Archive has this problem in several other topics, like its Wayback Machine, there is not search engine to search the billions grabbed websites by keyword of whatever. Internet Archive is a pile of hard disks and a time capsule with backups, and they try to do the best at showing the materials (media players, pdf viewers), but it is not always easy or possible.

Hay (Husky)

12:55 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

The first step is that we now have stuff in different places. There used to be a period of time a few years ago when there weren't *any* backups of Commons images. The next step is that somebody uses these dumps for a new creative project. Maybe someone working at a university with lots of bandwidth and lots of space... On Mon, Oct 14, 2013 at 2:26 PM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:

...

Emilio J. Rodríguez-Posada, 14/10/2013 14:18:

...and that's why Hay said we need someone with a good idea. :) Now it's easy to download the dataset (though it's not perfect), of course this doesn't automatically make something cool happen with it. Except replication of the data in multiple places, which is a good thing in itself. Nemo _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

-- Kind regards, -- Hay Kranen Wikipedian in Residence Koninklijke Bibliotheek & Nationaal Archief (the Netherlands) http://www.twitter.com/hayify

Gerard Meijssen

12:59 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

Hoi, While I do agree that it is good to have the data in many places and, the Internet Archive on its own moves it to several places as well. Many of us have seen the IA servers at the Library of Alexandria. While it is ok to find a use for the data at the IA, I would like us to concentrate first and foremost on how we can make better use of the media that is in Commons itself. How we can open it up to more use. Make Commons more accessable. Do realise that when there is a good use for all the data that is in the IA, the same use and more could be made with the larger amount of data that is in Commons itself. Thanks, GerardM On 14 October 2013 14:26, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:

...

Emilio J. Rodríguez-Posada, 14/10/2013 14:18: Internet Archive has this problem in several other topics, like its

Wayback Machine, there is not search engine to search the billions grabbed websites by keyword of whatever. Internet Archive is a pile of hard disks and a time capsule with backups, and they try to do the best at showing the materials (media players, pdf viewers), but it is not always easy or possible.

...and that's why Hay said we need someone with a good idea. :) Now it's easy to download the dataset (though it's not perfect), of course this doesn't automatically make something cool happen with it. Except replication of the data in multiple places, which is a good thing in itself. Nemo ______________________________**_________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/commons-l<https://lists.w…

Hydriz Scholz

1:30 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

It was not the original intention of us at WikiTeam to create these media tarballs so that researchers can use them from there. We created these tarballs so that everyone in the Wikimedia movement can be rest assured that there is one backup copy of their media on the Internet Archive. Trust me, the number of people who are going to actually use these tarballs are going to be lesser than the number of people editing the smaller wikis combined, certainly everyone is going to be using the data on Commons itself. So, we can fully focus on improving Commons to make it more data-accessible without taking the risk of having people working on the tarballs on the Internet Archive for research instead. That being said, we can't even guarantee that the images in these tarballs are up-to-date. They are all downloaded and should be regarded as a snapshot of the image at the time of download, not an effective live backup of all the images on Commons. We are looking into creating subsequent tarballs that take into account the new uploads and the re-uploads so that Commons is actually backed up. I guess the way we presented the tarballs on the Internet Archive is enough to deter anyone from conducting research directly from it, unless he/she does an in-depth mining of the data to get what he/she wants, but it certainly is going to be much tougher than mining the information from Commons directly in its current state. On Mon, Oct 14, 2013 at 8:59 PM, Gerard Meijssen <gerard.meijssen(a)gmail.com>wrote;wrote:

...

Emilio J. Rodríguez-Posada, 14/10/2013 14:18: Internet Archive has this problem in several other topics, like its

...and that's why Hay said we need someone with a good idea. :) Now it's easy to download the dataset (though it's not perfect), of course this doesn't automatically make something cool happen with it. Except replication of the data in multiple places, which is a good thing in itself. Nemo ______________________________**_________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/commons-l<https://lists.w…

_______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

-- Regards, Hydriz Be social, follow/add me: Facebook: http://tinyurl.com/hydrizfb Google+: http://tinyurl.com/hydrizgl Twitter: @hydrizwiki

Federico Leva (Nemo)

6:47 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

I agree with Gerard of course, but I still think there is no contradiction between the two things: the minute, careful curation and interface improvement work on Commons; and the mass-preservation, analysis and usage of it as a dataset. They also are interests for different people with different resources, so there is no competition. For instance, I asked some help with WikiTeam's software, and that's in python, while MediaWiki is PHP and JavaScript: by doing both things we are more likely to cover all interests and *reduce* waste of resources. :) Nemo

geni

8:22 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

On 14 October 2013 13:59, Gerard Meijssen <gerard.meijssen(a)gmail.com> wrote:

...

Gerard Meijssen

8:38 p.m.

New subject: [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org

Hoi, Geni, sorry but there is a difference of their being a backup within the WMF of Commons and there being a dataset of Commons at the IA that is not current. People can do all the analysis they want on the old data and it will not make any difference. It will not make the data that is currently in Commons any more accessible. We have been told repeatedly that the data at the WMF is secure. Beyond that the data is like knowing what the maximum is the insurance policy will pay. You know it will be not enough. It is however very much a hypothetical question. How to make Commons usable is an here and now issue. Thanks, GerardM On 14 October 2013 22:22, geni <geniice(a)gmail.com> wrote:

...

On 14 October 2013 13:59, Gerard Meijssen <gerard.meijssen(a)gmail.com>wrote;wrote:

And you need to stop right there. As in don't express a further opinion until you realise how wrong you are. You can't do any analysis on data that is lost. And non backed up data is just data that doesn't know that it is lost yet. -- geni _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Yann Forget

16 Oct 16 Oct

7:29 p.m.

New subject: "Tarballs" of all 2004-2012 Commons files now available at archive.org

Hello, I got the torrent file file, but there isn't any peer. https://archive.org/details/wikimediacommons-torrents https://archive.org/download/wikimediacommons-torrents/wikimediacommons-tor… Regards, Yann 2013/10/13 Federico Leva (Nemo) <nemowiki(a)gmail.com>

...

Nemo ______________________________**_________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/commons-l<https://lists.w…

Federico Leva (Nemo)

8:20 p.m.

New subject: "Tarballs" of all 2004-2012 Commons files now available at archive.org

Yann Forget, 16/10/2013 21:29:

...

Hello, I got the torrent file file, but there isn't any peer. https://archive.org/details/wikimediacommons-torrents https://archive.org/download/wikimediacommons-torrents/wikimediacommons-tor…

What do you mean? archive.org webseeds them from two separate servers, don't you manage to download? Nemo

3843

days inactive

3846

days old

commons-l@lists.wikimedia.org

Manage subscription

17 comments

10 participants

tags (0)

participants (10)

Emilio J. Rodríguez-Posada
Fabrice Florin
Federico Leva (Nemo)
geni
Gerard Meijssen
Hay (Husky)
Hydriz Scholz
Maarten Dammers
Samuel Klein
Yann Forget