Dear ones,
Where might I get or mirror a dump of Commons media files?
It seems worth mentioning on the front page of
It looks like the compressed XML of the ~50M description pages is ~25GB.
It looks like wiki-team set up a dump script that posted monthly dumps to
the internet archive; in 2013 it stopped include the month+year in the title; in 2016 it stopped altogether. https://archive.org/details/wikimediacommons
Samuel Klein via Commons-l, 24/05/19 00:09:
Where might I get or mirror a dump of Commons media files?
From https://archive.org/details/wikimediacommons (updated irregularly; if someone wants to help so that I'm not the single bottleneck, let me know!).
We also have a link to torrents to seed in https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive
Federico
Is there one more recent than 2016?
On Fri., May 24, 2019, 5:48 p.m. Federico Leva (Nemo), nemowiki@gmail.com wrote:
Samuel Klein via Commons-l, 24/05/19 00:09:
Where might I get or mirror a dump of Commons media files?
From https://archive.org/details/wikimediacommons (updated irregularly; if someone wants to help so that I'm not the single bottleneck, let me know!).
We also have a link to torrents to seed in https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive
Federico
And yes I'd love to help!
On Fri., May 24, 2019, 6:47 p.m. Samuel Klein, meta.sj@gmail.com wrote:
Is there one more recent than 2016?
On Fri., May 24, 2019, 5:48 p.m. Federico Leva (Nemo), nemowiki@gmail.com wrote:
Samuel Klein via Commons-l, 24/05/19 00:09:
Where might I get or mirror a dump of Commons media files?
From https://archive.org/details/wikimediacommons (updated irregularly; if someone wants to help so that I'm not the single bottleneck, let me know!).
We also have a link to torrents to seed in https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive
Federico
Checking in again, as I have the feeling the bulk of Commons media (~300 TB in all) is not mirrored anywhere right now.
1. WikiTeam, we love you, what do you need to be more effective at archiving WM wikis? 2. Are there any commons media dumps more recent than 2016? I saw something related mentioned on phab? within the last year, but can't find it now.
It seems like a good plan like to ensure a quarterly snapshot is on IPFS [via the Internet Archive], and to keep a copy in the moon archive + others.
*Refs* - WikiTeam page https://wiki.archiveteam.org/index.php/Wikimedia_Commons - Last real update 2014 - IA WM Commons Gargle https://archive.org/details/wikimediacommons?and[]=year%3A%222016%22 - last update 2016 - Excellent WP Archive https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive: last real update 2017?
On Fri, May 24, 2019 at 5:48 PM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Samuel Klein via Commons-l, 24/05/19 00:09:
Where might I get or mirror a dump of Commons media files?
From https://archive.org/details/wikimediacommons (updated irregularly; if someone wants to help so that I'm not the single bottleneck, let me know!).
We also have a link to torrents to seed in https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive
Federico
The power of an iPhone with only 3G connected uplink.
The code eats away from the source leave no traces, freakin’ ay , this is destroying work so I’m gone
Bye
On 13 Dec 2021, at 20:00, Samuel Klein meta.sj@gmail.com wrote:
Checking in again, as I have the feeling the bulk of Commons media (~300 TB in all) is not mirrored anywhere right now.
1. WikiTeam, we love you, what do you need to be more effective at archiving WM wikis? 2. Are there any commons media dumps more recent than 2016? I saw something related mentioned on phab? within the last year, but can't find it now.
It seems like a good plan like to ensure a quarterly snapshot is on IPFS [via the Internet Archive], and to keep a copy in the moon archive + others.
Refs - WikiTeam page - Last real update 2014 - IA WM Commons Gargle - last update 2016 - Excellent WP Archive: last real update 2017?
On Fri, May 24, 2019 at 5:48 PM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Samuel Klein via Commons-l, 24/05/19 00:09:
Where might I get or mirror a dump of Commons media files?
From https://archive.org/details/wikimediacommons (updated irregularly; if someone wants to help so that I'm not the single bottleneck, let me know!).
We also have a link to torrents to seed in https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive
Federico
Il 13/12/21 20:59, Samuel Klein ha scritto:
- WikiTeam, we love you, what do you need to be more effective at
archiving WM wikis?
Mainly, at least one volunteer willing to run the scripts in my stead so that it doesn't always fall on me. This is either a babysitting or a coding task to make things more reliable so it tends to be a project in the region of dozens or possibly hundreds of hours of work.
(You can save some work if you spend a few hundred dollars on good equipment and/or happen to have several TB of fast internet-connected disks few network hops from SFMIX and the WMF.)
There's also the problem that people thought it smart to mirror millions of big Internet Archive files on Commons, so I become less and less comfortable with the idea of filling IA with hundreds of TB of content that's becoming more and more duplicative. At ArchiveTeam, as a reference, we consider 2000 $/TB as a cost of an upload to IA. Are 300 more TB of Commons media worth almost a million dollar to the IA's mission? I'm not quite sure, I'd need to check what's in there.
Federico
On Mon, Dec 13, 2021 at 3:06 PM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Il 13/12/21 20:59, Samuel Klein ha scritto:
- WikiTeam, we love you, what do you need to be more effective at
archiving WM wikis?
Mainly, at least one volunteer willing to run the scripts in my stead so that it doesn't always fall on me. This is either a babysitting or a coding task to make things more reliable so it tends to be a project in the region of dozens or possibly hundreds of hours of work.
Got it -- that's for all the scripts, not just commons?
(You can save some work if you spend a few hundred dollars on good equipment and/or happen to have several TB of fast internet-connected disks few network hops from SFMIX and the WMF.)
Some UC or other uni on Internet2 in the US perhaps...
There's also the problem that people thought it smart to mirror millions of big Internet Archive files on Commons
We could use a dump that didn't include any files that have "source:IA" or an "archived-at" field.
At ArchiveTeam, as a
reference, we consider 2000 $/TB as a cost of an upload to IA.
Good to know, not cheap. Maybe not the right target for something we plan to replace / re-archive regularly.
S.