Bulk download - Wikitech-l - lists.wikimedia.org

List overview All Threads
Download

Bulk download

Re: [Wikitech-l] AbuseFilter error...

Can we help Tor users make...

Mihai Chintoanu

23 Sep 2013 23 Sep '13

3:22 p.m.

Hi everyone, I have a list of about 1.8 million images which I have to download from commons.wikimedia.org. Is there any simple way to do this which doesn't involve an individual HTTP hit for each image? Many thanks in advance. Mihai

Reply

Show replies by date

K. Peachey

23 Sep 23 Sep

3:36 p.m.

On Mon, Sep 23, 2013 at 11:22 PM, Mihai Chintoanu < mihai.chintoanu(a)skobbler.com> wrote:

Hi everyone, I have a list of about 1.8 million images which I have to download from commons.wikimedia.org.

Why?

Reply

Ariel T. Glenn

3:58 p.m.

We have a somewhat out of date off site mirror of images (I'm working on the out of date part). This includes commons. It's accessible by rsync, http, ftp: http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media Thanks again to your.org for hosting that. Are these images used on some particular project? If so we might be able to do better. Ariel Στις 23-09-2013, ημέρα Δευ, και ώρα 15:22 +0200, ο/η Mihai Chintoanu έγραψε:

Hi everyone, I have a list of about 1.8 million images which I have to download from commons.wikimedia.org. Is there any simple way to do this which doesn't involve an individual HTTP hit for each image? Many thanks in advance. Mihai _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply

gnosygnu

24 Sep 24 Sep

3:23 a.m.

Ariel and others have already touched upon this, but just in case you want more details (I'm trying to do something similar): If your images are centered around one wiki (for example, the 1.8 million images are for articles in English Wikipedia), you can use the tarballs at your.org: http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/. The latest set is from 2012-12, but it should form a good base for your images. You'd want the tarballs with a prefix "of enwiki-20121201-remote-media". These are all the images for enwiki whose [[File]] page is hosted by commons ("remote"). In contrast, "local" are for the images that are hosted directly by enwiki. Some rough numbers: - 24 tarball files: each about 90 GB - approximately 2.3 million full-sized originals (not thumbs). Also includes audio, video, pdf, etc. - takes up 2.1 TB of hard disk space - takes about 14 days to download these tarballs (with an 18 Mbps download connection) Hope this helps. On Mon, Sep 23, 2013 at 9:58 AM, Ariel T. Glenn <ariel(a)wikimedia.org> wrote:

We have a somewhat out of date off site mirror of images (I'm working on the out of date part). This includes commons. It's accessible by rsync, http, ftp: http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media Thanks again to your.org for hosting that. Are these images used on some particular project? If so we might be able to do better. Ariel Στις 23-09-2013, ημέρα Δευ, και ώρα 15:22 +0200, ο/η Mihai Chintoanu έγραψε:

Hi everyone, I have a list of about 1.8 million images which I have to download from commons.wikimedia.org. Is there any simple way to do this which doesn't involve an individual HTTP hit for each image? Many thanks in advance. Mihai _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply

Jeremy Baron

23 Sep 23 Sep

4:11 p.m.

On Sep 23, 2013 9:25 AM, "Mihai Chintoanu" <mihai.chintoanu(a)skobbler.com> wrote:

I have a list of about 1.8 million images which I have to download from

commons.wikimedia.org. Is there any simple way to do this which doesn't involve an individual HTTP hit for each image? You mean full size originals, not thumbs scaled to a certain size, right? You should rsync from a mirror[0] (rsync allows specifying a list of files to copy) and then fill in the missing images from upload.wikimedia.org ; for upload.wikimedia.org I'd say you should throttle yourself to 1 cache miss per second (you can check headers on a response to see if was a hit or miss and then back off when you get a miss) and you shouldn't use more than one or two simultaneous HTTP connections. In any case, make sure you have an accurate UA string with contact info (email address) so ops can contact you if there's an issue. At the moment there's only one mirror and it's ~6-12 months out of date so there may be a substantial amount to fill in. And of course you should be getting checksums from somewhere (the API?) and verifying them. If your images are all missing from the mirror than it should take around 40 days at 0.5 img/sec but I guess you probably could do it in less than 10 days if you have a fast enough pipe. (depends on if you get a lit of misses or hits) See also [1] but not all of that applies because upload.wikimedia.org isn't MediaWiki. so e.g. no maxlag param. -Jeremy [0] https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media [1] https://www.mediawiki.org/wiki/API:Etiquette

Reply

Federico Leva (Nemo)

6:59 p.m.

New subject: [Xmldatadumps-l] Bulk download

Jeremy Baron, 23/09/2013 16:11:

On Sep 23, 2013 9:25 AM, "Mihai Chintoanu" <mihai.chintoanu(a)skobbler.com <mailto:mihai.chintoanu@skobbler.com>> wrote:

I have a list of about 1.8 million images which I have to download

from commons.wikimedia.org <http://commons.wikimedia.org>. Is there any simple way to do this which doesn't involve an individual HTTP hit for each image? You mean full size originals, not thumbs scaled to a certain size, right? You should rsync from a mirror[0] (rsync allows specifying a list of files to copy)

I agree that rsync is probably your best bet. Another mirror I'm building is on archive.org, organised by day of upload. You can also request an individual file directly from the zips but that's not super-efficient. https://archive.org/search.php?query=subject%3A%22Wikimedia+Commons%22 Nemo

Reply

C. Scott Ananian

7:39 p.m.

New subject: [Xmldatadumps-l] Bulk download

Federico: providing information on the archive.org mirror (probably on https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media) along with any suggestions or ettiquette for using these would probably be useful. --scott

Reply

Federico Leva (Nemo)

29 Sep 29 Sep

10:29 a.m.

New subject: [Xmldatadumps-l] Bulk download

Scott, sure, I planned to add information about the archive.org copy in several places when I'm done with this first addition (I'm "only" at 18,110,540,394 KB as of now, should be a few TB more to complete 2012). IA invites people to use wget liberally, but I'll also get archive.org torrents produced for those items (over 25 GB it requires an admin). Nemo

Reply

C. Scott Ananian

23 Sep 23 Sep

7:29 p.m.

I added Jeremy's helpful tips to https://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_are_images_… feel free to improve these/reference them from other appropriate places, etc. --scott

Reply

Mihai Chintoanu

24 Sep 24 Sep

9:24 a.m.

Hello, Thank you to all who have taken the time to answer. As more people have asked, here are some details about the project. We want to build a feature in our smartphone app that allows users to read wikipedia articles and we want to make the articles and their images available offline to them, which is why we have to first download this content from wikipedia and wikimedia. We have installed a wikipedia mirror locally and extracted the desired article texts through the API. For the images, we first thought about getting the image dump tarballs. However, the articles (and consequently the images) are spread over more language domains, so this approach would have been both inefficient and too much space consuming. I'll look into the rsync approach. Once again, many thanks for all your suggestions. Mihai -----Original Message----- From: wikitech-l-bounces(a)lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Jeremy Baron Sent: 23 September 2013 17:12 To: Wikimedia developers; Wikipedia Xmldatadumps-l Subject: Re: [Wikitech-l] Bulk download On Sep 23, 2013 9:25 AM, "Mihai Chintoanu" <mihai.chintoanu(a)skobbler.com> wrote:

I have a list of about 1.8 million images which I have to download from

commons.wikimedia.org. Is there any simple way to do this which doesn't involve an individual HTTP hit for each image? You mean full size originals, not thumbs scaled to a certain size, right? You should rsync from a mirror[0] (rsync allows specifying a list of files to copy) and then fill in the missing images from upload.wikimedia.org ; for upload.wikimedia.org I'd say you should throttle yourself to 1 cache miss per second (you can check headers on a response to see if was a hit or miss and then back off when you get a miss) and you shouldn't use more than one or two simultaneous HTTP connections. In any case, make sure you have an accurate UA string with contact info (email address) so ops can contact you if there's an issue. At the moment there's only one mirror and it's ~6-12 months out of date so there may be a substantial amount to fill in. And of course you should be getting checksums from somewhere (the API?) and verifying them. If your images are all missing from the mirror than it should take around 40 days at 0.5 img/sec but I guess you probably could do it in less than 10 days if you have a fast enough pipe. (depends on if you get a lit of misses or hits) See also [1] but not all of that applies because upload.wikimedia.org isn't MediaWiki. so e.g. no maxlag param. -Jeremy [0] https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media [1] https://www.mediawiki.org/wiki/API:Etiquette _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply

Tomasz Finc

27 Sep 27 Sep

8:24 a.m.

Thanks for helping to distribute wikipedia more broadly Mihai. Do give Kiwix for Android [1] a shot as it does something very similar to your app. Perhaps you can even collaborate on the project. --tomasz [1] - https://play.google.com/store/apps/details?id=org.kiwix.kiwixmobile On Tue, Sep 24, 2013 at 12:24 AM, Mihai Chintoanu <mihai.chintoanu(a)skobbler.com> wrote:

Hello, Thank you to all who have taken the time to answer. As more people have asked, here are some details about the project. We want to build a feature in our smartphone app that allows users to read wikipedia articles and we want to make the articles and their images available offline to them, which is why we have to first download this content from wikipedia and wikimedia. We have installed a wikipedia mirror locally and extracted the desired article texts through the API. For the images, we first thought about getting the image dump tarballs. However, the articles (and consequently the images) are spread over more language domains, so this approach would have been both inefficient and too much space consuming. I'll look into the rsync approach. Once again, many thanks for all your suggestions. Mihai -----Original Message----- From: wikitech-l-bounces(a)lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Jeremy Baron Sent: 23 September 2013 17:12 To: Wikimedia developers; Wikipedia Xmldatadumps-l Subject: Re: [Wikitech-l] Bulk download On Sep 23, 2013 9:25 AM, "Mihai Chintoanu" <mihai.chintoanu(a)skobbler.com> wrote:

I have a list of about 1.8 million images which I have to download from

commons.wikimedia.org. Is there any simple way to do this which doesn't involve an individual HTTP hit for each image? You mean full size originals, not thumbs scaled to a certain size, right? You should rsync from a mirror[0] (rsync allows specifying a list of files to copy) and then fill in the missing images from upload.wikimedia.org ; for upload.wikimedia.org I'd say you should throttle yourself to 1 cache miss per second (you can check headers on a response to see if was a hit or miss and then back off when you get a miss) and you shouldn't use more than one or two simultaneous HTTP connections. In any case, make sure you have an accurate UA string with contact info (email address) so ops can contact you if there's an issue. At the moment there's only one mirror and it's ~6-12 months out of date so there may be a substantial amount to fill in. And of course you should be getting checksums from somewhere (the API?) and verifying them. If your images are all missing from the mirror than it should take around 40 days at 0.5 img/sec but I guess you probably could do it in less than 10 days if you have a fast enough pipe. (depends on if you get a lit of misses or hits) See also [1] but not all of that applies because upload.wikimedia.org isn't MediaWiki. so e.g. no maxlag param. -Jeremy [0] https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media [1] https://www.mediawiki.org/wiki/API:Etiquette _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply

3861

days inactive

3867

days old

wikitech-l@lists.wikimedia.org

Manage subscription

10 comments

8 participants

tags (0)

participants (8)

Ariel T. Glenn
C. Scott Ananian
Federico Leva (Nemo)
gnosygnu
Jeremy Baron
K. Peachey
Mihai Chintoanu
Tomasz Finc