Thanks for helping to distribute wikipedia more broadly Mihai. Do give
Kiwix for Android [1] a shot as it does something very similar to your
app. Perhaps you can even collaborate on the project.
--tomasz
[1] -
On Tue, Sep 24, 2013 at 12:24 AM, Mihai Chintoanu
<mihai.chintoanu(a)skobbler.com> wrote:
Hello,
Thank you to all who have taken the time to answer.
As more people have asked, here are some details about the project. We want to build a
feature in our smartphone app that allows users to read wikipedia articles and we want to
make the articles and their images available offline to them, which is why we have to
first download this content from wikipedia and wikimedia. We have installed a wikipedia
mirror locally and extracted the desired article texts through the API. For the images, we
first thought about getting the image dump tarballs. However, the articles (and
consequently the images) are spread over more language domains, so this approach would
have been both inefficient and too much space consuming.
I'll look into the rsync approach.
Once again, many thanks for all your suggestions.
Mihai
-----Original Message-----
From: wikitech-l-bounces(a)lists.wikimedia.org
[mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Jeremy Baron
Sent: 23 September 2013 17:12
To: Wikimedia developers; Wikipedia Xmldatadumps-l
Subject: Re: [Wikitech-l] Bulk download
On Sep 23, 2013 9:25 AM, "Mihai Chintoanu"
<mihai.chintoanu(a)skobbler.com>
wrote:
I have a list of about 1.8 million images which I
have to download from
commons.wikimedia.org. Is there any simple way to do this
which doesn't
involve an individual HTTP hit for each image?
You mean full size originals, not thumbs scaled to a certain size, right?
You should rsync from a mirror[0] (rsync allows specifying a list of files
to copy) and then fill in the missing images from
upload.wikimedia.org ;
for
upload.wikimedia.org I'd say you should throttle yourself to 1 cache
miss per second (you can check headers on a response to see if was a hit or
miss and then back off when you get a miss) and you shouldn't use more than
one or two simultaneous HTTP connections. In any case, make sure you have
an accurate UA string with contact info (email address) so ops can contact
you if there's an issue.
At the moment there's only one mirror and it's ~6-12 months out of date so
there may be a substantial amount to fill in. And of course you should be
getting checksums from somewhere (the API?) and verifying them. If your
images are all missing from the mirror than it should take around 40 days
at 0.5 img/sec but I guess you probably could do it in less than 10 days if
you have a fast enough pipe. (depends on if you get a lit of misses or hits)
See also [1] but not all of that applies because
upload.wikimedia.org isn't
MediaWiki. so e.g. no maxlag param.
-Jeremy
[0]
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media
[1]
https://www.mediawiki.org/wiki/API:Etiquette
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l