Images from en-wiki - Offline-l

21 Dec 2012

Dear Sanjay,

Let me address your question, on how to add images to your local copy
of enwiki, in part 4 below.  But some preliminaries might be in order:

1) ENWIKI.  Building a mirror of <http://en.wikipedia.org/> is the
most demanding case.  Details on how to do so, can be found in the
WP-MIRROR Reference Manual
<http://www.nongnu.org/wp-mirror/manual/wp-mirror-0.5.pdf>.  The
enwiki is still growing, so if you decide to download all the images,
then I would suggest purchasing 3T hard drives, rather than 2T.

2) SIMPLEWIKI.  If you have not already done so, you may wish to take
a look at <http://simple.wikipedia.org/> to see if mirroring the
simplewiki meets your needs.  Simple English means shorter sentences
and is intended for those who learned English as a Second Language
(ESL).  The most recent dump file
<http://dumps.wikimedia.org/simplewiki/20121209/simplewiki-20121209-pages-articles.xml.bz2>
has 123k articles, 66k images, and occupies 60G of hard disk space.

3) WP-MIRROR.  Wp-mirror is a mirror building utility.  It uses
MediaWiki and imports the articles into databases managed by MySQL.
It downloads full size images.  By default, it builds a mirror of the
`simple' wikipedia, but it can be configured for any set of
wikipedias.  It works `out-of-the-box' for GNU/Linux distributions:
Debian 7.0 (wheezy) and Ubuntu 12.10 (quantal).  Home page
<http://www.nongnu.org/wp-mirror/>.

As you mentioned that you are `currently running a WAMP solution', I
should point out that WP-MIRROR has not been ported to Windows.

And now to your question:

4) IMAGES. There is a way to download SOME of the images (rather than
all) for the enwiki.  Wp-mirror, as part of its duties: 1) splits the
dump file into chunks (x-chunks) of 1000 pages each, 2) scrapes each
x-chunk to find image file names, and 3) generates a shell script
(i-chunk) for downloading the image files referenced in the
corresponding x-chunk.  This means that you can run just the i-chunks
that you want.

Example:  Running the first 100 i-chunks would download the images for
the first 100,000 pages, which are the pages that are the oldest,
largest, and most decorated with images.

So the method might work as follows:  1) install wp-mirror on a laptop
with wheezy or quantal, 2) configure it for enwiki, 3) run it just
long enough to generate the i-chunks, 4) abort wp-mirror, 5) run the
desired i-chunks manually, and 6) move the images over to your WAMP
server.

5) PERFORMANCE.  At the suggestion of Jason Skomorowski, the next
version of wp-mirror will have a number of performance enhancements.
In particular, the i-chunks will make use of HTTP/1.1 persistent
connections [RFC2616].  If you are in a hurry, wp-mirror 0.5 should
work fine; but, if you can wait a month or so, version 0.6, when
released, will download image files with far less latency.

Let me know if I can be of any help.

Sincererly Yours,
Kent