On Fri, Nov 11, 2011 at 11:18 PM, emijrp <emijrp(a)gmail.com> wrote:
> ---------- Forwarded message ----------
> From: emijrp <emijrp(a)gmail.com>
> Date: 2011/11/11
> Subject: Old English Wikipedia image dump from 2005
> To: wikiteam-discuss(a)googlegroups.com
> Hi all;
> I want to share with you this Archive Team link. It is an old English
> Wikipedia image dump from 2005. One of the last ones, probably, before
> Wikimedia Foundation stopped publishing image dumps. Enjoy.
>  http://www.archive.org/details/wikimedia-image-dump-2005-11
People interested in image dumps may be also interested in my post
relating to the GFDL requirements, which I think mean images need to
be included in the dumps.
"..the [GFDL] license requires that someone can download a
''complete'' Transparent copy for one year after the last Opaque copy
is distributed. As a result, I believe the BoT needs to ensure that
the dumps are available ''and'' that they can be available for one
year after WMF turns of the lights on the core servers (it allows
'agents' to provide this service). As Wikipedia contains images, the
images are required to be included. .."
discussion continues ..
While I was trying to rsync all the dumps available on the mirror site, I noticed that the latest dump available was only made in early October, what happened to the newer dumps made on the Brazilian mirror? Has it stopped following updates from dumps.wikimedia.org?
BTW if anyone of you were following the mirroring XML dumps talk page, I have just sent an email to ask for AARNet to mirror our dumps too. They have about 84TB of space available and is located in Australia. Cross your fingers!
So this has been running all of one day now, and I expect it to break in
wild and crazy ways over the next period while we get the bugs out.
But, throwing caution to the winds...
I'm generating dumps each day for each non-closed non-private project,
of revisions added since the previous day. It uses the standard xml
format, writing out stubs and history files.
This is a sort of poor person's incremental dump. What do I mean by
that? Well... It doesn't contain a list of deletions, page moves,
undeletes. It just dumps the metadata and text for every revision
between X1 (last revision dumped the day before) and X2 (last revision
in db as of the time it's dumped). The reason for that? Dumping a
range of revisions is relatively easy. Accounting for page deletions,
moves etc. since the previous dump is hard, so that is an exercise left
to the reader :-P
Even with these limitations I'm hoping the data will be useful to folks.
These are specifically *not* intended to be kept around forever; we'll
keep some reasonable number, 20-30 of them, and then start tossing old
ones after that.
A note about the timing of the dumps: they run once a day, there's no
progress reporting. An updated index file is published near the end of
the day. Also, we dump content with a delay of 12 hours, to allow
admins to delete things that might contain sensitive information. This
was less of a concern for dumps generated once a week, but daily runs
increase the odds of something bad getting dumped.
And speaking of the index file, it's here:
Guess I'll add some documentation on wikitech too. The code is in my
branch in svn, see
I may well be patching things tomorrow at this time for jobs that failed
to run, so feel free to point out issues, but also don't be surprised by
>> Providing multiple terabyte sized files for download doesn't make any
kind of sense to me.
>> However, if we get concrete proposals for categories of Commons
images people really want
>> and would use, we can put those together. I think this has been said
before on wikitech-l if not here.
The Picture of the Year (POTY) collections are truly stunning! I am not
very interested in having terabytes of random snapshots on my computer,
instead I find smaller collections of "best of the best" much more suiting
for the public. This way, it will be accessible to those with smaller
amounts of diskspace and they'll be equally impressed.
I doubt there are many people interested in the image tarballs at all,
they're just going for the principle of accessibility. Presumably wikipedia
has plenty of back-up capabilities and there are enough gurus doing
everything to prevent possible data loss. Offering public back-ups has no
additional value in this perspective. Most of us probably use the wikipedia
XML's for offline usage or research. I have not yet come across image
research projects requiring tens of terabytes of images to be successful!
I say, if the POTY downloads are popular according to statistics, why not
compile a couple more years? The thing I'm talking about is hosted at
I am interested in looking at the links between webpages on wikipedia
for scientific research. I have been to
which suggested that the latest pages-articles is likely the one
people want. However, I'm unclear on some things.
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different
files, and I can't actually tell if one of them would actually contain
only link information. Is there a description of what each file
(2) The enwiki-latest-pages-articles.xml file uncompresses as
31.55GB. Is it correct that this contains the current snapshot of all
pages and articles in wikipedia? (I only ask because this seems
(3) If I am constrained to use latest-pages-articles.xml, I'm unclear
on the method used to denote a link. It would appear that links are
denoted by [[link]] or [[link | word]]. Such patterns would be fairly
easy to find using perl. However, I've noticed some odd cases, such
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the
first to formulate ...... in his
If I must search through the page-articles file, and if the [[ ]]
notation is overloaded, is there a description of the patterns that
are used in this file? I.e. a way for me to ensure that I'm only
grabbing links, not figure captions or some other content.
Thanks for your help!
A while back (over 2 years ago, urk!) we had a request for dumps of
titles of things other than articles . I haven't seen that request
repeated, but I'm wondering how useful that would be to folks and which
namespaces we should dump, if we were going to add a few. Article talk