I am interested in looking at the links between webpages on wikipedia
for scientific research. I have been to
which suggested that the latest pages-articles is likely the one
people want. However, I'm unclear on some things.
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different
files, and I can't actually tell if one of them would actually contain
only link information. Is there a description of what each file
(2) The enwiki-latest-pages-articles.xml file uncompresses as
31.55GB. Is it correct that this contains the current snapshot of all
pages and articles in wikipedia? (I only ask because this seems
(3) If I am constrained to use latest-pages-articles.xml, I'm unclear
on the method used to denote a link. It would appear that links are
denoted by [[link]] or [[link | word]]. Such patterns would be fairly
easy to find using perl. However, I've noticed some odd cases, such
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the
first to formulate ...... in his
If I must search through the page-articles file, and if the [[ ]]
notation is overloaded, is there a description of the patterns that
are used in this file? I.e. a way for me to ensure that I'm only
grabbing links, not figure captions or some other content.
Thanks for your help!
I'd like to know if there is a webpage that is more of a walk-through on
how o create a mirror for wikipedia. I have had a few people asked me how
this can be done here at CLEF2011 and I was not able to give them a
-- とある白い猫 (To Aru Shiroi Neko)
I've shot all but a couple jobs, and those will die before I go to bed.
Tomorrow after the first round of deployments and fixes has gone around
and the dust has stelled, I'll crank the dumps back up until it's time
for the second round.
As people prepare for the "het deployment" of mw 1.18 you are going to
see some interruptions of the dumps while I get these hosts ready for
the switch. As a side effect of one of the configuration changes, the
host running the "large-ish" wikis is temporarily not running any jobs.
I should be starting those jobs again on Moday. The four "small wiki"
processes should be ok.
When we get around to the actual deployment, I will likely stop all jobs
til deployment is complete, as I cannot begin to guess what the output
would look like if the codebase were switched out underneath the dumpers
in the middle of a run.
Just like the scripts to preserve wikis, I'm working in a new script to
download all Wikimedia Commons images packed by day. But I have limited
spare time. Sad that volunteers have to do this without any help from
I started too an effort in meta: (with low activity) to mirror XML dumps.
If you know about universities or research groups which works with
Wiki[pm]edia XML dumps, they would be a possible successful target to mirror
If you want to download the texts into your PC, you only need 100GB free and
to run this Python script.
I heard that Internet Archive saves XML dumps quarterly or so, but no
official announcement. Also, I heard about Library of Congress wanting to
mirror the dumps, but not news since a long time.
L'Encyclopédie has an "uptime" of 260 years and growing. Will
Wiki[pm]edia projects reach that?
2011/6/2 Fae <faenwp(a)gmail.com>
> I'm taking part in an images discussion workshop with a number of
> academics tomorrow and could do with a statement about the WMF's long
> term commitment to supporting Wikimedia Commons (and other projects)
> in terms of the public availability of media. Is there an official
> published policy I can point to that includes, say, a 10 year or 100
> If it exists, this would be a key factor for researchers choosing
> where to share their images with the public.
> Guide to email tags: http://j.mp/faetags
> foundation-l mailing list
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
The September en wikipedia dumps are done. Folks who use them, note
that this is the first run with the generation of a pile of smaller
files. The naming scheme as you will have noticed has an additional
string: -p<first-page-id-contained>p<last-pageid-contained> Expect the
specific groupings to change from one run to the next; it's time-based,
rather than based on the number of pages or revisions.
You may notice a gap of a few numbers between files; this would indicate
that those pages were deleted and not included in the dump at all.
Since there were no issues with the network, database servers, broken MW
deployments etc., the run finished without any need for restarts of a
particular step; this is probably the fastest we'll ever see it run, in
a little under 8 days.
Any issues, please let me know. I expect people will need a script to
download these files easily; didn't someone on this list have a tool in
...another dump. August is done, July 7z are done, the last of the May
history and 7z are done. That brings us up to date.
I expect to test new code with production of many small files, as
previously discussed on this list, starting within the next few days.
This test will be for en wikipedia only, as that's the dump that's
hardest to run to completion. The results might be a perfectly good
dump, or not. Even if they are, I do not plan to try running en
wikipedia dumps twice a month, so don't get your hopes up. (Who would
process all that data every two weeks anyways?)