I can help to partially answer some of the questions.
Firstly, enwiki dumps have the pageid ranges posted in the name of the file, which should
be easy to use and parse so that you can get the page that you need. Whether or not the
stubs and the main dumps can be mapped one to one, I am not too sure about that. Ideally
it should be one to one so that it is easier to parse.
Secondly, for dumps with parts, all of the parts would have to be finished before they are
made available for download. Also, there is a file called "dumpruninfo.txt" in
each dump that will provide the status for the individual type of dumps. You can use that
to parse the status of the dump that you need.
On a side note, would it be easier for you to parse the status of dumps using an API? I am
currently collecting ideas on T92996 , please do voice out if you wish to have certain
Hope these helps.
On 19 Dec 2015, at 01:28, wp mirror
I am trying to understand how the XML dumps are split (as seen for enwiki, frwiki,
dewiki,, etc.). This is because I would like to write a script that can recognize when a
complete set of, say `pages-articles', split dumps has been posted (even if the
`pages-meta-history' split dumps are not complete). To that end, I have some
Most wikis with split files (`dewiki', `frwiki', `wikidatawiki', and six
others) are split into four pieces. There is a one-to-one correspondence between the
`pages' and `stub' split files. It is easy to write code for this case.
How are the split dumps for the `enwiki' (and soon the `frwiki' and `dewiki')
named? I notice that the page range of the last `pages' split file changes every
month. There are no pages ranges on the `stub' files. There is a many-to-one
correspondence between `pages-meta-history' and `stub-meta-history' split files.
It is harder to write code for this case. It is also not possible to use the
`mwxml2sql' transform tool unless there is a one-to-one correspondence between
`pages' and `stub' files.
How are the dumps split.?
There seems to be a one-to-one correspondence between `pages-articles' and
`stub-articles' files. Yet, the `enwiki-20151002' dumps are split in an anomalous
way. The `pages-articles' dumps are split into 28 files, while the
`stub-articles' dumps are split into 27 files. Likewise with the
`pages-meta-current' (28 files) and `stub-meta-current' dumps (27 files). Should
my code be able to handle this as valid, or flag it as a bug?
There is a many-to-one correspondence between `pages-meta-history' and
`stub-meta-history' files. How do we understand this well enough to write code?
When split dumps are generated, are the files posted one-by-one, or atomically as a
complete set? In other words, how do we recognize when a `pages-articles' dump set is
complete, even if the `pages-meta-history' dump set is missing?
On Fri, Dec 4, 2015 at 4:24 AM, Ariel T. Glenn
Στις 03-12-2015, ημέρα Πεμ, και ώρα 15:30 -0700, ο/η Bryan White
I see where almost all the dumps have "Dump
complete" next to them
and the data has been transferred to labs. Problem is, the dumps are
not complete. Is this the new paradigm?... After each stage of the
dump, label them done and then transfer what files were generated?
Wash, rinse and repeat?
Transferring each file that is complete when the rsync runs is the new
paradigm, which has been happening since sometime last month. The
marking of all dumps as 'Dump complete' is a bug from my last deploy 2
days ago; I have to track that down. It should be listing them as
Xmldatadumps-l mailing list
Xmldatadumps-l mailing list