Dear Ariel,
0) Context.
I am trying to understand how the XML dumps are split (as seen for enwiki, frwiki, dewiki,, etc.). This is because I would like to write a script that can recognize when a complete set of, say `pages-articles', split dumps has been posted (even if the `pages-meta-history' split dumps are not complete). To that end, I have some questions.
1) Naming.
Most wikis with split files (`dewiki', `frwiki', `wikidatawiki', and six others) are split into four pieces. There is a one-to-one correspondence between the `pages' and `stub' split files. It is easy to write code for this case.
How are the split dumps for the `enwiki' (and soon the `frwiki' and `dewiki') named? I notice that the page range of the last `pages' split file changes every month. There are no pages ranges on the `stub' files. There is a many-to-one correspondence between `pages-meta-history' and `stub-meta-history' split files. It is harder to write code for this case. It is also not possible to use the `mwxml2sql' transform tool unless there is a one-to-one correspondence between `pages' and `stub' files.
2) Splitting.
How are the dumps split.?
There seems to be a one-to-one correspondence between `pages-articles' and `stub-articles' files. Yet, the `enwiki-20151002' dumps are split in an anomalous way. The `pages-articles' dumps are split into 28 files, while the `stub-articles' dumps are split into 27 files. Likewise with the `pages-meta-current' (28 files) and `stub-meta-current' dumps (27 files). Should my code be able to handle this as valid, or flag it as a bug?
There is a many-to-one correspondence between `pages-meta-history' and `stub-meta-history' files. How do we understand this well enough to write code?
3) Posting.
When split dumps are generated, are the files posted one-by-one, or atomically as a complete set? In other words, how do we recognize when a `pages-articles' dump set is complete, even if the `pages-meta-history' dump set is missing?
Sincerely Yours,
Kent