Dear Ariel,

0) Context.

I am trying to understand how the XML dumps are split (as seen for enwiki, frwiki, dewiki,, etc.). This is because I would like to write a script that can recognize when a complete set of, say `pages-articles', split dumps has been posted (even if the `pages-meta-history' split dumps are not complete). To that end, I have some questions.

1) Naming.

Most wikis with split files (`dewiki', `frwiki', `wikidatawiki', and six others) are split into four pieces. There is a one-to-one correspondence between the `pages' and `stub' split files. It is easy to write code for this case.

How are the split dumps for the `enwiki' (and soon the `frwiki' and `dewiki') named? I notice that the page range of the last `pages' split file changes every month. There are no pages ranges on the `stub' files. There is a many-to-one correspondence between `pages-meta-history' and `stub-meta-history' split files. It is harder to write code for this case. It is also not possible to use the `mwxml2sql' transform tool unless there is a one-to-one correspondence between `pages' and `stub' files.

2) Splitting.

How are the dumps split.?

There seems to be a one-to-one correspondence between `pages-articles' and `stub-articles' files. Yet, the `enwiki-20151002' dumps are split in an anomalous way. The `pages-articles' dumps are split into 28 files, while the `stub-articles' dumps are split into 27 files. Likewise with the `pages-meta-current' (28 files) and `stub-meta-current' dumps (27 files). Should my code be able to handle this as valid, or flag it as a bug?

There is a many-to-one correspondence between `pages-meta-history' and `stub-meta-history' files. How do we understand this well enough to write code?

3) Posting.

When split dumps are generated, are the files posted one-by-one, or atomically as a complete set? In other words, how do we recognize when a `pages-articles' dump set is complete, even if the `pages-meta-history' dump set is missing?

Sincerely Yours,

Kent

On Fri, Dec 4, 2015 at 4:24 AM, Ariel T. Glenn <aglenn@wikimedia.org> wrote:

Στις 03-12-2015, ημέρα Πεμ, και ώρα 15:30 -0700, ο/η Bryan White
έγραψε:

> I see where almost all the dumps have "Dump complete" next to them
> and the data has been transferred to labs. Problem is, the dumps are
> not complete. Is this the new paradigm?... After each stage of the
> dump, label them done and then transfer what files were generated?
> Wash, rinse and repeat?
>
> Bryan

> _______________________________________________

Transferring each file that is complete when the rsync runs is the new
paradigm, which has been happening since sometime last month. The
marking of all dumps as 'Dump complete' is a bug from my last deploy 2
days ago; I have to track that down. It should be listing them as
'Partial Dump'.

Ariel

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l