dump problem? - Xmldatadumps-l - lists.wikimedia.org

List overview All Threads
Download

dump problem?

corrupted files english december

Request for input on de fr wiki...

Bryan White

3 Dec 2015 3 Dec '15

11:30 p.m.

I see where almost all the dumps have "Dump complete" next to them and the data has been transferred to labs. Problem is, the dumps are not complete. Is this the new paradigm?... After each stage of the dump, label them done and then transfer what files were generated? Wash, rinse and repeat? Bryan

Attachments:

attachment.htm (text/html — 683 bytes)

Reply

Show replies by date

Ariel T. Glenn

4 Dec 4 Dec

10:24 a.m.

Στις 03-12-2015, ημέρα Πεμ, και ώρα 15:30 -0700, ο/η Bryan White έγραψε:

I see where almost all the dumps have "Dump complete" next to them and the data has been transferred to labs. Problem is, the dumps are not complete. Is this the new paradigm?... After each stage of the dump, label them done and then transfer what files were generated? Wash, rinse and repeat? Bryan _______________________________________________

Transferring each file that is complete when the rsync runs is the new paradigm, which has been happening since sometime last month. The marking of all dumps as 'Dump complete' is a bug from my last deploy 2 days ago; I have to track that down. It should be listing them as 'Partial Dump'. Ariel

Reply

wp mirror

18 Dec 18 Dec

6:28 p.m.

Dear Ariel, 0) Context. I am trying to understand how the XML dumps are split (as seen for enwiki, frwiki, dewiki,, etc.). This is because I would like to write a script that can recognize when a complete set of, say `pages-articles', split dumps has been posted (even if the `pages-meta-history' split dumps are not complete). To that end, I have some questions. 1) Naming. Most wikis with split files (`dewiki', `frwiki', `wikidatawiki', and six others) are split into four pieces. There is a one-to-one correspondence between the `pages' and `stub' split files. It is easy to write code for this case. How are the split dumps for the `enwiki' (and soon the `frwiki' and `dewiki') named? I notice that the page range of the last `pages' split file changes every month. There are no pages ranges on the `stub' files. There is a many-to-one correspondence between `pages-meta-history' and `stub-meta-history' split files. It is harder to write code for this case. It is also not possible to use the `mwxml2sql' transform tool unless there is a one-to-one correspondence between `pages' and `stub' files. 2) Splitting. How are the dumps split.? There seems to be a one-to-one correspondence between `pages-articles' and `stub-articles' files. Yet, the `enwiki-20151002' dumps are split in an anomalous way. The `pages-articles' dumps are split into 28 files, while the `stub-articles' dumps are split into 27 files. Likewise with the `pages-meta-current' (28 files) and `stub-meta-current' dumps (27 files). Should my code be able to handle this as valid, or flag it as a bug? There is a many-to-one correspondence between `pages-meta-history' and `stub-meta-history' files. How do we understand this well enough to write code? 3) Posting. When split dumps are generated, are the files posted one-by-one, or atomically as a complete set? In other words, how do we recognize when a `pages-articles' dump set is complete, even if the `pages-meta-history' dump set is missing? Sincerely Yours, Kent On Fri, Dec 4, 2015 at 4:24 AM, Ariel T. Glenn <aglenn(a)wikimedia.org> wrote:

Στις 03-12-2015, ημέρα Πεμ, και ώρα 15:30 -0700, ο/η Bryan White έγραψε:

I see where almost all the dumps have "Dump complete" next to them and the data has been transferred to labs. Problem is, the dumps are not complete. Is this the new paradigm?... After each stage of the dump, label them done and then transfer what files were generated? Wash, rinse and repeat? Bryan _______________________________________________

Transferring each file that is complete when the rsync runs is the new paradigm, which has been happening since sometime last month. The marking of all dumps as 'Dump complete' is a bug from my last deploy 2 days ago; I have to track that down. It should be listing them as 'Partial Dump'. Ariel _______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Reply

Hydriz Scholz

19 Dec 19 Dec

4:43 a.m.

I can help to partially answer some of the questions. Firstly, enwiki dumps have the pageid ranges posted in the name of the file, which should be easy to use and parse so that you can get the page that you need. Whether or not the stubs and the main dumps can be mapped one to one, I am not too sure about that. Ideally it should be one to one so that it is easier to parse. Secondly, for dumps with parts, all of the parts would have to be finished before they are made available for download. Also, there is a file called "dumpruninfo.txt" in each dump that will provide the status for the individual type of dumps. You can use that to parse the status of the dump that you need. On a side note, would it be easier for you to parse the status of dumps using an API? I am currently collecting ideas on T92996 [1], please do voice out if you wish to have certain functionality available. Hope these helps. [1]: https://phabricator.wikimedia.org/T92966

On 19 Dec 2015, at 01:28, wp mirror <wpmirrordev(a)gmail.com> wrote: Dear Ariel, 0) Context. I am trying to understand how the XML dumps are split (as seen for enwiki, frwiki, dewiki,, etc.). This is because I would like to write a script that can recognize when a complete set of, say `pages-articles', split dumps has been posted (even if the `pages-meta-history' split dumps are not complete). To that end, I have some questions. 1) Naming. Most wikis with split files (`dewiki', `frwiki', `wikidatawiki', and six others) are split into four pieces. There is a one-to-one correspondence between the `pages' and `stub' split files. It is easy to write code for this case. How are the split dumps for the `enwiki' (and soon the `frwiki' and `dewiki') named? I notice that the page range of the last `pages' split file changes every month. There are no pages ranges on the `stub' files. There is a many-to-one correspondence between `pages-meta-history' and `stub-meta-history' split files. It is harder to write code for this case. It is also not possible to use the `mwxml2sql' transform tool unless there is a one-to-one correspondence between `pages' and `stub' files. 2) Splitting. How are the dumps split.? There seems to be a one-to-one correspondence between `pages-articles' and `stub-articles' files. Yet, the `enwiki-20151002' dumps are split in an anomalous way. The `pages-articles' dumps are split into 28 files, while the `stub-articles' dumps are split into 27 files. Likewise with the `pages-meta-current' (28 files) and `stub-meta-current' dumps (27 files). Should my code be able to handle this as valid, or flag it as a bug? There is a many-to-one correspondence between `pages-meta-history' and `stub-meta-history' files. How do we understand this well enough to write code? 3) Posting. When split dumps are generated, are the files posted one-by-one, or atomically as a complete set? In other words, how do we recognize when a `pages-articles' dump set is complete, even if the `pages-meta-history' dump set is missing? Sincerely Yours, Kent

On Fri, Dec 4, 2015 at 4:24 AM, Ariel T. Glenn <aglenn(a)wikimedia.org> wrote: Στις 03-12-2015, ημέρα Πεμ, και ώρα 15:30 -0700, ο/η Bryan White έγραψε:

I see where almost all the dumps have "Dump complete" next to them and the data has been transferred to labs. Problem is, the dumps are not complete. Is this the new paradigm?... After each stage of the dump, label them done and then transfer what files were generated? Wash, rinse and repeat? Bryan _______________________________________________

Transferring each file that is complete when the rsync runs is the new paradigm, which has been happening since sometime last month. The marking of all dumps as 'Dump complete' is a bug from my last deploy 2 days ago; I have to track that down. It should be listing them as 'Partial Dump'. Ariel _______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

_______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Reply

3051

days inactive

3067

days old

xmldatadumps-l@lists.wikimedia.org

Manage subscription

3 comments

4 participants

tags (0)

participants (4)

Ariel T. Glenn
Bryan White
Hydriz Scholz
wp mirror