Hello, I was wondering how the decision is reached to split enwiki pages-meta-history into, say, N XML files. How is N determined? Is it based on something like "let's try to have X many pages per XML file" or "Y many revisions per XML file" or trying to keep the size (GB) of each XML file roughly equivalent? Or is N just an arbitrary number chosen because it sounds nice? :)
Thanks, Diane
Στις 24-04-2012, ημέρα Τρι, και ώρα 07:17 -0700, ο/η Napolitano, Diane έγραψε:
Hello, I was wondering how the decision is reached to split enwiki pages-meta-history into, say, N XML files. How is N determined? Is it based on something like "let's try to have X many pages per XML file" or "Y many revisions per XML file" or trying to keep the size (GB) of each XML file roughly equivalent? Or is N just an arbitrary number chosen because it sounds nice? :)
We have N = 27 because more than that overloads the cpus on the box with the result that we wind up with a pile of truncated files.
We guess at the number of pages to go into each file hoping to get roughly the same execution time to produce each piece.
Ariel
I see...and how do you select the pages that go into each XML file?
Thanks! Diane
-----Original Message----- From: Ariel T. Glenn [mailto:ariel@wikimedia.org] Sent: Wednesday, April 25, 2012 5:57 AM To: Napolitano, Diane Cc: Xmldatadumps-l@lists.wikimedia.org Subject: Re: [Xmldatadumps-l] Question about enwiki pages-meta-history splits
Στις 24-04-2012, ημέρα Τρι, και ώρα 07:17 -0700, ο/η Napolitano, Diane έγραψε:
Hello, I was wondering how the decision is reached to split enwiki pages-meta-history into, say, N XML files. How is N determined? Is it based on something like "let's try to have X many pages per XML file" or "Y many revisions per XML file" or trying to keep the size (GB) of each XML file roughly equivalent? Or is N just an arbitrary number chosen because it sounds nice? :)
We have N = 27 because more than that overloads the cpus on the box with the result that we wind up with a pile of truncated files.
We guess at the number of pages to go into each file hoping to get roughly the same execution time to produce each piece.
Ariel
Στις 25-04-2012, ημέρα Τετ, και ώρα 06:34 -0700, ο/η Napolitano, Diane έγραψε:
I see...and how do you select the pages that go into each XML file?
In order by page id.
Ariel
Thanks! Diane
-----Original Message----- From: Ariel T. Glenn [mailto:ariel@wikimedia.org] Sent: Wednesday, April 25, 2012 5:57 AM To: Napolitano, Diane Cc: Xmldatadumps-l@lists.wikimedia.org Subject: Re: [Xmldatadumps-l] Question about enwiki pages-meta-history splits
Στις 24-04-2012, ημέρα Τρι, και ώρα 07:17 -0700, ο/η Napolitano, Diane έγραψε:
Hello, I was wondering how the decision is reached to split enwiki pages-meta-history into, say, N XML files. How is N determined? Is it based on something like "let's try to have X many pages per XML file" or "Y many revisions per XML file" or trying to keep the size (GB) of each XML file roughly equivalent? Or is N just an arbitrary number chosen because it sounds nice? :)
We have N = 27 because more than that overloads the cpus on the box with the result that we wind up with a pile of truncated files.
We guess at the number of pages to go into each file hoping to get roughly the same execution time to produce each piece.
Ariel
Right, that makes sense. Ok, that concludes my questioning for now. I shall now resume lurking. :)
Thanks, Diane
-----Original Message----- From: Ariel T. Glenn [mailto:ariel@wikimedia.org] Sent: Wednesday, April 25, 2012 9:46 AM To: Napolitano, Diane Cc: Xmldatadumps-l@lists.wikimedia.org Subject: RE: [Xmldatadumps-l] Question about enwiki pages-meta-history splits
Στις 25-04-2012, ημέρα Τετ, και ώρα 06:34 -0700, ο/η Napolitano, Diane έγραψε:
I see...and how do you select the pages that go into each XML file?
In order by page id.
Ariel
Thanks! Diane
-----Original Message----- From: Ariel T. Glenn [mailto:ariel@wikimedia.org] Sent: Wednesday, April 25, 2012 5:57 AM To: Napolitano, Diane Cc: Xmldatadumps-l@lists.wikimedia.org Subject: Re: [Xmldatadumps-l] Question about enwiki pages-meta-history splits
Στις 24-04-2012, ημέρα Τρι, και ώρα 07:17 -0700, ο/η Napolitano, Diane έγραψε:
Hello, I was wondering how the decision is reached to split enwiki pages-meta-history into, say, N XML files. How is N determined? Is it based on something like "let's try to have X many pages per XML file" or "Y many revisions per XML file" or trying to keep the size (GB) of each XML file roughly equivalent? Or is N just an arbitrary number chosen because it sounds nice? :)
We have N = 27 because more than that overloads the cpus on the box with the result that we wind up with a pile of truncated files.
We guess at the number of pages to go into each file hoping to get roughly the same execution time to produce each piece.
Ariel
xmldatadumps-l@lists.wikimedia.org