Right, that makes sense. Ok, that concludes my questioning for now. I shall now resume
lurking. :)
Thanks,
Diane
-----Original Message-----
From: Ariel T. Glenn [mailto:ariel@wikimedia.org]
Sent: Wednesday, April 25, 2012 9:46 AM
To: Napolitano, Diane
Cc: Xmldatadumps-l(a)lists.wikimedia.org
Subject: RE: [Xmldatadumps-l] Question about enwiki pages-meta-history splits
Στις 25-04-2012, ημέρα Τετ, και ώρα 06:34 -0700, ο/η Napolitano, Diane
έγραψε:
I see...and how do you select the pages that go into
each XML file?
In order by page id.
Ariel
Thanks!
Diane
-----Original Message-----
From: Ariel T. Glenn [mailto:ariel@wikimedia.org]
Sent: Wednesday, April 25, 2012 5:57 AM
To: Napolitano, Diane
Cc: Xmldatadumps-l(a)lists.wikimedia.org
Subject: Re: [Xmldatadumps-l] Question about enwiki pages-meta-history splits
Στις 24-04-2012, ημέρα Τρι, και ώρα 07:17 -0700, ο/η Napolitano, Diane
έγραψε:
Hello, I was wondering how the decision is
reached to split enwiki pages-meta-history into, say, N XML files. How is N determined?
Is it based on something like "let's try to have X many pages per XML file"
or "Y many revisions per XML file" or trying to keep the size (GB) of each XML
file roughly equivalent? Or is N just an arbitrary number chosen because it sounds nice?
:)
We have N = 27 because more than that overloads the cpus on the box with
the result that we wind up with a pile of truncated files.
We guess at the number of pages to go into each file hoping to get
roughly the same execution time to produce each piece.
Ariel