On 8/14/06, Eric Astor eastor1@swarthmore.edu wrote:
Experiments with page output, actually, for the full Simple English Wikipedia... My experiments also showed compressed HTML for the full page output to be roughly 3x larger than compressed wikitext. That is full page output, so the point is probably lost somewhat - but considering that the full output should be *more* repetitive than the XML-ized markup, I'd still be surprised if the database didn't double in size, at a minimum. Anyone care to run the full database through a conversion to XML and run the experiments? I'm a little short on time lately.
I've converted the first few sections of [[Paris]] (by hand) and done the comparison on them. (The XML isn't perfect — I didn't convert character entities, for instance — but it suffices.) The input was 12.7 KB, the XML 17.1 KB; after ZIPping with normal compression, the difference dropped to 5.12 KB to 5.43 KB, and with best-compression RAR, the XML was actually a bit smaller (4.47 to 4.61). If we *do* work with compressed databases, then increase in size from XML looks to be trivial.
(Files available upon request.)