On Sun, Jun 29, 2008 at 10:03 AM, Lars Aronsson lars@aronsson.se wrote:
I'd like to propose a quality metric: The difference in rank between the article count and the compressed database size.
I think this is a good metric, especially because it's a relative metric (since it's effectively comparing projects against their peers to see how mature they are).
Someone earlier was discussing article sizes, so I hacked up a script to graph the distribution of article sizes:
http://www.toolserver.org/~thebainer/articlesizes/
Most graphs share the same basic shape, with a roughly logarithmic distribution once you get past the initial peak (see the English Wikipedia graph for an example of what I mean), but some are different, and it tends to coincide with what has already been observed.
The Swedish Wikipedia was (when this table was compiled) the 10th biggest by article count, but the 12th biggest by compressed database size, so its quality is 10 - 12 = -2.
Swedish Wikipedia is distributed in almost exactly the same way as English Wikipedia, with the difference being that its average size is less than half that of En's, at around 1900 bytes.
The Russian Wikipedia was the 11th by article count, but 9th by compressed database size, so its quality is +2. This doesn't mean the Russian Wikipedia is better than the English one, only that it is better than (two of) its peers of similar size.
Not only does the Russian Wikipedia have a high average article size (about 5500 bytes, compared with, for example, English Wikipedia at around 4100 bytes) but its graph, which has multiple peaks, seems to show that, unlike many other projects, it has more mature, medium-size articles than it does stubs.
The Volapük Wikipedia was the 15th by article count, but the worse than the 30th by compressed database size (the table is incomplete), so its quality is worse than -15.
The Volapük Wikipedia has an unusual distribution, with two peaks. One is in the usual place, just below the average size (which is low, at just over 1000 bytes) while the other is around 2 - 2.5kb, which corresponds to the size of all the geography stubs created by SmeiraBot.