[Wikipedia-l] Wikipedia v. Britannica / Statistics

Erik Zachte e.p.zachte at chello.nl
Mon Aug 18 21:13:33 UTC 2003

> Really, how important is it that we be always looking 
> over our shoulders 

Britannica is a well respected encyclopedia and rightly so. I think it
won’t hurt to set ourselves a goal and learn from the strengthes and
weaknesses of the 'competition'. It is not a deathmatch though, just for
honours. I would hate to see Wikipedia push Britannica out of the market
in four years time.



I would like to change the stats on 'mean article size' and 'number of
articles over x bytes': now they are counted in the traditional
all-inclusive way. I think it would be fairer as a comparison with
Britannica and other printed reference books to count only readable
text, so no html/wiki markup, no hidden links. This will lower the
counts a bit (5-10%?), but will counter possible claims that we use
inflate figures.

Word count and page size distribution are on my todo list. Here also I
would like to the counts to be conservative. 

> Beware, the remainder of this post deals with the dirty details <

It seems there is no standard method for counting words. Tools like wc
tend to inflate counts, since every character between linebreaks,
spaces, tabs is seen as a word.

So I propose the following count :

- all html/wiki markup is stripped
- image, external  and interwiki links and hidden part of internal
links, are stripped (so the focus is on the article proper)
- &#...; html tags are replaced by single character, say 'x'
- numbers (consecutive digits,comas,points) are converted to say 'word',
so 234,345.56 counts as one
- now all series of 0-9,a-z,A-Z,À-ÿ characters of length >=2 are counted

I know there are one characters words, so I miss a tiny fraction, but
counting those would also encompass single characters in formula's,
calculations and what have you. The error will be minimal.

So for example the following counts as 19 words (wc counts 29).

Yesterday's temperature was 35.5° Celsius, 95 Fahrenheit ( F = 1.8 C +
32 ) ; fresh water (H <sub>2</sub> O) was in high demand, 234,345
icecreams were sold.

counts as 19 words.

Any comments?

Erik Zachte

More information about the Wikipedia-l mailing list