Really, how important is it that we be always looking over our shoulders
Britannica is a well respected encyclopedia and rightly so. I think it wont hurt to set ourselves a goal and learn from the strengthes and weaknesses of the 'competition'. It is not a deathmatch though, just for honours. I would hate to see Wikipedia push Britannica out of the market in four years time.
------
Statistics:
A I would like to change the stats on 'mean article size' and 'number of articles over x bytes': now they are counted in the traditional all-inclusive way. I think it would be fairer as a comparison with Britannica and other printed reference books to count only readable text, so no html/wiki markup, no hidden links. This will lower the counts a bit (5-10%?), but will counter possible claims that we use inflate figures.
B Word count and page size distribution are on my todo list. Here also I would like to the counts to be conservative.
Beware, the remainder of this post deals with the dirty details <
It seems there is no standard method for counting words. Tools like wc tend to inflate counts, since every character between linebreaks, spaces, tabs is seen as a word.
So I propose the following count :
- all html/wiki markup is stripped - image, external and interwiki links and hidden part of internal links, are stripped (so the focus is on the article proper) - &#...; html tags are replaced by single character, say 'x' - numbers (consecutive digits,comas,points) are converted to say 'word', so 234,345.56 counts as one - now all series of 0-9,a-z,A-Z,À-ÿ characters of length >=2 are counted
I know there are one characters words, so I miss a tiny fraction, but counting those would also encompass single characters in formula's, calculations and what have you. The error will be minimal.
So for example the following counts as 19 words (wc counts 29).
Yesterday's temperature was 35.5° Celsius, 95 Fahrenheit ( F = 1.8 C + 32 ) ; fresh water (H <sub>2</sub> O) was in high demand, 234,345 icecreams were sold.
counts as 19 words.
Any comments?
Erik Zachte
Erik Z.-
- image, external and interwiki links and hidden part of internal
links, are stripped (so the focus is on the article proper)
External link text and interwiki link text should not be stripped. This is part of the article proper. Image captions should not be stripped either.
- now all series of 0-9,a-z,A-Z,?-ÿ characters of length >=2 are counted
Single character words, numbers etc. must be counted. This is normal in word counting. Not doing so is not conservative but wrong.
Regards,
Erik
wikipedia-l@lists.wikimedia.org