Hi Erik, I'm crossposting this message to the wikisource-l, if anyone is
interested to give some inputs.
The http://stats.wikimedia.org/wikisource/EN/TablesDatabaseWords.htm seens
to be inaccurate. Apparently your tool compute only words in the main
namespace. It may works for projects like Wikipedia and theirs very long
talk pages at the namespace Project: on some subjects (such as deletion
requests). But it doens't work for Wikisource for two main reasons:
1) Some subdomains have custom namespaces for short biographies and list of
works by author (en, it, pt and others), some have it on the main namespace
(fr, de, es and others). This is a minor issue, since the amount of words on
those pages is small
2) Some Wikisources (de, fr and en, according to
http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics ) have large
amount of contents in a custom namespace devoted to the ProofreadPage
Extension ( http://www.mediawiki.org/wiki/Extension:Proofread_Page ). This
content is displayed on main namespace within page transclusion (see
http://en.wikisource.org/w/index.php?title=35_Sonnets&action=edit for an
example).
Is possible to include the custom namespaces for all Wikisources on your
automated calculation tool?
[[:m:User:555]]
> De: "John Vandenberg" <jayvdb(a)gmail.com>
> A: "discussion list for Wikisource, the free library" <wikisource-l(a)lists.wikimedia.org>
> Objet: Re: [Wikisource-l] Changing the Wikisource main page
> Date: Sun, 14 Sep 2008 06:03:36 +1000
> A Chinese "word" has more meaning than a Spanish "word". I dont have
> the numbers, but the word "word" is not the same in all languages.
> This makes words a very complex statistic.
>
> --
> John
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
I may have found a very simple solution : if we agree that a chinese sign is a word as we understand "word", than we have to found how many sign there are. I made a test, and found that a chinese sign is 3 octets. The very same statistics tells us that the average number of octets of an article on the chinese wikisource is 1957. So, there are 1957/3 = 652.3 words. The statistics counts (on may 31, 2008) 29084 articles for the chinese wikisource, and 652.3*29084 gives 18.9M words for total.
The only question remaining is : why the statistics page presents 29.3M as the number of words for the chinese wikisource ? Is that the number of "groups of letters" ?
Anyway, if we accept the figures, we would have : 1. English : 211M words - 2. French : 125M - 3. Spanish : 41.8M - 4. Russian : 22.2M - 5. Chinese : 18.9M - 6. Polish : 18.2M - 7. Portuguese : 15.5M - 8. Deutsch : 14.4M - 9. Italian : 12.0M - 10. Arabic : 10.6M.