Any progress on this?
I'm asking it because I am going to destroy two statistics in the next weeks
1) Destroying the "article" count: Portuguese Wikisource will be bumped from
7th to 2th in article count when I finish the import of A to N articles from
a public domain dictionary. (See more about it at the end of this message).
2) Destroying the "all pages" from
: I've found
on Internet Archive the 24 volumes of the "Décadas da Ásia" by
and I will be uploading
those djvu files on Commons and the text extracted text on Portuguese
Wikisource in a few hours
(I've talked with some users about mass upload of OCR-ed texts and no one
have made objections until now; this is like as the bot creation of stub
articles on Wikipedias for them. I hope that you all agrees with this
opinion and no one start a "radical cleanup" propose like the one on the
Volapuk Wikipedia in December 2007 ;-) . Please note that the ProofreadPage
Statistics page have more stats and graphs higher than the "all pages" one)
More about the dictionary and the import
The portuguese speakers at Distributed Proofreaders @ Project Gutenberg are
proofreading a dictionary and making it public avaiable as the work is done
. The "source-code" (the words database) is
also shared http://dicionario-aberto.net/sources.html
. Since it is in public
domain both on United States and in Portugal (the country of origin), I'm
going to import it to the Portuguese Wikisource. I've proposed it on
ago and no one have made any objections to it until now.
From A to N that dictionary have more than 85k entries
and no one on
Portuguese Wikisource is opposing to import one entry per article.
offered to import the dictionary on the Portuguese Wiktionary (
but only one user have manifest suppor for it and two have opposed;
the opposers suggested to change the "no articles were found" on
pt.wiktionary a bit, pointing to search for words on that dictionary in
The import will be started in few weeks. I'm only waiting for more opinions
on pt.wikisource and pt.wiktionary and to the FlaggedRevs getting enabled on
pt.wikisource (since bot created articles are automatically flagged as
approved and I don't have plans to spend ten years flagging those pages by
On Tue, Sep 16, 2008 at 9:43 PM, Syagrius <syagrius(a)caramail.com> wrote:
A: "discussion list for Wikisource, the free library" <
Objet: Re: [Wikisource-l] Changing the Wikisource
Date: Sun, 14 Sep 2008 06:03:36 +1000
A Chinese "word" has more meaning than
a Spanish "word". I dont have
the numbers, but the word "word" is not the same in all languages.
This makes words a very complex statistic.
Wikisource-l mailing list
I may have found a very simple solution : if we agree that a chinese sign
is a word as we understand "word", than we have to found how many sign there
are. I made a test, and found that a chinese sign is 3 octets. The very same
statistics tells us that the average number of octets of an article on the
chinese wikisource is 1957. So, there are 1957/3 = 652.3 words. The
statistics counts (on may 31, 2008) 29084 articles for the chinese
wikisource, and 652.3*29084 gives 18.9M words for total.
The only question remaining is : why the statistics page presents 29.3M as
the number of words for the chinese wikisource ? Is that the number of
"groups of letters" ?
Anyway, if we accept the figures, we would have : 1. English : 211M words -
2. French : 125M - 3. Spanish : 41.8M - 4. Russian : 22.2M - 5. Chinese :
18.9M - 6. Polish : 18.2M - 7. Portuguese : 15.5M - 8. Deutsch : 14.4M - 9.
Italian : 12.0M - 10. Arabic : 10.6M.
Wikisource-l mailing list