Any progress on this?

I'm asking it because I am going to destroy two statistics in the next weeks

1) Destroying the "article" count: Portuguese Wikisource will be bumped from 7th to 2th in article count when I finish the import of A to N articles from a public domain dictionary. (See more about it at the end of this message).

2) Destroying the "all pages" from http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics : I've found on Internet Archive the 24 volumes of the "Décadas da Ásia" by http://en.wikipedia.org/wiki/Jo%C3%A3o_de_Barros and I will be uploading those djvu files on Commons and the text extracted text on Portuguese Wikisource in a few hours

(I've talked with some users about mass upload of OCR-ed texts and no one have made objections until now; this is like as the bot creation of stub articles on Wikipedias for them. I hope that you all agrees with this opinion and no one start a "radical cleanup" propose like the one on the Volapuk Wikipedia in December 2007 ;-) . Please note that the ProofreadPage Statistics page have more stats and graphs higher than the "all pages" one)

-----------------------------------
More about the dictionary and the import

The portuguese speakers at Distributed Proofreaders @ Project Gutenberg are proofreading a dictionary and making it public avaiable as the work is done at http://dicionario-aberto.net . The "source-code" (the words database) is also shared http://dicionario-aberto.net/sources.html. Since it is in public domain both on United States and in Portugal (the country of origin), I'm going to import it to the Portuguese Wikisource. I've proposed it on http://pt.wikisource.org/wiki/Wikisource:Esplanada/Candido_de_Figueiredo_1913 days ago and no one have made any objections to it until now.

From A to N that dictionary have more than 85k entries and no one on Portuguese Wikisource is opposing to import one entry per article. I've offered to import the dictionary on the Portuguese Wiktionary ( http://pt.wiktionary.org/wiki/Wikcion%C3%A1rio:Esplanada#Candido_de_Figueiredo_1913.2C_v.C3.A3o_querer.3F ) but only one user have manifest suppor for it and two have opposed; one of the opposers suggested to change the "no articles were found" on pt.wiktionary a bit, pointing to search for words on that dictionary in pt.wikisource.

The import will be started in few weeks. I'm only waiting for more opinions on pt.wikisource and pt.wiktionary and to the FlaggedRevs getting enabled on pt.wikisource (since bot created articles are automatically flagged as approved and I don't have plans to spend ten years flagging those pages by hand :-)

On Tue, Sep 16, 2008 at 9:43 PM, Syagrius <syagrius@caramail.com> wrote:

> De: "John Vandenberg" <jayvdb@gmail.com>
> A: "discussion list for Wikisource, the free library" <wikisource-l@lists.wikimedia.org>
> Objet: Re: [Wikisource-l] Changing the Wikisource main page

> Date: Sun, 14 Sep 2008 06:03:36 +1000

> A Chinese "word" has more meaning than a Spanish "word". I dont have
> the numbers, but the word "word" is not the same in all languages.
> This makes words a very complex statistic.
>
> --
> John
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l

I may have found a very simple solution : if we agree that a chinese sign is a word as we understand "word", than we have to found how many sign there are. I made a test, and found that a chinese sign is 3 octets. The very same statistics tells us that the average number of octets of an article on the chinese wikisource is 1957. So, there are 1957/3 = 652.3 words. The statistics counts (on may 31, 2008) 29084 articles for the chinese wikisource, and 652.3*29084 gives 18.9M words for total.

The only question remaining is : why the statistics page presents 29.3M as the number of words for the chinese wikisource ? Is that the number of "groups of letters" ?

Anyway, if we accept the figures, we would have : 1. English : 211M words - 2. French : 125M - 3. Spanish : 41.8M - 4. Russian : 22.2M - 5. Chinese : 18.9M - 6. Polish : 18.2M - 7. Portuguese : 15.5M - 8. Deutsch : 14.4M - 9. Italian : 12.0M - 10. Arabic : 10.6M.
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l