> De: "John Vandenberg" <jayvdb(a)gmail.com>
> A: "discussion list for Wikisource, the free library" <wikisource-l(a)lists.wikimedia.org>
> Objet: Re: [Wikisource-l] Changing the Wikisource main page
> Date: Sun, 14 Sep 2008 06:03:36 +1000
> A Chinese "word" has more meaning than a Spanish "word". I dont have
> the numbers, but the word "word" is not the same in all languages.
> This makes words a very complex statistic.
>
> --
> John
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
I may have found a very simple solution : if we agree that a chinese sign is a word as we understand "word", than we have to found how many sign there are. I made a test, and found that a chinese sign is 3 octets. The very same statistics tells us that the average number of octets of an article on the chinese wikisource is 1957. So, there are 1957/3 = 652.3 words. The statistics counts (on may 31, 2008) 29084 articles for the chinese wikisource, and 652.3*29084 gives 18.9M words for total.
The only question remaining is : why the statistics page presents 29.3M as the number of words for the chinese wikisource ? Is that the number of "groups of letters" ?
Anyway, if we accept the figures, we would have : 1. English : 211M words - 2. French : 125M - 3. Spanish : 41.8M - 4. Russian : 22.2M - 5. Chinese : 18.9M - 6. Polish : 18.2M - 7. Portuguese : 15.5M - 8. Deutsch : 14.4M - 9. Italian : 12.0M - 10. Arabic : 10.6M.
---------- Forwarded message ----------
From: John at Darkstar <vacuum(a)jeb.no>
Date: Wed, Sep 24, 2008 at 10:27 PM
Subject: [Foundation-l] Old newspapers going to destruction
To: Wikimedia Foundation Mailing List <foundation-l(a)lists.wikimedia.org>
In Norway a university has a large collection of newspapers, the
collection is claimed to cover around 3000 running meters in the store
house - without the norwegian and nordic newspapers, whats left is
international newspapers from the last 150 years. If no one is coming up
with a solution the collection is going to be destructed (actually burned)
I think the best thing to do is to scan them and make them publicly
available. Of course neither I or WM Norway can set forth to do such a
task, but if there should be some wealthy person out there that might be
able to involve himself in such a task, I think it would be a very
worthy gift to the mankind (where is the women!) to do such a thing.
When I heard of this I was shocked. Most of us are. I've infact studied
with the university that attempted tu burn the newspapers. The plans
have been stalled for now, but some permanent solution has to be found.
John
_______________________________________________
foundation-l mailing list
foundation-l(a)lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
> De: "John Vandenberg" <jayvdb(a)gmail.com>
> A: "discussion list for Wikisource, the free library" <wikisource-l(a)lists.wikimedia.org>
> Objet: Re: [Wikisource-l] Changing the Wikisource main page
> Date: Thu, 11 Sep 2008 09:40:56 +1000
> On Thu, Sep 11, 2008 at 4:04 AM, Syagrius <<a href=mailto:"syagrius(a)caramail.com">syagrius(a)caramail.com</a>> wrote:
> > Hello,
> >
> >
> > As Wikipedia decided to change its main page presentation, I think that
> > Wikisource maybe should do the same. The "War" between the spanish and
> the
> > chinese wikisources demonstrates that the article count does not reflect
> the
> > true depht of a Wikisource, as someone can create thousands of very small
> > articles. What should the main page present ? I don't think that, as
> > Wikipedia did, chosing the number of visitors would be a good idea, since
> > Wikisource is very less known than Wikipedia, and the figures may be not
> > reliable. If these stats from Erik Zachte can be trusted (and renewed), I
> > would suggest that the number of words
> > <a target="_blank"
> href='http://stats.wikimedia.org/wikisource/EN/TablesDatabaseWords.htm'>http
> ://stats.wikimedia.org/wikisource/EN/TablesDatabaseWords.htm</a> may be the
> > fairest figure to present. The only problem would be : is the number of
> > words given for the Chinese (and Japanese, also) wikisource correct ?
>
> It is an interesting statistic, and I havent investigated the
> algorithm being used. Is there a mathematical description of the
> algorithm used?
>
> My first guess is that we would need to weight it according to the
> entropy of each language. For example, Chinese and Japanese have a
> much higher entropy, so they need to be weighted higher.
>
> > What do you think of this proposition ?
>
> If we are going to change the front page of the portal, it is these
> stats that I would like to see used and improved:
>
> <a target="_blank"
> href='http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics'>http:/
> /wikisource.org/wiki/Wikisource:ProofreadPage_Statistics</a>
>
> We need to present ourselves as a _serious_ project, doing top quality
> work.
>
> I suggest that we also feature two texts on the main portal each month:
> - one work that is hosted on wikisource.org - i.e. from a language
> which is _not_ on a subdomain
> - one work from a subdomain, from a different sub-domain each month,
> _after_ it has been selected as a featured text on the subdomain.
>
> --
> John Vandenberg
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
For more informations about the statistics, I think that you may contact Erik Zachte himself. I still think that the number of words is the best solution, because a Wikisource could easily create a lot of "pages" with no correction at all. I don't really understand what you mean by "entropy of each langage", but it seems that the number of words of the Chinese Wikisource is ok : 29M and 42M for the Spanish Wikisource...
Xavier
(Enmerkar on French wikisource)
---------- Forwarded message ----------
From: Rebecca Hargrave Malamud <webchick(a)invisible.net>
Date: Thu, Sep 11, 2008 at 10:16 AM
Subject: [ol-discuss] Announcement: Internet Archive "Using Digital
Collections" Conference, October 27-28, 2008
To: ol-discuss(a)archive.org
On October 27-28 the Internet Archive will host its annual conference
in San Francisco with the theme "Using Digital Collections". The
meeting is being expanded from its previous one-day format in response
to participant feedback from the 2007 meeting.
This year we will build on prior themes of creating and accessing
digital content by spotlighting how digital collections are being used
to promote scholarship and to bridge institutional and geographic
boundaries. The agenda will also include technical updates on
scanning projects and a meeting of the Open Content Alliance.
A detailed agenda will be distributed in September; in the meantime,
please block out the dates and plan to join us in San Francisco!
Meeting Specifics:
Monday, October 27 - Tuesday, October 28
Golden Gate Club, Presidio of San Francisco
If you have questions or have suggestions for others who should be
invited, please contact Casey Nelson at casey(a)archive.org or Linda
Frueh at linda(a)archive.org.
RSVP by October 1 to casey(a)archive.org
_______________________________________________
Ol-discuss mailing list
Ol-discuss(a)archive.org
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss