Hi Lars,
I should have said, trying to make a qualitative comparison using quantitiative measures is quite difficult given the nature of Wikipedias.
To illustrate using the "article count" used on www.wikipedia.org, I had a looksie at two Wikis, Quechua (2,166 articles), and Friulian (2,041 articles). Using this quantitative comparison, they should be about equal in comprehensiveness.
However, after hitting "Random" five times on each, I got the following pages:
Quechua:
http://qu.wikipedia.org/wiki/Chhukruna (stub) http://qu.wikipedia.org/wiki/Minsk (stub) http://qu.wikipedia.org/wiki/Yatawaki (stubby, most of the article is just a bullet-point list) http://qu.wikipedia.org/wiki/Kiru_ismu (stub) http://qu.wikipedia.org/wiki/T%27aklla (stub)
Friulian:
http://fur.wikipedia.org/wiki/1622 (stub) http://fur.wikipedia.org/wiki/Timp_coorden%C3%A2t_univers%C3%A2l (short, maybe a Start class on en:) http://fur.wikipedia.org/wiki/Lauc (full article, longer by word count than version on en:, about the same size at that on it: if you remove the large bar graph from the it: version) http://fur.wikipedia.org/wiki/Toponims_Talian_Furlan_D (list) http://fur.wikipedia.org/wiki/Islam (full article)
Anecdotal maybe, but clearly once you eyeball it, the Friulian Wikipedia has the edge in terms of comprehensiveness and quality over Quechua. Trouble is, I don't see how this can be algorithmically determined using an automated process. You're going to need humans to look at these things and make the determination, and really, who has time to do that for 200+ wikipedias? Not to mention the accusations of bias, poor article selection, and other such things that will be made.
Using word counts is also going to be a problem too, languages like Norfuk that have lots of small particle words and the like will show inflated word counts compared to languages like Mandarin that don't have written words in the Western sense, languages like Gaeilge or Cymraeg where the very notion of "word" is a pretty nebulous one, or languages like Kalaallisut which compress many meanings and affixes into each and every word.
Cheers, Craig Franklin
Date: Thu, 3 May 2007 11:30:53 +0200 (CEST) From: Lars Aronsson lars@aronsson.se Subject: Re: [Wikipedia-l] Quality vs Quantity To: wikipedia-l@lists.wikimedia.org Message-ID: Pine.LNX.4.64.0705031128240.15940@localhost.localdomain Content-Type: TEXT/PLAIN; charset=US-ASCII
Craig Franklin wrote:
You show me a way of quantitatively comparing two Wikis using an automated process, and I'll show you a language or Wiki that will break it.
Today the front page www.wikipedia.org measures and compares the number of articles. You can begin to "break" that method. And then you can figure out some method that might perhaps be slightly better. I've suggested two already.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
I am not sure how easy it would be to determine this, but your examples suggest: Number of unique articles started on xx-WP that are subsequently translated on other WPs. This highlights the particular virtues of a world-wide collaboration.
DGG David Gooodman
On 5/5/07, Craig Franklin craig@halo-17.net wrote:
Hi Lars,
I should have said, trying to make a qualitative comparison using quantitiative measures is quite difficult given the nature of Wikipedias.
To illustrate using the "article count" used on www.wikipedia.org, I had a looksie at two Wikis, Quechua (2,166 articles), and Friulian (2,041 articles). Using this quantitative comparison, they should be about equal in comprehensiveness.
However, after hitting "Random" five times on each, I got the following pages:
Quechua:
http://qu.wikipedia.org/wiki/Chhukruna (stub) http://qu.wikipedia.org/wiki/Minsk (stub) http://qu.wikipedia.org/wiki/Yatawaki (stubby, most of the article is just a bullet-point list) http://qu.wikipedia.org/wiki/Kiru_ismu (stub) http://qu.wikipedia.org/wiki/T%27aklla (stub)
Friulian:
http://fur.wikipedia.org/wiki/1622 (stub) http://fur.wikipedia.org/wiki/Timp_coorden%C3%A2t_univers%C3%A2l (short, maybe a Start class on en:) http://fur.wikipedia.org/wiki/Lauc (full article, longer by word count than version on en:, about the same size at that on it: if you remove the large bar graph from the it: version) http://fur.wikipedia.org/wiki/Toponims_Talian_Furlan_D (list) http://fur.wikipedia.org/wiki/Islam (full article)
Anecdotal maybe, but clearly once you eyeball it, the Friulian Wikipedia has the edge in terms of comprehensiveness and quality over Quechua. Trouble is, I don't see how this can be algorithmically determined using an automated process. You're going to need humans to look at these things and make the determination, and really, who has time to do that for 200+ wikipedias? Not to mention the accusations of bias, poor article selection, and other such things that will be made.
Using word counts is also going to be a problem too, languages like Norfuk that have lots of small particle words and the like will show inflated word counts compared to languages like Mandarin that don't have written words in the Western sense, languages like Gaeilge or Cymraeg where the very notion of "word" is a pretty nebulous one, or languages like Kalaallisut which compress many meanings and affixes into each and every word.
Cheers, Craig Franklin
Date: Thu, 3 May 2007 11:30:53 +0200 (CEST) From: Lars Aronsson lars@aronsson.se Subject: Re: [Wikipedia-l] Quality vs Quantity To: wikipedia-l@lists.wikimedia.org Message-ID: Pine.LNX.4.64.0705031128240.15940@localhost.localdomain Content-Type: TEXT/PLAIN; charset=US-ASCII
Craig Franklin wrote:
You show me a way of quantitatively comparing two Wikis using an automated process, and I'll show you a language or Wiki that will break it.
Today the front page www.wikipedia.org measures and compares the number of articles. You can begin to "break" that method. And then you can figure out some method that might perhaps be slightly better. I've suggested two already.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
Wikipedia-l mailing list Wikipedia-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikipedia-l
Yes, Lars suggested using the size after compression, which would solve most if not all of the problems you noted.
Mark
On 04/05/07, Craig Franklin craig@halo-17.net wrote:
Hi Lars,
I should have said, trying to make a qualitative comparison using quantitiative measures is quite difficult given the nature of Wikipedias.
To illustrate using the "article count" used on www.wikipedia.org, I had a looksie at two Wikis, Quechua (2,166 articles), and Friulian (2,041 articles). Using this quantitative comparison, they should be about equal in comprehensiveness.
However, after hitting "Random" five times on each, I got the following pages:
Quechua:
http://qu.wikipedia.org/wiki/Chhukruna (stub) http://qu.wikipedia.org/wiki/Minsk (stub) http://qu.wikipedia.org/wiki/Yatawaki (stubby, most of the article is just a bullet-point list) http://qu.wikipedia.org/wiki/Kiru_ismu (stub) http://qu.wikipedia.org/wiki/T%27aklla (stub)
Friulian:
http://fur.wikipedia.org/wiki/1622 (stub) http://fur.wikipedia.org/wiki/Timp_coorden%C3%A2t_univers%C3%A2l (short, maybe a Start class on en:) http://fur.wikipedia.org/wiki/Lauc (full article, longer by word count than version on en:, about the same size at that on it: if you remove the large bar graph from the it: version) http://fur.wikipedia.org/wiki/Toponims_Talian_Furlan_D (list) http://fur.wikipedia.org/wiki/Islam (full article)
Anecdotal maybe, but clearly once you eyeball it, the Friulian Wikipedia has the edge in terms of comprehensiveness and quality over Quechua. Trouble is, I don't see how this can be algorithmically determined using an automated process. You're going to need humans to look at these things and make the determination, and really, who has time to do that for 200+ wikipedias? Not to mention the accusations of bias, poor article selection, and other such things that will be made.
Using word counts is also going to be a problem too, languages like Norfuk that have lots of small particle words and the like will show inflated word counts compared to languages like Mandarin that don't have written words in the Western sense, languages like Gaeilge or Cymraeg where the very notion of "word" is a pretty nebulous one, or languages like Kalaallisut which compress many meanings and affixes into each and every word.
Cheers, Craig Franklin
Date: Thu, 3 May 2007 11:30:53 +0200 (CEST) From: Lars Aronsson lars@aronsson.se Subject: Re: [Wikipedia-l] Quality vs Quantity To: wikipedia-l@lists.wikimedia.org Message-ID: Pine.LNX.4.64.0705031128240.15940@localhost.localdomain Content-Type: TEXT/PLAIN; charset=US-ASCII
Craig Franklin wrote:
You show me a way of quantitatively comparing two Wikis using an automated process, and I'll show you a language or Wiki that will break it.
Today the front page www.wikipedia.org measures and compares the number of articles. You can begin to "break" that method. And then you can figure out some method that might perhaps be slightly better. I've suggested two already.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
Wikipedia-l mailing list Wikipedia-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikipedia-l
wikipedia-l@lists.wikimedia.org