Maybe this is not the most popular item, but I do like to comment on the news about Japanese and Polish Wikipedias and their 500,000 articles each. In fact, jp.WP actually has 500,000, but pl.WP does not. In an attempt to compare Wikipedia language editions I have clicked the button "random articles" and with a sample of 50 clicks each I have calculated how many articles a language edition really has, minus all those pseudo articles.
A pseudo article is e.g. http://pdc.wikipedia.org/wiki/Bikini http://co.wikipedia.org/wiki/191 http://ksh.wikipedia.org/wiki/Varsseveld http://pl.wikipedia.org/wiki/Tandil http://vo.wikipedia.org/wiki/Poplar_Bluff
Many Wikipedias loose, in my calculation, quite a huge percentage of their articles. There is one honourable exception: Japanese Wikipedia, which in 50 clicks showed absolutely no pseudo article. If Japanese Wikipedia would have such a floppy policy about new articles as many others have, jp.WP were already close to one million "articles". Pl.WP has for about 300,000 real articles, very respectable, but not what it seems to be.
Since the beginnings, Wikipedians report about the number of articles, having to tell something about to the media and to be proud about their achievements. They rank Wikipedia language editions by the number of articles. This has caused tragical dynamics: many Wikipedians and Wikipedias are so obsessed with this number that they produce rubbish articles to show off. Volapük Wikipedia with more than 100,000 pseudo articles created by a single bot using user is only the top of the iceberg, and when someone called to close vo.WP, vo.WP was supported by a amazing number of users from many language editions: cosi fan tutte. Wikipedians could and should use their time for more useful article work.
It would be good if the community found a different way to compare or to measure it's successes.
Ziko
On Fri, Jun 27, 2008 at 3:49 PM, Ziko van Dijk zvandijk@googlemail.com wrote:
A pseudo article is e.g. http://pdc.wikipedia.org/wiki/Bikini http://co.wikipedia.org/wiki/191 http://ksh.wikipedia.org/wiki/Varsseveld http://pl.wikipedia.org/wiki/Tandil http://vo.wikipedia.org/wiki/Poplar_Bluff
Ok, I understand numbers 2, 4 and 5 in your list. Number 1 is presumably included for being extremely stubby, but what's the issue with the ksh: page? Only thing I notice is that the text part hasn't got any internal links. But to consider something like that a 'non article' like the co: and pl: examples seems harsh in the extreme.
http://ksh.wikipedia.org/wiki/Varsseveld It's not Ripuarian (ksh), but Nedersaksisch, the text is taken directly from nds-nl. Ziko
2008/6/27 Andre Engels andreengels@gmail.com:
On Fri, Jun 27, 2008 at 3:49 PM, Ziko van Dijk zvandijk@googlemail.com wrote:
A pseudo article is e.g. http://pdc.wikipedia.org/wiki/Bikini http://co.wikipedia.org/wiki/191 http://ksh.wikipedia.org/wiki/Varsseveld http://pl.wikipedia.org/wiki/Tandil http://vo.wikipedia.org/wiki/Poplar_Bluff
Ok, I understand numbers 2, 4 and 5 in your list. Number 1 is presumably included for being extremely stubby, but what's the issue with the ksh: page? Only thing I notice is that the text part hasn't got any internal links. But to consider something like that a 'non article' like the co: and pl: examples seems harsh in the extreme.
-- Andre Engels, andreengels@gmail.com ICQ: 6260644 -- Skype: a_engels
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
The depth criterion available here: http://meta.wikimedia.org/wiki/List_of_wikipedias is a good starting point. I quote: "The "Depth" column ((Edits/Articles) × (Non-Articles/Articles) × (Stub-ratio)) is a rough indicator of a Wikipedia's quality, showing how frequently its articles are updated."
Note that indeed Volapuek, Polish, Ripuarian and others have very low depth ranking.
Harel
On Fri, Jun 27, 2008 at 4:49 PM, Ziko van Dijk zvandijk@googlemail.com wrote:
It would be good if the community found a different way to compare or to measure it's successes.
Alas, judging a language edition by Wikimedia Statistics does not work.
Indonesian, Asturian and Volapük WPs have the same "depth" (8), but id.WP is a very good WP. How comes? There not so many edits per article in id.WP, because it has translated a lot from English. A legitimate way to create (good) articles, but it does not need a lot of edits.
Bot activity: Indeed, "bot bumps" can often easily be detected in stats tables. Especially the small Wikipedias (I suppose) show (relatively) many bot activities due to interwiki linking. On the other hand, pseudo articles can be created by hand (let a script create it outside WP and then insert it "manually").
Ziko
2008/6/27 Harel Cain harel.cain@gmail.com:
The depth criterion available here: http://meta.wikimedia.org/wiki/List_of_wikipedias is a good starting point. I quote: "The "Depth" column ((Edits/Articles) × (Non-Articles/Articles) × (Stub-ratio)) is a rough indicator of a Wikipedia's quality, showing how frequently its articles are updated."
Note that indeed Volapuek, Polish, Ripuarian and others have very low depth ranking.
Harel
On Fri, Jun 27, 2008 at 4:49 PM, Ziko van Dijk zvandijk@googlemail.com wrote:
It would be good if the community found a different way to compare or to measure it's successes.
-- Quidquid latine dictum sit, altum viditur.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Alas, judging a language edition by Wikimedia Statistics does not work.
Indonesian, Asturian and Volapük WPs have the same "depth" (8), but id.WP is a very good WP. How comes? There not so many edits per article in id.WP, because it has translated a lot from English. A legitimate way to create (good) articles, but it does not need a lot of edits.
Bot activity: Indeed, "bot bumps" can often easily be detected in stats tables. Especially the small Wikipedias (I suppose) show (relatively) many bot activities due to interwiki linking. On the other hand, pseudo articles can be created by hand (let a script create it outside WP and then insert it "manually").
Ziko
Hi Ziko. Standard of article in pl wiki is 1,5 kb (in 2008)
http://tools.wikimedia.pl/~warx/dnb/index.xml
500k is nothing for us, but is good event for PR and only it. If we have good PR we have more new users. I can't see "count obsession" in pl wiki. I se other obsessions - copyvio, POV, vandals, trolls, lack of sources etc. Users in pl wiki know how others look at them. Believe me, we try be better, but our community is not huge.
Przykuta
On Fri, Jun 27, 2008 at 9:49 PM, Ziko van Dijk zvandijk@googlemail.com wrote:
Maybe this is not the most popular item, but I do like to comment on the news about Japanese and Polish Wikipedias and their 500,000 articles each. In fact, jp.WP actually has 500,000, but pl.WP does not. In an attempt to compare Wikipedia language editions I have clicked the button "random articles" and with a sample of 50 clicks each I have calculated how many articles a language edition really has, minus all those pseudo articles.
Yes, it's good to remind folks that "article count" is not a good metric as it fails to take into account the cultural norms within the language communities.
For a real startling view of what you are observing, you can see the wikistats show Ja: (orange) has never had a "bot bump" like pl:, where all those jagged jumps (yellow) are bot additions, meaning those articles very likely have never been edited by humans.
http://stats.wikimedia.org/EN/PlotsPngArticlesTotal.htm#p2
-Andrew (User:Fuzheado)
On Fri, Jun 27, 2008 at 4:51 PM, Andrew Lih andrew.lih@gmail.com wrote:
Yes, it's good to remind folks that "article count" is not a good metric as it fails to take into account the cultural norms within the language communities.
A year ago, some admins at de.wp (including me) tried and failed to replace the "article count" on the de.wp front page by a rather vague statement of "hundreds of thousands of articles" along with the counter for articles with the "featured article" status. Not that it would be impossible to compromise that number, but at least I felt it was something worth focusing on to display it to our visitors.
Mathias
2008/6/27 Ziko van Dijk zvandijk@googlemail.com:
Maybe this is not the most popular item, but I do like to comment on the news about Japanese and Polish Wikipedias and their 500,000 articles each. In fact, jp.WP actually has 500,000, but pl.WP does not. In an attempt to compare Wikipedia language editions I have clicked the button "random articles" and with a sample of 50 clicks each I have calculated how many articles a language edition really has, minus all those pseudo articles.
A pseudo article is e.g. http://pdc.wikipedia.org/wiki/Bikini http://co.wikipedia.org/wiki/191 http://ksh.wikipedia.org/wiki/Varsseveld http://pl.wikipedia.org/wiki/Tandil http://vo.wikipedia.org/wiki/Poplar_Bluff
Many Wikipedias loose, in my calculation, quite a huge percentage of their articles. There is one honourable exception: Japanese Wikipedia, which in 50 clicks showed absolutely no pseudo article. If Japanese Wikipedia would have such a floppy policy about new articles as many others have, jp.WP were already close to one million "articles". Pl.WP has for about 300,000 real articles, very respectable, but not what it seems to be.
Since the beginnings, Wikipedians report about the number of articles, having to tell something about to the media and to be proud about their achievements. They rank Wikipedia language editions by the number of articles. This has caused tragical dynamics: many Wikipedians and Wikipedias are so obsessed with this number that they produce rubbish articles to show off. Volapük Wikipedia with more than 100,000 pseudo articles created by a single bot using user is only the top of the iceberg, and when someone called to close vo.WP, vo.WP was supported by a amazing number of users from many language editions: cosi fan tutte. Wikipedians could and should use their time for more useful article work.
Well... Bear in mind that English Wikipedia also contains quite a lot of bot-created articles and in fact English Wikipedia was the first one to produce it. The others just followed the idea and started to do it in order to artifically increase the number of articles. Polish started to do it, when our rank went down due to mass production of bot-created articles in Swedish, Italian, French and other Wikipedias.
Comapare:
http://pl.wikipedia.org/wiki/Aignerville
and
http://en.wikipedia.org/wiki/Aignerville
or
http://pl.wikipedia.org/wiki/Is%C3%B2vol
and
http://it.wikipedia.org/wiki/Is%C3%B2vol
http://nl.wikipedia.org/wiki/Eksj%C3%B6_(stad)
and
http://pl.wikipedia.org/wiki/Eksj%C3%B6
http://pl.wikipedia.org/wiki/Dystrykt_Set%C3%BAbal
and
http://nn.wikipedia.org/wiki/Set%C3%BAbal
etc...
Nothing really special with Polish Wikipedia - many others do exactly the same including English. We had simply more active coders who knew how to feed bots. But - as you can compare with other Wikipedias they did sometimes really good job - in a sense that many bot created stubs in Polish Wikipedia contains more data than their equivalents in for example Swedish or French Wikipedia.
http://fr.wikipedia.org/wiki/Gr%C3%B3dek
http://fr.wikipedia.org/wiki/Drzewica
http://fr.wikipedia.org/wiki/Pszczyna
http://fr.wikipedia.org/wiki/Jas%C5%82o
etc...
Among the Big Wikipedias, the pl.WP has one of the lowest quota of real articles:
Artikel (off.) realt. Art. Artikel W (Quot.) EN 1400000 1344000 0,96 DE 696000 668160 0,96 FR 613000 514920 0,84 JA 466000 466000 1 IT 408000 301920 0,74 PL 467000 298880 0,64 ES 326000 293400 0,9 NL 404000 274720 0,68 SV 272000 217600 0,8 PT 338000 209560 0,62 RU 233000 195720 0,84 ZH 164000 144320 0,88 (most numbers from jan. 2008, en, de and pt older; estimations should be rounded, in fact)
Only 64 % real articles in pl.WP, while the much criticized sv.WP has 80%. But this is not about blaming some Wikipedians, but about finding out how to compare WPs in a more effective way. The average size (bytes per article) does not work either. Take the article "Berlin" in Opper Sorabian (hsb). It has 3740 bytes. Sounds good, but only 454 bytes (six short sentences) are the actual text. 1823 bytes alone are for the interwikis. This is not a manipulation, but you see the difficulties when reading Wikimedia statistics. Even a "geographical stub" with infoboxes, categories and interwikis produces a lot of bytes. It takes a human to evaluate. Ziko
2008/6/27 Tomasz Ganicz polimerek@gmail.com:
2008/6/27 Ziko van Dijk zvandijk@googlemail.com:
Maybe this is not the most popular item, but I do like to comment on the news about Japanese and Polish Wikipedias and their 500,000 articles each. In fact, jp.WP actually has 500,000, but pl.WP does not. In an attempt to compare Wikipedia language editions I have clicked the button "random articles" and with a sample of 50 clicks each I have calculated how many articles a language edition really has, minus all those pseudo articles.
A pseudo article is e.g. http://pdc.wikipedia.org/wiki/Bikini http://co.wikipedia.org/wiki/191 http://ksh.wikipedia.org/wiki/Varsseveld http://pl.wikipedia.org/wiki/Tandil http://vo.wikipedia.org/wiki/Poplar_Bluff
Many Wikipedias loose, in my calculation, quite a huge percentage of their articles. There is one honourable exception: Japanese Wikipedia, which in 50 clicks showed absolutely no pseudo article. If Japanese Wikipedia would have such a floppy policy about new articles as many others have, jp.WP were already close to one million "articles". Pl.WP has for about 300,000 real articles, very respectable, but not what it seems to be.
Since the beginnings, Wikipedians report about the number of articles, having to tell something about to the media and to be proud about their achievements. They rank Wikipedia language editions by the number of articles. This has caused tragical dynamics: many Wikipedians and Wikipedias are so obsessed with this number that they produce rubbish articles to show off. Volapük Wikipedia with more than 100,000 pseudo articles created by a single bot using user is only the top of the iceberg, and when someone called to close vo.WP, vo.WP was supported by a amazing number of users from many language editions: cosi fan tutte. Wikipedians could and should use their time for more useful article work.
Well... Bear in mind that English Wikipedia also contains quite a lot of bot-created articles and in fact English Wikipedia was the first one to produce it. The others just followed the idea and started to do it in order to artifically increase the number of articles. Polish started to do it, when our rank went down due to mass production of bot-created articles in Swedish, Italian, French and other Wikipedias.
Comapare:
http://pl.wikipedia.org/wiki/Aignerville
and
http://en.wikipedia.org/wiki/Aignerville
or
http://pl.wikipedia.org/wiki/Is%C3%B2vol
and
http://it.wikipedia.org/wiki/Is%C3%B2vol
http://nl.wikipedia.org/wiki/Eksj%C3%B6_(stad)
and
http://pl.wikipedia.org/wiki/Eksj%C3%B6
http://pl.wikipedia.org/wiki/Dystrykt_Set%C3%BAbal
and
http://nn.wikipedia.org/wiki/Set%C3%BAbal
etc...
Nothing really special with Polish Wikipedia - many others do exactly the same including English. We had simply more active coders who knew how to feed bots. But - as you can compare with other Wikipedias they did sometimes really good job - in a sense that many bot created stubs in Polish Wikipedia contains more data than their equivalents in for example Swedish or French Wikipedia.
http://fr.wikipedia.org/wiki/Gr%C3%B3dek
http://fr.wikipedia.org/wiki/Drzewica
http://fr.wikipedia.org/wiki/Pszczyna
http://fr.wikipedia.org/wiki/Jas%C5%82o
etc...
-- Tomek "Polimerek" Ganicz http://pl.wikimedia.org/wiki/User:Polimerek http://www.ganicz.pl/poli/ http://www.ptchem.lodz.pl/en/TomaszGanicz.html
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2008/6/27 Ziko van Dijk zvandijk@googlemail.com:
Among the Big Wikipedias, the pl.WP has one of the lowest quota of real articles:
Artikel (off.) realt. Art. Artikel W (Quot.)
EN 1400000 1344000 0,96 DE 696000 668160 0,96 FR 613000 514920 0,84 JA 466000 466000 1 IT 408000 301920 0,74 PL 467000 298880 0,64 ES 326000 293400 0,9 NL 404000 274720 0,68 SV 272000 217600 0,8 PT 338000 209560 0,62 RU 233000 195720 0,84 ZH 164000 144320 0,88 (most numbers from jan. 2008, en, de and pt older; estimations should be rounded, in fact)
Can you explain how this evalution been done? How do you distinguish between "real" and other articles? Especially I don't believe in statiscts shown for en Wikipedia. I have a feeing that there is much more bot created articles in en Wikipedia than your statistcs show.
About a year ago I wanted to evaluate the number of bot created articles created in Polish Wikipedia, and then evaluate how many of them were expanded by humans. Unfortunatelly it was impossible to perform as the bot owners do not keep records of its activity. Anyway we checked randomly what happened with bot-created articles about Polish villages and small towns, which was the very first bot produciton in our Wikikipedia. As I was strongly opposed several years ago to produce bot-created articles but failed to persuade my fellow wikipedians, I just wanted to prove that it was indeed bad idea. However, the study shown that around 70% of them were efectively expanded by humans. Villagers added quite a lot of useful stuff to these articles like histories of their villages, pictures of interesting buildings etc. Can you explain if these articles are treated "real" or "not real" in your statistics and why?
On Sat, Jun 28, 2008 at 9:47 AM, Tomasz Ganicz polimerek@gmail.com wrote:
Can you explain how this evalution been done? How do you distinguish between "real" and other articles? Especially I don't believe in statiscts shown for en Wikipedia. I have a feeing that there is much more bot created articles in en Wikipedia than your statistcs show.
That is described in his first mail: He did 'random article' 50 times and used that as a sample.
2008/6/28 Andre Engels andreengels@gmail.com:
On Sat, Jun 28, 2008 at 9:47 AM, Tomasz Ganicz polimerek@gmail.com wrote:
Can you explain how this evalution been done? How do you distinguish between "real" and other articles? Especially I don't believe in statiscts shown for en Wikipedia. I have a feeing that there is much more bot created articles in en Wikipedia than your statistcs show.
That is described in his first mail: He did 'random article' 50 times and used that as a sample.
Well it is not described - I mean there is no clear criteria of evaluation mentioned. Does he speak Japanese or Polish? Is it possible to recognize "real" and "unreal" articles without understanding them?
Compare:
http://he.wikipedia.org/wiki/%D7%9C%D7%95%D7%93%D7%96%27
Is it "real" or "unreal" article and why? I have a feeling that it is bot created, but I am no sure about it, as I don't speak Hebrew :-)
And what about this:
http://uk.wikipedia.org/wiki/%D0%A4%D1%96%D0%B3%D1%83%D0%BB%D1%81_%D1%96_%D0...
It is quite long, but I am almost sure that it is bot created and untouch by any human, because it contains only statistical data and sentences looking as if they were machine created. I don't speak Ukrainian well but understand it a little bit. But it is still just my feelings...
It is funny that this article is longer than similar in es-Wikipedia, although Spanish one was edited by humans for sure :-)
http://es.wikipedia.org/wiki/F%C3%ADgols_y_Ali%C3%B1%C3%A1
and moreover - if you check all Wikipedias which contain article about Fígols i Alinyà only Spanish one looks as edited by human (but it is just my feelings I can be wrong).
And this:
http://ta.wikipedia.org/wiki/%E0%AE%B5%E0%AE%BE%E0%AE%B0%E0%AF%8D%E0%AE%9A%E...
real or not real? I really don't know, probably bot-created :-)
I think if we would like to perform serios evaluation of "real" and "unreal" articles it should be based on clear, not based on "feelings" criteria, done on larger samples (at least 500 articles) and by people who understand what they are reading.
There is Google Translater, and the Interwikis help as well. That article of he.WP about Lodz I would count as a real article, because there is information more than in a data base (links to Holocaust related articles, something about 19th century, economy (textile)). Indeed, I would like to make a more scientific scheme and apply it to a larger sample, maybe there will establish a research group about. I believe that my method does give a reasonable picture; of course, whether my results say "50.000" real articles or "52.000" is not really a measurable difference. Ziko
PS: By the way, it is fun to browse a foreign language Wikipedia with the help of Google translater - not perfect, but interesting what others write about.
2008/6/28 Tomasz Ganicz polimerek@gmail.com:
2008/6/28 Andre Engels andreengels@gmail.com:
On Sat, Jun 28, 2008 at 9:47 AM, Tomasz Ganicz polimerek@gmail.com wrote:
Can you explain how this evalution been done? How do you distinguish between "real" and other articles? Especially I don't believe in statiscts shown for en Wikipedia. I have a feeing that there is much more bot created articles in en Wikipedia than your statistcs show.
That is described in his first mail: He did 'random article' 50 times and used that as a sample.
Well it is not described - I mean there is no clear criteria of evaluation mentioned. Does he speak Japanese or Polish? Is it possible to recognize "real" and "unreal" articles without understanding them?
Compare:
http://he.wikipedia.org/wiki/%D7%9C%D7%95%D7%93%D7%96%27
Is it "real" or "unreal" article and why? I have a feeling that it is bot created, but I am no sure about it, as I don't speak Hebrew :-)
And what about this:
http://uk.wikipedia.org/wiki/%D0%A4%D1%96%D0%B3%D1%83%D0%BB%D1%81_%D1%96_%D0...
It is quite long, but I am almost sure that it is bot created and untouch by any human, because it contains only statistical data and sentences looking as if they were machine created. I don't speak Ukrainian well but understand it a little bit. But it is still just my feelings...
It is funny that this article is longer than similar in es-Wikipedia, although Spanish one was edited by humans for sure :-)
http://es.wikipedia.org/wiki/F%C3%ADgols_y_Ali%C3%B1%C3%A1
and moreover - if you check all Wikipedias which contain article about Fígols i Alinyà only Spanish one looks as edited by human (but it is just my feelings I can be wrong).
And this:
http://ta.wikipedia.org/wiki/%E0%AE%B5%E0%AE%BE%E0%AE%B0%E0%AF%8D%E0%AE%9A%E...
real or not real? I really don't know, probably bot-created :-)
I think if we would like to perform serios evaluation of "real" and "unreal" articles it should be based on clear, not based on "feelings" criteria, done on larger samples (at least 500 articles) and by people who understand what they are reading.
-- Tomek "Polimerek" Ganicz http://pl.wikimedia.org/wiki/User:Polimerek http://www.ganicz.pl/poli/ http://www.ptchem.lodz.pl/en/TomaszGanicz.html
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2008/6/28 Ziko van Dijk zvandijk@googlemail.com:
There is Google Translater, and the Interwikis help as well. That article of he.WP about Lodz I would count as a real article, because there is information more than in a data base (links to Holocaust related articles, something about 19th century, economy (textile)). Indeed, I would like to make a more scientific scheme and apply it to a larger sample, maybe there will establish a research group about. I believe that my method does give a reasonable picture; of course, whether my results say "50.000" real articles or "52.000" is not really a measurable difference.
Sorry about it, but it only shows that your results are not reliable, because it is based on your feelings and poor quality machine translations which could change in unpredictable way your feelings. I might be affraid that the results shown in your table is just a reflection of: a) the quality of machine translation performed by google - it is better for latin and germanic based languages (English, French, Italian, German, Dutch etc.) and much worse for slavic, arabic and East Asian languages. b)your own subconcious attitude toward various nations and Wikipedias - even if you are trying to evaluate them all fair
Google translate produces sometimes really funny results when translating from Polish to English. For example:
"Przyszłość partii przyszłością narodu" (Future of the party is the future of the nation) is translated to:
"The future of the future of the nation lot" :-)
Or:
"Byłbym spał, gdybym mógł." ( I would sleep, if only I could)
is translated to:
"I would be he lay, if I only could."
http://en.wikipedia.org/wiki/Machine_translation_software_usability#Trustwor...
http://www.nist.gov/speech/tests/mt/2006/doc/mt06eval_official_results.html
I think that a method to distunguish between "real" and "unreal" articles should be based on analysis of the history of article and formal "hard" criteria.
For example one can make a criteria that if there are at least 4 sentences writen by a human it is "real article".
2008/6/28 Ziko van Dijk zvandijk@googlemail.com:
I have discussed my study with many people (one had similar results), but no one was so aggressive, Tomasz.
b)your own subconcious attitude toward various nations and Wikipedias
? Is this an accusation?
No, I am just a scientist, so I have a tendency to be sceptical and have basic knowledge about typical mistakes of doing statistical research.Too small sample, no clear criteria of evaluating it, and you did not tested the experimental error or replication of your method, by comparing results from several experiments asking other people to use your meaning of what "real" article is.
50 articles sample tested by one person, who for sure have its own attitudes is not enough to say that this or another Wikipedia is better or worse. Everyone has its own attitudes towards one or another nation. It is very natural thing. And if there is no clear definition of what is "real" article and what is not, and to evaluate this it was used google machine translation (which according to NIST survey from 2006 is found to be OK in only around 49% cases) so I am quite sure that your results cannot be taken seriously. You could have stastical error at least around 15-20% (if not more), so the results 0,60 or 0,80 is in experimental error range.
Anyway it would be interesting to make better planned experiments to evaluate the quality of Wikipedia articles, but for sure it has to be done on larger sample, some sort of "hard" criteria or a group of at least 10 researchers speaking diffrent languages and having different cultural background when to use "soft, human based" criteria.
First of all, that is for Arabic and Chinese, which probably have the worst quality of Google Translate.
Second of all, Google consistently fared better than almost every other system, a surprising feat for a very recently developed system.
Even if machine translation isn't completely accurate, it's often enough to get an idea of the content of the page, and I know I have learned about several topics through reading translated articles from pl.wp.
Mark
On 28/06/2008, Tomasz Ganicz polimerek@gmail.com wrote:
2008/6/28 Ziko van Dijk zvandijk@googlemail.com:
I have discussed my study with many people (one had similar results), but no one was so aggressive, Tomasz.
b)your own subconcious attitude toward various nations and Wikipedias
? Is this an accusation?
No, I am just a scientist, so I have a tendency to be sceptical and have basic knowledge about typical mistakes of doing statistical research.Too small sample, no clear criteria of evaluating it, and you did not tested the experimental error or replication of your method, by comparing results from several experiments asking other people to use your meaning of what "real" article is.
50 articles sample tested by one person, who for sure have its own attitudes is not enough to say that this or another Wikipedia is better or worse. Everyone has its own attitudes towards one or another nation. It is very natural thing. And if there is no clear definition of what is "real" article and what is not, and to evaluate this it was used google machine translation (which according to NIST survey from 2006 is found to be OK in only around 49% cases) so I am quite sure that your results cannot be taken seriously. You could have stastical error at least around 15-20% (if not more), so the results 0,60 or 0,80 is in experimental error range.
Anyway it would be interesting to make better planned experiments to evaluate the quality of Wikipedia articles, but for sure it has to be done on larger sample, some sort of "hard" criteria or a group of at least 10 researchers speaking diffrent languages and having different cultural background when to use "soft, human based" criteria.
-- Tomek "Polimerek" Ganicz http://pl.wikimedia.org/wiki/User:Polimerek http://www.ganicz.pl/poli/ http://www.ptchem.lodz.pl/en/TomaszGanicz.html
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2008/6/29 Mark Williamson node.ue@gmail.com:
Second of all, Google consistently fared better than almost every other system, a surprising feat for a very recently developed system.
Not that recently - it's based on SYSTRAN.
- d.
No, it's not.
Mark
2008/6/28 David Gerard dgerard@gmail.com:
2008/6/29 Mark Williamson node.ue@gmail.com:
Second of all, Google consistently fared better than almost every other system, a surprising feat for a very recently developed system.
Not that recently - it's based on SYSTRAN.
- d.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Tomasz Ganicz wrote:
And if there is no clear definition of what is "real" article and what is not,
Apparently it was the 500k article event that caused Ziko to bring the topic up this time. He's frustrated (and so am I) that 500K articles is reported as an achievement, when it is indeed doubtful what quality these articles have. Still, I think he exaggerates the problem.
Earlier this year, when the topic came up on meta, it was because of which languages were featured as the top 10 on www.wikipedia.org, http://meta.wikimedia.org/wiki/Top_Ten_Wikipedias
Since then, the Russian Wikipedia has gained the 10th position and Swedish ("the one with all the stubs") is down to 11th, so there is one problem less to care about. During that discussion, I proposed to use the size of the compressed database dump (pages-articles.xml.bz2) as the official metric, since it both counts the total database size (one long article counts the same as two short ones) and it completely removes the impact of bot generated articles. The compressed size of the Volapük Wikipedia is very small, becase the same patterns appear in many of its numerous articles.
On the talk page, there is a table where this is shown, and you can sort by column by clicking the little boxes, http://meta.wikimedia.org/wiki/Talk:Top_Ten_Wikipedias#What_problem_do_we_wa...
I'd like to propose a quality metric: The difference in rank between the article count and the compressed database size.
The English Wikipedia is the biggest (rank 1), whether you count articles or compressed database size. So its quality is 0.
The Polish Wikipedia was the 4th by article count, but the 7th by compressed database size, for a quality of 4 - 7 = -3.
The Swedish Wikipedia was (when this table was compiled) the 10th biggest by article count, but the 12th biggest by compressed database size, so its quality is 10 - 12 = -2.
The Russian Wikipedia was the 11th by article count, but 9th by compressed database size, so its quality is +2. This doesn't mean the Russian Wikipedia is better than the English one, only that it is better than (two of) its peers of similar size.
The Volapük Wikipedia was the 15th by article count, but the worse than the 30th by compressed database size (the table is incomplete), so its quality is worse than -15.
On Sun, Jun 29, 2008 at 10:03 AM, Lars Aronsson lars@aronsson.se wrote:
I'd like to propose a quality metric: The difference in rank between the article count and the compressed database size.
I think this is a good metric, especially because it's a relative metric (since it's effectively comparing projects against their peers to see how mature they are).
Someone earlier was discussing article sizes, so I hacked up a script to graph the distribution of article sizes:
http://www.toolserver.org/~thebainer/articlesizes/
Most graphs share the same basic shape, with a roughly logarithmic distribution once you get past the initial peak (see the English Wikipedia graph for an example of what I mean), but some are different, and it tends to coincide with what has already been observed.
The Swedish Wikipedia was (when this table was compiled) the 10th biggest by article count, but the 12th biggest by compressed database size, so its quality is 10 - 12 = -2.
Swedish Wikipedia is distributed in almost exactly the same way as English Wikipedia, with the difference being that its average size is less than half that of En's, at around 1900 bytes.
The Russian Wikipedia was the 11th by article count, but 9th by compressed database size, so its quality is +2. This doesn't mean the Russian Wikipedia is better than the English one, only that it is better than (two of) its peers of similar size.
Not only does the Russian Wikipedia have a high average article size (about 5500 bytes, compared with, for example, English Wikipedia at around 4100 bytes) but its graph, which has multiple peaks, seems to show that, unlike many other projects, it has more mature, medium-size articles than it does stubs.
The Volapük Wikipedia was the 15th by article count, but the worse than the 30th by compressed database size (the table is incomplete), so its quality is worse than -15.
The Volapük Wikipedia has an unusual distribution, with two peaks. One is in the usual place, just below the average size (which is low, at just over 1000 bytes) while the other is around 2 - 2.5kb, which corresponds to the size of all the geography stubs created by SmeiraBot.
2008/6/29 Stephen Bain stephen.bain@gmail.com:
On Sun, Jun 29, 2008 at 10:03 AM, Lars Aronsson lars@aronsson.se wrote:
I'd like to propose a quality metric: The difference in rank between the article count and the compressed database size.
I think this is a good metric, especially because it's a relative metric (since it's effectively comparing projects against their peers to see how mature they are).
Someone earlier was discussing article sizes, so I hacked up a script to graph the distribution of article sizes:
Yes, but size of avarage article can be easily icreased artifically by bot-creation of inofoboxes, navigation templates, long list of categories and interwiki etc.
Take a look for example on:
http://pl.wikipedia.org/wiki/Telmisartan
infobox is around 90% of its content. Blame Polish Wikipedia that it allows creating such articles, I would agree with you :-) The article was created by human not by bot.
Higher average size of articles in any Wikipedia can be easilly achieved just by creating bot-only articles with plenty of statstical data, huge infobox and several navigation templates. See example, I have already shown:
http://uk.wikipedia.org/wiki/%D0%A4%D1%96%D0%B3%D1%83%D0%BB%D1%81_%D1%96_%D0...
Would be nice to have a tool comparing the "real" size of articles, by which I mean counting size of free text only - without all templates and other non-text stuff.
Who's to say some bot-created articles can't be useful?
Some may not be very useful, but others are very helpful.
Mark
2008/6/29 Tomasz Ganicz polimerek@gmail.com:
2008/6/29 Stephen Bain stephen.bain@gmail.com:
On Sun, Jun 29, 2008 at 10:03 AM, Lars Aronsson lars@aronsson.se wrote:
I'd like to propose a quality metric: The difference in rank between the article count and the compressed database size.
I think this is a good metric, especially because it's a relative metric (since it's effectively comparing projects against their peers to see how mature they are).
Someone earlier was discussing article sizes, so I hacked up a script to graph the distribution of article sizes:
Yes, but size of avarage article can be easily icreased artifically by bot-creation of inofoboxes, navigation templates, long list of categories and interwiki etc.
Take a look for example on:
http://pl.wikipedia.org/wiki/Telmisartan
infobox is around 90% of its content. Blame Polish Wikipedia that it allows creating such articles, I would agree with you :-) The article was created by human not by bot.
Higher average size of articles in any Wikipedia can be easilly achieved just by creating bot-only articles with plenty of statstical data, huge infobox and several navigation templates. See example, I have already shown:
http://uk.wikipedia.org/wiki/%D0%A4%D1%96%D0%B3%D1%83%D0%BB%D1%81_%D1%96_%D0...
Would be nice to have a tool comparing the "real" size of articles, by which I mean counting size of free text only - without all templates and other non-text stuff.
-- Tomek "Polimerek" Ganicz http://pl.wikimedia.org/wiki/User:Polimerek http://www.ganicz.pl/poli/ http://www.ptchem.lodz.pl/en/TomaszGanicz.html
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Stephen Bain wrote:
Swedish Wikipedia is distributed in almost exactly the same way as English Wikipedia, with the difference being that its average size is less than half that of En's, at around 1900 bytes.
In Sweden we have this problem that people in our capital are so arrogant towards the rest of the country. This is of course nothing special about Sweden, it happens in every country. Americans just cannot stand New Yorkers (the big city) and policy makers in Washington DC. People in Russia probably hate people in Moscow. Europe is united in its disliking of the EU bureaucracy headquarters in Brussels. Talking to other Swedes of my age, it is easier to discuss individual streets in San Francisco than streets in a neighboring Swedish city where none of us have been. We know that the U.N. headquarters is on the east side of Manhattan and that Wall Street is at a walking distance from Battery Park. We all tend to look upwards. Swedes look to Stockholm, Paris, London and New York, but more seldom to Latvia or Bangladesh or Malawi.
All Swedish wikipedians know about the Swedish and the English Wikipedia. Of course the English is much larger. The fact that Ingmar Bergman's English article is three times longer (44K) than the Swedish one (16K), is taken for granted and not as an urgent crisis that needs to be addressed. The Swedish Wikipedia has articles about small Swedish places and people that don't have articles in the English Wikipedia, and so their sizes cannot be compared. It is also wellknown that the Swedish Wikipedia is slightly larger than our immediate neighboring languages Finnish, Norwegian and Danish.
The idea to compare the Swedish Wikipedia to the one in Czech, having the same number of speakers (10M), just doesn't occur, because nobody in Sweden speaks Czech and most would believe you need to know the language in order to understand anything. I guess it is mutual, the Czech speak their own language plus English and German and a bit of Polish, but nobody would care about Swedish.
It was only recently that the stubbiness of the Swedish Wikipedia, relative to other languages of Wikipedia, started to come into public awareness. Most regular wikipedians are now aware of this. But when we start to talk about improving quality, average Swedes who are occasional contributors still don't understand what this is about. They think we want to compete with traditional printed encyclopedias, and that we are losing our soul as a Net project. Why should any article be deleted, when storage is so cheap? Why require source citations and complete sentences and birth dates? Surely someone else is going to add that later.
It's easy to measure the average size of articles, but much harder to change it. Even if you add 50 bytes to each article, that would only move the average by 50 bytes and not from 2K to 4K. Just by translating from English, you can add 30K to the Swedish article on Ingmar Bergman, but that doesn't move the average size.
What I did was instead to look at the very smallest articles. There's a page [[Special:Shortpages]] for this, but unfortunately it contains both short articles and disambiguation pages. From the database dumps, I could filter out the disambiguations which are 10K on the Swedish Wikipedia, leaving 270K real articles.
I found that 0.1 percent (270 articles) were shorter than 90 bytes. This was better than the Arabic Wikipedia where 0.1 percent are shorter than 62 bytes, Latvian with 78 bytes and Estonian with 81 bytes. But it is far worse than Danish 130 bytes, Polish 160, Czech 213, German 280 and Russian 359 bytes. This was in April. By merging and improving the very shortest articles, the Swedish Wikipedia's 0.1 percent shortest articles now reach 145 bytes.
In the next step, I found 1.0 percent of articles (2700 articles) were shorter than 126 bytes, which was far worse than any other language I looked at. But during April and May this has improved to 171 bytes, which is better than Arabic and Estonian. During all of April, the Swedish Wikipedia didn't grow at all, because stubs were being merged and removed faster than new articles were written. This is when the Russian Wikipedia on May 19 got the 10th position at 283K articles.
By addressing the very shortest articles, we remove the easiest excuse for people to create new very short stubs. They can no longer point to other articles that are only 100 bytes long. After the lower limit is pushed from 120 to 140 bytes, it will continue to push upwards to 160 and 180 bytes.
Some might ask why we don't just remove everything shorter than 200 or 300 bytes. You are welcome to try to propose this, but so far every such proposal has met compact resitance. Instead I found 20 articles saying "X is an island in Y archipelago" and merged them into a nice article on "Y archipelago" with a map, so that everybody can agree that the new style is a real improvement. With that sort of aggregation, it becomes more obvious that island Z is missing or that island W doesn't really belong. Fact checking and quality assurance are improved by merging stubs. It's all about setting an example for the future. The average article size will move slowly over the next few years.
The next medium term goal is to look at the 10 percent shortest articles (27K of them), the "lower decile". For the Swedish Wikipedia, this has moved from 312 bytes in April to 349 bytes in June. For the Lithuanian, Estonian, Arabic, Danish Wikipedias it is around 400 bytes. Norwegian and Icelandic are around 500, Polish at 638. The French, Finnish and Hungarian are around 700. Latvian has 808, Czech 917, German 1081 and the Russian Wikipedia has 10 percent of its articles shorter than 1305 bytes.
If you believe the Polish 638 bytes is really bad, it's just because you haven't looked close enough at Danish and Norwegian. In fact, it's not so much worse than the French 682 bytes. You are welcome to make fun of the Swedish stubs, because it is an area where we can show constant improvement.
This hunt for "substubs" is just one of several quality improvement initiatives currently running on the Swedish Wikipedia.
On Mon, Jun 30, 2008 at 12:44 AM, Lars Aronsson lars@aronsson.se wrote:
Stephen Bain wrote:
Swedish Wikipedia is distributed in almost exactly the same way as English Wikipedia, with the difference being that its average size is less than half that of En's, at around 1900 bytes.
...
It was only recently that the stubbiness of the Swedish Wikipedia, relative to other languages of Wikipedia, started to come into public awareness. Most regular wikipedians are now aware of this.
Well, since the graphs are distributed in the same way, one can say that Swedish Wikipedia is just as stubby as the English Wikipedia, if you define a stub relative to the average article size, which I think is probably a better definition than any absolute byte value.
Not only does the Russian Wikipedia have a high average article size (about 5500 bytes, compared with, for example, English Wikipedia at around 4100 bytes) but its graph, which has multiple peaks, seems to show that, unlike many other projects, it has more mature, medium-size articles than it does stubs.
But isn't that all Russian cyrilic letters take up 2 bytes of space? while simple latin letters only 1 byte?
My guess much of that 1400-byte will be eaten away when accounting for the cyrilic vs latin letters.
Renata
The Polish Wikipedia was the 4th by article count, but the 7th by compressed database size, for a quality of 4 - 7 = -3.
Main space or all database? We don't like many edits per article in short time. Look at
http://stats.wikimedia.org/PL/TablesWikipediaArticleEditsPL.htm
http://stats.wikimedia.org/PL/TablesWikipediaArticleEditsEN.htm
http://stats.wikimedia.org/PL/TablesWikipediaArticleEditsES.htm
http://stats.wikimedia.org/PL/TablesWikipediaArticleEditsRU.htm
http://stats.wikimedia.org/PL/TablesWikipediaArticleEditsFR.htm
If any (meta)page is huge, we do everything to reduce its growth of size. So, often we create more subpages.
But - some days ago we talk on IRc with Japanese wikipedians about comparisions of wikipedias - not only about "count of..." and quality. Wee need to know more about our projects.
regards
Przykuta
Tomasz, My impression is that you do not like the results because pl.WP has a poor ratio, that's what you initially complained about. I know - and never denied - that 50 is a small sample; I did it for 53 Wikipedias. I do have criteria, even if I did not list them up for you, you have not read my paper. But you are immediately accusing me on judging purely on feelings and attitudes to nations.
For example, at geo stubs I want at least two informations that are not bot created. http://pl.wikipedia.org/wiki/Abisynia_(powiat_bialski) I checked the first some of that Kategoria:Zalążki artykułów o polskich wsiach and only Abramy I'd count as real, because there are two informations about history (1599 and 1676). I suppose that many of the other 50,159 articles of that category are pseudo articles (they all have that part about the administrative division of 1975-1998). The same thing can be said about Kategoria:Zalążek artykułu o miejscowości francuskiej with 35,066 cities in France. Schematic Planetoid articles are no real articles, like the 14.444 in Kategoria:Planetoidy pasa głównego.
So, I can imagine why pl.WP has only 64% real articles according to my sample.
Ziko
2008/6/28 Tomasz Ganicz polimerek@gmail.com:
2008/6/28 Ziko van Dijk zvandijk@googlemail.com:
I have discussed my study with many people (one had similar results), but no one was so aggressive, Tomasz.
b)your own subconcious attitude toward various nations and Wikipedias
? Is this an accusation?
No, I am just a scientist, so I have a tendency to be sceptical and have basic knowledge about typical mistakes of doing statistical research.Too small sample, no clear criteria of evaluating it, and you did not tested the experimental error or replication of your method, by comparing results from several experiments asking other people to use your meaning of what "real" article is.
50 articles sample tested by one person, who for sure have its own attitudes is not enough to say that this or another Wikipedia is better or worse. Everyone has its own attitudes towards one or another nation. It is very natural thing. And if there is no clear definition of what is "real" article and what is not, and to evaluate this it was used google machine translation (which according to NIST survey from 2006 is found to be OK in only around 49% cases) so I am quite sure that your results cannot be taken seriously. You could have stastical error at least around 15-20% (if not more), so the results 0,60 or 0,80 is in experimental error range.
Anyway it would be interesting to make better planned experiments to evaluate the quality of Wikipedia articles, but for sure it has to be done on larger sample, some sort of "hard" criteria or a group of at least 10 researchers speaking diffrent languages and having different cultural background when to use "soft, human based" criteria.
-- Tomek "Polimerek" Ganicz http://pl.wikimedia.org/wiki/User:Polimerek http://www.ganicz.pl/poli/ http://www.ptchem.lodz.pl/en/TomaszGanicz.html
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Hoi, When you base your statistics on numbers that have a pseudo relevance, the resulting statistics have as a consequence the same pseudo relevance or less. When as a result of concentrating on inflating the wrong numbers the results are called "tragic" as you have done, it is clear that the numbers everybody is concentrating on are the wrong ones.
No amount of quibbling will change this fact, you can increase your sample size as much as you like, you can include all kinds of other factors that have a tangential relation to the numbers considered but it will not make the results any better. It will not change the numbers you are basing the argument on; they will not give meaningful results when people try and improve the numbers.
This whole argumentation is based on the metric of number of articles. More relevant are the numbers of reads for a project. By and large, there is no way in which the numbers can be manipulated in a way that can be called detrimental to individual Wikipedias and Wikipedia in general.
The most tragic part of this whole argument is that it is based on the wrong premises. The result of all of these argument make no real difference. I can safely argue that the big increase in the number of articles for the Volapuk Wikipedia provided not only a large amount of new articles, it increased the visibility of this language, it increased the number of people editing in Volapuk it is a genuine success. The problem is that for all kinds of reasons people are of the opinion that it diminishes the success of *their* Wikipedia and a rich variety of arguments have been used to diminish the success of all the hard work that went into making this happen. THIS IS SAD
When we use as a different metric particularly the number of people using a project, there is no way that anyone can argue that the numbers are unacceptable. It is obvious that the number of people speaking a language will impact the relative numbers. At the same time any and all activities that stimulate the numbers of readers are positive to the Wikipedia and WMF aims. For languages with few people the method of gaining more readers may be different. When all the intellectual activity is centred on these methods, we have a positive discussion in stead of the current discussion that will not bring us anything that I consider worthwhile..
My challenge to you all is to argue that your arguments have any relevance except for the fact that we have always considered relevance by number of activities... If your arguments are not convincing, you have all the arguments why we should ditch the number of articles as the yardstick we measure the relevance of our Wikipedia projets by...
Thanks, GerardM
On Fri, Jun 27, 2008 at 3:49 PM, Ziko van Dijk zvandijk@googlemail.com wrote:
Maybe this is not the most popular item, but I do like to comment on the news about Japanese and Polish Wikipedias and their 500,000 articles each. In fact, jp.WP actually has 500,000, but pl.WP does not. In an attempt to compare Wikipedia language editions I have clicked the button "random articles" and with a sample of 50 clicks each I have calculated how many articles a language edition really has, minus all those pseudo articles.
A pseudo article is e.g. http://pdc.wikipedia.org/wiki/Bikini http://co.wikipedia.org/wiki/191 http://ksh.wikipedia.org/wiki/Varsseveld http://pl.wikipedia.org/wiki/Tandil http://vo.wikipedia.org/wiki/Poplar_Bluff
Many Wikipedias loose, in my calculation, quite a huge percentage of their articles. There is one honourable exception: Japanese Wikipedia, which in 50 clicks showed absolutely no pseudo article. If Japanese Wikipedia would have such a floppy policy about new articles as many others have, jp.WP were already close to one million "articles". Pl.WP has for about 300,000 real articles, very respectable, but not what it seems to be.
Since the beginnings, Wikipedians report about the number of articles, having to tell something about to the media and to be proud about their achievements. They rank Wikipedia language editions by the number of articles. This has caused tragical dynamics: many Wikipedians and Wikipedias are so obsessed with this number that they produce rubbish articles to show off. Volapük Wikipedia with more than 100,000 pseudo articles created by a single bot using user is only the top of the iceberg, and when someone called to close vo.WP, vo.WP was supported by a amazing number of users from many language editions: cosi fan tutte. Wikipedians could and should use their time for more useful article work.
It would be good if the community found a different way to compare or to measure it's successes.
Ziko
-- Ziko van Dijk NL-Silvolde
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
wikimedia-l@lists.wikimedia.org