(adding Analytics, as a relevant group for this discussion.)
I think this is next to meaningless, because the differing bot policies and practices on different wikis skew the data into incoherence.
The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org is offering that metric.
A.
On Sun, Jun 7, 2015 at 3:23 PM, Milos Rancic millosh@gmail.com wrote:
When you get data, at some point of time you start thinking about quite fringe comparisons. But that could actually give some useful conclusions, like this time it did [1].
We did the next:
- Used the number of primary speakers from Ethnologue. (Erik Zachte is
using approximate number of primary + secondary speakers; that could be good for correction of this data.)
- Categorized languages according to the logarithmic number of
speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
- Took the number of articles of Wikipedia in particular language and
created ration (number of articles / number of speakers).
- This list is consisted just of languages with Ethnologue status 1
(national), 2 (provincial) or 3 (wider communication). In fact, we have a lot of projects (more than 100) with worse language status; a number of them are actually threatened or even on the edge of extinction.
Those are the preliminary results and I will definitely have to pass through all the numbers. I fixed manually some serious errors, like not having English Wikipedia itself inside of data :D
Putting the languages into the logarithmic categories proved to be useful, as we are now able to compare the Wikipedias according to their gross capacity (numbers of speakers). I suppose somebody well introduced into statistics could even create the function which could be used to check how good one project stays, no matter of those strict categories.
It's obvious that as more speakers one language has, it's harder to the community to follow the ratio.
So, the winners per category are:
= 1k: Hawaiian, ratio 0.96900 = 10k: Mirandese, ratio 0.18073 = 100k: Basque, ratio 0.38061 = 1M: Swedish, ratio 0.21381 = 10M: Dutch, ratio 0.08305 = 100M: English, ratio 0.01447However, keep in mind that we removed languages not inside categories 1, 2 or 3. That affected >=10k languages, as, for example, Upper Sorbian stays much better than Mirandese (0.67). (Will fix it while creating the full report. Obviously, in this case logarithmic categories of numbers of speakers are much more important than what's the state of the language.)
It's obvious that we could draw the line between 1:1 for 1-10k speakers to 10:1 for >=100M speakers. But, again, I would like to get input of somebody more competent.
One very important category is missing here and it's about the level of development of the speakers. That could be added: GDP/PPP per capita for spoken country or countries would be useful as measurement. And I suppose somebody with statistical knowledge would be able to give us the number which would have meaning "ability to create Wikipedia article".
Completed in such way, we'd be able to measure the success of particular Wikimedia groups and organizations. OK. Articles per speaker are not the only way to do so, but we could use other parameters, as well: number of new/active/very active editors etc. And we could put it into time scale.
I'll make some other results. And to remind: I'd like to have the formula to count "ability to create Wikipedia article" and then to produce "level of particular community success in creating Wikipedia articles". And, of course, to implement it for editors.
[1] https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_i...
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Read the rest :P On Jun 13, 2015 02:43, "Asaf Bartov" abartov@wikimedia.org wrote:
(adding Analytics, as a relevant group for this discussion.)
I think this is next to meaningless, because the differing bot policies and practices on different wikis skew the data into incoherence.
The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org is offering that metric.
A.
On Sun, Jun 7, 2015 at 3:23 PM, Milos Rancic millosh@gmail.com wrote:
When you get data, at some point of time you start thinking about quite fringe comparisons. But that could actually give some useful conclusions, like this time it did [1].
We did the next:
- Used the number of primary speakers from Ethnologue. (Erik Zachte is
using approximate number of primary + secondary speakers; that could be good for correction of this data.)
- Categorized languages according to the logarithmic number of
speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
- Took the number of articles of Wikipedia in particular language and
created ration (number of articles / number of speakers).
- This list is consisted just of languages with Ethnologue status 1
(national), 2 (provincial) or 3 (wider communication). In fact, we have a lot of projects (more than 100) with worse language status; a number of them are actually threatened or even on the edge of extinction.
Those are the preliminary results and I will definitely have to pass through all the numbers. I fixed manually some serious errors, like not having English Wikipedia itself inside of data :D
Putting the languages into the logarithmic categories proved to be useful, as we are now able to compare the Wikipedias according to their gross capacity (numbers of speakers). I suppose somebody well introduced into statistics could even create the function which could be used to check how good one project stays, no matter of those strict categories.
It's obvious that as more speakers one language has, it's harder to the community to follow the ratio.
So, the winners per category are:
= 1k: Hawaiian, ratio 0.96900 = 10k: Mirandese, ratio 0.18073 = 100k: Basque, ratio 0.38061 = 1M: Swedish, ratio 0.21381 = 10M: Dutch, ratio 0.08305 = 100M: English, ratio 0.01447However, keep in mind that we removed languages not inside categories 1, 2 or 3. That affected >=10k languages, as, for example, Upper Sorbian stays much better than Mirandese (0.67). (Will fix it while creating the full report. Obviously, in this case logarithmic categories of numbers of speakers are much more important than what's the state of the language.)
It's obvious that we could draw the line between 1:1 for 1-10k speakers to 10:1 for >=100M speakers. But, again, I would like to get input of somebody more competent.
One very important category is missing here and it's about the level of development of the speakers. That could be added: GDP/PPP per capita for spoken country or countries would be useful as measurement. And I suppose somebody with statistical knowledge would be able to give us the number which would have meaning "ability to create Wikipedia article".
Completed in such way, we'd be able to measure the success of particular Wikimedia groups and organizations. OK. Articles per speaker are not the only way to do so, but we could use other parameters, as well: number of new/active/very active editors etc. And we could put it into time scale.
I'll make some other results. And to remind: I'd like to have the formula to count "ability to create Wikipedia article" and then to produce "level of particular community success in creating Wikipedia articles". And, of course, to implement it for editors.
[1]
https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_i...
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
-- Asaf Bartov Wikimedia Foundation http://www.wikimediafoundation.org
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Read the rest :P On Jun 13, 2015 02:43, "Asaf Bartov" abartov@wikimedia.org wrote:
(adding Analytics, as a relevant group for this discussion.)
I think this is next to meaningless, because the differing bot policies and practices on different wikis skew the data into incoherence.
The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org is offering that metric.
A.
On Sun, Jun 7, 2015 at 3:23 PM, Milos Rancic millosh@gmail.com wrote:
When you get data, at some point of time you start thinking about quite fringe comparisons. But that could actually give some useful conclusions, like this time it did [1].
We did the next:
- Used the number of primary speakers from Ethnologue. (Erik Zachte is
using approximate number of primary + secondary speakers; that could be good for correction of this data.)
- Categorized languages according to the logarithmic number of
speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
- Took the number of articles of Wikipedia in particular language and
created ration (number of articles / number of speakers).
- This list is consisted just of languages with Ethnologue status 1
(national), 2 (provincial) or 3 (wider communication). In fact, we have a lot of projects (more than 100) with worse language status; a number of them are actually threatened or even on the edge of extinction.
Those are the preliminary results and I will definitely have to pass through all the numbers. I fixed manually some serious errors, like not having English Wikipedia itself inside of data :D
Putting the languages into the logarithmic categories proved to be useful, as we are now able to compare the Wikipedias according to their gross capacity (numbers of speakers). I suppose somebody well introduced into statistics could even create the function which could be used to check how good one project stays, no matter of those strict categories.
It's obvious that as more speakers one language has, it's harder to the community to follow the ratio.
So, the winners per category are:
= 1k: Hawaiian, ratio 0.96900 = 10k: Mirandese, ratio 0.18073 = 100k: Basque, ratio 0.38061 = 1M: Swedish, ratio 0.21381 = 10M: Dutch, ratio 0.08305 = 100M: English, ratio 0.01447However, keep in mind that we removed languages not inside categories 1, 2 or 3. That affected >=10k languages, as, for example, Upper Sorbian stays much better than Mirandese (0.67). (Will fix it while creating the full report. Obviously, in this case logarithmic categories of numbers of speakers are much more important than what's the state of the language.)
It's obvious that we could draw the line between 1:1 for 1-10k speakers to 10:1 for >=100M speakers. But, again, I would like to get input of somebody more competent.
One very important category is missing here and it's about the level of development of the speakers. That could be added: GDP/PPP per capita for spoken country or countries would be useful as measurement. And I suppose somebody with statistical knowledge would be able to give us the number which would have meaning "ability to create Wikipedia article".
Completed in such way, we'd be able to measure the success of particular Wikimedia groups and organizations. OK. Articles per speaker are not the only way to do so, but we could use other parameters, as well: number of new/active/very active editors etc. And we could put it into time scale.
I'll make some other results. And to remind: I'd like to have the formula to count "ability to create Wikipedia article" and then to produce "level of particular community success in creating Wikipedia articles". And, of course, to implement it for editors.
[1]
https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_i...
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
-- Asaf Bartov Wikimedia Foundation http://www.wikimediafoundation.org
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
In more detail: It seems that those numbers are doing quite fine even with bot-generated articles. But, I will present everything during the next 10-15 days. Editors per million of speakers is good measure, but not complete. There are some specific cases where it doesn't recognize the problem, while some other measurements recognize.
Ratio of edits per active and very active user would be also interesting to check.
On Sat, Jun 13, 2015 at 8:42 AM, Milos Rancic millosh@gmail.com wrote:
Read the rest :P
On Jun 13, 2015 02:43, "Asaf Bartov" abartov@wikimedia.org wrote:
(adding Analytics, as a relevant group for this discussion.)
I think this is next to meaningless, because the differing bot policies and practices on different wikis skew the data into incoherence.
The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org is offering that metric.
A.
On Sun, Jun 7, 2015 at 3:23 PM, Milos Rancic millosh@gmail.com wrote:
When you get data, at some point of time you start thinking about quite fringe comparisons. But that could actually give some useful conclusions, like this time it did [1].
We did the next:
- Used the number of primary speakers from Ethnologue. (Erik Zachte is
using approximate number of primary + secondary speakers; that could be good for correction of this data.)
- Categorized languages according to the logarithmic number of
speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
- Took the number of articles of Wikipedia in particular language and
created ration (number of articles / number of speakers).
- This list is consisted just of languages with Ethnologue status 1
(national), 2 (provincial) or 3 (wider communication). In fact, we have a lot of projects (more than 100) with worse language status; a number of them are actually threatened or even on the edge of extinction.
Those are the preliminary results and I will definitely have to pass through all the numbers. I fixed manually some serious errors, like not having English Wikipedia itself inside of data :D
Putting the languages into the logarithmic categories proved to be useful, as we are now able to compare the Wikipedias according to their gross capacity (numbers of speakers). I suppose somebody well introduced into statistics could even create the function which could be used to check how good one project stays, no matter of those strict categories.
It's obvious that as more speakers one language has, it's harder to the community to follow the ratio.
So, the winners per category are:
= 1k: Hawaiian, ratio 0.96900 = 10k: Mirandese, ratio 0.18073 = 100k: Basque, ratio 0.38061 = 1M: Swedish, ratio 0.21381 = 10M: Dutch, ratio 0.08305 = 100M: English, ratio 0.01447However, keep in mind that we removed languages not inside categories 1, 2 or 3. That affected >=10k languages, as, for example, Upper Sorbian stays much better than Mirandese (0.67). (Will fix it while creating the full report. Obviously, in this case logarithmic categories of numbers of speakers are much more important than what's the state of the language.)
It's obvious that we could draw the line between 1:1 for 1-10k speakers to 10:1 for >=100M speakers. But, again, I would like to get input of somebody more competent.
One very important category is missing here and it's about the level of development of the speakers. That could be added: GDP/PPP per capita for spoken country or countries would be useful as measurement. And I suppose somebody with statistical knowledge would be able to give us the number which would have meaning "ability to create Wikipedia article".
Completed in such way, we'd be able to measure the success of particular Wikimedia groups and organizations. OK. Articles per speaker are not the only way to do so, but we could use other parameters, as well: number of new/active/very active editors etc. And we could put it into time scale.
I'll make some other results. And to remind: I'd like to have the formula to count "ability to create Wikipedia article" and then to produce "level of particular community success in creating Wikipedia articles". And, of course, to implement it for editors.
[1]
https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_i...
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
-- Asaf Bartov Wikimedia Foundation http://www.wikimediafoundation.org
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages
Asaf Bartov, 13/06/2015 02:42:
The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org http://stats.wikimedia.org is offering that metric.
I personally agree on this in general, but Millosh is trying something different in his current quest, i.e. content ingestion and content coverage assessment, also for missing language subdomains. (By the way, I created the category, please add stuff: https://meta.wikimedia.org/wiki/Category:Content_coverage .)
Mere article count tells us very little and he acknowledged it. As you added analytics: maybe when https://phabricator.wikimedia.org/T44259 is fixed we can also do fancy things like join various tables and count (countable) articles above a minimum threshold of hits, or something like that.
Oh, and the total number of internal links in a wiki is also an interesting metric in many cases: they're often a good indicator of how curated a wiki globally is, while bot-created articles are often orphan. (Locally there might be overlinking but that's rarely a wiki-wide issue.) I don't remember how reliable the WikiStats numbers are, but they often give a good clue already.
Nemo
I started writing a longer email, but then realized that it's better to stick with the most important points, as everything is anyway enough complex. Thus, just metrics and its applications, not anything else.
While I was reloading a year ago my few years old idea to open Wikipedia in 3000 more languages, I realized that we have substantial problem. The most numerous communities have ~100 active (thus 5+ edits/month) editors per million of speakers. As my hypothesis was that we could have Wikipedias in languages spoken by more than 10,000 people, that would mean that at the best they would have 1 (one) active editor. Thus, something else has to be done... But before that, we have to gather data and have the idea what's that "something".
My first idea -- something of a kind between a desperate one and "we should try something" -- was to ask people from Wikimedia Estonia, Wikimedia Finland and Wikimedia UK to try to reach as many as possible new active users on particular projects. The point is that Scottish Gaelic, Estonian and Finish are among the top in active users per million of speakers.
A year later, Estonians are doing a very good job (others are good, as well). They are above 100 active users per million of speakers and in a couple of years they could reach even a couple of hundreds.
But, there is an obvious flaw in this kind of reasoning and I was aware of it from the beginning: It's about languages spoken i rich countries, while we'll be dealing with the communities on the opposite end of wealth. However, at least it's possible to increase relative number of active users in "ideal" situations, which means that ~100 active users per million of speakers is not a kind of realistic maximum.
Thanks to the project Wiktionary meets Matica srpska, I am getting now more precise insights into Ethnologue data (don't ask me what's the relation, it was a couple of paragraphs long explanation inside of the email I didn't send).
So, a month ago or so I got the first data and the news were very good: more than 5000 languages won't die during the next 100 years. More than 2500 languages are in very good shape. If we take for granted that Ethnologue's data are about languages.
In the meantime, Sylvian mentioned on Languages list that he is working on Kichwa Wikipedia. And he noted one important thing: if we are going to have Wikipedias in languages like Kichwa is -- and that's likely the prototype for the most of the languages which we will meet in the future -- we have to adapt to them, not to impose unrealistic expectations to them. That's connected to the data, as I want to know what we could expect from them. (A note to self: literacy rate is very important parameter, as well.)
It is also important to be able to follow numerically the development of particular community and give them know-how based on previous successful experiences.
As we got more results from Ethnologue data, my ambitions raised. Of course I wanted to get number of articles per speaker. I got an approximate correlation between Wikipedia editions and Ethnologue data. Yes, of course, I knew that there are Wikipedia editions with a lot of bot-generated articles. So, I've cut data to languages with 5 or more on Ethnologue language vitality scale and with the condition that the language has to have native speakers and I've got pretty sane results. Yes, Dutch and Swedish Wikipedias include a lot of bot-generated articles, but the number of articles in those langauges are quite fine in comparison with the rest of the projects.
There are few arguments in favor of counting (even bot-generated) articles: * First, the most important flaw in analyzing such data is taking their synchrony, not the development. But synchrony is the starting point. By looking into development, we could monitor the number of new articles per month and we could easily conclude what's the normal state of the community and what's not. * Then it doesn't take a lot of efforts to create legitimate information on some of the topics by using bots. If legitimate articles, that gives us a clue about the capacity of particular community to create articles and thus spread free knowledge. * For example, if organized properly, it's not hard to create sane articles based on English (or Spanish or whichever) Wikipedia templates about actors and movies. That means that English (or Spanish or whichever) Wikipedia raises capacity of other Wikipedia editions, which is legitimate and quite relevant. It's relevant in the sense that we should particularly care about languages with large number of L2 speakers and languages used as international or regional lingua franca. In reverse note, we could conclude which languages have potential to create a lot of articles thanks to the fact that the speakers of that language are fluent in one of the big languages. That's also quite relevant for "gross capacity" to share knowledge in their own language. * The number of possible articles will always raise. Even for bot-generated articles. (Take as an example newly discovered planets outside of our solar system. For monolinguals, it's relevant to have that kind of information in their native language.) Thus, possibilities will raise and it's important to monitor capacities of the communities. Having a programmer raises capacity, obviously. Having a dexterous community member, capable to find a programmer inside of the movement willing to help creating a bot also counts.
I've seen projects with a lot of edits and disproportionally small number of articles. From my perspective, it's better to have more articles than to have a lot of rollbacks and a lot of talk. Although the community itself is our most important value, our main task is to create articles, not to argue. Besides the fact that it could be a sign of bad community health.
But there are many other possible indicators, which could work in the most of the cases. For example, edit count. From the first five projects by the number of articles, we could easily conclude that the ranks are: (1) English, (2) German, (3) French, (4) Dutch, (5) Swedish, not (1) English, (2) Swedish, (3) Dutch, (4) German, (5) French. (By taking a look into the other Wikipedias, we could see that even Chinese on 15th place is stronger than the Swedish Wikipedia on 2nd one.)
Not counting English as world's primary lingua franca, It's also interesting to see that the edits per German and French speaker is roughly 1.5, while 0.6 in Russian case. Danish is ~1.7, Polish is ~1.05, Serbian is ~1.2, but Japanese is ~0.4 and Swahili ~0.05. (I made approximations without a calculator, thus error range is likely +-10% :) ) Thus, GDP/PPP per capita doesn't need to be that important factor (in the sense "if you reach particular GDP/PPP per capita, it's not anymore important factor"), while other things could be.
It's also important to have in mind that various data are likely exposing various issues. And every issue has to be analyzed from socio-economic perspective (obviously, Japanese Wikipedia is not relatively weak because of the same reason as Russian or Swahili Wikipedia are).
I will include as many parameters as possible in the future analysis. As I have now the number of speakers of particular language per country, it is possible now to correlate economic development with particular language.
On Jun 13, 2015 09:38, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:
Asaf Bartov, 13/06/2015 02:42:
The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org http://stats.wikimedia.org is offering that metric.
I personally agree on this in general, but Millosh is trying something different in his current quest, i.e. content ingestion and content coverage assessment, also for missing language subdomains. (By the way, I created the category, please add stuff: https://meta.wikimedia.org/wiki/Category:Content_coverage .)
Mere article count tells us very little and he acknowledged it. As you added analytics: maybe when https://phabricator.wikimedia.org/T44259 is fixed we can also do fancy things like join various tables and count (countable) articles above a minimum threshold of hits, or something like that.
Oh, and the total number of internal links in a wiki is also an interesting metric in many cases: they're often a good indicator of how curated a wiki globally is, while bot-created articles are often orphan. (Locally there might be overlinking but that's rarely a wiki-wide issue.) I don't remember how reliable the WikiStats numbers are, but they often give a good clue already.
Nemo
Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages
Wonderful work, Miloš.
Some notes on edit count: 1. Some Wikipedias import all the versions of a translated article because they believe that it's required for attribution (AFAIK it isn't). This, of course, inflates the edit count in a completely artificial way, and sadly I don't know how to filter this chaff.
2. Bot edits could probably be filtered out, but there are some very different types of bots and it should be taken into account when measuring community success. Some bots just create articles (Waray, Swedish, Dutch). Some fix interlanguage links (not any longer, but it was huge everywhere before 2013). Some auto-fix spelling, and it's a sign of a healthy community (Hebrew, Catalan, and some others). Some are smarter than AbuseFilter at reverting vandalism, and that's also a good sign.
3. Some sysops delete revisions with vandalism, which could simply be reverted. I don't know how prevalent it is. More generally, deleted revisions could probably be counted in a useful way as part of this project. בתאריך 14 ביוני 2015 14:14, "Milos Rancic" millosh@gmail.com כתב:
I started writing a longer email, but then realized that it's better to stick with the most important points, as everything is anyway enough complex. Thus, just metrics and its applications, not anything else.
While I was reloading a year ago my few years old idea to open Wikipedia in 3000 more languages, I realized that we have substantial problem. The most numerous communities have ~100 active (thus 5+ edits/month) editors per million of speakers. As my hypothesis was that we could have Wikipedias in languages spoken by more than 10,000 people, that would mean that at the best they would have 1 (one) active editor. Thus, something else has to be done... But before that, we have to gather data and have the idea what's that "something".
My first idea -- something of a kind between a desperate one and "we should try something" -- was to ask people from Wikimedia Estonia, Wikimedia Finland and Wikimedia UK to try to reach as many as possible new active users on particular projects. The point is that Scottish Gaelic, Estonian and Finish are among the top in active users per million of speakers.
A year later, Estonians are doing a very good job (others are good, as well). They are above 100 active users per million of speakers and in a couple of years they could reach even a couple of hundreds.
But, there is an obvious flaw in this kind of reasoning and I was aware of it from the beginning: It's about languages spoken i rich countries, while we'll be dealing with the communities on the opposite end of wealth. However, at least it's possible to increase relative number of active users in "ideal" situations, which means that ~100 active users per million of speakers is not a kind of realistic maximum.
Thanks to the project Wiktionary meets Matica srpska, I am getting now more precise insights into Ethnologue data (don't ask me what's the relation, it was a couple of paragraphs long explanation inside of the email I didn't send).
So, a month ago or so I got the first data and the news were very good: more than 5000 languages won't die during the next 100 years. More than 2500 languages are in very good shape. If we take for granted that Ethnologue's data are about languages.
In the meantime, Sylvian mentioned on Languages list that he is working on Kichwa Wikipedia. And he noted one important thing: if we are going to have Wikipedias in languages like Kichwa is -- and that's likely the prototype for the most of the languages which we will meet in the future -- we have to adapt to them, not to impose unrealistic expectations to them. That's connected to the data, as I want to know what we could expect from them. (A note to self: literacy rate is very important parameter, as well.)
It is also important to be able to follow numerically the development of particular community and give them know-how based on previous successful experiences.
As we got more results from Ethnologue data, my ambitions raised. Of course I wanted to get number of articles per speaker. I got an approximate correlation between Wikipedia editions and Ethnologue data. Yes, of course, I knew that there are Wikipedia editions with a lot of bot-generated articles. So, I've cut data to languages with 5 or more on Ethnologue language vitality scale and with the condition that the language has to have native speakers and I've got pretty sane results. Yes, Dutch and Swedish Wikipedias include a lot of bot-generated articles, but the number of articles in those langauges are quite fine in comparison with the rest of the projects.
There are few arguments in favor of counting (even bot-generated) articles:
- First, the most important flaw in analyzing such data is taking
their synchrony, not the development. But synchrony is the starting point. By looking into development, we could monitor the number of new articles per month and we could easily conclude what's the normal state of the community and what's not.
- Then it doesn't take a lot of efforts to create legitimate
information on some of the topics by using bots. If legitimate articles, that gives us a clue about the capacity of particular community to create articles and thus spread free knowledge.
- For example, if organized properly, it's not hard to create sane
articles based on English (or Spanish or whichever) Wikipedia templates about actors and movies. That means that English (or Spanish or whichever) Wikipedia raises capacity of other Wikipedia editions, which is legitimate and quite relevant. It's relevant in the sense that we should particularly care about languages with large number of L2 speakers and languages used as international or regional lingua franca. In reverse note, we could conclude which languages have potential to create a lot of articles thanks to the fact that the speakers of that language are fluent in one of the big languages. That's also quite relevant for "gross capacity" to share knowledge in their own language.
- The number of possible articles will always raise. Even for
bot-generated articles. (Take as an example newly discovered planets outside of our solar system. For monolinguals, it's relevant to have that kind of information in their native language.) Thus, possibilities will raise and it's important to monitor capacities of the communities. Having a programmer raises capacity, obviously. Having a dexterous community member, capable to find a programmer inside of the movement willing to help creating a bot also counts.
I've seen projects with a lot of edits and disproportionally small number of articles. From my perspective, it's better to have more articles than to have a lot of rollbacks and a lot of talk. Although the community itself is our most important value, our main task is to create articles, not to argue. Besides the fact that it could be a sign of bad community health.
But there are many other possible indicators, which could work in the most of the cases. For example, edit count. From the first five projects by the number of articles, we could easily conclude that the ranks are: (1) English, (2) German, (3) French, (4) Dutch, (5) Swedish, not (1) English, (2) Swedish, (3) Dutch, (4) German, (5) French. (By taking a look into the other Wikipedias, we could see that even Chinese on 15th place is stronger than the Swedish Wikipedia on 2nd one.)
Not counting English as world's primary lingua franca, It's also interesting to see that the edits per German and French speaker is roughly 1.5, while 0.6 in Russian case. Danish is ~1.7, Polish is ~1.05, Serbian is ~1.2, but Japanese is ~0.4 and Swahili ~0.05. (I made approximations without a calculator, thus error range is likely +-10% :) ) Thus, GDP/PPP per capita doesn't need to be that important factor (in the sense "if you reach particular GDP/PPP per capita, it's not anymore important factor"), while other things could be.
It's also important to have in mind that various data are likely exposing various issues. And every issue has to be analyzed from socio-economic perspective (obviously, Japanese Wikipedia is not relatively weak because of the same reason as Russian or Swahili Wikipedia are).
I will include as many parameters as possible in the future analysis. As I have now the number of speakers of particular language per country, it is possible now to correlate economic development with particular language.
On Jun 13, 2015 09:38, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:
Asaf Bartov, 13/06/2015 02:42:
The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s
stats.wikimedia.org
http://stats.wikimedia.org is offering that metric.
I personally agree on this in general, but Millosh is trying something
different in his current quest, i.e. content ingestion and content coverage assessment, also for missing language subdomains. (By the way, I created the category, please add stuff: https://meta.wikimedia.org/wiki/Category:Content_coverage .)
Mere article count tells us very little and he acknowledged it. As you
added analytics: maybe when https://phabricator.wikimedia.org/T44259 is fixed we can also do fancy things like join various tables and count (countable) articles above a minimum threshold of hits, or something like that.
Oh, and the total number of internal links in a wiki is also an
interesting metric in many cases: they're often a good indicator of how curated a wiki globally is, while bot-created articles are often orphan. (Locally there might be overlinking but that's rarely a wiki-wide issue.) I don't remember how reliable the WikiStats numbers are, but they often give a good clue already.
Nemo
Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages
Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages