Re: [Languages] [Analytics] [Wikimedia-l] Wikipedia article per speaker

List overview All Threads
Download

newer

older

Wikipedia article per speaker

Re: [Languages] [Wikimedia-l]...

Federico Leva (Nemo)

13 Jun 2015 13 Jun '15

4:38 p.m.

Asaf Bartov, 13/06/2015 02:42:

...

The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org http://stats.wikimedia.org is offering that metric.

I personally agree on this in general, but Millosh is trying something different in his current quest, i.e. content ingestion and content coverage assessment, also for missing language subdomains. (By the way, I created the category, please add stuff: https://meta.wikimedia.org/wiki/Category:Content_coverage .)

Mere article count tells us very little and he acknowledged it. As you added analytics: maybe when https://phabricator.wikimedia.org/T44259 is fixed we can also do fancy things like join various tables and count (countable) articles above a minimum threshold of hits, or something like that.

Oh, and the total number of internal links in a wiki is also an interesting metric in many cases: they're often a good indicator of how curated a wiki globally is, while bot-created articles are often orphan. (Locally there might be overlinking but that's rarely a wiki-wide issue.) I don't remember how reliable the WikiStats numbers are, but they often give a good clue already.

Nemo

Show replies by date

Milos Rancic

14 Jun 14 Jun

8:13 p.m.

New subject: [Analytics] [Wikimedia-l] Wikipedia article per speaker

I started writing a longer email, but then realized that it's better to stick with the most important points, as everything is anyway enough complex. Thus, just metrics and its applications, not anything else.

While I was reloading a year ago my few years old idea to open Wikipedia in 3000 more languages, I realized that we have substantial problem. The most numerous communities have ~100 active (thus 5+ edits/month) editors per million of speakers. As my hypothesis was that we could have Wikipedias in languages spoken by more than 10,000 people, that would mean that at the best they would have 1 (one) active editor. Thus, something else has to be done... But before that, we have to gather data and have the idea what's that "something".

My first idea -- something of a kind between a desperate one and "we should try something" -- was to ask people from Wikimedia Estonia, Wikimedia Finland and Wikimedia UK to try to reach as many as possible new active users on particular projects. The point is that Scottish Gaelic, Estonian and Finish are among the top in active users per million of speakers.

A year later, Estonians are doing a very good job (others are good, as well). They are above 100 active users per million of speakers and in a couple of years they could reach even a couple of hundreds.

But, there is an obvious flaw in this kind of reasoning and I was aware of it from the beginning: It's about languages spoken i rich countries, while we'll be dealing with the communities on the opposite end of wealth. However, at least it's possible to increase relative number of active users in "ideal" situations, which means that ~100 active users per million of speakers is not a kind of realistic maximum.

Thanks to the project Wiktionary meets Matica srpska, I am getting now more precise insights into Ethnologue data (don't ask me what's the relation, it was a couple of paragraphs long explanation inside of the email I didn't send).

So, a month ago or so I got the first data and the news were very good: more than 5000 languages won't die during the next 100 years. More than 2500 languages are in very good shape. If we take for granted that Ethnologue's data are about languages.

In the meantime, Sylvian mentioned on Languages list that he is working on Kichwa Wikipedia. And he noted one important thing: if we are going to have Wikipedias in languages like Kichwa is -- and that's likely the prototype for the most of the languages which we will meet in the future -- we have to adapt to them, not to impose unrealistic expectations to them. That's connected to the data, as I want to know what we could expect from them. (A note to self: literacy rate is very important parameter, as well.)

It is also important to be able to follow numerically the development of particular community and give them know-how based on previous successful experiences.

As we got more results from Ethnologue data, my ambitions raised. Of course I wanted to get number of articles per speaker. I got an approximate correlation between Wikipedia editions and Ethnologue data. Yes, of course, I knew that there are Wikipedia editions with a lot of bot-generated articles. So, I've cut data to languages with 5 or more on Ethnologue language vitality scale and with the condition that the language has to have native speakers and I've got pretty sane results. Yes, Dutch and Swedish Wikipedias include a lot of bot-generated articles, but the number of articles in those langauges are quite fine in comparison with the rest of the projects.

There are few arguments in favor of counting (even bot-generated) articles: * First, the most important flaw in analyzing such data is taking their synchrony, not the development. But synchrony is the starting point. By looking into development, we could monitor the number of new articles per month and we could easily conclude what's the normal state of the community and what's not. * Then it doesn't take a lot of efforts to create legitimate information on some of the topics by using bots. If legitimate articles, that gives us a clue about the capacity of particular community to create articles and thus spread free knowledge. * For example, if organized properly, it's not hard to create sane articles based on English (or Spanish or whichever) Wikipedia templates about actors and movies. That means that English (or Spanish or whichever) Wikipedia raises capacity of other Wikipedia editions, which is legitimate and quite relevant. It's relevant in the sense that we should particularly care about languages with large number of L2 speakers and languages used as international or regional lingua franca. In reverse note, we could conclude which languages have potential to create a lot of articles thanks to the fact that the speakers of that language are fluent in one of the big languages. That's also quite relevant for "gross capacity" to share knowledge in their own language. * The number of possible articles will always raise. Even for bot-generated articles. (Take as an example newly discovered planets outside of our solar system. For monolinguals, it's relevant to have that kind of information in their native language.) Thus, possibilities will raise and it's important to monitor capacities of the communities. Having a programmer raises capacity, obviously. Having a dexterous community member, capable to find a programmer inside of the movement willing to help creating a bot also counts.

I've seen projects with a lot of edits and disproportionally small number of articles. From my perspective, it's better to have more articles than to have a lot of rollbacks and a lot of talk. Although the community itself is our most important value, our main task is to create articles, not to argue. Besides the fact that it could be a sign of bad community health.

But there are many other possible indicators, which could work in the most of the cases. For example, edit count. From the first five projects by the number of articles, we could easily conclude that the ranks are: (1) English, (2) German, (3) French, (4) Dutch, (5) Swedish, not (1) English, (2) Swedish, (3) Dutch, (4) German, (5) French. (By taking a look into the other Wikipedias, we could see that even Chinese on 15th place is stronger than the Swedish Wikipedia on 2nd one.)

Not counting English as world's primary lingua franca, It's also interesting to see that the edits per German and French speaker is roughly 1.5, while 0.6 in Russian case. Danish is ~1.7, Polish is ~1.05, Serbian is ~1.2, but Japanese is ~0.4 and Swahili ~0.05. (I made approximations without a calculator, thus error range is likely +-10% :) ) Thus, GDP/PPP per capita doesn't need to be that important factor (in the sense "if you reach particular GDP/PPP per capita, it's not anymore important factor"), while other things could be.

It's also important to have in mind that various data are likely exposing various issues. And every issue has to be analyzed from socio-economic perspective (obviously, Japanese Wikipedia is not relatively weak because of the same reason as Russian or Swahili Wikipedia are).

I will include as many parameters as possible in the future analysis. As I have now the number of speakers of particular language per country, it is possible now to correlate economic development with particular language.

On Jun 13, 2015 09:38, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:

...

Asaf Bartov, 13/06/2015 02:42:

...
The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org http://stats.wikimedia.org is offering that metric.

I personally agree on this in general, but Millosh is trying something different in his current quest, i.e. content ingestion and content coverage assessment, also for missing language subdomains. (By the way, I created the category, please add stuff: https://meta.wikimedia.org/wiki/Category:Content_coverage .)

Mere article count tells us very little and he acknowledged it. As you added analytics: maybe when https://phabricator.wikimedia.org/T44259 is fixed we can also do fancy things like join various tables and count (countable) articles above a minimum threshold of hits, or something like that.

Oh, and the total number of internal links in a wiki is also an interesting metric in many cases: they're often a good indicator of how curated a wiki globally is, while bot-created articles are often orphan. (Locally there might be overlinking but that's rarely a wiki-wide issue.) I don't remember how reliable the WikiStats numbers are, but they often give a good clue already.

Nemo

Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages

Amir E. Aharoni

15 Jun 15 Jun

12:04 a.m.

New subject: [Analytics] [Wikimedia-l] Wikipedia article per speaker

Wonderful work, Miloš.

Some notes on edit count: 1. Some Wikipedias import all the versions of a translated article because they believe that it's required for attribution (AFAIK it isn't). This, of course, inflates the edit count in a completely artificial way, and sadly I don't know how to filter this chaff.

2. Bot edits could probably be filtered out, but there are some very different types of bots and it should be taken into account when measuring community success. Some bots just create articles (Waray, Swedish, Dutch). Some fix interlanguage links (not any longer, but it was huge everywhere before 2013). Some auto-fix spelling, and it's a sign of a healthy community (Hebrew, Catalan, and some others). Some are smarter than AbuseFilter at reverting vandalism, and that's also a good sign.

3. Some sysops delete revisions with vandalism, which could simply be reverted. I don't know how prevalent it is. More generally, deleted revisions could probably be counted in a useful way as part of this project. בתאריך 14 ביוני 2015 14:14,‏ "Milos Rancic" millosh@gmail.com כתב:

...

I started writing a longer email, but then realized that it's better to stick with the most important points, as everything is anyway enough complex. Thus, just metrics and its applications, not anything else.

While I was reloading a year ago my few years old idea to open Wikipedia in 3000 more languages, I realized that we have substantial problem. The most numerous communities have ~100 active (thus 5+ edits/month) editors per million of speakers. As my hypothesis was that we could have Wikipedias in languages spoken by more than 10,000 people, that would mean that at the best they would have 1 (one) active editor. Thus, something else has to be done... But before that, we have to gather data and have the idea what's that "something".

My first idea -- something of a kind between a desperate one and "we should try something" -- was to ask people from Wikimedia Estonia, Wikimedia Finland and Wikimedia UK to try to reach as many as possible new active users on particular projects. The point is that Scottish Gaelic, Estonian and Finish are among the top in active users per million of speakers.

A year later, Estonians are doing a very good job (others are good, as well). They are above 100 active users per million of speakers and in a couple of years they could reach even a couple of hundreds.

But, there is an obvious flaw in this kind of reasoning and I was aware of it from the beginning: It's about languages spoken i rich countries, while we'll be dealing with the communities on the opposite end of wealth. However, at least it's possible to increase relative number of active users in "ideal" situations, which means that ~100 active users per million of speakers is not a kind of realistic maximum.

Thanks to the project Wiktionary meets Matica srpska, I am getting now more precise insights into Ethnologue data (don't ask me what's the relation, it was a couple of paragraphs long explanation inside of the email I didn't send).

So, a month ago or so I got the first data and the news were very good: more than 5000 languages won't die during the next 100 years. More than 2500 languages are in very good shape. If we take for granted that Ethnologue's data are about languages.

In the meantime, Sylvian mentioned on Languages list that he is working on Kichwa Wikipedia. And he noted one important thing: if we are going to have Wikipedias in languages like Kichwa is -- and that's likely the prototype for the most of the languages which we will meet in the future -- we have to adapt to them, not to impose unrealistic expectations to them. That's connected to the data, as I want to know what we could expect from them. (A note to self: literacy rate is very important parameter, as well.)

It is also important to be able to follow numerically the development of particular community and give them know-how based on previous successful experiences.

As we got more results from Ethnologue data, my ambitions raised. Of course I wanted to get number of articles per speaker. I got an approximate correlation between Wikipedia editions and Ethnologue data. Yes, of course, I knew that there are Wikipedia editions with a lot of bot-generated articles. So, I've cut data to languages with 5 or more on Ethnologue language vitality scale and with the condition that the language has to have native speakers and I've got pretty sane results. Yes, Dutch and Swedish Wikipedias include a lot of bot-generated articles, but the number of articles in those langauges are quite fine in comparison with the rest of the projects.

There are few arguments in favor of counting (even bot-generated) articles:

First, the most important flaw in analyzing such data is taking

their synchrony, not the development. But synchrony is the starting point. By looking into development, we could monitor the number of new articles per month and we could easily conclude what's the normal state of the community and what's not.

Then it doesn't take a lot of efforts to create legitimate

information on some of the topics by using bots. If legitimate articles, that gives us a clue about the capacity of particular community to create articles and thus spread free knowledge.

For example, if organized properly, it's not hard to create sane

articles based on English (or Spanish or whichever) Wikipedia templates about actors and movies. That means that English (or Spanish or whichever) Wikipedia raises capacity of other Wikipedia editions, which is legitimate and quite relevant. It's relevant in the sense that we should particularly care about languages with large number of L2 speakers and languages used as international or regional lingua franca. In reverse note, we could conclude which languages have potential to create a lot of articles thanks to the fact that the speakers of that language are fluent in one of the big languages. That's also quite relevant for "gross capacity" to share knowledge in their own language.

The number of possible articles will always raise. Even for

bot-generated articles. (Take as an example newly discovered planets outside of our solar system. For monolinguals, it's relevant to have that kind of information in their native language.) Thus, possibilities will raise and it's important to monitor capacities of the communities. Having a programmer raises capacity, obviously. Having a dexterous community member, capable to find a programmer inside of the movement willing to help creating a bot also counts.

I've seen projects with a lot of edits and disproportionally small number of articles. From my perspective, it's better to have more articles than to have a lot of rollbacks and a lot of talk. Although the community itself is our most important value, our main task is to create articles, not to argue. Besides the fact that it could be a sign of bad community health.

But there are many other possible indicators, which could work in the most of the cases. For example, edit count. From the first five projects by the number of articles, we could easily conclude that the ranks are: (1) English, (2) German, (3) French, (4) Dutch, (5) Swedish, not (1) English, (2) Swedish, (3) Dutch, (4) German, (5) French. (By taking a look into the other Wikipedias, we could see that even Chinese on 15th place is stronger than the Swedish Wikipedia on 2nd one.)

Not counting English as world's primary lingua franca, It's also interesting to see that the edits per German and French speaker is roughly 1.5, while 0.6 in Russian case. Danish is ~1.7, Polish is ~1.05, Serbian is ~1.2, but Japanese is ~0.4 and Swahili ~0.05. (I made approximations without a calculator, thus error range is likely +-10% :) ) Thus, GDP/PPP per capita doesn't need to be that important factor (in the sense "if you reach particular GDP/PPP per capita, it's not anymore important factor"), while other things could be.

It's also important to have in mind that various data are likely exposing various issues. And every issue has to be analyzed from socio-economic perspective (obviously, Japanese Wikipedia is not relatively weak because of the same reason as Russian or Swahili Wikipedia are).

I will include as many parameters as possible in the future analysis. As I have now the number of speakers of particular language per country, it is possible now to correlate economic development with particular language.

On Jun 13, 2015 09:38, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:

...
Asaf Bartov, 13/06/2015 02:42:

...
The (already existing) metric of active-editors-per-million-speakers is, it seems to me, a far more robust metric. Erik Z.'s

stats.wikimedia.org

...
...
http://stats.wikimedia.org is offering that metric.

I personally agree on this in general, but Millosh is trying something

different in his current quest, i.e. content ingestion and content coverage assessment, also for missing language subdomains. (By the way, I created the category, please add stuff: https://meta.wikimedia.org/wiki/Category:Content_coverage .)

...
Mere article count tells us very little and he acknowledged it. As you

added analytics: maybe when https://phabricator.wikimedia.org/T44259 is fixed we can also do fancy things like join various tables and count (countable) articles above a minimum threshold of hits, or something like that.

...
Oh, and the total number of internal links in a wiki is also an

interesting metric in many cases: they're often a good indicator of how curated a wiki globally is, while bot-created articles are often orphan. (Locally there might be overlinking but that's rarely a wiki-wide issue.) I don't remember how reliable the WikiStats numbers are, but they often give a good clue already.

...
Nemo

Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages

Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages

3485

Age (days ago)

3486

Last active (days ago)

languages@lists.wikimedia.org

2 comments

3 participants

tags (0)

participants (3)

Amir E. Aharoni
Federico Leva (Nemo)
Milos Rancic