Re: [Analytics] [Languages] [Wikimedia-l] Wikipedia article per speaker

14 Jun 2015


      I started writing a longer email, but then realized that it's better
to stick with the most important points, as everything is anyway
enough complex. Thus, just metrics and its applications, not anything
else.
While I was reloading a year ago my few years old idea to open
Wikipedia in 3000 more languages, I realized that we have substantial
problem. The most numerous communities have ~100 active (thus 5+
edits/month) editors per million of speakers. As my hypothesis was
that we could have Wikipedias in languages spoken by more than 10,000
people, that would mean that at the best they would have 1 (one)
active editor. Thus, something else has to be done... But before that,
we have to gather data and have the idea what's that "something".
My first idea -- something of a kind between a desperate one and "we
should try something" -- was to ask people from Wikimedia Estonia,
Wikimedia Finland and Wikimedia UK to try to reach as many as possible
new active users on particular projects. The point is that Scottish
Gaelic, Estonian and Finish are among the top in active users per
million of speakers.
A year later, Estonians are doing a very good job (others are good, as
well). They are above 100 active users per million of speakers and in
a couple of years they could reach even a couple of hundreds.
But, there is an obvious flaw in this kind of reasoning and I was
aware of it from the beginning: It's about languages spoken i rich
countries, while we'll be dealing with the communities on the opposite
end of wealth. However, at least it's possible to increase relative
number of active users in "ideal" situations, which means that ~100
active users per million of speakers is not a kind of realistic
maximum.
Thanks to the project Wiktionary meets Matica srpska, I am getting now
more precise insights into Ethnologue data (don't ask me what's the
relation, it was a couple of paragraphs long explanation inside of the
email I didn't send).
So, a month ago or so I got the first data and the news were very
good: more than 5000 languages won't die during the next 100 years.
More than 2500 languages are in very good shape. If we take for
granted that Ethnologue's data are about languages.
In the meantime, Sylvian mentioned on Languages list that he is
working on Kichwa Wikipedia. And he noted one important thing: if we
are going to have Wikipedias in languages like Kichwa is -- and that's
likely the prototype for the most of the languages which we will meet
in the future -- we have to adapt to them, not to impose unrealistic
expectations to them. That's connected to the data, as I want to know
what we could expect from them. (A note to self: literacy rate is very
important parameter, as well.)
It is also important to be able to follow numerically the development
of particular community and give them know-how based on previous
successful experiences.
As we got more results from Ethnologue data, my ambitions raised. Of
course I wanted to get number of articles per speaker. I got an
approximate correlation between Wikipedia editions and Ethnologue
data. Yes, of course, I knew that there are Wikipedia editions with a
lot of bot-generated articles. So, I've cut data to languages with 5
or more on Ethnologue language vitality scale and with the condition
that the language has to have native speakers and I've got pretty sane
results. Yes, Dutch and Swedish Wikipedias include a lot of
bot-generated articles, but the number of articles in those langauges
are quite fine in comparison with the rest of the projects.
There are few arguments in favor of counting (even bot-generated) articles:
* First, the most important flaw in analyzing such data is taking
their synchrony, not the development. But synchrony is the starting
point. By looking into development, we could monitor the number of new
articles per month and we could easily conclude what's the normal
state of the community and what's not.
* Then it doesn't take a lot of efforts to create legitimate
information on some of the topics by using bots. If legitimate
articles, that gives us a clue about the capacity of particular
community to create articles and thus spread free knowledge.
* For example, if organized properly, it's not hard to create sane
articles based on English (or Spanish or whichever) Wikipedia
templates about actors and movies. That means that English (or Spanish
or whichever) Wikipedia raises capacity of other Wikipedia editions,
which is legitimate and quite relevant. It's relevant in the sense
that we should particularly care about languages with large number of
L2 speakers and languages used as international or regional lingua
franca. In reverse note, we could conclude which languages have
potential to create a lot of articles thanks to the fact that the
speakers of that language are fluent in one of the big languages.
That's also quite relevant for "gross capacity" to share knowledge in
their own language.
* The number of possible articles will always raise. Even for
bot-generated articles. (Take as an example newly discovered planets
outside of our solar system. For monolinguals, it's relevant to have
that kind of information in their native language.) Thus,
possibilities will raise and it's important to monitor capacities of
the communities. Having a programmer raises capacity, obviously.
Having a dexterous community member, capable to find a programmer
inside of the movement willing to help creating a bot also counts.
I've seen projects with a lot of edits and disproportionally small
number of articles. From my perspective, it's better to have more
articles than to have a lot of rollbacks and a lot of talk. Although
the community itself is our most important value, our main task is to
create articles, not to argue. Besides the fact that it could be a
sign of bad community health.
But there are many other possible indicators, which could work in the
most of the cases. For example, edit count. From the first five
projects by the number of articles, we could easily conclude that the
ranks are: (1) English, (2) German, (3) French, (4) Dutch, (5)
Swedish, not (1) English, (2) Swedish, (3) Dutch, (4) German, (5)
French. (By taking a look into the other Wikipedias, we could see that
even Chinese on 15th place is stronger than the Swedish Wikipedia on
2nd one.)
Not counting English as world's primary lingua franca, It's also
interesting to see that the edits per German and French speaker is
roughly 1.5, while 0.6 in Russian case. Danish is ~1.7, Polish is
~1.05, Serbian is ~1.2, but Japanese is ~0.4 and Swahili ~0.05. (I
made approximations without a calculator, thus error range is likely
+-10% :) ) Thus, GDP/PPP per capita doesn't need to be that important
factor (in the sense "if you reach particular GDP/PPP per capita, it's
not anymore important factor"), while other things could be.
It's also important to have in mind that various data are likely
exposing various issues. And every issue has to be analyzed from
socio-economic perspective (obviously, Japanese Wikipedia is not
relatively weak because of the same reason as Russian or Swahili
Wikipedia are).
I will include as many parameters as possible in the future analysis.
As I have now the number of speakers of particular language per
country, it is possible now to correlate economic development with
particular language.
On Jun 13, 2015 09:38, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:
...
Asaf Bartov, 13/06/2015 02:42:
...
The (already existing) metric of active-editors-per-million-speakers is,
it seems to me, a far more robust metric.  Erik Z.'s stats.wikimedia.org
http://stats.wikimedia.org is offering that metric.
I personally agree on this in general, but Millosh is trying something different in his current quest, i.e. content ingestion and content coverage assessment, also for missing language subdomains. (By the way, I created the category, please add stuff: https://meta.wikimedia.org/wiki/Category:Content_coverage .)
Mere article count tells us very little and he acknowledged it. As you added analytics: maybe when https://phabricator.wikimedia.org/T44259 is fixed we can also do fancy things like join various tables and count (countable) articles above a minimum threshold of hits, or something like that.
Oh, and the total number of internal links in a wiki is also an interesting metric in many cases: they're often a good indicator of how curated a wiki globally is, while bot-created articles are often orphan. (Locally there might be overlinking but that's rarely a wiki-wide issue.) I don't remember how reliable the WikiStats numbers are, but they often give a good clue already.
Nemo

Languages mailing list
Languages@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/languages

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Languages] [Wikimedia-l] Wikipedia article per speaker