A few weeks back, I was playing around with some numbers. It is true that number of effective speakers isn't a very good predictor, but it is a place to start.
Most of our mature editor communities have about 20 active editors per 1 million effective speakers, give or take a factor of 4. In other words, among communities with at least 100 active editors most range from 5 to 80 active editors per million effective speakers. Admittedly, that is not a very precise range. English, for example, is right at 20 on this metric. There are also some important outliers, such as the Chinese and Arabic communities (both less than 2 active editors per million speakers), which probably have yet to reach parity with the other active languages. There are also a few major languages (e.g. Hindi, Bengali, and Malay) that arguably haven't even begun. Those have fewer than 100 active editors and less than 0.5 editors per 1 million speakers, despite hundreds of millions of speakers.
I suspect that if one could start adjusting for other factors, e.g. speakers with internet access, one might be able to narrow that predicted range. Economic and cultural factors are also probably important, as well as the penetration of secondary languages like English.
Structurally, it seems like this kind of data analysis problem would be fairly amenable to various kinds of regression analysis. The main difficulty would be gathering the right data, e.g. number of effective speakers (which probably needs to subdivided by country in order to compare to other data sets), internet penetration, economic indicators, access to education, etc. Anyone happen to know where there is comprehensive language data broken down by country?
As others have suggested, I would emphasize community participation or readership metrics rather than article metrics due to bot biasing, etc.
Anyway, if one uses 20 active editors per 1 million speakers as a rough guide, one can estimate which languages have the most natural potential for growth. The top 15 on that list would be in order: Chinese, Hindi, Arabic, Malay, Spanish, Indonesian, Bengali, Portuguese, Russian, Punjabi, Marathi, Tagalog, Javanese, Wu, and Telugu. Those would collectively account 70% of "missing" editors if we assume that we roughly expect 20 editors / 1 million speakers. In terms of feature development for under-utilized languages, those are probably a reasonable set to be thinking about.
Most of the list is from Asian countries, and with the exception of Spanish and Portuguese, they are all languages that use non-latin character sets. So support for other scripts is obviously important. On the other hand, it is also possible that many of these language are "missing" in part because the computer literate among the populations who speak these languages actually prefer to edit in some other language (e.g. English).
Anyway, just a few thoughts.
-Robert Rohde
On Sun, Jan 25, 2015 at 5:57 PM, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> wrote:
Hi,
It is well-known that the size of a Wikipedia in a given language is not proportional to the number of people who speak that language. By "size" I mean the article count and the active editor count.
This begs the question: Is it proportional to anything else?
I can think of a bunch of possible things (to most items you can add "... in the countries where this language is spoken"):
- Penetration of Internet access
- Quality of education
- Number of people who know other major languages, such as English, French,
Russian, Spanish, etc.
- Number of people who *don't* know other major languages
- Gross domestic product
- Human Development Index
- The level of usage of this language in the education system (in some
countries schools function in foreign languages)
- Amount of published literature in that language
- Level of censorship and press freedom
- [[Language planning]] policies (think Catalonia, Ukraine, Quebec, Israel)
It is quite possible that the size of a Wikipedia is proportional not to one of these things, but to a combination of them. It is also possible that it is not proportional to any of the above, or to anything at all.
Did anybody ever try to research this?
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe