Wikipedia article per speaker

List overview All Threads
Download

newer

older

Lessons from benchmarking...

Re: [Languages] [Analytics]...

Milos Rancic

8 Jun 2015 8 Jun '15

1:23 a.m.

When you get data, at some point of time you start thinking about quite fringe comparisons. But that could actually give some useful conclusions, like this time it did [1].

We did the next: * Used the number of primary speakers from Ethnologue. (Erik Zachte is using approximate number of primary + secondary speakers; that could be good for correction of this data.) * Categorized languages according to the logarithmic number of speakers: >=10k, >=100k, >=1M, >=10M, >=100M. * Took the number of articles of Wikipedia in particular language and created ration (number of articles / number of speakers). * This list is consisted just of languages with Ethnologue status 1 (national), 2 (provincial) or 3 (wider communication). In fact, we have a lot of projects (more than 100) with worse language status; a number of them are actually threatened or even on the edge of extinction.

Those are the preliminary results and I will definitely have to pass through all the numbers. I fixed manually some serious errors, like not having English Wikipedia itself inside of data :D

Putting the languages into the logarithmic categories proved to be useful, as we are now able to compare the Wikipedias according to their gross capacity (numbers of speakers). I suppose somebody well introduced into statistics could even create the function which could be used to check how good one project stays, no matter of those strict categories.

It's obvious that as more speakers one language has, it's harder to the community to follow the ratio.

So, the winners per category are: 1) >= 1k: Hawaiian, ratio 0.96900 2) >= 10k: Mirandese, ratio 0.18073 3) >= 100k: Basque, ratio 0.38061 4) >= 1M: Swedish, ratio 0.21381 5) >= 10M: Dutch, ratio 0.08305 6) >= 100M: English, ratio 0.01447

However, keep in mind that we removed languages not inside categories 1, 2 or 3. That affected >=10k languages, as, for example, Upper Sorbian stays much better than Mirandese (0.67). (Will fix it while creating the full report. Obviously, in this case logarithmic categories of numbers of speakers are much more important than what's the state of the language.)

It's obvious that we could draw the line between 1:1 for 1-10k speakers to 10:1 for >=100M speakers. But, again, I would like to get input of somebody more competent.

One very important category is missing here and it's about the level of development of the speakers. That could be added: GDP/PPP per capita for spoken country or countries would be useful as measurement. And I suppose somebody with statistical knowledge would be able to give us the number which would have meaning "ability to create Wikipedia article".

Completed in such way, we'd be able to measure the success of particular Wikimedia groups and organizations. OK. Articles per speaker are not the only way to do so, but we could use other parameters, as well: number of new/active/very active editors etc. And we could put it into time scale.

I'll make some other results. And to remind: I'd like to have the formula to count "ability to create Wikipedia article" and then to produce "level of particular community success in creating Wikipedia articles". And, of course, to implement it for editors.

[1] https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_i...

Show replies by date

Federico Leva (Nemo)

12 Jun 12 Jun

9:51 p.m.

Milos Rancic, 08/06/2015 00:23:

...

And I suppose somebody with statistical knowledge would be able to give us the number which would have meaning "ability to create Wikipedia article".

Why not use the human development index (HDI) as factor? Also, instead of the number of articles I'd rather use database size or number of words.

Nemo

Milos Rancic

11:07 p.m.

Illario, Latin doesn't have L1 speakers. And data about languages are such a mess, that I would stick with Ethnologue's data for L1 speakers, although they are not reliable. Ethnologue counts "there are 100,000 speakers of language X in country A and 34 in country B, thus there are 100,034 speakers in total" (although likely error margin for the first number is 150 times larger than the second number), as well as it has numerous other flaws, like fringe "macrolanguage" category is. However, besides counting the same way, English Wikipedia has much worse failures when we leave ~50 major languages safety, if not based on Ethnologue's data. (It's mostly about wishful thinking of ethnic nationalists and chronic lack of manpower to fix that bullshit promptly.)

Nemo, yes I was thinking about various data instead of article count and GDP/PPP per capita, so here are the thoughts, including those two parameters:

* Article count per speaker gives one one nice pseudo-hyperbolic curve. Basically, you can see a hyperbolic curve by drawing the line over the highest points: Hawaiian-Upper Sorbian-Basque-Swedish-Dutch-English. By normalizing the numbers, we could get targets per language.

* However, edit count seems like better idea. I think, but it has to be proved, that such numbers won't have to be adjusted for the number of speakers themselves.

* We could count various numbers related to users. For example, it seems that as smaller ratio between the number of active and very active users is, as healthier community is. Also, number of editors per million of speaker per GDP or HDI could be useful parameter.

* I was thinking yesterday about HDI. But then I've realized that it would be good to create all of possibly relevant charts and see what they bring as information. I am interested in comparison of Wikipedia stats with Gini coefficient, for example.

And I will do that. After I finish with the most frustrating part of the job: draw the line between Wikipedia editions, Ethnologue data and actual languages. Good news is that I am on ~150th of ~280 Wikipedia editions and it's likely I will finish it during the next week. (After almost eight years of dealing with this matter, whenever someone says that there are two hundred eighty something Wikipedia languages or that there are 7000 languages in the world, I reach for my revolver.) On Jun 12, 2015 20:51, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:

...

Milos Rancic, 08/06/2015 00:23:

...
And I suppose somebody with statistical knowledge would be able to give us the number which would have meaning "ability to create Wikipedia article".

Why not use the human development index (HDI) as factor? Also, instead of the number of articles I'd rather use database size or number of words.

Nemo

Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages

Amir E. Aharoni

14 Jun 14 Jun

6:17 p.m.

Dry article creation with little actual community interaction like discussions, arguments and reverts, is problematic, but it does have one overlooked advantage, which I myself didn't quite realize just a few months ago: Creating a lot of texts that are known to be corresponding (a.k.a. parallel) can be used by machine translation developers to create statistical MT engines. When an engine exists, it may make translation of more articles easier and faster.

Creating "enough articles to bootstrap MT" can be a goal for a content creation project. I'm not sure how many is enough - 10,000?..

And either I missed it, or nobody mentioned it yet, but ahem ahem ahem ContentTranslation. It is already helping Wikipedias in minorized languages to create a lot of meaningful articles more easily, and with future features like task lists and suggestions, it will be possible to use it for tracking success conveniently. בתאריך 8 ביוני 2015 01:23,‏ "Milos Rancic" millosh@gmail.com כתב:

...

When you get data, at some point of time you start thinking about quite fringe comparisons. But that could actually give some useful conclusions, like this time it did [1].

We did the next:

Used the number of primary speakers from Ethnologue. (Erik Zachte is

using approximate number of primary + secondary speakers; that could be good for correction of this data.)

Categorized languages according to the logarithmic number of

speakers: >=10k, >=100k, >=1M, >=10M, >=100M.

Took the number of articles of Wikipedia in particular language and

created ration (number of articles / number of speakers).

This list is consisted just of languages with Ethnologue status 1

(national), 2 (provincial) or 3 (wider communication). In fact, we have a lot of projects (more than 100) with worse language status; a number of them are actually threatened or even on the edge of extinction.

Those are the preliminary results and I will definitely have to pass through all the numbers. I fixed manually some serious errors, like not having English Wikipedia itself inside of data :D

Putting the languages into the logarithmic categories proved to be useful, as we are now able to compare the Wikipedias according to their gross capacity (numbers of speakers). I suppose somebody well introduced into statistics could even create the function which could be used to check how good one project stays, no matter of those strict categories.

It's obvious that as more speakers one language has, it's harder to the community to follow the ratio.

So, the winners per category are:

...
= 1k: Hawaiian, ratio 0.96900

...
= 10k: Mirandese, ratio 0.18073

...
= 100k: Basque, ratio 0.38061

...
= 1M: Swedish, ratio 0.21381

...
= 10M: Dutch, ratio 0.08305

...
= 100M: English, ratio 0.01447

However, keep in mind that we removed languages not inside categories 1, 2 or 3. That affected >=10k languages, as, for example, Upper Sorbian stays much better than Mirandese (0.67). (Will fix it while creating the full report. Obviously, in this case logarithmic categories of numbers of speakers are much more important than what's the state of the language.)

It's obvious that we could draw the line between 1:1 for 1-10k speakers to 10:1 for >=100M speakers. But, again, I would like to get input of somebody more competent.

One very important category is missing here and it's about the level of development of the speakers. That could be added: GDP/PPP per capita for spoken country or countries would be useful as measurement. And I suppose somebody with statistical knowledge would be able to give us the number which would have meaning "ability to create Wikipedia article".

Completed in such way, we'd be able to measure the success of particular Wikimedia groups and organizations. OK. Articles per speaker are not the only way to do so, but we could use other parameters, as well: number of new/active/very active editors etc. And we could put it into time scale.

I'll make some other results. And to remind: I'd like to have the formula to count "ability to create Wikipedia article" and then to produce "level of particular community success in creating Wikipedia articles". And, of course, to implement it for editors.

[1] https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_i...

Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages

Milos Rancic

6:35 p.m.

On Sun, Jun 14, 2015 at 5:17 PM, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:

...

And either I missed it, or nobody mentioned it yet, but ahem ahem ahem ContentTranslation. It is already helping Wikipedias in minorized languages to create a lot of meaningful articles more easily, and with future features like task lists and suggestions, it will be possible to use it for tracking success conveniently.

Just a short note here... The complexity of the task, which I think I comprehend, is so significant, that I made the lamest mistake from my own perspective. Please note that the page Names of Wikimedia languages [1] assumes that there is only one variant of Serbian (although some languages have full four written varieties in Serbian: Немачка / Nemačka / Њемачка / Njemačka).

So, yes, ContentTranslation. (To be honest, one of my priorities should be to actually see how it works...) Besides the tools (and I think there are some other tools, as well), there is a lot of documentation, which should be gathered inside of one user friendly howto.

Creating correlations between Wikimedia projects data and data about languages is not a simple task. In relation to the languages, we know which information we need, but we often don't have enough of data; in relation to Wikimedia, we have data, but we often don't know what to do with it. And the most important danger of dealing with such sets is not to have enough data and don't know what to do with it.

While the lack of reliable data about languages could be fixed through necessary approximations, while searching for more relevant data, the part which says that we should know what we should do with data could be easily fixed by sharing the ideas here. That's the main reason why I am sharing here work in progress.

(Now back to linking languages: 208th Wikipedia edition by size, Karachay-Balkar...)

[1] https://meta.wikimedia.org/wiki/Names_of_Wikimedia_languages

Milos Rancic

6:38 p.m.

On Sun, Jun 14, 2015 at 5:35 PM, Milos Rancic millosh@gmail.com wrote:

...

Just a short note here... The complexity of the task, which I think I comprehend, is so significant, that I made the lamest mistake from my own perspective. Please note that the page Names of Wikimedia languages [1] assumes that there is only one variant of Serbian (although some languages have full four written varieties in Serbian: Немачка / Nemačka / Њемачка / Njemačka).

One more lame mistake: It's not about countries, but about languages. Thus: немачки, njemački, њемачки, njemački,

Milos Rancic

6:40 p.m.

On Sun, Jun 14, 2015 at 5:38 PM, Milos Rancic millosh@gmail.com wrote:

...

One more lame mistake: It's not about countries, but about languages. Thus: немачки, njemački, њемачки, njemački,

Khm... немачки, nemački, њемачки, njemački,

Gerard Meijssen

7:25 p.m.

Hoi, The objective of articles are that they are read. So when bot created articles lead to more readers, I am perfectly happy and so we should be all. Certainly, more can be achieved with the creators of bot articles but as far as I can observe, it has never been a priority because it does not fit in with the fixed agendas and the pre conceived ideas.

The bottom line is in the number of readers and editors. We can learn from and optimise with the Swedes. <grin> why don't we? </grin> Thanks, GerardM

On 14 June 2015 at 17:17, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:

...

Dry article creation with little actual community interaction like discussions, arguments and reverts, is problematic, but it does have one overlooked advantage, which I myself didn't quite realize just a few months ago: Creating a lot of texts that are known to be corresponding (a.k.a. parallel) can be used by machine translation developers to create statistical MT engines. When an engine exists, it may make translation of more articles easier and faster.

Creating "enough articles to bootstrap MT" can be a goal for a content creation project. I'm not sure how many is enough - 10,000?..

And either I missed it, or nobody mentioned it yet, but ahem ahem ahem ContentTranslation. It is already helping Wikipedias in minorized languages to create a lot of meaningful articles more easily, and with future features like task lists and suggestions, it will be possible to use it for tracking success conveniently. בתאריך 8 ביוני 2015 01:23,‏ "Milos Rancic" millosh@gmail.com כתב:

When you get data, at some point of time you start thinking about

...
quite fringe comparisons. But that could actually give some useful conclusions, like this time it did [1].

We did the next:

Used the number of primary speakers from Ethnologue. (Erik Zachte is

using approximate number of primary + secondary speakers; that could be good for correction of this data.)

Categorized languages according to the logarithmic number of

speakers: >=10k, >=100k, >=1M, >=10M, >=100M.

Took the number of articles of Wikipedia in particular language and

created ration (number of articles / number of speakers).

This list is consisted just of languages with Ethnologue status 1

(national), 2 (provincial) or 3 (wider communication). In fact, we have a lot of projects (more than 100) with worse language status; a number of them are actually threatened or even on the edge of extinction.

Those are the preliminary results and I will definitely have to pass through all the numbers. I fixed manually some serious errors, like not having English Wikipedia itself inside of data :D

Putting the languages into the logarithmic categories proved to be useful, as we are now able to compare the Wikipedias according to their gross capacity (numbers of speakers). I suppose somebody well introduced into statistics could even create the function which could be used to check how good one project stays, no matter of those strict categories.

It's obvious that as more speakers one language has, it's harder to the community to follow the ratio.

So, the winners per category are:

...
= 1k: Hawaiian, ratio 0.96900

...
= 10k: Mirandese, ratio 0.18073

...
= 100k: Basque, ratio 0.38061

...
= 1M: Swedish, ratio 0.21381

...
= 10M: Dutch, ratio 0.08305

...
= 100M: English, ratio 0.01447

However, keep in mind that we removed languages not inside categories 1, 2 or 3. That affected >=10k languages, as, for example, Upper Sorbian stays much better than Mirandese (0.67). (Will fix it while creating the full report. Obviously, in this case logarithmic categories of numbers of speakers are much more important than what's the state of the language.)

It's obvious that we could draw the line between 1:1 for 1-10k speakers to 10:1 for >=100M speakers. But, again, I would like to get input of somebody more competent.

One very important category is missing here and it's about the level of development of the speakers. That could be added: GDP/PPP per capita for spoken country or countries would be useful as measurement. And I suppose somebody with statistical knowledge would be able to give us the number which would have meaning "ability to create Wikipedia article".

Completed in such way, we'd be able to measure the success of particular Wikimedia groups and organizations. OK. Articles per speaker are not the only way to do so, but we could use other parameters, as well: number of new/active/very active editors etc. And we could put it into time scale.

I'll make some other results. And to remind: I'd like to have the formula to count "ability to create Wikipedia article" and then to produce "level of particular community success in creating Wikipedia articles". And, of course, to implement it for editors.

[1] https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_i...

Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages

Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages

3484

Age (days ago)

3491

Last active (days ago)

languages@lists.wikimedia.org

7 comments

4 participants

tags (0)

participants (4)

Amir E. Aharoni
Federico Leva (Nemo)
Gerard Meijssen
Milos Rancic