Dry article creation with little actual community interaction like discussions, arguments and reverts, is problematic, but it does have one overlooked advantage, which I myself didn't quite realize just a few months ago: Creating a lot of texts that are known to be corresponding (a.k.a. parallel) can be used by machine translation developers to create statistical MT engines. When an engine exists, it may make translation of more articles easier and faster.
Creating "enough articles to bootstrap MT" can be a goal for a content creation project. I'm not sure how many is enough - 10,000?..
And either I missed it, or nobody mentioned it yet, but ahem ahem ahem ContentTranslation. It is already helping Wikipedias in minorized languages to create a lot of meaningful articles more easily, and with future features like task lists and suggestions, it will be possible to use it for tracking success conveniently. בתאריך 8 ביוני 2015 01:23, "Milos Rancic" millosh@gmail.com כתב:
When you get data, at some point of time you start thinking about quite fringe comparisons. But that could actually give some useful conclusions, like this time it did [1].
We did the next:
- Used the number of primary speakers from Ethnologue. (Erik Zachte is
using approximate number of primary + secondary speakers; that could be good for correction of this data.)
- Categorized languages according to the logarithmic number of
speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
- Took the number of articles of Wikipedia in particular language and
created ration (number of articles / number of speakers).
- This list is consisted just of languages with Ethnologue status 1
(national), 2 (provincial) or 3 (wider communication). In fact, we have a lot of projects (more than 100) with worse language status; a number of them are actually threatened or even on the edge of extinction.
Those are the preliminary results and I will definitely have to pass through all the numbers. I fixed manually some serious errors, like not having English Wikipedia itself inside of data :D
Putting the languages into the logarithmic categories proved to be useful, as we are now able to compare the Wikipedias according to their gross capacity (numbers of speakers). I suppose somebody well introduced into statistics could even create the function which could be used to check how good one project stays, no matter of those strict categories.
It's obvious that as more speakers one language has, it's harder to the community to follow the ratio.
So, the winners per category are:
= 1k: Hawaiian, ratio 0.96900 = 10k: Mirandese, ratio 0.18073 = 100k: Basque, ratio 0.38061 = 1M: Swedish, ratio 0.21381 = 10M: Dutch, ratio 0.08305 = 100M: English, ratio 0.01447However, keep in mind that we removed languages not inside categories 1, 2 or 3. That affected >=10k languages, as, for example, Upper Sorbian stays much better than Mirandese (0.67). (Will fix it while creating the full report. Obviously, in this case logarithmic categories of numbers of speakers are much more important than what's the state of the language.)
It's obvious that we could draw the line between 1:1 for 1-10k speakers to 10:1 for >=100M speakers. But, again, I would like to get input of somebody more competent.
One very important category is missing here and it's about the level of development of the speakers. That could be added: GDP/PPP per capita for spoken country or countries would be useful as measurement. And I suppose somebody with statistical knowledge would be able to give us the number which would have meaning "ability to create Wikipedia article".
Completed in such way, we'd be able to measure the success of particular Wikimedia groups and organizations. OK. Articles per speaker are not the only way to do so, but we could use other parameters, as well: number of new/active/very active editors etc. And we could put it into time scale.
I'll make some other results. And to remind: I'd like to have the formula to count "ability to create Wikipedia article" and then to produce "level of particular community success in creating Wikipedia articles". And, of course, to implement it for editors.
[1] https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_i...
Languages mailing list Languages@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/languages