Hi Professor Kai,
Erich Zachte has done some interesting work on this https://stats.wikimedia.org/wikimedia/animations/wivivi/wivivi.html maps Wikipedia readership to particular language versions of Wikipedia per country.
Some important caveats to this:
- Blocks of one language version may not apply to other language versions - several countries such as China have blocked at least the language version of Wikipedia that they think their population might watch. - Language use and internet access can vary sharply by generation in particular countries. So the ratio of Russian to English readers of Wikipedia in Georgia may be heavily skewed by the fact that most over 50s have Russian as a second language, while younger people are more likely to understand English, and more likely to have access to the internet. - Language versions of Wikipedia vary enormously in size, and those figures don't show how many Georgians only go to the English wikipedia if there is a gap in the Georgian one. - Anecdotally a lot of users of the English language Wikipedia do so via Google translate.
Hope that helps
WereSpielChequers
On Fri, 7 Jun 2024 at 21:21, Caroline Myrick cmyrick@wikimedia.org wrote:
Hi! +1 to the above, and your project sounds very interesting.
I don’t have much to add to the very helpful suggestions you’ve already received, but I did want to use this opportunity to mention a project I’m currently working on related to the state of languages across Wikimedia projects:
https://meta.wikimedia.org/wiki/Research:Incubator_and_language_representati...
Like you, we too want to incorporate external data that will allow us to look at regional and country-level language metrics for our projects, related to coverage and representation. Due to the similarities in our research, I would love to hear if you (or other interested folks in this thread) have any feedback about our project. Please feel free to post any questions, comments, or ideas on the project’s talk page.
Best,
Caroline Myrick
Sr Analyst, Research Wikimedia Foundation
On Fri, Jun 7, 2024 at 6:10 AM Morten Wang nettrom@gmail.com wrote:
+1 to using Ethnologue as a source. That appears to also be what Miquel-Ribé and Laniado did for the Wikipedia Cultural Diversity Dataset: https://ojs.aaai.org/index.php/ICWSM/article/view/3260 (2019 ICWSM
paper)
Another approach to use is to look at geolocation of pageviews or edits.
In
our 2012 Ur-Wikipedia paper ( https://dl.acm.org/doi/abs/10.1145/2462932.2462959), we used the proportion of edits from a given country to locate a Wikipedia to that country, provided there was a clear majority of them. This meant we couldn't
decide
where to locate English or Spanish because their distribution is spread across multiple countries, whereas for others it was much clearer.
Cheers, Morten
On Fri, 7 Jun 2024 at 10:11, Biyanto biyanto.rebin@gmail.com wrote:
Hi Kai,
You should start with Ethnologue's country data; this website provides
the
most comprehensive data. But, be aware that the data may not be
updated.
so
compare it with Endangered Language Project data https://endangeredlanguages.com/ and UNESCO's World Atlas of
Language
https://en.wal.unesco.org/; in the case of my country, Indonesia,
the
power dynamics around the national language, Indonesian, and Indigenous (local) languages lead to language shifting to Indonesian, or major
lingua
franca in each region, such as Makassar Malay in the greater South Sulawesi, etc, and it is hard to exactly calculate the current number
since
the latest official population census is lack of awareness in language diversity as well.
Hope this helps.
Best, Biyanto
On Fri, Jun 7, 2024 at 7:30 AM Kai Zhu kaizhublcu@gmail.com wrote:
Dear all,
I am currently undertaking a research project that explores the
choice
of
language when reading Wikipedia across different countries. One of
the
tasks of my study involves mapping Wikipedia languages to the
countries
where these languages are predominantly spoken. I recognize the
complexity
of this task and understand that a perfect mapping might not be
possible.
However, I would appreciate any recommendations on the best
methodologies,
practices, or data sources for accomplishing this.
Additionally, I have a related question: What are good data sources
for
information regarding the proportion of a country's population that
speaks
various languages?
Thank you for your help and insights.
Best regards, Kai Zhu Assistant Professor Bocconi University _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org