Hi! +1 to the above, and your project sounds very interesting.
I don’t have much to add to the very helpful suggestions you’ve already received, but I did want to use this opportunity to mention a project I’m currently working on related to the state of languages across Wikimedia projects: https://meta.wikimedia.org/wiki/Research:Incubator_and_language_representati...
Like you, we too want to incorporate external data that will allow us to look at regional and country-level language metrics for our projects, related to coverage and representation. Due to the similarities in our research, I would love to hear if you (or other interested folks in this thread) have any feedback about our project. Please feel free to post any questions, comments, or ideas on the project’s talk page.
Best,
Caroline Myrick
Sr Analyst, Research Wikimedia Foundation
On Fri, Jun 7, 2024 at 6:10 AM Morten Wang nettrom@gmail.com wrote:
+1 to using Ethnologue as a source. That appears to also be what Miquel-Ribé and Laniado did for the Wikipedia Cultural Diversity Dataset: https://ojs.aaai.org/index.php/ICWSM/article/view/3260 (2019 ICWSM paper)
Another approach to use is to look at geolocation of pageviews or edits. In our 2012 Ur-Wikipedia paper ( https://dl.acm.org/doi/abs/10.1145/2462932.2462959), we used the proportion of edits from a given country to locate a Wikipedia to that country, provided there was a clear majority of them. This meant we couldn't decide where to locate English or Spanish because their distribution is spread across multiple countries, whereas for others it was much clearer.
Cheers, Morten
On Fri, 7 Jun 2024 at 10:11, Biyanto biyanto.rebin@gmail.com wrote:
Hi Kai,
You should start with Ethnologue's country data; this website provides
the
most comprehensive data. But, be aware that the data may not be updated.
so
compare it with Endangered Language Project data https://endangeredlanguages.com/ and UNESCO's World Atlas of Language https://en.wal.unesco.org/; in the case of my country, Indonesia, the power dynamics around the national language, Indonesian, and Indigenous (local) languages lead to language shifting to Indonesian, or major
lingua
franca in each region, such as Makassar Malay in the greater South Sulawesi, etc, and it is hard to exactly calculate the current number
since
the latest official population census is lack of awareness in language diversity as well.
Hope this helps.
Best, Biyanto
On Fri, Jun 7, 2024 at 7:30 AM Kai Zhu kaizhublcu@gmail.com wrote:
Dear all,
I am currently undertaking a research project that explores the choice
of
language when reading Wikipedia across different countries. One of the tasks of my study involves mapping Wikipedia languages to the countries where these languages are predominantly spoken. I recognize the
complexity
of this task and understand that a perfect mapping might not be
possible.
However, I would appreciate any recommendations on the best
methodologies,
practices, or data sources for accomplishing this.
Additionally, I have a related question: What are good data sources for information regarding the proportion of a country's population that
speaks
various languages?
Thank you for your help and insights.
Best regards, Kai Zhu Assistant Professor Bocconi University _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org