Dear all,
I am currently undertaking a research project that explores the choice of language when reading Wikipedia across different countries. One of the tasks of my study involves mapping Wikipedia languages to the countries where these languages are predominantly spoken. I recognize the complexity of this task and understand that a perfect mapping might not be possible. However, I would appreciate any recommendations on the best methodologies, practices, or data sources for accomplishing this.
Additionally, I have a related question: What are good data sources for information regarding the proportion of a country's population that speaks various languages?
Thank you for your help and insights.
Best regards, Kai Zhu Assistant Professor Bocconi University
Dear Kai,
Thanks for raising this up.
I would also like to find out if there are, at least, rough estimates on the distribution of language speakers across countries as it would be helpful to calibrate my model on the inivisible tax of free knowledge.
I am aware that it should be very difficult to find such data or develop a proper approach, but I am curious to learn more from more knowledgeable people in this area.
Best regards, Kiril Simeonovski
On Fri, 7 Jun 2024 at 07:30, Kai Zhu kaizhublcu@gmail.com wrote:
Dear all,
I am currently undertaking a research project that explores the choice of language when reading Wikipedia across different countries. One of the tasks of my study involves mapping Wikipedia languages to the countries where these languages are predominantly spoken. I recognize the complexity of this task and understand that a perfect mapping might not be possible. However, I would appreciate any recommendations on the best methodologies, practices, or data sources for accomplishing this.
Additionally, I have a related question: What are good data sources for information regarding the proportion of a country's population that speaks various languages?
Thank you for your help and insights.
Best regards, Kai Zhu Assistant Professor Bocconi University _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hi Kai,
You should start with Ethnologue's country data; this website provides the most comprehensive data. But, be aware that the data may not be updated. so compare it with Endangered Language Project data https://endangeredlanguages.com/ and UNESCO's World Atlas of Language https://en.wal.unesco.org/; in the case of my country, Indonesia, the power dynamics around the national language, Indonesian, and Indigenous (local) languages lead to language shifting to Indonesian, or major lingua franca in each region, such as Makassar Malay in the greater South Sulawesi, etc, and it is hard to exactly calculate the current number since the latest official population census is lack of awareness in language diversity as well.
Hope this helps.
Best, Biyanto
On Fri, Jun 7, 2024 at 7:30 AM Kai Zhu kaizhublcu@gmail.com wrote:
Dear all,
I am currently undertaking a research project that explores the choice of language when reading Wikipedia across different countries. One of the tasks of my study involves mapping Wikipedia languages to the countries where these languages are predominantly spoken. I recognize the complexity of this task and understand that a perfect mapping might not be possible. However, I would appreciate any recommendations on the best methodologies, practices, or data sources for accomplishing this.
Additionally, I have a related question: What are good data sources for information regarding the proportion of a country's population that speaks various languages?
Thank you for your help and insights.
Best regards, Kai Zhu Assistant Professor Bocconi University _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
+1 to using Ethnologue as a source. That appears to also be what Miquel-Ribé and Laniado did for the Wikipedia Cultural Diversity Dataset: https://ojs.aaai.org/index.php/ICWSM/article/view/3260 (2019 ICWSM paper)
Another approach to use is to look at geolocation of pageviews or edits. In our 2012 Ur-Wikipedia paper ( https://dl.acm.org/doi/abs/10.1145/2462932.2462959), we used the proportion of edits from a given country to locate a Wikipedia to that country, provided there was a clear majority of them. This meant we couldn't decide where to locate English or Spanish because their distribution is spread across multiple countries, whereas for others it was much clearer.
Cheers, Morten
On Fri, 7 Jun 2024 at 10:11, Biyanto biyanto.rebin@gmail.com wrote:
Hi Kai,
You should start with Ethnologue's country data; this website provides the most comprehensive data. But, be aware that the data may not be updated. so compare it with Endangered Language Project data https://endangeredlanguages.com/ and UNESCO's World Atlas of Language https://en.wal.unesco.org/; in the case of my country, Indonesia, the power dynamics around the national language, Indonesian, and Indigenous (local) languages lead to language shifting to Indonesian, or major lingua franca in each region, such as Makassar Malay in the greater South Sulawesi, etc, and it is hard to exactly calculate the current number since the latest official population census is lack of awareness in language diversity as well.
Hope this helps.
Best, Biyanto
On Fri, Jun 7, 2024 at 7:30 AM Kai Zhu kaizhublcu@gmail.com wrote:
Dear all,
I am currently undertaking a research project that explores the choice of language when reading Wikipedia across different countries. One of the tasks of my study involves mapping Wikipedia languages to the countries where these languages are predominantly spoken. I recognize the
complexity
of this task and understand that a perfect mapping might not be possible. However, I would appreciate any recommendations on the best
methodologies,
practices, or data sources for accomplishing this.
Additionally, I have a related question: What are good data sources for information regarding the proportion of a country's population that
speaks
various languages?
Thank you for your help and insights.
Best regards, Kai Zhu Assistant Professor Bocconi University _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hi! +1 to the above, and your project sounds very interesting.
I don’t have much to add to the very helpful suggestions you’ve already received, but I did want to use this opportunity to mention a project I’m currently working on related to the state of languages across Wikimedia projects: https://meta.wikimedia.org/wiki/Research:Incubator_and_language_representati...
Like you, we too want to incorporate external data that will allow us to look at regional and country-level language metrics for our projects, related to coverage and representation. Due to the similarities in our research, I would love to hear if you (or other interested folks in this thread) have any feedback about our project. Please feel free to post any questions, comments, or ideas on the project’s talk page.
Best,
Caroline Myrick
Sr Analyst, Research Wikimedia Foundation
On Fri, Jun 7, 2024 at 6:10 AM Morten Wang nettrom@gmail.com wrote:
+1 to using Ethnologue as a source. That appears to also be what Miquel-Ribé and Laniado did for the Wikipedia Cultural Diversity Dataset: https://ojs.aaai.org/index.php/ICWSM/article/view/3260 (2019 ICWSM paper)
Another approach to use is to look at geolocation of pageviews or edits. In our 2012 Ur-Wikipedia paper ( https://dl.acm.org/doi/abs/10.1145/2462932.2462959), we used the proportion of edits from a given country to locate a Wikipedia to that country, provided there was a clear majority of them. This meant we couldn't decide where to locate English or Spanish because their distribution is spread across multiple countries, whereas for others it was much clearer.
Cheers, Morten
On Fri, 7 Jun 2024 at 10:11, Biyanto biyanto.rebin@gmail.com wrote:
Hi Kai,
You should start with Ethnologue's country data; this website provides
the
most comprehensive data. But, be aware that the data may not be updated.
so
compare it with Endangered Language Project data https://endangeredlanguages.com/ and UNESCO's World Atlas of Language https://en.wal.unesco.org/; in the case of my country, Indonesia, the power dynamics around the national language, Indonesian, and Indigenous (local) languages lead to language shifting to Indonesian, or major
lingua
franca in each region, such as Makassar Malay in the greater South Sulawesi, etc, and it is hard to exactly calculate the current number
since
the latest official population census is lack of awareness in language diversity as well.
Hope this helps.
Best, Biyanto
On Fri, Jun 7, 2024 at 7:30 AM Kai Zhu kaizhublcu@gmail.com wrote:
Dear all,
I am currently undertaking a research project that explores the choice
of
language when reading Wikipedia across different countries. One of the tasks of my study involves mapping Wikipedia languages to the countries where these languages are predominantly spoken. I recognize the
complexity
of this task and understand that a perfect mapping might not be
possible.
However, I would appreciate any recommendations on the best
methodologies,
practices, or data sources for accomplishing this.
Additionally, I have a related question: What are good data sources for information regarding the proportion of a country's population that
speaks
various languages?
Thank you for your help and insights.
Best regards, Kai Zhu Assistant Professor Bocconi University _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hi Professor Kai,
Erich Zachte has done some interesting work on this https://stats.wikimedia.org/wikimedia/animations/wivivi/wivivi.html maps Wikipedia readership to particular language versions of Wikipedia per country.
Some important caveats to this:
- Blocks of one language version may not apply to other language versions - several countries such as China have blocked at least the language version of Wikipedia that they think their population might watch. - Language use and internet access can vary sharply by generation in particular countries. So the ratio of Russian to English readers of Wikipedia in Georgia may be heavily skewed by the fact that most over 50s have Russian as a second language, while younger people are more likely to understand English, and more likely to have access to the internet. - Language versions of Wikipedia vary enormously in size, and those figures don't show how many Georgians only go to the English wikipedia if there is a gap in the Georgian one. - Anecdotally a lot of users of the English language Wikipedia do so via Google translate.
Hope that helps
WereSpielChequers
On Fri, 7 Jun 2024 at 21:21, Caroline Myrick cmyrick@wikimedia.org wrote:
Hi! +1 to the above, and your project sounds very interesting.
I don’t have much to add to the very helpful suggestions you’ve already received, but I did want to use this opportunity to mention a project I’m currently working on related to the state of languages across Wikimedia projects:
https://meta.wikimedia.org/wiki/Research:Incubator_and_language_representati...
Like you, we too want to incorporate external data that will allow us to look at regional and country-level language metrics for our projects, related to coverage and representation. Due to the similarities in our research, I would love to hear if you (or other interested folks in this thread) have any feedback about our project. Please feel free to post any questions, comments, or ideas on the project’s talk page.
Best,
Caroline Myrick
Sr Analyst, Research Wikimedia Foundation
On Fri, Jun 7, 2024 at 6:10 AM Morten Wang nettrom@gmail.com wrote:
+1 to using Ethnologue as a source. That appears to also be what Miquel-Ribé and Laniado did for the Wikipedia Cultural Diversity Dataset: https://ojs.aaai.org/index.php/ICWSM/article/view/3260 (2019 ICWSM
paper)
Another approach to use is to look at geolocation of pageviews or edits.
In
our 2012 Ur-Wikipedia paper ( https://dl.acm.org/doi/abs/10.1145/2462932.2462959), we used the proportion of edits from a given country to locate a Wikipedia to that country, provided there was a clear majority of them. This meant we couldn't
decide
where to locate English or Spanish because their distribution is spread across multiple countries, whereas for others it was much clearer.
Cheers, Morten
On Fri, 7 Jun 2024 at 10:11, Biyanto biyanto.rebin@gmail.com wrote:
Hi Kai,
You should start with Ethnologue's country data; this website provides
the
most comprehensive data. But, be aware that the data may not be
updated.
so
compare it with Endangered Language Project data https://endangeredlanguages.com/ and UNESCO's World Atlas of
Language
https://en.wal.unesco.org/; in the case of my country, Indonesia,
the
power dynamics around the national language, Indonesian, and Indigenous (local) languages lead to language shifting to Indonesian, or major
lingua
franca in each region, such as Makassar Malay in the greater South Sulawesi, etc, and it is hard to exactly calculate the current number
since
the latest official population census is lack of awareness in language diversity as well.
Hope this helps.
Best, Biyanto
On Fri, Jun 7, 2024 at 7:30 AM Kai Zhu kaizhublcu@gmail.com wrote:
Dear all,
I am currently undertaking a research project that explores the
choice
of
language when reading Wikipedia across different countries. One of
the
tasks of my study involves mapping Wikipedia languages to the
countries
where these languages are predominantly spoken. I recognize the
complexity
of this task and understand that a perfect mapping might not be
possible.
However, I would appreciate any recommendations on the best
methodologies,
practices, or data sources for accomplishing this.
Additionally, I have a related question: What are good data sources
for
information regarding the proportion of a country's population that
speaks
various languages?
Thank you for your help and insights.
Best regards, Kai Zhu Assistant Professor Bocconi University _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hi, I am Nkem Osuigwe, a Nigerian.
You may want to speak with Olusola Olaniyan, the President, Wikimedia User Group. He is doing a project on preservation of indigenous languages and this is important in Nigeria where we have close to 500 spoken languages and quite a number on Wikipedia. I have copied Olusola to this mail.
Thank you.
Nkem E. Osuigwe PhD FNLA CLN Human Capacity Development & Training Director, African Library and Information Associations & Institutions(AfLIA) P.O.Box BC 38, Burma Camp, Accra, Ghana. neosuigwe@aflia.net drnkemosuigwe@gmail.com nkemekene@ymail.com Website: www.aflia.net Facebook /Twitter /Instagram
"If I had asked people what they wanted, they would have said faster horses." - Henry Ford. "Knowledge is limited to all we now know and understand, while imagination embraces the entire world, and all there ever will be to know and understand" - Albert Einstein
On Fri, 7 Jun 2024, 06:30 Kai Zhu, kaizhublcu@gmail.com wrote:
Dear all,
I am currently undertaking a research project that explores the choice of language when reading Wikipedia across different countries. One of the tasks of my study involves mapping Wikipedia languages to the countries where these languages are predominantly spoken. I recognize the complexity of this task and understand that a perfect mapping might not be possible. However, I would appreciate any recommendations on the best methodologies, practices, or data sources for accomplishing this.
Additionally, I have a related question: What are good data sources for information regarding the proportion of a country's population that speaks various languages?
Thank you for your help and insights.
Best regards, Kai Zhu Assistant Professor Bocconi University _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
wiki-research-l@lists.wikimedia.org