Hello,
I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it:
* We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs?
* Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
* What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
* We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
Many Thanks, Rawia
[Description: Strategy& Logo] Formerly Booz & Company
Rawia Abdel Samad Direct: +9611985655 | Mobile: +97455153807 Email: Rawia.AbdelSamad@strategyand.pwc.commailto:Rawia.AbdelSamad@strategyand.pwc.com www.strategyand.com
Hi Rawia,
In response to your first two questions, these links might help you:
* https://stats.wikimedia.org/EN/Sitemap.htm * https://stats.wikimedia.org/EN/TablesArticlesTotal.htm * https://stats.wikimedia.org/EN/PlotsPngArticlesTotal.htm
Also, yes, article count includes stubs. Article count probably does not include pages that are in the Draft or Articles for Creation processes, or deleted articles; it would be good if someone can confirm this.
I'm not sure there have been major step changes in the number of articles on the major language WIkipedias. Again, see by https://stats.wikimedia.org/EN/PlotsPngArticlesTotal.htm. (There has been a great deal of analysis and theorizing done about the changes in *active editor* statistics over time.) However, there may be step changes in article count on smaller Wikipedias; someone from Analytics, the Small Wiki Monitoring Team, or the Incubator project might be able to help with that question.
Thank you for your interest,
Pine (writing in an unofficial personal capacity only)
*This is an Encyclopedia* https://www.wikipedia.org/
*One gateway to the wide garden of knowledge, where lies The deep rock of our past, in which we must delve The well of our future,The clear water we must leave untainted for those who come after us,The fertile earth, in which truth may grow in bright places, tended by many hands,And the broad fall of sunshine, warming our first steps toward knowing how much we do not know.*
*—Catherine Munro*
On Wed, Jan 21, 2015 at 12:47 AM, Abdel Samad, Rawia < Rawia.AbdelSamad@strategyand.pwc.com> wrote:
Hello,
I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one’s primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one’s primary language. We have a few questions related to this analysis prior to publishing it:
· We are currently using the article count by language based on Wikimedia’s foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count – does it include stubs?
· Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
· What are the biggest drivers you’ve seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
· We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the “List of Wikipedias” sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
Many Thanks,
Rawia
[image: Description: Strategy& Logo]
*Formerly Booz & Company*
*Rawia Abdel Samad*
Direct: +9611985655 | Mobile: +97455153807
Email: Rawia.AbdelSamad@strategyand.pwc.com
www.strategyand.com
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Rawia,
The metawiki page you link to counts everything defined as a "content page" by that wiki. I believe the definition is that it has to be in the main namespace (so an article; not a discussion page, image file, etc) and that it has to have at least one valid internal link (so it can't just be a wall of unformatted text). This will also exclude drafts, as Pine notes. Stubs *are* included in all definitions (which is a good thing, because our stub tracking is abysmal)
If you use http://stats.wikimedia.org/EN/Sitemap.htm you can get slightly outdated article counts (but with a long historic tail). This uses a definition of "article count" which is a little more generous, and counts all pages in the main namespace. It is probably a better one for your purposes as it's less liable to change.
There are currently 51 projects above the 100k threshold according to wikistats; this includes Simple English, Latin, Volapuk and Esperanto, which you may not want to count! Some very small languages with large article counts may have a very high proportion of auto-generated articles - there's been some research done on this but I can't immediately put my finger on it. See, eg, this discussion: https://lists.wikimedia.org/pipermail/analytics/2015-January/003214.html
As for language codes, I believe any two-letter code is a valid ISO 639-1 code, and (almost?) all three-letter codes are valid ISO 639-2. There are about a dozen others which will need mapped by hand. Note that Norwegian appears twice (nn, no).
Andrew.
On 21 January 2015 at 08:47, Abdel Samad, Rawia < Rawia.AbdelSamad@strategyand.pwc.com> wrote:
Hello,
I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one’s primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one’s primary language. We have a few questions related to this analysis prior to publishing it:
· We are currently using the article count by language based on Wikimedia’s foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count – does it include stubs?
· Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
· What are the biggest drivers you’ve seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
· We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the “List of Wikipedias” sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
Many Thanks,
Rawia
[image: Description: Strategy& Logo]
*Formerly Booz & Company*
*Rawia Abdel Samad*
Direct: +9611985655 | Mobile: +97455153807
Email: Rawia.AbdelSamad@strategyand.pwc.com
www.strategyand.com
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
If you just need ballpark numbers, the proposed approach might work. If you want to produce something concretely usable, it's going to be much more complex: https://meta.wikimedia.org/wiki/Research:Measuring_mission_success
In particular, 100k is a ridiculous number and restricting yourself to Wikipedia means for many languages you'll lose the most important content people are looking for, e.g. on Wiktionary and Wikisource (dictionaries, original literature and official documents in that language).
Abdel Samad, Rawia, 21/01/2015 09:47:
·We are currently using the article count by language based on Wikimedia’s foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count – does it include stubs?
0) You'd better use its source, http://wikistats.wmflabs.org/, 1) which is as reliable as Special:Statistics is, i.e. not so much; 2) and uses the official https://www.mediawiki.org/wiki/Manual:Article_count definition, as stats.wikimedia.org (now) does, 3) calling "good" and "stub" what is now called "countable" and "non-countable".
·Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
stats.wikimedia.org has such data.
·What are the biggest drivers you’ve seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
Bot imports, clearly. The number of articles is an extremely poor metric for measuring "coverage".
·We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the “List of Wikipedias” sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
We try to document them at https://meta.wikimedia.org/wiki/Special_language_codes
Andrew Gray, 21/01/2015 10:18:
This uses a definition of "article count" which is a little more generous, and counts all pages in the main namespace.
It doesn't. https://www.mediawiki.org/wiki/Special:Search/Analytics/Metrics_definitions
Nemo
On 21 January 2015 at 10:20, Federico Leva (Nemo) nemowiki@gmail.com wrote:
If you just need ballpark numbers, the proposed approach might work. If you want to produce something concretely usable, it's going to be much more complex: https://meta.wikimedia.org/wiki/Research:Measuring_mission_success
In particular, 100k is a ridiculous number and restricting yourself to Wikipedia means for many languages you'll lose the most important content people are looking for, e.g. on Wiktionary and Wikisource (dictionaries, original literature and official documents in that language).
I'm not sure I'd go for "most important", but yes, I agree - an aggregated total of xx.wikisource, xx.wikipedia, xx.wiktionary, xx.wikibooks might well be useful.
Abdel Samad, Rawia, 21/01/2015 09:47:
·We are currently using the article count by language based on Wikimedia’s foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count – does it include stubs?
- You'd better use its source, http://wikistats.wmflabs.org/,
- which is as reliable as Special:Statistics is, i.e. not so much;
- and uses the official https://www.mediawiki.org/wiki/Manual:Article_count
definition, as stats.wikimedia.org (now) does, 3) calling "good" and "stub" what is now called "countable" and "non-countable".
...agh, so wikistats.wmflabs.org uses a completely different definition of 'stub' to the one we use on the wikis? One more source of confusion :-)
·What are the biggest drivers you’ve seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
Bot imports, clearly. The number of articles is an extremely poor metric for measuring "coverage".
Agreed.
Andrew Gray, 21/01/2015 10:18:
This uses a definition of "article count" which is a little more generous, and counts all pages in the main namespace.
It doesn't. https://www.mediawiki.org/wiki/Special:Search/Analytics/Metrics_definitions
You're quite right - I'd misremembered and then been lulled into a false sense of confirmation by Wikistats being larger :-)
Is it fair to say that:
a) both Wikistats and the on-wiki Special:Statistics use the same article-count measure (ns0, one outbound link/category, not a redirect);
b) for various reasons these two sources for the data don't always line up;
c) but all told, it's as good a measure as we have
Andrew.
Abdel Samad, Rawia, 21/01/2015 09:47:
I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one’s primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one’s primary language.
Hello Rawia, is there any update on this project? Have you contacted Google about similar "content availability" and "content ingestion" activities they conducted in the past, also related to machine translation (https://meta.wikimedia.org/wiki/Machine_translation )?
We are very interested in this sort of initiatives (see also https://lists.wikimedia.org/pipermail/wiki-research-l/2015-March/004297.html ), but experience taught us that looking at the wrong things can have terrible consequences.
Nemo
We have a few questions related to this analysis prior to publishing it:
·We are currently using the article count by language based on Wikimedia’s foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count – does it include stubs?
·Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
·What are the biggest drivers you’ve seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
·We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the “List of Wikipedias” sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
Many Thanks,
Rawia
On Tue, Mar 17, 2015 at 2:09 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Abdel Samad, Rawia, 21/01/2015 09:47:
I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one’s primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one’s primary language.
Hello Rawia, is there any update on this project? Have you contacted Google about similar "content availability" and "content ingestion" activities they conducted in the past, also related to machine translation ( https://meta.wikimedia.org/wiki/Machine_translation )?
I believe this resulted in;
https://fbnewsroomus.files.wordpress.com/2015/02/state-of-connectivity_3.pdf
Some interesting analysis, see p. 32-34 for the bits on us.
Luis