Re: [Wiki-research-l] Wikipedia traffic: selected language versions

19 May 2014

Hello Oliver,
   Let me use Cantonese (yue) and Hakka (hak) as examples to illustrate
some possibilities. Just the population data points.
   Have a look at the Ethnologue data http://www.ethnologue.com/language/yueand
http://www.ethnologue.com/language/hak
   Note that you should see the population number in China and also other
places in the world (under the section of "Also Spoken In") There are also
other data points such as "status" and "writing".
   Then one can look up the CLDR's Language-Territory or Territory-Language
information, the entries for Cantonese and Hakka does not exist yet.
   Note also that both Cantonese and Hakka have their own language versions
of Wikipedia (zh-yue and hak). The coding and naming needs a table here for
data integration.
   Now, as tertiary sources that integrates other data points,
Wikipedia/Wikidata can get the data points from Ethnologue to enrich its
content.
   These data points would be important baseline for almost any human
language-based Wikipedia projects to identify their potential editors.
   The current active editors of small and medium size language Wikipedia
projects should be interested in getting hold of such data. Also, they may
know more updated and reliable data ahead of Ethnologue.
   For traffic data reports, a Cantonese Wikipedian can then normalize the
viewing and editing traffic data against the population data, thereby
identifying the "per speaker capita" number for the viewing/editing
traffic.
   I have done some normalization work (or geolinguistic normalization) for
languages such as Spanish and Arabic where the CLDR's Language-Territory or
Territory-Language information data. The surprising results are that for
Spanish, per captia editing traffic are the highest in Germany, Paraguay,
Uruguay and Spain; per capita viewing traffic are the highest in Paraguay,
Spain, Chile, etc. For Arabic, per capita editing traffic are the highest
in Kuwait, Baharain, Saudi Arabia, Qatar, Israel, UAE, etc; per capita
viewing traffic are the highest in Israel, Kuwait, Saudi Arabia, etc.
   I personally believe such data curation, when supported by better and
expected-to-be-improved geolinguistic data population data points now
available in Ethnologue and other sources that different language
Wikipedians may know, would be useful to Wikipedians first.
   In short, I did not intend to ask Wikipedians or the Wikimedia research
staff to do extra "original research". My suggestions aim to parse the
traffic data one level down from either language or territory to the more
specific language-territory aggregate so as better inform development
strategies and academic research on Wikipedia.
   Overall, I think it is viable to construct a data process to show what
need to be done and what can be achieved. The showing-by-doing approach can
show some results first with infographics for language versions that are
more data-ready (e.g. Arabic and Spanish). Then other language versions can
strive to fill the now *identified* data gaps by contributing data points
through Wikipedia and Wikidata projects. What is needed then is a database
and expert pool of territory-language and language-territory information
across Wikipedia projects. It can be as simple and as straightforward to
have a Wikidata object of geo-lingustic population for any
territory-language combinations, potentially with existing translations
made possible by Wikidata, then the traffic/viewing data reports can be (1)
localized/translated into different languages automatically and (2)
geo-linguistically normalized to show the current outreach of a language
Wikipedia per language-speaker.
   The above are only my current rough and initial thoughts. Please let me
know if the ideas or expressions are not clear enough.
Best,
han-teng liao

2014-05-19 7:35 GMT+08:00 Oliver Keyes &lt;okeyes(a)wikimedia.org&gt;rg>:

...
  Could you give an example of what we could do better
than CLDR or the
 relevant ISO standards?

 On 18 May 2014 10:06, h &lt;hanteng(a)gmail.com&gt; wrote:

  Dear Nemo,

     As I am waiting for a more complete response, I am not sure that I
 understand your last "No" as in "No, we definitely can't" means.
To
 clarify, take the CLDR supplement Language-Territory information for
 example

http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_i…

     One can suggest additions of the data point by submitting sourced
 numbers for a geo-linguistic population like this:

http://unicode.org/cldr/trac/newticket?&description=%3Cterritory%2c%20s…

     In Wikipedia articles and Wikidata pages, there are many attempts to
 provide more updated and better sourced data points. I see  the potentials
 in exchanging such data, curating them better in Wikidata projects as more
 detailed and dynamic source than the CLDR.

     These data points will have extra benefits in curating traffic data.
 For one, these geo-linguistic population data points would be useful to
 normalize traffic data for further analysis, such as geographic
 normalization.  For another, they provide important reference data for the
 development strategies and policies of the Wikipedia projects.

 Best,
 han-teng liao

 2014-05-18 16:23 GMT+08:00 Federico Leva (Nemo) &lt;nemowiki(a)gmail.com&gt;om>:

 Thanks for your suggestions. Just some quick pointers below.

 h, 18/05/2014 08:26:

  (I-A). Tabulate the data points in absolute
numbers first, not
 percentage numbers [...]

 (I-B). Include all language versions for the *editing traffic* report as
 well. [...]

 (I-C). Provide static data objects in more accessible format (i.e. csv
 and/or json). [...]

 (II-A).  Putting viewing traffic and editing traffic report on the same
 page. [...]

 (II-B).  Organizing and archiving the traffic reports for historical
 comparison. [...]

 (I-C). Provide dynamic data objects in more accessible format (i.e. csv
 and/or json).

 At least the first four are "just" changes in the WikiStats reports
 formatting, personally I encourage you to submit patches: <
 https://git.wikimedia.org/summary/analytics%2Fwikistats.git> (should be
 the "squids" directory, but there is some ongoing refactoring of the repos).

 On archives and "history rewriting"/reports regeneration, see also
 https://bugzilla.wikimedia.org/show_bug.cgi?id=46198

  [...] (III-B).  Smaller (i.e more specific) geographic aggregate units.

 The country (geographic) information is often based on geo-IP databases,
 and sometimes provincial and city-level data would be available.

 http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075964.html

  [...]

 ( I know that the Unicode Common Locale Data Repository (CLDR Version 25
 <http://cldr.unicode.org/index/downloads/cldr-25>)
 provides“language-territory”
 <http://www.unicode.org/cldr/charts/latest/supplemental/
 language_territory_information.html>or
 “territory-language”
 <http://www.unicode.org/cldr/charts/latest/supplemental/
 territory_language_information.html>unit-based

 charts, but I believe that the Wikimedia projects can use and build one
 better..)  [...]

 No, we definitely can't, not alone. I've asked for help, please
 contribute: <https://www.mediawiki.org/wiki/Universal_Language_
 Selector/FAQ#How_does_Universal_Language_Selector_
 determine_which_languages_I_may_understand>.

 Nemo

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Wikipedia traffic: selected language versions