Hello Oliver,

Let me use Cantonese (yue) and Hakka (hak) as examples to illustrate some possibilities. Just the population data points.

Have a look at the Ethnologue data http://www.ethnologue.com/language/yue and http://www.ethnologue.com/language/hak

Note that you should see the population number in China and also other places in the world (under the section of "Also Spoken In") There are also other data points such as "status" and "writing".

Then one can look up the CLDR's Language-Territory or Territory-Language information, the entries for Cantonese and Hakka does not exist yet.

Note also that both Cantonese and Hakka have their own language versions of Wikipedia (zh-yue and hak). The coding and naming needs a table here for data integration.

Now, as tertiary sources that integrates other data points, Wikipedia/Wikidata can get the data points from Ethnologue to enrich its content.

These data points would be important baseline for almost any human language-based Wikipedia projects to identify their potential editors.

The current active editors of small and medium size language Wikipedia projects should be interested in getting hold of such data. Also, they may know more updated and reliable data ahead of Ethnologue.

For traffic data reports, a Cantonese Wikipedian can then normalize the viewing and editing traffic data against the population data, thereby identifying the "per speaker capita" number for the viewing/editing traffic.

I have done some normalization work (or geolinguistic normalization) for languages such as Spanish and Arabic where the CLDR's Language-Territory or Territory-Language information data. The surprising results are that for Spanish, per captia editing traffic are the highest in Germany, Paraguay, Uruguay and Spain; per capita viewing traffic are the highest in Paraguay, Spain, Chile, etc. For Arabic, per capita editing traffic are the highest in Kuwait, Baharain, Saudi Arabia, Qatar, Israel, UAE, etc; per capita viewing traffic are the highest in Israel, Kuwait, Saudi Arabia, etc.

I personally believe such data curation, when supported by better and expected-to-be-improved geolinguistic data population data points now available in Ethnologue and other sources that different language Wikipedians may know, would be useful to Wikipedians first.

In short, I did not intend to ask Wikipedians or the Wikimedia research staff to do extra "original research". My suggestions aim to parse the traffic data one level down from either language or territory to the more specific language-territory aggregate so as better inform development strategies and academic research on Wikipedia.

Overall, I think it is viable to construct a data process to show what need to be done and what can be achieved. The showing-by-doing approach can show some results first with infographics for language versions that are more data-ready (e.g. Arabic and Spanish). Then other language versions can strive to fill the now *identified* data gaps by contributing data points through Wikipedia and Wikidata projects. What is needed then is a database and expert pool of territory-language and language-territory information across Wikipedia projects. It can be as simple and as straightforward to have a Wikidata object of geo-lingustic population for any territory-language combinations, potentially with existing translations made possible by Wikidata, then the traffic/viewing data reports can be (1) localized/translated into different languages automatically and (2) geo-linguistically normalized to show the current outreach of a language Wikipedia per language-speaker.

The above are only my current rough and initial thoughts. Please let me know if the ideas or expressions are not clear enough.

Best,

han-teng liao

2014-05-19 7:35 GMT+08:00 Oliver Keyes <okeyes@wikimedia.org>:

Could you give an example of what we could do better than CLDR or the relevant ISO standards?

On 18 May 2014 10:06, h <hanteng@gmail.com> wrote:

Dear Nemo,

As I am waiting for a more complete response, I am not sure that I understand your last "No" as in "No, we definitely can't" means. To clarify, take the CLDR supplement Language-Territory information for example

http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html

One can suggest additions of the data point by submitting sourced numbers for a geo-linguistic population like this: http://unicode.org/cldr/trac/newticket?&description=%3Cterritory%2c%20speaker%20population%20in%20territory%2c%20and%20references%3E&summary=Add%20territory%20to%20Traditional%20Chinese%20(zh_Hant)

In Wikipedia articles and Wikidata pages, there are many attempts to provide more updated and better sourced data points. I see the potentials in exchanging such data, curating them better in Wikidata projects as more detailed and dynamic source than the CLDR.

These data points will have extra benefits in curating traffic data. For one, these geo-linguistic population data points would be useful to normalize traffic data for further analysis, such as geographic normalization. For another, they provide important reference data for the development strategies and policies of the Wikipedia projects.

Best,
han-teng liao

2014-05-18 16:23 GMT+08:00 Federico Leva (Nemo) <nemowiki@gmail.com>:

Thanks for your suggestions. Just some quick pointers below.

h, 18/05/2014 08:26:

(I-A). Tabulate the data points in absolute numbers first, not
percentage numbers [...]

(I-B). Include all language versions for the *editing traffic* report as
well. [...]

(I-C). Provide static data objects in more accessible format (i.e. csv
and/or json). [...]

(II-A). Putting viewing traffic and editing traffic report on the same
page. [...]

(II-B). Organizing and archiving the traffic reports for historical
comparison. [...]

(I-C). Provide dynamic data objects in more accessible format (i.e. csv
and/or json).

At least the first four are "just" changes in the WikiStats reports formatting, personally I encourage you to submit patches: <https://git.wikimedia.org/summary/analytics%2Fwikistats.git> (should be the "squids" directory, but there is some ongoing refactoring of the repos).

On archives and "history rewriting"/reports regeneration, see also https://bugzilla.wikimedia.org/show_bug.cgi?id=46198

[...] (III-B). Smaller (i.e more specific) geographic aggregate units.

The country (geographic) information is often based on geo-IP databases,
and sometimes provincial and city-level data would be available.

http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075964.html

[...]

( I know that the Unicode Common Locale Data Repository (CLDR Version 25
<http://cldr.unicode.org/index/downloads/cldr-25>)
provides“language-territory”
<http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html>or
“territory-language”
<http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html>unit-based

charts, but I believe that the Wikimedia projects can use and build one
better..) [...]

No, we definitely can't, not alone. I've asked for help, please contribute: <https://www.mediawiki.org/wiki/Universal_Language_Selector/FAQ#How_does_Universal_Language_Selector_determine_which_languages_I_may_understand>.

Nemo

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l