Hello Oliver,
Let me use Cantonese (yue) and Hakka (hak) as examples to illustrate
some possibilities. Just the population data points.
Have a look at the Ethnologue data
http://www.ethnologue.com/language/yueand
http://www.ethnologue.com/language/hak
Note that you should see the population number in China and also other
places in the world (under the section of "Also Spoken In") There are also
other data points such as "status" and "writing".
Then one can look up the CLDR's Language-Territory or Territory-Language
information, the entries for Cantonese and Hakka does not exist yet.
Note also that both Cantonese and Hakka have their own language versions
of Wikipedia (zh-yue and hak). The coding and naming needs a table here for
data integration.
Now, as tertiary sources that integrates other data points,
Wikipedia/Wikidata can get the data points from Ethnologue to enrich its
content.
These data points would be important baseline for almost any human
language-based Wikipedia projects to identify their potential editors.
The current active editors of small and medium size language Wikipedia
projects should be interested in getting hold of such data. Also, they may
know more updated and reliable data ahead of Ethnologue.
For traffic data reports, a Cantonese Wikipedian can then normalize the
viewing and editing traffic data against the population data, thereby
identifying the "per speaker capita" number for the viewing/editing
traffic.
I have done some normalization work (or geolinguistic normalization) for
languages such as Spanish and Arabic where the CLDR's Language-Territory or
Territory-Language information data. The surprising results are that for
Spanish, per captia editing traffic are the highest in Germany, Paraguay,
Uruguay and Spain; per capita viewing traffic are the highest in Paraguay,
Spain, Chile, etc. For Arabic, per capita editing traffic are the highest
in Kuwait, Baharain, Saudi Arabia, Qatar, Israel, UAE, etc; per capita
viewing traffic are the highest in Israel, Kuwait, Saudi Arabia, etc.
I personally believe such data curation, when supported by better and
expected-to-be-improved geolinguistic data population data points now
available in Ethnologue and other sources that different language
Wikipedians may know, would be useful to Wikipedians first.
In short, I did not intend to ask Wikipedians or the Wikimedia research
staff to do extra "original research". My suggestions aim to parse the
traffic data one level down from either language or territory to the more
specific language-territory aggregate so as better inform development
strategies and academic research on Wikipedia.
Overall, I think it is viable to construct a data process to show what
need to be done and what can be achieved. The showing-by-doing approach can
show some results first with infographics for language versions that are
more data-ready (e.g. Arabic and Spanish). Then other language versions can
strive to fill the now *identified* data gaps by contributing data points
through Wikipedia and Wikidata projects. What is needed then is a database
and expert pool of territory-language and language-territory information
across Wikipedia projects. It can be as simple and as straightforward to
have a Wikidata object of geo-lingustic population for any
territory-language combinations, potentially with existing translations
made possible by Wikidata, then the traffic/viewing data reports can be (1)
localized/translated into different languages automatically and (2)
geo-linguistically normalized to show the current outreach of a language
Wikipedia per language-speaker.
The above are only my current rough and initial thoughts. Please let me
know if the ideas or expressions are not clear enough.
Best,
han-teng liao
2014-05-19 7:35 GMT+08:00 Oliver Keyes <okeyes(a)wikimedia.org>rg>:
Could you give an example of what we could do better
than CLDR or the
relevant ISO standards?
On 18 May 2014 10:06, h <hanteng(a)gmail.com> wrote:
Dear Nemo,
As I am waiting for a more complete response, I am not sure that I
understand your last "No" as in "No, we definitely can't" means.
To
clarify, take the CLDR supplement Language-Territory information for
example
http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_i…
One can suggest additions of the data point by submitting sourced
numbers for a geo-linguistic population like this:
http://unicode.org/cldr/trac/newticket?&description=%3Cterritory%2c%20s…
In Wikipedia articles and Wikidata pages, there are many attempts to
provide more updated and better sourced data points. I see the potentials
in exchanging such data, curating them better in Wikidata projects as more
detailed and dynamic source than the CLDR.
These data points will have extra benefits in curating traffic data.
For one, these geo-linguistic population data points would be useful to
normalize traffic data for further analysis, such as geographic
normalization. For another, they provide important reference data for the
development strategies and policies of the Wikipedia projects.
Best,
han-teng liao
2014-05-18 16:23 GMT+08:00 Federico Leva (Nemo) <nemowiki(a)gmail.com>om>:
Thanks for your suggestions. Just some quick pointers below.
h, 18/05/2014 08:26:
(I-A). Tabulate the data points in absolute
numbers first, not
percentage numbers [...]
(I-B). Include all language versions for the *editing traffic* report as
well. [...]
(I-C). Provide static data objects in more accessible format (i.e. csv
and/or json). [...]
(II-A). Putting viewing traffic and editing traffic report on the same
page. [...]
(II-B). Organizing and archiving the traffic reports for historical
comparison. [...]
(I-C). Provide dynamic data objects in more accessible format (i.e. csv
and/or json).
At least the first four are "just" changes in the WikiStats reports
formatting, personally I encourage you to submit patches: <
https://git.wikimedia.org/summary/analytics%2Fwikistats.git> (should be
the "squids" directory, but there is some ongoing refactoring of the repos).
On archives and "history rewriting"/reports regeneration, see also
https://bugzilla.wikimedia.org/show_bug.cgi?id=46198
[...] (III-B). Smaller (i.e more specific) geographic aggregate units.
The country (geographic) information is often based on geo-IP databases,
and sometimes provincial and city-level data would be available.
http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075964.html
[...]
( I know that the Unicode Common Locale Data Repository (CLDR Version 25
<http://cldr.unicode.org/index/downloads/cldr-25>)
provides“language-territory”
<http://www.unicode.org/cldr/charts/latest/supplemental/
language_territory_information.html>or
“territory-language”
<http://www.unicode.org/cldr/charts/latest/supplemental/
territory_language_information.html>unit-based
charts, but I believe that the Wikimedia projects can use and build one
better..) [...]
No, we definitely can't, not alone. I've asked for help, please
contribute: <https://www.mediawiki.org/wiki/Universal_Language_
Selector/FAQ#How_does_Universal_Language_Selector_
determine_which_languages_I_may_understand>.
Nemo
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l