Hello Oliver,
Let me use Cantonese (yue) and Hakka (hak) as examples to illustrate some possibilities. Just the population data points.
Note that you should see the population number in China and also other places in the world (under the section of "Also Spoken In") There are also other data points such as "status" and "writing".
Then one can look up the CLDR's Language-Territory or Territory-Language information, the entries for Cantonese and Hakka does not exist yet.
Note also that both Cantonese and Hakka have their own language versions of Wikipedia (zh-yue and hak). The coding and naming needs a table here for data integration.
Now, as tertiary sources that integrates other data points, Wikipedia/Wikidata can get the data points from Ethnologue to enrich its content.
These data points would be important baseline for almost any human language-based Wikipedia projects to identify their potential editors.
The current active editors of small and medium size language Wikipedia projects should be interested in getting hold of such data. Also, they may know more updated and reliable data ahead of Ethnologue.
For traffic data reports, a Cantonese Wikipedian can then normalize the viewing and editing traffic data against the population data, thereby identifying the "per speaker capita" number for the viewing/editing traffic.
I have done some normalization work (or geolinguistic normalization) for languages such as Spanish and Arabic where the CLDR's Language-Territory or Territory-Language information data. The surprising results are that for Spanish, per captia editing traffic are the highest in Germany, Paraguay, Uruguay and Spain; per capita viewing traffic are the highest in Paraguay, Spain, Chile, etc. For Arabic, per capita editing traffic are the highest in Kuwait, Baharain, Saudi Arabia, Qatar, Israel, UAE, etc; per capita viewing traffic are the highest in Israel, Kuwait, Saudi Arabia, etc.
I personally believe such data curation, when supported by better and expected-to-be-improved geolinguistic data population data points now available in Ethnologue and other sources that different language Wikipedians may know, would be useful to Wikipedians first.
In short, I did not intend to ask Wikipedians or the Wikimedia research staff to do extra "original research". My suggestions aim to parse the traffic data one level down from either language or territory to the more specific language-territory aggregate so as better inform development strategies and academic research on Wikipedia.
Overall, I think it is viable to construct a data process to show what need to be done and what can be achieved. The showing-by-doing approach can show some results first with infographics for language versions that are more data-ready (e.g. Arabic and Spanish). Then other language versions can strive to fill the now *identified* data gaps by contributing data points through Wikipedia and Wikidata projects. What is needed then is a database and expert pool of territory-language and language-territory information across Wikipedia projects. It can be as simple and as straightforward to have a Wikidata object of geo-lingustic population for any territory-language combinations, potentially with existing translations made possible by Wikidata, then the traffic/viewing data reports can be (1) localized/translated into different languages automatically and (2) geo-linguistically normalized to show the current outreach of a language Wikipedia per language-speaker.
The above are only my current rough and initial thoughts. Please let me know if the ideas or expressions are not clear enough.
Best,
han-teng liao