On Sat, 27 Mar 2021 at 01:52, Grounder UK <grounderuk@gmail.com> wrote:

I would look at scripts, aiming to have every supported script represented in the first block of reserved ZIDs.

Yes, we are definitely going to pre-create all ISO 639-3 languages (and all the MediaWiki content and interface languages, where they are disjoint) as part of the initial content injection before the wiki launches.

However, new languages will come along over time, either as they get recognised by ISO or as the Wikifunctions community expands its ideas of and capacity towards for whom we will want to create content.

Additionally, I expect that we will end up wanting to produce outputs tuned to different audiences within larger macro-groups of a "language"; for instance, "encyclopædic educational American English" might be a form that the community decides it wants to select as a 'house style' to better reflect the discursive, plain-spoken, authoritative manner, which could be the target of the natural language generation system(s) we create, and contains some explicit or implicit decisions around tone, mood, and so on (in this case, I imagine something along the lines of third person impersonal, no passive voice, indicative mood).

Indeed, different conceptual areas might be better served by different target languages; the language used to talk about quantum mechanics might be inappropriately stilted and feel odd when talking about pop cultural items (and in reverse, the tone of pop culture coverage might well read poorly for quantum mechanics). There's also the often-considered idea around the Wikimedia world of the same content being available in different forms for readers of differing general abilities, specific aptitudes, and subject familiarities. Perhaps we might have "encyclopædic educational American English for ~10 year olds", "… for ~15 year olds", and "… for ~20 year olds", which could differ from each other a fair bit in linguistic construction, varying in terms of vocabulary, metre, and depth, for example, whilst still covering the same factual content.

The language addressing framework that we create now should not preclude our community from doing as it sees fit in future, and not decide too much that binds our hands later (but also doesn't leave so much undecided that we cannot make headway now).

We should also reserve a few ZIDs for international and interlingual labels, as a “language”. This would include “en-GB” as a label for the object labelled “British English” in English, for example,

The language code for our internal use (which will indeed likely be "en"/"en-GB"/"en-GB-…" etc. in BCP47 style) will be stored as the key of the instance (Z60K1); the label "British English" would be stored as a label (Z2K3 -> "en-GB"). If you mean the formal mapping of that to a given ISO code, we could add that as a key but I'm not sure that's something we'd generally need? We'd probably want to do as much of the mapping from Wikifunctions's languages to wider concepts on Wikidata, rather than mastering the content within the wiki.

or “m” for the SI unit labelled “mètre” in French.

That's an alias for a unit label, which is currently handled in Wikidata on a per-language basis. I think we'd want to avoid replicating that data locally on Wikifunctions. Our knowledge about natural languages is so that we can label/describe/discuss functions at launch, and for the future, to inform forthcoming systems as output targets.

More importantly, that kind of central decision is not something we should generally build for, as I believe it's very rare that such decisions would be appropriate for all languages. Where many languages have a shared abbreviation this can feel a bit duplicative, but I'd rather we avoid baking in general decisions about how languages will wish things to work.

If said label/alias is appropriate in, for example, Pashto, I'd expect it to be added to the Pashto entry on Wikidata for metre; note that it's not there currently. If it were not appropriate, a different abbreviation could be added, assuming that "متر" is not generally considered short enough. But that would be for the Pashto speaking/contributing community to decide on a language basis, rather than imposing a centrally-controlled list of "international" labels that I/we/someone decides is appropriate for all target output formats for now and forever. For concepts which have a shared monolingual string at the Wikidata level (like species taxonomic name), that would be better handled by special functions for such concepts that would choose how to handle the label for the given language, rather than having it hard-coded as a source or target language object.

Indeed, there also might be different output formats rather than just labels, so that American English (or perhaps "American English / Imperial Units"?) output would describe things in foot-pounds or other odd (but locally-expected) units, instead of imposing a single view for all consumers. This way text in basic English might use metres, in "American English" might use feet, in "Canadian English" would use both metres and feet, in Indian English might use metres with the crore/lakh notation, etc. Particular areas of content (like Physics) might be flagged to use particular language outputs (like metric units).

After that, we might choose to ensure that ISO 639-1 languages (184 with two-character codes, like “de”) have a lower ZID than other interface languages. This is because ISO 639-1 was intended to include the most common languages.

We could. Using the UN Six as the first full was intended to privilege the very most common languages, though of course through the lens of the 1945 geo-political settlement. Of course we could also do both.

James D. Forrester (he/him or they/themself)

Wikimedia Foundation