Thanks, James. As I said, “no worries, in any event.” I am happy for ZIDs
to be allocated automatically as the need arises. Reserving a range for
future needs is problematic but providing an initial set is not. The
question is whether some languages should be identified as more important
than others. My view is that this is unnecessary and may cause offence or
fruitless debate. The same is likely to be true of any ordering. So I would
go for wholly random.
How we use or extend the initial set are questions for another day.
However, I have already expressed an opinion, on-wiki, that the results of
rendering functions should have different types depending, only in part, on
the target language (and script, orthography, register etc). We should not
assume that rendered content would have a single “language” tag, like
Z11/monolingual text does.
I am wary of aliases. Of course, Z12/multilingual text is one
implementation of aliasing, and purely functional aliases are an inevitable
second. A third system on top of those may be tricky. I think this goes
back to the question of multi-lingually labelized objects. We need to
interpret labels unambiguously but they are not globally unique, so we need
label+context, where “language” is part (or all) of context. In context,
“en-GB” (for example) is unambiguous not because of the “language” in
context but because it refers to the same thing irrespective of that
“language”. Making “en-GB” an alias in many or all languages is not simply
a question of duplication, it is simply missing the point... But I do not
propose we should duplicate Wikidata’s flaws. We just need a way for
standardized nomenclatures to be handled independently, however incomplete
their adoption in the real world may be. SI units are a good example
because we should expect even quite simple functions to accept and return
explicit units, not (just) the natural-language words for those units.
(And, in passing, we should prefer to fall back to SI units rather than
“en”, more often than not.) To be clear, this is not proposing an
imposition; it is descriptive of how real people think and work, partially
overcoming natural-language and cultural barriers. Conversions between
units is another problem better left for later, but I would be wary of
inferring units or formats from the contextual “language” (because we know
that can’t work).
Briefly, all we need to decide now is how to provide ZIDs for an agreed set
of language–script–style combinations that already have unambiguous
identifiers. If we agree not to use the existing identifiers directly, an
easily explained and culturally neutral one-to-one mapping is all I hope
for. We should not allow any one-to-many mapping without further
discussion. For now, only Z11/monolingual text will use the new Z60 ZIDs
and users will not be able to add Z60s.
(I still believe that standardized codes should be accepted as labels
without needing to be declared within each language, but I withdraw my
proposal that there should be specific Z60s for the implied international
“languages”. This implies some form of cross-language aliasing, but I hope
that might be purely functional, and we can defer discussion of the details
or continue elsewhere.)
Regards,
Al.
On Tue, 30 Mar 2021 at 20:50, James Forrester <jforrester(a)wikimedia.org>
wrote:
On Sat, 27 Mar 2021 at 01:52, Grounder UK
<grounderuk(a)gmail.com> wrote:
I would look at scripts, aiming to have every
supported script
represented in the first block of reserved ZIDs.
Yes, we are definitely going to pre-create all ISO 639-3 languages (and
all the MediaWiki content and interface languages, where they are disjoint)
as part of the initial content injection before the wiki launches.
However, new languages will come along over time, either as they get
recognised by ISO or as the Wikifunctions community expands its ideas of
and capacity towards for whom we will want to create content.
Additionally, I expect that we will end up wanting to produce outputs
tuned to different audiences within larger macro-groups of a "language";
for instance, "encyclopædic educational American English" might be a form
that the community decides it wants to select as a 'house style' to better
reflect the discursive, plain-spoken, authoritative manner, which could be
the target of the natural language generation system(s) we create, and
contains some explicit or implicit decisions around tone, mood, and so on
(in this case, I imagine something along the lines of third person
impersonal, no passive voice, indicative mood).
Indeed, different conceptual areas might be better served by different
target languages; the language used to talk about quantum mechanics might
be inappropriately stilted and feel odd when talking about pop
cultural items (and in reverse, the tone of pop culture coverage might well
read poorly for quantum mechanics). There's also the often-considered
idea around the Wikimedia world of the same content being available in
different forms for readers of differing general abilities, specific
aptitudes, and subject familiarities. Perhaps we might have "encyclopædic
educational American English for ~10 year olds", "… for ~15 year olds",
and
"… for ~20 year olds", which could differ from each other a fair bit in
linguistic construction, varying in terms of vocabulary, metre, and depth,
for example, whilst still covering the same factual content.
The language addressing framework that we create now should not preclude
our community from doing as it sees fit in future, and not decide too much
that binds our hands later (but also doesn't leave so much undecided that
we cannot make headway now).
We should also reserve a few ZIDs for
international and interlingual
labels, as a “language”. This would include “en-GB” as a label for the
object labelled “British English” in English, for example,
The language code for our internal use (which will indeed likely be
"en"/"en-GB"/"en-GB-…" *etc.* in BCP47 style) will be
stored as the key
of the instance (Z60K1
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Pre-generic_function_model#Z60/Language_(Z4/Type)>);
the label "British English" would be stored as a label (Z2K3
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Pre-generic_function_model#Z2/Persistent_object_(Z4/Type)
-> "en-GB"). If you mean the
formal mapping of that to a given ISO code,
we could add that as a key but I'm not sure that's something we'd generally
need? We'd probably want to do as much of the mapping from Wikifunctions's
languages to wider concepts on Wikidata, rather than mastering the content
within the wiki.
or “m” for the SI unit labelled “mètre” in French.
That's an alias for a unit label, which is currently handled in Wikidata
on a per-language basis. I think we'd want to avoid replicating that data
locally on Wikifunctions. Our knowledge about natural languages is so that
we can label/describe/discuss functions at launch, and for the future, to inform
forthcoming systems as output targets.
More importantly, that kind of central decision is not something we should
generally build for, as I believe it's very rare that such decisions would
be appropriate for all languages. Where many languages have a shared
abbreviation this can feel a bit duplicative, but I'd rather we avoid
baking in general decisions about how languages will wish things to work.
If said label/alias is appropriate in, for example, Pashto, I'd expect it
to be added to the Pashto entry on Wikidata for metre; note that it's not
there currently. If it were not appropriate, a different abbreviation could
be added, assuming that "متر" is not generally considered short enough.
But that would be for the Pashto speaking/contributing community to decide
on a language basis, rather than imposing a centrally-controlled list of
"international" labels that I/we/someone decides is appropriate for all
target output formats for now and forever. For concepts which have a shared
monolingual string at the Wikidata level (like species taxonomic name),
that would be better handled by special functions for such concepts
that would choose how to handle the label for the given language, rather
than having it hard-coded as a source or target language object.
Indeed, there also might be different output *formats* rather than just
labels, so that American English (or perhaps "American English / Imperial
Units"?) output would describe things in foot-pounds or other odd (but
locally-expected) units, instead of imposing a single view for all
consumers. This way text in basic English might use metres, in "American
English" might use feet, in "Canadian English" would use both metres and
feet, in Indian English might use metres with the crore/lakh notation,
*etc.* Particular areas of content (like Physics) might be flagged to use
particular language outputs (like metric units).
After that, we might choose to ensure that ISO 639-1 languages (184 with
> two-character codes, like “de”) have a lower ZID than other interface
> languages. This is because ISO 639-1 was intended to include the most
> common languages.
We could. Using the UN Six as the first full was intended to privilege the
very most common languages, though of course through the lens of the 1945
geo-political settlement. Of course we could also do both.
J.
--
*James D. Forrester* (he/him <http://pronoun.is/he> or they/themself
<http://pronoun.is/they/.../themself>)
Wikimedia Foundation <https://wikimediafoundation.org/
_______________________________________________
Abstract-Wikipedia mailing list
Abstract-Wikipedia(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia