Assigning languages to ZIDs - Abstract-Wikipedia

Thad Guidry

11 Mar 11 Mar

1:53 a.m.

Hi Denny, I think that using an open spec like Microsoft's LCID would probably come in useful here? There are mappings to ISO codes, etc. https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba… I'm not an expert in this particular area, but I do know that in present day, from some scientists, that this is how some of their systems interact with Digit's rather than Letters when there is a need. There is some history with LCID's here also with W3C, IANA and other orgs, but I was never a part of the standards or open spec drafts at the time in the 90's, so I don't know the history. https://www.w3.org/International/ms-lang.html So you might have to poke some other folks more in the know than myself. Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/ On Wed, Mar 10, 2021 at 7:17 PM Denny Vrandečić <dvrandecic(a)wikimedia.org> wrote:

...

How would you assign languages to ZIDs? Currently the idea is to reserve the four digit ZIDs for languages. I would like to start with the languages that are user interface languages for MediaWiki (but there's enough space to extend this considerably later). But how to turn them into numbers? The best way I can come up with is to take the language codes, sort them alphabetically, and take that for the first set, and then, when more come in, add them chronologically. Let's have that as the strawman proposal, but maybe someone can come up with a smarter idea? _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

Reply

Lucas Werkmeister

8:50 a.m.

I was wondering about this earlier (though in the context of programming languages for code implementations, rather than natural languages): why do we need to assign them to ZIDs at all? Wouldn’t it be better to refer to them using Wikidata item IDs? Cheers, Lucas On 11.03.21 02:17, Denny Vrandečić wrote:

...

How would you assign languages to ZIDs? Currently the idea is to reserve the four digit ZIDs for languages. I would like to start with the languages that are user interface languages for MediaWiki (but there's enough space to extend this considerably later). But how to turn them into numbers? The best way I can come up with is to take the language codes, sort them alphabetically, and take that for the first set, and then, when more come in, add them chronologically. Let's have that as the strawman proposal, but maybe someone can come up with a smarter idea? _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

Reply

Charles Matthews

1:43 p.m.

...

On 11 March 2021 at 12:24 David Abián <davidabian(a)wikimedia.es> wrote: Hi, Although we usually claim that Wikidata entities and their URIs are stable, I believe that, unfortunately, their stability isn't as high as desirable in practice, or at least the risk of deletion, merging, redirection, redefinition of identity due to confusion or ambiguity, etc. exists and materializes too often (I hope I'm not being read by Wikidata haters :D). This happens with all kinds of Items, and it can happen with minority languages as well as with something as widespread as, for instance, Chinese.

...

On 11/03/2021 09:50, Lucas Werkmeister wrote: > Wouldn’t it be better to refer to them using Wikidata item IDs?

Not an area where I'm an expert. But there seem to be quite a number of standard IDs for languages, include some which are ISO, and P424 on Wikidata which is "Wikimedia language code". I would say the default solution would be Z-numbers, and a Wikidata property for WA ID so that cross-walking to any other standard ID is just a Wikidata query away. If what is intended is an exact fit with P424 then that would be duplication; but I suppose in the end that won't be the decision. Charles

Reply

Denny Vrandečić

27 Mar 27 Mar

12:34 a.m.

To close this thread: The current proposal says to give Z1001-Z1006 to the big six UN languages, and then to assign Z1011ff. to all the other language codes based on the alphabetical order of their language codes. I was wondering if, for the languages not in the big six, instead of using the alphabetical order, if there was an appetite to using a randomized order Advantages of alphabetical order: it enables a bit of guessing and remembering (if you know what "uz" is, you'll be able to guess "uz-cyrl" and "uz-latn" with quite some confidence). Advantages of random order: no special weight to latin script and latinized names of languages. On Thu, Mar 11, 2021 at 3:51 PM Denny Vrandečić <dvrandecic(a)wikimedia.org> wrote:

...

Thanks for the thoughts, these are super useful! And yes, I sent that mail yesterday out with not enough context. Thad, thanks, LCID looks indeed very interesting! That could provide a source of numbers to draw from. But looking into it in detail, it also looks like "we took the English language names, and in that order assigned numbers, until we finished the first batch, and then added more batches chronologically". So at least there's precedent for doing that. Lucas, I was also thinking about QIDs. But besides the point that David raised, or rather in addition to them, there's also the point that Wikidata should be the 'pure representation' of what a language is - whereas the objects in Wikifunctions representing languages should malleable to exactly what we need and want them to be, irrelevant of their independent ontological status in the world (I hope that makes sense?). In Wikifunctions we decide what a language is, and their fallbacks, etc., based on product needs. In Wikidata on the other side we decide what is a language and what is not based on what relevant sources are stating. These hopefully overlap, and ideally are equivalent, but in reality I don't expect them to be and I don't want to introduce a push for edits in Wikidata to make certain features in Wikifunctions work. Yes, we should have mappings from the relevant ZIDs to QIDs, and/or the other way around, but that's why I think they shouldn't be the same. Charles, regarding your point, yes, we should be compatible and map to external standards as much as possible, but for the same reason as to regarding Wikidata, we shouldn't simply import them wholesale. I wrote down a few more thoughts on-wiki here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Representation_of_langua… On Thu, Mar 11, 2021 at 5:44 AM Charles Matthews via Abstract-Wikipedia < abstract-wikipedia(a)lists.wikimedia.org> wrote:

On 11 March 2021 at 12:24 David Abián <davidabian(a)wikimedia.es> wrote: Hi, Although we usually claim that Wikidata entities and their URIs are stable, I believe that, unfortunately, their stability isn't as high as desirable in practice, or at least the risk of deletion, merging, redirection, redefinition of identity due to confusion or ambiguity, etc. exists and materializes too often (I hope I'm not being read by Wikidata haters :D). This happens with all kinds of Items, and it can happen with minority languages as well as with something as widespread as, for instance, Chinese.

On 11/03/2021 09:50, Lucas Werkmeister wrote: > Wouldn’t it be better to refer to them using Wikidata item IDs?

Not an area where I'm an expert. But there seem to be quite a number of standard IDs for languages, include some which are ISO, and P424 on Wikidata which is "Wikimedia language code". I would say the default solution would be Z-numbers, and a Wikidata property for WA ID so that cross-walking to any other standard ID is just a Wikidata query away. If what is intended is an exact fit with P424 then that would be duplication; but I suppose in the end that won't be the decision. Charles _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

Reply

Samuel Klein

4:23 a.m.

Perhaps something like ordering roughly by prevalence of use, or by randomization within tiers of use. ? Purely random order feels a bit chaotic. On Fri, Mar 26, 2021 at 9:07 PM Thad Guidry <thadguidry(a)gmail.com> wrote:

...

Hmm, I am definitely more attracted to: * Advantages of random order: no special weight to latin script and latinized names of languages. because of it's closer alignment to the ethos of Abstract Wikipedia. Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/ On Fri, Mar 26, 2021 at 7:35 PM Denny Vrandečić <dvrandecic(a)wikimedia.org> wrote:

To close this thread: The current proposal says to give Z1001-Z1006 to the big six UN languages, and then to assign Z1011ff. to all the other language codes based on the alphabetical order of their language codes. I was wondering if, for the languages not in the big six, instead of using the alphabetical order, if there was an appetite to using a randomized order Advantages of alphabetical order: it enables a bit of guessing and remembering (if you know what "uz" is, you'll be able to guess "uz-cyrl" and "uz-latn" with quite some confidence). Advantages of random order: no special weight to latin script and latinized names of languages. On Thu, Mar 11, 2021 at 3:51 PM Denny Vrandečić <dvrandecic(a)wikimedia.org> wrote:

Thanks for the thoughts, these are super useful! And yes, I sent that mail yesterday out with not enough context. Thad, thanks, LCID looks indeed very interesting! That could provide a source of numbers to draw from. But looking into it in detail, it also looks like "we took the English language names, and in that order assigned numbers, until we finished the first batch, and then added more batches chronologically". So at least there's precedent for doing that. Lucas, I was also thinking about QIDs. But besides the point that David raised, or rather in addition to them, there's also the point that Wikidata should be the 'pure representation' of what a language is - whereas the objects in Wikifunctions representing languages should malleable to exactly what we need and want them to be, irrelevant of their independent ontological status in the world (I hope that makes sense?). In Wikifunctions we decide what a language is, and their fallbacks, etc., based on product needs. In Wikidata on the other side we decide what is a language and what is not based on what relevant sources are stating. These hopefully overlap, and ideally are equivalent, but in reality I don't expect them to be and I don't want to introduce a push for edits in Wikidata to make certain features in Wikifunctions work. Yes, we should have mappings from the relevant ZIDs to QIDs, and/or the other way around, but that's why I think they shouldn't be the same. Charles, regarding your point, yes, we should be compatible and map to external standards as much as possible, but for the same reason as to regarding Wikidata, we shouldn't simply import them wholesale. I wrote down a few more thoughts on-wiki here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Representation_of_langua… On Thu, Mar 11, 2021 at 5:44 AM Charles Matthews via Abstract-Wikipedia < abstract-wikipedia(a)lists.wikimedia.org> wrote:

> On 11 March 2021 at 12:24 David Abián <davidabian(a)wikimedia.es> wrote: > > > Hi, > > Although we usually claim that Wikidata entities and their URIs are > stable, I believe that, unfortunately, their stability isn't as high as > desirable in practice, or at least the risk of deletion, merging, > redirection, redefinition of identity due to confusion or ambiguity, > etc. exists and materializes too often (I hope I'm not being read by > Wikidata haters :D). This happens with all kinds of Items, and it can > happen with minority languages as well as with something as widespread > as, for instance, Chinese. > On 11/03/2021 09:50, Lucas Werkmeister wrote: > > Wouldn’t it be better to refer to them using Wikidata item IDs? Not an area where I'm an expert. But there seem to be quite a number of standard IDs for languages, include some which are ISO, and P424 on Wikidata which is "Wikimedia language code". I would say the default solution would be Z-numbers, and a Wikidata property for WA ID so that cross-walking to any other standard ID is just a Wikidata query away. If what is intended is an exact fit with P424 then that would be duplication; but I suppose in the end that won't be the decision. Charles _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

_______________________________________________

Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

-- Samuel Klein @metasj w:user:sj +1 617 529 4266

Reply

Philippe Verdy

28 Mar 28 Mar

12:14 a.m.

And I still wonder why ZID's have to be entirely numeric with just the "Z" prefix. When in fact they are actually pagenames (on the wiki) that are also mapped to an internal pageid. ZID's are just unique identifiers (supposed to be short and easy to process, and probably only in ASCII). But I don't wee why ZID's could not be "scoped". For me the numeric form "Znnnn" is just the default form for "unscoped" ZID's. But ZID's that fall within the same registry, could very well have the form "Znnn/" (where "Znnn" is the "unscoped" parent ZID of a type) followed by the appropriate code from that type. In that case a single unscoped ZID ("Z60" ) is sufficient as the parent ZID for the type "natural language" encoded according to BCP47. And then "Z60/en" is the ZID of English. This way we don't "reinvent the wheel", and we can reuse existing standard registries of codes, by scoping them in a parent ZID associated to the type representing this registry standard (here BCP47 is the registry standard). To be able to use a "scoped" ZID (i.e. a ZID using subpages separated by "/"), the parent ZID (its base pagename in Mediawiki) just has to be a valid type in Wikifunctions (the parent type, here "Z60", may implement its own validator for the sub-ZID's it can process, i.e. check BCP47 conformance, and implement normalization, here the capitalization of codes as they lettercase is not significant, and the unification of "-" or "_" as per BCP47 rules, so that one cannot create non-conforming codes and so that the normalizer will also ensure that if one user enters the ZID "Z60/EN" in the JSON data to edit or import, it gets implicitly normalized to "Z60/en" by the validator implemented in the "Z60" type, and so that the presence in the same edit or import of both "Z60/en" and "Z60/EN" in the same "Multilingual text" intance will be treated as an error, by the "Multilingual text" type validator). And this way, we no longer need to reserve large ranges: we can use as well coped ZIDs for Wikifunctions itself (e.g. builtin functions, builtin types...) and we no longer risk any collision. These scoped ZIDs then work like namespaces in programming languages. And we are still compatible with MediaWiki and no longer need to reinvent our own encoding (pseudo-)standard. This also allows different encoding standards to coexist, and allows implementing code conversions from one scope to another (when there's no ambiguity and codes are actually exact aliases), so that normalization is then possible as well (e.g. the "Z60" validator could recognize "Z60/ENG" as "valid", but implicitly converted to "Z60/en" by the normalization, by also accepting inputs using ISO 639-3 instead of just ISO 639-1 which is recommended in BCP47. Optionally, the "Z60" validator could recognize other locale identifiers such as Android's resource qualifiers like "zh-rCN" as being an alias of "zh-CN" in BCP47 and so normalizable to the "Z60/zh-CN" ZID). And then we get the best of all worlds! No more need to remember many numeric ZID codes specific to Wikifunctions. And a simple and efficient way to process ZIDs, notably with natural languages that will be used a lot with standard i18n libraries. Le sam. 27 mars 2021 à 09:52, Grounder UK <grounderuk(a)gmail.com> a écrit :

...

Some randomization is a better start than Latin-alphabetization. I don’t think anyone should be guessing what a ZID might be! I would look at scripts, aiming to have every supported script represented in the first block of reserved ZIDs. We should also reserve a few ZIDs for international and interlingual labels, as a “language”. This would include “en-GB” as a label for the object labelled “British English” in English, for example, or “m” for the SI unit labelled “mètre” in French. After that, we might choose to ensure that ISO 639-1 languages (184 with two-character codes, like “de”) have a lower ZID than other interface languages. This is because ISO 639-1 was intended to include the most common languages. No worries, in any event. Al. On Sat, 27 Mar 2021 at 00:35, Denny Vrandečić <dvrandecic(a)wikimedia.org> wrote:

To close this thread: The current proposal says to give Z1001-Z1006 to the big six UN languages, and then to assign Z1011ff. to all the other language codes based on the alphabetical order of their language codes. I was wondering if, for the languages not in the big six, instead of using the alphabetical order, if there was an appetite to using a randomized order Advantages of alphabetical order: it enables a bit of guessing and remembering (if you know what "uz" is, you'll be able to guess "uz-cyrl" and "uz-latn" with quite some confidence). Advantages of random order: no special weight to latin script and latinized names of languages. On Thu, Mar 11, 2021 at 3:51 PM Denny Vrandečić <dvrandecic(a)wikimedia.org> wrote:

Thanks for the thoughts, these are super useful! And yes, I sent that mail yesterday out with not enough context. Thad, thanks, LCID looks indeed very interesting! That could provide a source of numbers to draw from. But looking into it in detail, it also looks like "we took the English language names, and in that order assigned numbers, until we finished the first batch, and then added more batches chronologically". So at least there's precedent for doing that. Lucas, I was also thinking about QIDs. But besides the point that David raised, or rather in addition to them, there's also the point that Wikidata should be the 'pure representation' of what a language is - whereas the objects in Wikifunctions representing languages should malleable to exactly what we need and want them to be, irrelevant of their independent ontological status in the world (I hope that makes sense?). In Wikifunctions we decide what a language is, and their fallbacks, etc., based on product needs. In Wikidata on the other side we decide what is a language and what is not based on what relevant sources are stating. These hopefully overlap, and ideally are equivalent, but in reality I don't expect them to be and I don't want to introduce a push for edits in Wikidata to make certain features in Wikifunctions work. Yes, we should have mappings from the relevant ZIDs to QIDs, and/or the other way around, but that's why I think they shouldn't be the same. Charles, regarding your point, yes, we should be compatible and map to external standards as much as possible, but for the same reason as to regarding Wikidata, we shouldn't simply import them wholesale. I wrote down a few more thoughts on-wiki here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Representation_of_langua… On Thu, Mar 11, 2021 at 5:44 AM Charles Matthews via Abstract-Wikipedia < abstract-wikipedia(a)lists.wikimedia.org> wrote:

> On 11 March 2021 at 12:24 David Abián <davidabian(a)wikimedia.es> wrote: > > > Hi, > > Although we usually claim that Wikidata entities and their URIs are > stable, I believe that, unfortunately, their stability isn't as high as > desirable in practice, or at least the risk of deletion, merging, > redirection, redefinition of identity due to confusion or ambiguity, > etc. exists and materializes too often (I hope I'm not being read by > Wikidata haters :D). This happens with all kinds of Items, and it can > happen with minority languages as well as with something as widespread > as, for instance, Chinese. > On 11/03/2021 09:50, Lucas Werkmeister wrote: > > Wouldn’t it be better to refer to them using Wikidata item IDs? Not an area where I'm an expert. But there seem to be quite a number of standard IDs for languages, include some which are ISO, and P424 on Wikidata which is "Wikimedia language code". I would say the default solution would be Z-numbers, and a Wikidata property for WA ID so that cross-walking to any other standard ID is just a Wikidata query away. If what is intended is an exact fit with P424 then that would be duplication; but I suppose in the end that won't be the decision. Charles _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

_______________________________________________

Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

Reply

Samuel Klein

29 Mar 29 Mar

6:35 p.m.

Aha, this feels much better to me, if plausible. And this is the sort of situation that will come up a lot. This sort of [subnamespace] mechanism could make ZIDs easier to remember, and the namespaces easier to parse, and limit the extent to which a high-cardinality subspace feels like an 'imposition' or skewing of the namespace as a whole. [since we may harbor fuzzy meta-concepts like wanting to see a random ID] On Sat, Mar 27, 2021 at 8:14 PM Philippe Verdy <verdyp(a)gmail.com> wrote:

...

And I still wonder why ZID's have to be entirely numeric with just the "Z" prefix. When in fact they are actually pagenames (on the wiki) that are also mapped to an internal pageid. ZID's are just unique identifiers (supposed to be short and easy to process, and probably only in ASCII). But I don't wee why ZID's could not be "scoped". For me the numeric form "Znnnn" is just the default form for "unscoped" ZID's. But ZID's that fall within the same registry, could very well have the form "Znnn/" (where "Znnn" is the "unscoped" parent ZID of a type) followed by the appropriate code from that type. In that case a single unscoped ZID ("Z60" ) is sufficient as the parent ZID for the type "natural language" encoded according to BCP47. And then "Z60/en" is the ZID of English. This way we don't "reinvent the wheel", and we can reuse existing standard registries of codes, by scoping them in a parent ZID associated to the type representing this registry standard (here BCP47 is the registry standard). To be able to use a "scoped" ZID (i.e. a ZID using subpages separated by "/"), the parent ZID (its base pagename in Mediawiki) just has to be a valid type in Wikifunctions (the parent type, here "Z60", may implement its own validator for the sub-ZID's it can process, i.e. check BCP47 conformance, and implement normalization, here the capitalization of codes as they lettercase is not significant, and the unification of "-" or "_" as per BCP47 rules, so that one cannot create non-conforming codes and so that the normalizer will also ensure that if one user enters the ZID "Z60/EN" in the JSON data to edit or import, it gets implicitly normalized to "Z60/en" by the validator implemented in the "Z60" type, and so that the presence in the same edit or import of both "Z60/en" and "Z60/EN" in the same "Multilingual text" intance will be treated as an error, by the "Multilingual text" type validator). And this way, we no longer need to reserve large ranges: we can use as well coped ZIDs for Wikifunctions itself (e.g. builtin functions, builtin types...) and we no longer risk any collision. These scoped ZIDs then work like namespaces in programming languages. And we are still compatible with MediaWiki and no longer need to reinvent our own encoding (pseudo-)standard. This also allows different encoding standards to coexist, and allows implementing code conversions from one scope to another (when there's no ambiguity and codes are actually exact aliases), so that normalization is then possible as well (e.g. the "Z60" validator could recognize "Z60/ENG" as "valid", but implicitly converted to "Z60/en" by the normalization, by also accepting inputs using ISO 639-3 instead of just ISO 639-1 which is recommended in BCP47. Optionally, the "Z60" validator could recognize other locale identifiers such as Android's resource qualifiers like "zh-rCN" as being an alias of "zh-CN" in BCP47 and so normalizable to the "Z60/zh-CN" ZID). And then we get the best of all worlds! No more need to remember many numeric ZID codes specific to Wikifunctions. And a simple and efficient way to process ZIDs, notably with natural languages that will be used a lot with standard i18n libraries. Le sam. 27 mars 2021 à 09:52, Grounder UK <grounderuk(a)gmail.com> a écrit :

Some randomization is a better start than Latin-alphabetization. I don’t think anyone should be guessing what a ZID might be! I would look at scripts, aiming to have every supported script represented in the first block of reserved ZIDs. We should also reserve a few ZIDs for international and interlingual labels, as a “language”. This would include “en-GB” as a label for the object labelled “British English” in English, for example, or “m” for the SI unit labelled “mètre” in French. After that, we might choose to ensure that ISO 639-1 languages (184 with two-character codes, like “de”) have a lower ZID than other interface languages. This is because ISO 639-1 was intended to include the most common languages. No worries, in any event. Al. On Sat, 27 Mar 2021 at 00:35, Denny Vrandečić <dvrandecic(a)wikimedia.org> wrote:

To close this thread: The current proposal says to give Z1001-Z1006 to the big six UN languages, and then to assign Z1011ff. to all the other language codes based on the alphabetical order of their language codes. I was wondering if, for the languages not in the big six, instead of using the alphabetical order, if there was an appetite to using a randomized order Advantages of alphabetical order: it enables a bit of guessing and remembering (if you know what "uz" is, you'll be able to guess "uz-cyrl" and "uz-latn" with quite some confidence). Advantages of random order: no special weight to latin script and latinized names of languages. On Thu, Mar 11, 2021 at 3:51 PM Denny Vrandečić < dvrandecic(a)wikimedia.org> wrote:

Thanks for the thoughts, these are super useful! And yes, I sent that mail yesterday out with not enough context. Thad, thanks, LCID looks indeed very interesting! That could provide a source of numbers to draw from. But looking into it in detail, it also looks like "we took the English language names, and in that order assigned numbers, until we finished the first batch, and then added more batches chronologically". So at least there's precedent for doing that. Lucas, I was also thinking about QIDs. But besides the point that David raised, or rather in addition to them, there's also the point that Wikidata should be the 'pure representation' of what a language is - whereas the objects in Wikifunctions representing languages should malleable to exactly what we need and want them to be, irrelevant of their independent ontological status in the world (I hope that makes sense?). In Wikifunctions we decide what a language is, and their fallbacks, etc., based on product needs. In Wikidata on the other side we decide what is a language and what is not based on what relevant sources are stating. These hopefully overlap, and ideally are equivalent, but in reality I don't expect them to be and I don't want to introduce a push for edits in Wikidata to make certain features in Wikifunctions work. Yes, we should have mappings from the relevant ZIDs to QIDs, and/or the other way around, but that's why I think they shouldn't be the same. Charles, regarding your point, yes, we should be compatible and map to external standards as much as possible, but for the same reason as to regarding Wikidata, we shouldn't simply import them wholesale. I wrote down a few more thoughts on-wiki here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Representation_of_langua… On Thu, Mar 11, 2021 at 5:44 AM Charles Matthews via Abstract-Wikipedia <abstract-wikipedia(a)lists.wikimedia.org> wrote: > > > On 11 March 2021 at 12:24 David Abián <davidabian(a)wikimedia.es> > wrote: > > > > > > Hi, > > > > Although we usually claim that Wikidata entities and their URIs are > > stable, I believe that, unfortunately, their stability isn't as high > as > > desirable in practice, or at least the risk of deletion, merging, > > redirection, redefinition of identity due to confusion or ambiguity, > > etc. exists and materializes too often (I hope I'm not being read by > > Wikidata haters :D). This happens with all kinds of Items, and it can > > happen with minority languages as well as with something as > widespread > > as, for instance, Chinese. > > > On 11/03/2021 09:50, Lucas Werkmeister wrote: > > > Wouldn’t it be better to refer to them using Wikidata item IDs? > > Not an area where I'm an expert. But there seem to be quite a number > of standard IDs for languages, include some which are ISO, and P424 on > Wikidata which is "Wikimedia language code". > > I would say the default solution would be Z-numbers, and a Wikidata > property for WA ID so that cross-walking to any other standard ID is just a > Wikidata query away. If what is intended is an exact fit with P424 then > that would be duplication; but I suppose in the end that won't be the > decision. > > Charles > > _______________________________________________ > Abstract-Wikipedia mailing list > Abstract-Wikipedia(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia > _______________________________________________

Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

-- Samuel Klein @metasj w:user:sj +1 617 529 4266

Reply

Grounder UK

31 Mar 31 Mar

10:56 a.m.

Thanks, James. As I said, “no worries, in any event.” I am happy for ZIDs to be allocated automatically as the need arises. Reserving a range for future needs is problematic but providing an initial set is not. The question is whether some languages should be identified as more important than others. My view is that this is unnecessary and may cause offence or fruitless debate. The same is likely to be true of any ordering. So I would go for wholly random. How we use or extend the initial set are questions for another day. However, I have already expressed an opinion, on-wiki, that the results of rendering functions should have different types depending, only in part, on the target language (and script, orthography, register etc). We should not assume that rendered content would have a single “language” tag, like Z11/monolingual text does. I am wary of aliases. Of course, Z12/multilingual text is one implementation of aliasing, and purely functional aliases are an inevitable second. A third system on top of those may be tricky. I think this goes back to the question of multi-lingually labelized objects. We need to interpret labels unambiguously but they are not globally unique, so we need label+context, where “language” is part (or all) of context. In context, “en-GB” (for example) is unambiguous not because of the “language” in context but because it refers to the same thing irrespective of that “language”. Making “en-GB” an alias in many or all languages is not simply a question of duplication, it is simply missing the point... But I do not propose we should duplicate Wikidata’s flaws. We just need a way for standardized nomenclatures to be handled independently, however incomplete their adoption in the real world may be. SI units are a good example because we should expect even quite simple functions to accept and return explicit units, not (just) the natural-language words for those units. (And, in passing, we should prefer to fall back to SI units rather than “en”, more often than not.) To be clear, this is not proposing an imposition; it is descriptive of how real people think and work, partially overcoming natural-language and cultural barriers. Conversions between units is another problem better left for later, but I would be wary of inferring units or formats from the contextual “language” (because we know that can’t work). Briefly, all we need to decide now is how to provide ZIDs for an agreed set of language–script–style combinations that already have unambiguous identifiers. If we agree not to use the existing identifiers directly, an easily explained and culturally neutral one-to-one mapping is all I hope for. We should not allow any one-to-many mapping without further discussion. For now, only Z11/monolingual text will use the new Z60 ZIDs and users will not be able to add Z60s. (I still believe that standardized codes should be accepted as labels without needing to be declared within each language, but I withdraw my proposal that there should be specific Z60s for the implied international “languages”. This implies some form of cross-language aliasing, but I hope that might be purely functional, and we can defer discussion of the details or continue elsewhere.) Regards, Al. On Tue, 30 Mar 2021 at 20:50, James Forrester <jforrester(a)wikimedia.org> wrote:

...

On Sat, 27 Mar 2021 at 01:52, Grounder UK <grounderuk(a)gmail.com> wrote:

I would look at scripts, aiming to have every supported script represented in the first block of reserved ZIDs.

Yes, we are definitely going to pre-create all ISO 639-3 languages (and all the MediaWiki content and interface languages, where they are disjoint) as part of the initial content injection before the wiki launches. However, new languages will come along over time, either as they get recognised by ISO or as the Wikifunctions community expands its ideas of and capacity towards for whom we will want to create content. Additionally, I expect that we will end up wanting to produce outputs tuned to different audiences within larger macro-groups of a "language"; for instance, "encyclopædic educational American English" might be a form that the community decides it wants to select as a 'house style' to better reflect the discursive, plain-spoken, authoritative manner, which could be the target of the natural language generation system(s) we create, and contains some explicit or implicit decisions around tone, mood, and so on (in this case, I imagine something along the lines of third person impersonal, no passive voice, indicative mood). Indeed, different conceptual areas might be better served by different target languages; the language used to talk about quantum mechanics might be inappropriately stilted and feel odd when talking about pop cultural items (and in reverse, the tone of pop culture coverage might well read poorly for quantum mechanics). There's also the often-considered idea around the Wikimedia world of the same content being available in different forms for readers of differing general abilities, specific aptitudes, and subject familiarities. Perhaps we might have "encyclopædic educational American English for ~10 year olds", "… for ~15 year olds", and "… for ~20 year olds", which could differ from each other a fair bit in linguistic construction, varying in terms of vocabulary, metre, and depth, for example, whilst still covering the same factual content. The language addressing framework that we create now should not preclude our community from doing as it sees fit in future, and not decide too much that binds our hands later (but also doesn't leave so much undecided that we cannot make headway now).

We should also reserve a few ZIDs for international and interlingual labels, as a “language”. This would include “en-GB” as a label for the object labelled “British English” in English, for example,

The language code for our internal use (which will indeed likely be "en"/"en-GB"/"en-GB-…" *etc.* in BCP47 style) will be stored as the key of the instance (Z60K1 <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Pre-generic_function_model#Z60/Language_(Z4/Type)>); the label "British English" would be stored as a label (Z2K3 <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Pre-generic_function_model#Z2/Persistent_object_(Z4/Type)

-> "en-GB"). If you mean the formal mapping of that to a given ISO code, we could add that as a key but I'm not sure that's something we'd generally need? We'd probably want to do as much of the mapping from Wikifunctions's languages to wider concepts on Wikidata, rather than mastering the content within the wiki. or “m” for the SI unit labelled “mètre” in French.

That's an alias for a unit label, which is currently handled in Wikidata on a per-language basis. I think we'd want to avoid replicating that data locally on Wikifunctions. Our knowledge about natural languages is so that we can label/describe/discuss functions at launch, and for the future, to inform forthcoming systems as output targets. More importantly, that kind of central decision is not something we should generally build for, as I believe it's very rare that such decisions would be appropriate for all languages. Where many languages have a shared abbreviation this can feel a bit duplicative, but I'd rather we avoid baking in general decisions about how languages will wish things to work. If said label/alias is appropriate in, for example, Pashto, I'd expect it to be added to the Pashto entry on Wikidata for metre; note that it's not there currently. If it were not appropriate, a different abbreviation could be added, assuming that "متر" is not generally considered short enough. But that would be for the Pashto speaking/contributing community to decide on a language basis, rather than imposing a centrally-controlled list of "international" labels that I/we/someone decides is appropriate for all target output formats for now and forever. For concepts which have a shared monolingual string at the Wikidata level (like species taxonomic name), that would be better handled by special functions for such concepts that would choose how to handle the label for the given language, rather than having it hard-coded as a source or target language object. Indeed, there also might be different output *formats* rather than just labels, so that American English (or perhaps "American English / Imperial Units"?) output would describe things in foot-pounds or other odd (but locally-expected) units, instead of imposing a single view for all consumers. This way text in basic English might use metres, in "American English" might use feet, in "Canadian English" would use both metres and feet, in Indian English might use metres with the crore/lakh notation, *etc.* Particular areas of content (like Physics) might be flagged to use particular language outputs (like metric units). After that, we might choose to ensure that ISO 639-1 languages (184 with > two-character codes, like “de”) have a lower ZID than other interface > languages. This is because ISO 639-1 was intended to include the most > common languages.

We could. Using the UN Six as the first full was intended to privilege the very most common languages, though of course through the lens of the 1945 geo-political settlement. Of course we could also do both. J. -- *James D. Forrester* (he/him <http://pronoun.is/he> or they/themself <http://pronoun.is/they/.../themself>) Wikimedia Foundation <https://wikimediafoundation.org/

_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

Reply