If we want to avoid this complexity, we could just go by prefix. So if the languages is "de", variants like "de-CH" or "de-DE_old" would be
considered ok.
Ordering these alphabetically would put the "main" code (with no suffix)
first.
May be ok for a start.
I find this issue potentially controversial, and I think that the community at large should be involved in this matter to avoid future dissatisfaction and to promote involvement in the decision-making.
For languages there are regulatory bodies that assign codes, but for varieties it is not the case, or at least not totally. Even under the en-gb there are many varieties and dialects https://en.wikipedia.org/wiki/List_of_dialects_of_the_English_language#Unite...
In my opinion it would be more appropriate to use standardized language codes, and then specify the dialect with an item, as it provides greater flexibility. However, as mentioned before I would prefer if this topic in particular would be discussed with wiktionarians.
Thanks for moving this forward!
David
On Fri, Nov 25, 2016 at 11:45 AM, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:
Thank you Denny for having an open mind! And sorry for being a nuisance ;)
I think it's very important to have controversial but constructive discussions about these things. Data models are very hard to change even slightly once people have started to create and use the data. We need to try hard to get it as right as possible off the bat.
Some remarks inline below.
Am 25.11.2016 um 03:32 schrieb Denny Vrandečić:
There is one thing that worries me about the multi-lemma approach, and
that are
mentions of a discussion about ordering. If possible, I would suggest
not to
have ordering in every single Lexeme or even Form, but rather to use the following solution:
If I understand it correctly, we won't let every Lexeme have every
arbitrary
language anyway, right? Instead we will, for each language that has
variants
have somewhere in the configurations an explicit list of these variants,
i.e.
say, for English it will be US, British, etc., for Portuguese Brazilian
and
Portuguese, etc.
That approach is similar to what we are now doing for sorting Statement groups on Items. There is a global ordering of properties defined on a wiki page. So the community can still fight over it, but only in one place :) We can re-order based on user preference using a Gadget.
For the multi-variant lemmas, we need to declare the Lexeme's language separately, in addition to the language code associated with each lemma variant. It seems like the language will probably represented as reference to a Wikidata Item (that is, a Q-Id). That Item can be associated with an (ordered) list of matching language codes, via Statements on the Item, or via configuration (or, like we do for unit conversion, configuration generated from Statements on Items).
If we want to avoid this complexity, we could just go by prefix. So if the languages is "de", variants like "de-CH" or "de-DE_old" would be considered ok. Ordering these alphabetically would put the "main" code (with no suffix) first. May be ok for a start.
I'm not sure yet on what level we want to enforce the restriction on language codes. We can do it just before saving new data (the "validation" step), or we could treat it as a community enforced soft constraint. I'm tending towards the former, though.
Given that, we can in that very same place also define their ordering
and their
fallbacks.
Well, all lemmas would fall back on each other, the question is just which ones should be preferred. Simple heuristic: prefer the shortest language code. Or go by what MediaWiki does fro the UI (which is what we do for Item labels).
The upside is that it seems that this very same solution could also be
used for
languages with different scripts, like Serbian, Kazakh, and Uzbek
(although it
would not cover the problems with Chinese, but that wasn't solved
previously
either - so the situation is strictly better). (It doesn't really solve
all
problems - there is a reason why ISO treats language variants and scripts independently - but it improves on the vast majority of the problematic
cases).
Yes, it's not the only decision we have to make in this regard, but the most fundamental one, I think.
One consequence of this is that Forms should probably also allow multiple representations/spellings. This is for consistency with the lemma, for code re-use, and for compatibility with Lemon.
So, given that we drop any local ordering in the UI and API, I think that staying close to Lemon and choosing a TermList seems currently like the
most
promising approach to me, and I changed my mind.
Knowing that you won't do that without a good reason, I thank you for the compliment :)
My previous reservations still hold, and it will lead to some more complexity in the implementation not
only of
Wikidata but also of tools built on top of it,
The complexity of handling a multi-variant lemma is higher than a single string, but any wikibase client already needs to have the relevant code anyway, to handle item labels. So I expect little overhead. We'll want the lemma to be represented in a more compact way in the UI than we currently use for labels, though.
Thank you all for your help!
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech