If we want to avoid this complexity, we could just go
by prefix. So if the
languages is "de", variants like "de-CH" or "de-DE_old"
would be
considered ok.
Ordering these alphabetically would put the
"main" code (with no suffix)
first.
May be ok for a start.
I find this issue potentially controversial, and I think that the community
at large should be involved in this matter to avoid future dissatisfaction
and to promote involvement in the decision-making.
For languages there are regulatory bodies that assign codes, but for
varieties it is not the case, or at least not totally. Even under the en-gb
there are many varieties and dialects
https://en.wikipedia.org/wiki/List_of_dialects_of_the_English_language#Unit…
In my opinion it would be more appropriate to use standardized language
codes, and then specify the dialect with an item, as it provides greater
flexibility. However, as mentioned before I would prefer if this topic in
particular would be discussed with wiktionarians.
Thanks for moving this forward!
David
On Fri, Nov 25, 2016 at 11:45 AM, Daniel Kinzler <
daniel.kinzler(a)wikimedia.de> wrote:
> Thank you Denny for having an open mind! And sorry for being a nuisance ;)
>
> I think it's very important to have controversial but constructive
> discussions
> about these things. Data models are very hard to change even slightly once
> people have started to create and use the data. We need to try hard to get
> it as
> right as possible off the bat.
>
> Some remarks inline below.
>
> Am 25.11.2016 um 03:32 schrieb Denny Vrandečić:
> > There is one thing that worries me about the multi-lemma approach, and
> that are
> > mentions of a discussion about ordering. If possible, I would suggest
> not to
> > have ordering in every single Lexeme or even Form, but rather to use the
> > following solution:
> >
> > If I understand it correctly, we won't let every Lexeme have every
> arbitrary
> > language anyway, right? Instead we will, for each language that has
> variants
> > have somewhere in the configurations an explicit list of these variants,
> i.e.
> > say, for English it will be US, British, etc., for Portuguese Brazilian
> and
> > Portuguese, etc.
>
> That approach is similar to what we are now doing for sorting Statement
> groups
> on Items. There is a global ordering of properties defined on a wiki page.
> So
> the community can still fight over it, but only in one place :) We can
> re-order
> based on user preference using a Gadget.
>
> For the multi-variant lemmas, we need to declare the Lexeme's language
> separately, in addition to the language code associated with each lemma
> variant.
> It seems like the language will probably represented as reference to a
> Wikidata
> Item (that is, a Q-Id). That Item can be associated with an (ordered) list
> of
> matching language codes, via Statements on the Item, or via configuration
> (or,
> like we do for unit conversion, configuration generated from Statements on
> Items).
>
If we want to avoid this complexity, we could just go
by prefix. So if the
languages is "de", variants like "de-CH" or "de-DE_old"
would be
> considered ok.
Ordering these alphabetically would put the
"main" code (with no suffix)
> first.
May be ok for a start.
>
> I'm not sure yet on what level we want to enforce the restriction on
> language
> codes. We can do it just before saving new data (the "validation" step),
> or we
> could treat it as a community enforced soft constraint. I'm tending
> towards the
> former, though.
>
> > Given that, we can in that very same place also define their ordering
> and their
> > fallbacks.
>
> Well, all lemmas would fall back on each other, the question is just which
> ones
> should be preferred. Simple heuristic: prefer the shortest language code.
> Or go
> by what MediaWiki does fro the UI (which is what we do for Item labels).
>
> > The upside is that it seems that this very same solution could also be
> used for
> > languages with different scripts, like Serbian, Kazakh, and Uzbek
> (although it
> > would not cover the problems with Chinese, but that wasn't solved
> previously
> > either - so the situation is strictly better). (It doesn't really solve
> all
> > problems - there is a reason why ISO treats language variants and scripts
> > independently - but it improves on the vast majority of the problematic
> cases).
>
> Yes, it's not the only decision we have to make in this regard, but the
> most
> fundamental one, I think.
>
> One consequence of this is that Forms should probably also allow multiple
> representations/spellings. This is for consistency with the lemma, for code
> re-use, and for compatibility with Lemon.
>
> > So, given that we drop any local ordering in the UI and API, I think that
> > staying close to Lemon and choosing a TermList seems currently like the
> most
> > promising approach to me, and I changed my mind.
>
> Knowing that you won't do that without a good reason, I thank you for the
> compliment :)
>
> > My previous reservations still
> > hold, and it will lead to some more complexity in the implementation not
> only of
> > Wikidata but also of tools built on top of it,
>
> The complexity of handling a multi-variant lemma is higher than a single
> string,
> but any wikibase client already needs to have the relevant code anyway, to
> handle item labels. So I expect little overhead. We'll want the lemma to be
> represented in a more compact way in the UI than we currently use for
> labels,
> though.
>
>
> Thank you all for your help!
>
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikidata-tech mailing list
> Wikidata-tech(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>
--
Etiamsi omnes, ego non