Hi all,
thanks for the two matrices and the input here. I am tending to again let Daniel convince me about using multiple representations for the lemma and the forms. Mostly because that's what's closest to Lemon, and I trust the research and expertise within Lemon. Thank you Philipp for chiming in!
There is one thing that worries me about the multi-lemma approach, and that are mentions of a discussion about ordering. If possible, I would suggest not to have ordering in every single Lexeme or even Form, but rather to use the following solution:
If I understand it correctly, we won't let every Lexeme have every arbitrary language anyway, right? Instead we will, for each language that has variants have somewhere in the configurations an explicit list of these variants, i.e. say, for English it will be US, British, etc., for Portuguese Brazilian and Portuguese, etc.
Given that, we can in that very same place also define their ordering and their fallbacks. There is no need to have that being fought out on every single Lexeme. This will also reduce the complexity of the TermList solution, and thus bring Thiemo's decision matrix and Daniel's into alignment regarding their recommendation.
The upside is that it seems that this very same solution could also be used for languages with different scripts, like Serbian, Kazakh, and Uzbek (although it would not cover the problems with Chinese, but that wasn't solved previously either - so the situation is strictly better). (It doesn't really solve all problems - there is a reason why ISO treats language variants and scripts independently - but it improves on the vast majority of the problematic cases).
So, given that we drop any local ordering in the UI and API, I think that staying close to Lemon and choosing a TermList seems currently like the most promising approach to me, and I changed my mind. My previous reservations still hold, and it will lead to some more complexity in the implementation not only of Wikidata but also of tools built on top of it, but it seems that the advantages for the Wikidata contributors and a better scientifically supported data model outweigh this.
I hope that makes sense, Giving thanks, Denny
On Tue, Nov 22, 2016 at 3:28 AM Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 22.11.2016 um 10:19 schrieb David Cuenca Tudela:
There are many many words with multiple spellings, but not many words
with
more than two, and few with more than three [citation needed].
That is not true in languages with a high amount of dialects. For
instance in
Catalan there are 5 standard spellings for "carrot" depending on which
dialect
you choose, plus some more if you consider local variations: https://ca.wikipedia.org/wiki/Pastanaga
How does Lemon handle this? Does it provide some guidance on how to display a Form with many represenations? Or is that simply left to the application?
You are right that dialects pose a problem here, since they often have multiple competing spellings (e.g. there's German Low-German and Dutch Low-German - mostly same vocabulary, different orthography).
Additionally the same form can have different meanings depending on which dialect you choose. For instance "pastenaga" means "orange carrot" in
Catalan
from Catalonia, and "purple carrot" in Catalan from Valencia.
Which makes me think, how dialects will be handled? Statements?
This is up to the community. I suppose it will depend on the individual case. Sometimes, it will be more useful to have a separate lexeme. Sometimes, you'd have multiple representations (lemmas), plus representation statements with qualifiers.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech