Ugh, tough ones. I hope someone with a computer linguistics background will chime in, or check the Lemon models for answers.

I put my answers in-line.

On Fri, Nov 11, 2016 at 9:03 AM Daniel Kinzler <daniel.kinzler@wikimedia.de> wrote:

1) The question that came up for me recently is how we model the grammatical
context for senses. For instance, "to ask" can mean requesting information, or
requesting action, depending on whether we use "ask somebody about" or "ask
somebody to". Similarly, "to shit" has entirely different meanings when used
reflexively ("I shit myself").

Not only that. "I shit myself" is very different from "Don't shit yourself". It is not just the reflexivity. It might the whole phrase.

Looking at https://en.wiktionary.org/wiki/ask , we currently do not have the word "about" on this page. We have a list of different senses, each with usage examples, and that would work well in the current model. Indeed, the question is whether "ask somebody about" belongs here or not. "ask somebody their age" or "ask somebody for the way" works equally well.

Looking at https://en.wiktionary.org/wiki/shit#Verb the reflexive form is indeed mentioned on its own page: https://en.wiktionary.org/wiki/shit_oneself#English - I guess that would indicate its own Lexeme?

We could also add this information as a special field in the Sense entity, but I
don't even know what that field should contain, exactly.

Just a usage example on the sense? That would often be enough to express the proposition.

2) The older question is how we handle different renderings (spellings, scripts)
of the same lexeme. In English we have "color" vs "colour", in German we have
"stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and
Cyrillic rendering for every word. We can treat these as separate Lexemes, but
that would mean duplicating all information about them. We could have a single
Lemma, and represent the others as alternative Forms, or using statements on the
Lexeme. But that raises the question which spelling or script should be the
"main" one, and used in the lemma.

I would prefer to have multi-variant lemmas. They would work like the
multi-lingual labels we have now on items, but restricted to the variants of a
single language. For display, we would apply a similar language fallback
mechanism we now apply when showing labels.

I am not a friend of multi-variant lemmas. I would prefer to either have separate Lexemes or alternative Forms. Yes, there will be duplication in the data, but this is expected already, and also, since it is machine-readable, the duplication can be easily checked and bot-ified.

Also, this is how Wiktionary works today:

https://en.wiktionary.org/wiki/colour

https://en.wiktionary.org/wiki/color

Notice that there is no primacy of either.

Having multi-variant lemmas seem to complicate the situation a lot. I think it is important to have only one single Lemma for each Lexeme, in order to keep display logic simple - the display logic which will also be important in tools like the query service and every place that displays the data, not only Wikidata. Multi-variant lemmas are a good idea if you have entities that you look at in a specific language - like Wikidata's display of Items - but it is a bad idea for lexical data.

Examples of why this is bad: how would you say that the British English version is the same as the American English? You use fallback so you don't have to duplicate it. But what is the difference for an entry that doesn't have a BE variant in order to reduce redundancy vs an entry that doesn't have a BE variant because it has not been entered yet. Statements and Forms, or a a separate Lemma would both solve that issue. Lemmas do not have the capability and flexibility of statements.

How do you determine the primacy of the American or British English version? Fallback would be written into the code base, it would not be amenable to community editing through the wiki.

Whether separate Lexemes or alternative Forms are better might be different from language to language, from case to case. By hard-coding the multi-variant lemmas, you not only pre-decided the case, but also made the code and the data model much more complicated. And not only for the initial development, but for perpetuity, whenever the data is used.

What do you think?

We shouldn't force for perfection and covering everything from the beginning. I expect that with the lexical information in the data, Wikidata will continue to evolve. If not every case can be ideally modeled, but we can capture 99.9% - well, that's enough to get started, and then see later how the exceptions will be handled. Also, there is always Wiktionary as the layer on top of Wikidata that actually can easily resolve these issues anyway.

Once we have the simple pieces working, we can actually try to understand where the machinery is creaking and not working well, and then think about these issues. But until then I would prefer to keep the system as dumb and simple as possible.

Hope that makes sense,

Denny