Hi there!

1) a possible solution could be to have another category of items ("Gxxx", grammatical rule?) to store grammatical structures, like "Noun + verb + object" or "Noun + reflexive verb" and then linking to that structure with a qualifier of the position that it uses on that structure. Example:

"to shit" <grammatical structure required> "Subject + reflexive verb + reflexive pronoun"

<role in grammatical structure> "reflexive verb"

2) I would prefer statements as they can be complemented with qualifiers as for why it has a certain spelling (geographical variant, old usage, corruption...). It would be nice however if there would be some mechanism to have a special kind of property that would use its value as an item alias. And this is something that could benefit normal items in Wikidata too, as most name properties like P1448, P1477 (official name, birth name, etc), should have its value automatically show as alias of the item in all languages, if that were technologically feasible.

Cheers,

Micru

On Fri, Nov 11, 2016 at 6:03 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de> wrote:

Hi all!

There is two questions about modelling lexemes that are bothering me. One is an
old question, and one I only came across recently.

1) The question that came up for me recently is how we model the grammatical
context for senses. For instance, "to ask" can mean requesting information, or
requesting action, depending on whether we use "ask somebody about" or "ask
somebody to". Similarly, "to shit" has entirely different meanings when used
reflexively ("I shit myself").

There is no good place for this in our current model. The information could be
placed in a statement on the word Sense, but that would be kind of non-obvious,
and would not (at least not easily) allow for a concise rendering, in the way we
see it in most dictionaries ("to ask sbdy to do sthg"). The alternative would be
to treat each usage with a different grammatical context as a separate Lexeme (a
verb phrase Lexeme), so "to shit oneself" would be a separate lemma. That could
lead to a fragmentation of the content in a way that is quite unexpected to
people used to traditional dictionaries.

We could also add this information as a special field in the Sense entity, but I
don't even know what that field should contain, exactly.

Got a better idea?

2) The older question is how we handle different renderings (spellings, scripts)
of the same lexeme. In English we have "color" vs "colour", in German we have
"stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and
Cyrillic rendering for every word. We can treat these as separate Lexemes, but
that would mean duplicating all information about them. We could have a single
Lemma, and represent the others as alternative Forms, or using statements on the
Lexeme. But that raises the question which spelling or script should be the
"main" one, and used in the lemma.

I would prefer to have multi-variant lemmas. They would work like the
multi-lingual labels we have now on items, but restricted to the variants of a
single language. For display, we would apply a similar language fallback
mechanism we now apply when showing labels.

2b) if we treat lemmas as multi-variant, should Forms also be multi-variant, or
should they be per-variant? Should the glosse of a Sense be multi-variant? I
currently tend towards "yes" for all of the above.

What do you think?

--
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Etiamsi omnes, ego non