Hi there!
1) a possible solution could be to have another category of items ("Gxxx", grammatical rule?) to store grammatical structures, like "Noun + verb + object" or "Noun + reflexive verb" and then linking to that structure with a qualifier of the position that it uses on that structure. Example: "to shit" <grammatical structure required> "Subject + reflexive verb + reflexive pronoun" <role in grammatical structure> "reflexive verb"
2) I would prefer statements as they can be complemented with qualifiers as for why it has a certain spelling (geographical variant, old usage, corruption...). It would be nice however if there would be some mechanism to have a special kind of property that would use its value as an item alias. And this is something that could benefit normal items in Wikidata too, as most name properties like P1448, P1477 (official name, birth name, etc), should have its value automatically show as alias of the item in all languages, if that were technologically feasible.
Cheers, Micru
On Fri, Nov 11, 2016 at 6:03 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Hi all!
There is two questions about modelling lexemes that are bothering me. One is an old question, and one I only came across recently.
- The question that came up for me recently is how we model the
grammatical context for senses. For instance, "to ask" can mean requesting information, or requesting action, depending on whether we use "ask somebody about" or "ask somebody to". Similarly, "to shit" has entirely different meanings when used reflexively ("I shit myself").
There is no good place for this in our current model. The information could be placed in a statement on the word Sense, but that would be kind of non-obvious, and would not (at least not easily) allow for a concise rendering, in the way we see it in most dictionaries ("to ask sbdy to do sthg"). The alternative would be to treat each usage with a different grammatical context as a separate Lexeme (a verb phrase Lexeme), so "to shit oneself" would be a separate lemma. That could lead to a fragmentation of the content in a way that is quite unexpected to people used to traditional dictionaries.
We could also add this information as a special field in the Sense entity, but I don't even know what that field should contain, exactly.
Got a better idea?
- The older question is how we handle different renderings (spellings,
scripts) of the same lexeme. In English we have "color" vs "colour", in German we have "stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and Cyrillic rendering for every word. We can treat these as separate Lexemes, but that would mean duplicating all information about them. We could have a single Lemma, and represent the others as alternative Forms, or using statements on the Lexeme. But that raises the question which spelling or script should be the "main" one, and used in the lemma.
I would prefer to have multi-variant lemmas. They would work like the multi-lingual labels we have now on items, but restricted to the variants of a single language. For display, we would apply a similar language fallback mechanism we now apply when showing labels.
2b) if we treat lemmas as multi-variant, should Forms also be multi-variant, or should they be per-variant? Should the glosse of a Sense be multi-variant? I currently tend towards "yes" for all of the above.
What do you think?
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech