Re: [Wikidata-tech] Two questions about Lexeme Modeling

11 Nov 2016

      Hi there!
1) a possible solution could be to have another category of items ("Gxxx",
grammatical rule?) to store grammatical structures, like "Noun + verb +
object" or "Noun + reflexive verb" and then linking to that structure with
a qualifier of the position that it uses on that structure. Example:
"to shit" <grammatical structure required> "Subject + reflexive verb +
reflexive pronoun"
    <role in grammatical structure> "reflexive verb"
2) I would prefer statements as they can be complemented with qualifiers as
for why it has a certain spelling (geographical variant, old usage,
corruption...). It would be nice however if there would be some mechanism
to have a special kind of property that would use its value as an item
alias. And this is something that could benefit normal items in Wikidata
too, as most name properties like P1448, P1477 (official name, birth name,
etc), should have its value automatically show as alias of the item in all
languages, if that were technologically feasible.
Cheers,
Micru
On Fri, Nov 11, 2016 at 6:03 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de
...
wrote:
...
Hi all!
There is two questions about modelling lexemes that are bothering me. One
is an
old question, and one I only came across recently.

The question that came up for me recently is how we model the

grammatical
context for senses. For instance, "to ask" can mean requesting
information, or
requesting action, depending on whether we use "ask somebody about" or "ask
somebody to". Similarly, "to shit" has entirely different meanings when
used
reflexively ("I shit myself").
There is no good place for this in our current model. The information
could be
placed in a statement on the word Sense, but that would be kind of
non-obvious,
and would not (at least not easily) allow for a concise rendering, in the
way we
see it in most dictionaries ("to ask sbdy to do sthg"). The alternative
would be
to treat each usage with a different grammatical context as a separate
Lexeme (a
verb phrase Lexeme), so "to shit oneself" would be a separate lemma. That
could
lead to a fragmentation of the content in a way that is quite unexpected to
people used to traditional dictionaries.
We could also add this information as a special field in the Sense entity,
but I
don't even know what that field should contain, exactly.
Got a better idea?

The older question is how we handle different renderings (spellings,

scripts)
of the same lexeme. In English we have "color" vs "colour", in German we
have
"stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and
Cyrillic rendering for every word. We can treat these as separate Lexemes,
but
that would mean duplicating all information about them. We could have a
single
Lemma, and represent the others as alternative Forms, or using statements
on the
Lexeme. But that raises the question which spelling or script should be the
"main" one, and used in the lemma.
I would prefer to have multi-variant lemmas. They would work like the
multi-lingual labels we have now on items, but restricted to the variants
of a
single language. For display, we would apply a similar language fallback
mechanism we now apply when showing labels.
2b) if we treat lemmas as multi-variant, should Forms also be
multi-variant, or
should they be per-variant? Should the glosse of a Sense be multi-variant?
I
currently tend towards "yes" for all of the above.
What do you think?
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
-- 
Etiamsi omnes, ego non

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Wikidata-tech] Two questions about Lexeme Modeling