Synonyms and word senses

List overview All Threads
Download

newer

older

Newsletter #37: Welcome Max...

Newsletter #36: Phase ε completed

Andy

29 Jun 2021 29 Jun '21

2:32 p.m.

In WikiData is very large amount of Qnnnn entities and much smaller set of Lnnnn lexems. One lexem can have many senses and many different words can be synonyms (problem: sometimes very close meaning but not the same) For example in multilingual Wordnet : Polish->English word "kot" has:

02121808-n kot, kot domowy domestic cat, house cat, Felis domesticus, Felis catus any domesticated member of the genus Felis

10149241-n kot grunt an unskilled or low-ranking soldier or other worker

10508379-n kot raw recruit an inexperienced and untrained recruit

02121620-n (18) kot cat, true cat feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats

10641301-n kot sprog a new military recruit

# of them: 10149241-n 10508379-n and 10641301-n are English synonyms, very close meanings, differ lexem. In Polish is one lexem "kot", In Polish shouldn't distinguish this 3 senses, instead of should be one general definition. Problem: in Abstract Wikipedia source text should be language independent, sense-centered We need common sense for 3 differ English lexems? Senses are distinguishable in differ degree in differ language, for example "snow" in African languages vs Siberian languages. - how Abstract Wikipedia will do with senses? - how view senses in Wikidata?

Attachments:

attachment.htm (text/html — 2.0 KB)

Show replies by date

Thad Guidry

29 Jun 29 Jun

3:29 p.m.

I can answer the "view senses in Wikidata" question, and will let others answer your more broader questions of how Abstract Wikipedia might from a high-level to automatically generate content in other languages based on Lexicographical data in Wikidata.

So...what you just showed in your example is Text format of a general view of how you would like to visualize or view relations between Lexeme and edit or enter new data against those various relations. This is exactly what we need to show and improve in our Documentation much better visually, and using the Wikidata properties in those visualizations, so that folks know how to wire things up like that manually. BUT...instead of just in Documentation... we could improve the Lexeme page views themselves!

It's encouraging that there are tools coming online in the Wikidata ecosystem that are improving things so that users might have nice forms with placeholder property elements (not the Wikidata Lexeme pages which lack placeholders for your use case) to easily just fill out. For example, already having Sense, Item for Sense, Synonym, etc. etc. as placeholders on the form ready to fill out for a Lexeme.

In the SVG for Lexeme Data Model https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg we do a great job showing what is possible. We just need better placeholder forms in those tools https://www.wikidata.org/wiki/Wikidata:Tools/Lexicographical_data, and I'd say instead of a Tool, a new Gadget or two that can automatically expose those placeholder properties for easier manual entry based on a user's preference of what parts of the Lexeme data model they typically want to work on. Adding synonyms, senses, or sets of properties often used together, etc. as in your use case.

In other words, in the Gadget, actually allowing users to save their setup of the custom Lexeme page views for data entry as they need, which can show what is possible based on the Lexeme data model. That's what is missing. Otherwise, everyone has to remember what the Properties #'s are, and constantly retype them into little boxes over and over. (Carpal Tunnel Syndrome comes up rapidly) Much better to be able to enable a Gadget that has different "modes" of operation, with a few pre-saved templates of Lexeme page views.

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Tue, Jun 29, 2021 at 9:33 AM Andy borucki.andrzej@gmail.com wrote:

...

In WikiData is very large amount of Qnnnn entities and much smaller set of Lnnnn lexems. One lexem can have many senses and many different words can be synonyms (problem: sometimes very close meaning but not the same) For example in multilingual Wordnet : Polish->English word "kot" has:

02121808-n kot, kot domowy domestic cat, house cat, Felis domesticus, Felis catus any domesticated member of the genus Felis

10149241-n kot grunt an unskilled or low-ranking soldier or other worker

10508379-n kot raw recruit an inexperienced and untrained recruit

02121620-n (18) kot cat, true cat feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats

10641301-n kot sprog a new military recruit

# of them: 10149241-n 10508379-n and 10641301-n are English synonyms, very close meanings, differ lexem. In Polish is one lexem "kot", In Polish shouldn't distinguish this 3 senses, instead of should be one general definition. Problem: in Abstract Wikipedia source text should be language independent, sense-centered We need common sense for 3 differ English lexems? Senses are distinguishable in differ degree in differ language, for example "snow" in African languages vs Siberian languages.

how Abstract Wikipedia will do with senses?

how view senses in Wikidata?

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Andy

4:30 p.m.

All these tools are planning to do or some tools are written? How to help? how to write sample tool? (Python?,JavaScript? Php?) Is possible bot to mass import lexems? for example from Wiktionary? Now user can freely add sense for example to https://www.wikidata.org/wiki/Lexeme:L7006 from Wiktionary? How download only lexicographic data from WikiData? A year ago I download huge amount data - 100 GB zipped,

wt., 29 cze 2021 o 17:30 Thad Guidry thadguidry@gmail.com napisał(a):

...

In the SVG for Lexeme Data Model https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg we do a great job showing what is possible. We just need better placeholder forms in those tools https://www.wikidata.org/wiki/Wikidata:Tools/Lexicographical_data, and I'd say instead of a Tool, a new Gadget or two that can automatically expose those placeholder properties for easier manual entry based on a user's preference of what parts of the Lexeme data model they typically want to work on. Adding synonyms, senses, or sets of properties often used together, etc. as in your use case.

Philippe Verdy

10:17 p.m.

I'm a bit confused by the difference taken between the base "Lemma" and the MANY "Senses" it could have. Normally, each lemma has a SINGLE sense and a standard/base form. Some lemmas could have several forms (orthographic), while keeping its single sense. Each sense however may have restrictions on its forms depending on context/sentence (e.g. capitalization) but this also applies to the lemma. May be the term "Lemma" just refers to a dictionary entry which may exhibit several related senses (with minor semantic difference, meaning that the lemma also has the same set of possible translations to the same target language: note however that some lemmas may not have any suitable translation into a single lemma in the target language, where an expression could be needed, and each translation will depend on the forms and context of use or usage where one or several lemmas from the source language may map to one or several target lemmas in the target). But I don't see how separating senses from lemmas will offer any help, it is in my opinion an extra and unneeded layer of complexification. Are there good counter examples ? I can't imagine anyone (unless lemmas are ill-defined: if you refer to a dictionnary entry, it is just an editorial choice from a specific dictionnary). May be this is just an informal group of related senses but it is highly debatable and depend on each author: some authors may create entries by level of language, or jargons/terminologies/context of use (e.g. legal, commercial, vulgar/vernacular, scientific in specific domains), and such grouping is generally evolutive (even from the same authors) and subject to lot of personal perceptions and interpretations...

So we should make things more simple: merge Lemas and Senses into the same entity type (1-to-1). The only difference I see is in the set of forms for the same lemma, which may be euivalent (with just one prefered in some contexts, such as abbreviated forms, slangs/alterations/simplifications or just forms that are always considered as equivalent (e.g. indetermination of accents, proposed orthographic reforms, historic forms that fell out of use...)

Now there's the special case of contextual mutations (generally for phonetics or harmony, including some unwritten parts, such as rules for contractions, elisions and liaisons in French that change how surrounding terms are written or modified outside the written (or spoken, or gestured) form of the lemma itself., or the insertion of non-semantic phonetic particles (like [-t-] or [z'] in French)

Le mar. 29 juin 2021 à 18:31, Andy borucki.andrzej@gmail.com a écrit :

...

All these tools are planning to do or some tools are written? How to help? how to write sample tool? (Python?,JavaScript? Php?) Is possible bot to mass import lexems? for example from Wiktionary? Now user can freely add sense for example to https://www.wikidata.org/wiki/Lexeme:L7006 from Wiktionary? How download only lexicographic data from WikiData? A year ago I download huge amount data - 100 GB zipped,

wt., 29 cze 2021 o 17:30 Thad Guidry thadguidry@gmail.com napisał(a):

...
In the SVG for Lexeme Data Model https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg we do a great job showing what is possible. We just need better placeholder forms in those tools https://www.wikidata.org/wiki/Wikidata:Tools/Lexicographical_data, and I'd say instead of a Tool, a new Gadget or two that can automatically expose those placeholder properties for easier manual entry based on a user's preference of what parts of the Lexeme data model they typically want to work on. Adding synonyms, senses, or sets of properties often used together, etc. as in your use case.

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Gerard Meijssen

30 Jun 30 Jun

8:04 a.m.

Hoi, from a database point of view this is gibberish, you cannot model this, Thanks, GerardM

On Wed, 30 Jun 2021 at 00:18, Philippe Verdy verdyp@gmail.com wrote:

...

I'm a bit confused by the difference taken between the base "Lemma" and the MANY "Senses" it could have. Normally, each lemma has a SINGLE sense and a standard/base form. Some lemmas could have several forms (orthographic), while keeping its single sense. Each sense however may have restrictions on its forms depending on context/sentence (e.g. capitalization) but this also applies to the lemma. May be the term "Lemma" just refers to a dictionary entry which may exhibit several related senses (with minor semantic difference, meaning that the lemma also has the same set of possible translations to the same target language: note however that some lemmas may not have any suitable translation into a single lemma in the target language, where an expression could be needed, and each translation will depend on the forms and context of use or usage where one or several lemmas from the source language may map to one or several target lemmas in the target). But I don't see how separating senses from lemmas will offer any help, it is in my opinion an extra and unneeded layer of complexification. Are there good counter examples ? I can't imagine anyone (unless lemmas are ill-defined: if you refer to a dictionnary entry, it is just an editorial choice from a specific dictionnary). May be this is just an informal group of related senses but it is highly debatable and depend on each author: some authors may create entries by level of language, or jargons/terminologies/context of use (e.g. legal, commercial, vulgar/vernacular, scientific in specific domains), and such grouping is generally evolutive (even from the same authors) and subject to lot of personal perceptions and interpretations...

So we should make things more simple: merge Lemas and Senses into the same entity type (1-to-1). The only difference I see is in the set of forms for the same lemma, which may be euivalent (with just one prefered in some contexts, such as abbreviated forms, slangs/alterations/simplifications or just forms that are always considered as equivalent (e.g. indetermination of accents, proposed orthographic reforms, historic forms that fell out of use...)

Now there's the special case of contextual mutations (generally for phonetics or harmony, including some unwritten parts, such as rules for contractions, elisions and liaisons in French that change how surrounding terms are written or modified outside the written (or spoken, or gestured) form of the lemma itself., or the insertion of non-semantic phonetic particles (like [-t-] or [z'] in French)

Le mar. 29 juin 2021 à 18:31, Andy borucki.andrzej@gmail.com a écrit :

...
All these tools are planning to do or some tools are written? How to help? how to write sample tool? (Python?,JavaScript? Php?) Is possible bot to mass import lexems? for example from Wiktionary? Now user can freely add sense for example to https://www.wikidata.org/wiki/Lexeme:L7006 from Wiktionary? How download only lexicographic data from WikiData? A year ago I download huge amount data - 100 GB zipped,

wt., 29 cze 2021 o 17:30 Thad Guidry thadguidry@gmail.com napisał(a):

...
In the SVG for Lexeme Data Model https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg we do a great job showing what is possible. We just need better placeholder forms in those tools https://www.wikidata.org/wiki/Wikidata:Tools/Lexicographical_data, and I'd say instead of a Tool, a new Gadget or two that can automatically expose those placeholder properties for easier manual entry based on a user's preference of what parts of the Lexeme data model they typically want to work on. Adding synonyms, senses, or sets of properties often used together, etc. as in your use case.

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Philippe Verdy

10:19 a.m.

Thjis is an extreme response. In my view, from a database point of view, the separation of lemmas and senses IS the real gibberish. There's nothing to explain to 1-ot-many relation between the two concepts, which should be 1 to 1 and thus mergeable

Le mer. 30 juin 2021 à 10:17, Gerard Meijssen gerard.meijssen@gmail.com a écrit :

...

Hoi, from a database point of view this is gibberish, you cannot model this, Thanks, GerardM

On Wed, 30 Jun 2021 at 00:18, Philippe Verdy verdyp@gmail.com wrote:

...
I'm a bit confused by the difference taken between the base "Lemma" and the MANY "Senses" it could have. Normally, each lemma has a SINGLE sense and a standard/base form. Some lemmas could have several forms (orthographic), while keeping its single sense. Each sense however may have restrictions on its forms depending on context/sentence (e.g. capitalization) but this also applies to the lemma. May be the term "Lemma" just refers to a dictionary entry which may exhibit several related senses (with minor semantic difference, meaning that the lemma also has the same set of possible translations to the same target language: note however that some lemmas may not have any suitable translation into a single lemma in the target language, where an expression could be needed, and each translation will depend on the forms and context of use or usage where one or several lemmas from the source language may map to one or several target lemmas in the target). But I don't see how separating senses from lemmas will offer any help, it is in my opinion an extra and unneeded layer of complexification. Are there good counter examples ? I can't imagine anyone (unless lemmas are ill-defined: if you refer to a dictionnary entry, it is just an editorial choice from a specific dictionnary). May be this is just an informal group of related senses but it is highly debatable and depend on each author: some authors may create entries by level of language, or jargons/terminologies/context of use (e.g. legal, commercial, vulgar/vernacular, scientific in specific domains), and such grouping is generally evolutive (even from the same authors) and subject to lot of personal perceptions and interpretations...

So we should make things more simple: merge Lemas and Senses into the same entity type (1-to-1). The only difference I see is in the set of forms for the same lemma, which may be euivalent (with just one prefered in some contexts, such as abbreviated forms, slangs/alterations/simplifications or just forms that are always considered as equivalent (e.g. indetermination of accents, proposed orthographic reforms, historic forms that fell out of use...)

Now there's the special case of contextual mutations (generally for phonetics or harmony, including some unwritten parts, such as rules for contractions, elisions and liaisons in French that change how surrounding terms are written or modified outside the written (or spoken, or gestured) form of the lemma itself., or the insertion of non-semantic phonetic particles (like [-t-] or [z'] in French)

Le mar. 29 juin 2021 à 18:31, Andy borucki.andrzej@gmail.com a écrit :

...
All these tools are planning to do or some tools are written? How to help? how to write sample tool? (Python?,JavaScript? Php?) Is possible bot to mass import lexems? for example from Wiktionary? Now user can freely add sense for example to https://www.wikidata.org/wiki/Lexeme:L7006 from Wiktionary? How download only lexicographic data from WikiData? A year ago I download huge amount data - 100 GB zipped,

wt., 29 cze 2021 o 17:30 Thad Guidry thadguidry@gmail.com napisał(a):

...
In the SVG for Lexeme Data Model https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg we do a great job showing what is possible. We just need better placeholder forms in those tools https://www.wikidata.org/wiki/Wikidata:Tools/Lexicographical_data, and I'd say instead of a Tool, a new Gadget or two that can automatically expose those placeholder properties for easier manual entry based on a user's preference of what parts of the Lexeme data model they typically want to work on. Adding synonyms, senses, or sets of properties often used together, etc. as in your use case.

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Thomas Douillard

11:03 a.m.

This is just a modelling choice. Compared to what you propose, Wikibase lemmas make possible to share « Forms » when several of your lemmas has the same forms. No big deal. When the forms are different Wikibase have several lemmas, each with several possible senses. As each senses have their own ids on Wikidata you can refer to them just as you would refer to your kind of lemmas, and recover the forms just as easily. Hence this makes little practical difference.

Le mer. 30 juin 2021 à 12:20, Philippe Verdy verdyp@gmail.com a écrit :

...

Thjis is an extreme response. In my view, from a database point of view, the separation of lemmas and senses IS the real gibberish. There's nothing to explain to 1-ot-many relation between the two concepts, which should be 1 to 1 and thus mergeable

Le mer. 30 juin 2021 à 10:17, Gerard Meijssen gerard.meijssen@gmail.com a écrit :

...
Hoi, from a database point of view this is gibberish, you cannot model this, Thanks, GerardM

On Wed, 30 Jun 2021 at 00:18, Philippe Verdy verdyp@gmail.com wrote:

...
I'm a bit confused by the difference taken between the base "Lemma" and the MANY "Senses" it could have. Normally, each lemma has a SINGLE sense and a standard/base form. Some lemmas could have several forms (orthographic), while keeping its single sense. Each sense however may have restrictions on its forms depending on context/sentence (e.g. capitalization) but this also applies to the lemma. May be the term "Lemma" just refers to a dictionary entry which may exhibit several related senses (with minor semantic difference, meaning that the lemma also has the same set of possible translations to the same target language: note however that some lemmas may not have any suitable translation into a single lemma in the target language, where an expression could be needed, and each translation will depend on the forms and context of use or usage where one or several lemmas from the source language may map to one or several target lemmas in the target). But I don't see how separating senses from lemmas will offer any help, it is in my opinion an extra and unneeded layer of complexification. Are there good counter examples ? I can't imagine anyone (unless lemmas are ill-defined: if you refer to a dictionnary entry, it is just an editorial choice from a specific dictionnary). May be this is just an informal group of related senses but it is highly debatable and depend on each author: some authors may create entries by level of language, or jargons/terminologies/context of use (e.g. legal, commercial, vulgar/vernacular, scientific in specific domains), and such grouping is generally evolutive (even from the same authors) and subject to lot of personal perceptions and interpretations...

So we should make things more simple: merge Lemas and Senses into the same entity type (1-to-1). The only difference I see is in the set of forms for the same lemma, which may be euivalent (with just one prefered in some contexts, such as abbreviated forms, slangs/alterations/simplifications or just forms that are always considered as equivalent (e.g. indetermination of accents, proposed orthographic reforms, historic forms that fell out of use...)

Now there's the special case of contextual mutations (generally for phonetics or harmony, including some unwritten parts, such as rules for contractions, elisions and liaisons in French that change how surrounding terms are written or modified outside the written (or spoken, or gestured) form of the lemma itself., or the insertion of non-semantic phonetic particles (like [-t-] or [z'] in French)

Le mar. 29 juin 2021 à 18:31, Andy borucki.andrzej@gmail.com a écrit :

...
All these tools are planning to do or some tools are written? How to help? how to write sample tool? (Python?,JavaScript? Php?) Is possible bot to mass import lexems? for example from Wiktionary? Now user can freely add sense for example to https://www.wikidata.org/wiki/Lexeme:L7006 from Wiktionary? How download only lexicographic data from WikiData? A year ago I download huge amount data - 100 GB zipped,

wt., 29 cze 2021 o 17:30 Thad Guidry thadguidry@gmail.com napisał(a):

...
In the SVG for Lexeme Data Model https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg we do a great job showing what is possible. We just need better placeholder forms in those tools https://www.wikidata.org/wiki/Wikidata:Tools/Lexicographical_data, and I'd say instead of a Tool, a new Gadget or two that can automatically expose those placeholder properties for easier manual entry based on a user's preference of what parts of the Lexeme data model they typically want to work on. Adding synonyms, senses, or sets of properties often used together, etc. as in your use case.

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Andy

11:17 a.m.

Most of most frequent lexems has more than one sense, one sense usually have only rare lexems. While adding lexem and sense, one must fill not "definition" but "gloss" which should be very short. For example for "dog" is gloss "mammal" although cat and cow are also mammals. It will be good if were both gloss and definition?

Philippe Verdy

5:46 p.m.

You are again making a sever confusion between "lexemes" (your comment is true about them: it is a form in some orthographic system) and "lemmas" (strictly identical to "senses").

I just said that your schema makes 1-to-many relations between LEMMAS and SENSES where this should be 1-to-1.

there are 1-to-many relations from LEXEMES to LEMMAS=SENSES, I've not contested that. but we cannot use LEXEMES as the base of text abstraction (in an abstract language), we'll use LEMMAS.

We don't need any complex relation like LEXEME --(1-to-N)--> LEMMA --(1-to-N)--> SENSE (the second pair is non-sense it should be 1-to-1, and thus merged).

The abstract text will contain LEMMAS (semantic), from which some rules will decide which lexeme (lexical and very specific to each language) to use according to the target language and other constraints, and then which form of the lexeme (grammatical derivations/inflections/conjugation/contextual mutations or particles, plus capitalizing rules for some syntaxic or presentation forms)

Le mer. 30 juin 2021 à 13:18, Andy borucki.andrzej@gmail.com a écrit :

...

Most of most frequent lexems has more than one sense, one sense usually have only rare lexems. While adding lexem and sense, one must fill not "definition" but "gloss" which should be very short. For example for "dog" is gloss "mammal" although cat and cow are also mammals. It will be good if were both gloss and definition? _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Douglas Clark

6:32 p.m.

Agreed mostly. A lexeme is the head word that stands-in for all forms of the same meaning (forms of the same meaning equals lemma or sense). Let's not forget that a lexeme can be more than one word (fire engine, speak up, and even RTFM). From a word perspective, a lexeme is many to many, yet mostly one to many, AND the lexeme as a head word in one repository could also be a lemma of some lexeme in another repository. Author choice. Just wait until you get to the rules of how to select the correct lemma-sense from a lexeme's collection when the clue to the right sense is a sentence four sentences away. It's just going to get more complicated from here. Sadly, Abstract is probably the last large scale manual tagging effort, as there are a plethora of existing tagged corpora that can support Abstract if you would just use a bit of machine learning. Please don't say it's too hard to understand where or how the magic happens, as there is actually a machine learning for dummies book. It's just different.

On Wed, Jun 30, 2021 at 10:49 AM Philippe Verdy verdyp@gmail.com wrote:

...

You are again making a sever confusion between "lexemes" (your comment is true about them: it is a form in some orthographic system) and "lemmas" (strictly identical to "senses").

I just said that your schema makes 1-to-many relations between LEMMAS and SENSES where this should be 1-to-1.

there are 1-to-many relations from LEXEMES to LEMMAS=SENSES, I've not contested that. but we cannot use LEXEMES as the base of text abstraction (in an abstract language), we'll use LEMMAS.

We don't need any complex relation like LEXEME --(1-to-N)--> LEMMA --(1-to-N)--> SENSE (the second pair is non-sense it should be 1-to-1, and thus merged).

The abstract text will contain LEMMAS (semantic), from which some rules will decide which lexeme (lexical and very specific to each language) to use according to the target language and other constraints, and then which form of the lexeme (grammatical derivations/inflections/conjugation/contextual mutations or particles, plus capitalizing rules for some syntaxic or presentation forms)

Le mer. 30 juin 2021 à 13:18, Andy borucki.andrzej@gmail.com a écrit :

...
Most of most frequent lexems has more than one sense, one sense usually have only rare lexems. While adding lexem and sense, one must fill not "definition" but "gloss" which should be very short. For example for "dog" is gloss "mammal" although cat and cow are also mammals. It will be good if were both gloss and definition? _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Thad Guidry

6:36 p.m.

What do folks think of this for a proposed better view of our existing Lexeme page (so that it aligns better with our described Data Model in SVG https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg) to help visualize our data model better on the Lexeme pages themselves? Does this align with it? Better? Worse? Needs tweaks?

[image: Proposed_Lexeme_Page.png]

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:33 PM Douglas Clark clarkdd@gmail.com wrote:

...

Agreed mostly. A lexeme is the head word that stands-in for all forms of the same meaning (forms of the same meaning equals lemma or sense). Let's not forget that a lexeme can be more than one word (fire engine, speak up, and even RTFM). From a word perspective, a lexeme is many to many, yet mostly one to many, AND the lexeme as a head word in one repository could also be a lemma of some lexeme in another repository. Author choice. Just wait until you get to the rules of how to select the correct lemma-sense from a lexeme's collection when the clue to the right sense is a sentence four sentences away. It's just going to get more complicated from here. Sadly, Abstract is probably the last large scale manual tagging effort, as there are a plethora of existing tagged corpora that can support Abstract if you would just use a bit of machine learning. Please don't say it's too hard to understand where or how the magic happens, as there is actually a machine learning for dummies book. It's just different.

On Wed, Jun 30, 2021 at 10:49 AM Philippe Verdy verdyp@gmail.com wrote:

...
You are again making a sever confusion between "lexemes" (your comment is true about them: it is a form in some orthographic system) and "lemmas" (strictly identical to "senses").

I just said that your schema makes 1-to-many relations between LEMMAS and SENSES where this should be 1-to-1.

there are 1-to-many relations from LEXEMES to LEMMAS=SENSES, I've not contested that. but we cannot use LEXEMES as the base of text abstraction (in an abstract language), we'll use LEMMAS.

We don't need any complex relation like LEXEME --(1-to-N)--> LEMMA --(1-to-N)--> SENSE (the second pair is non-sense it should be 1-to-1, and thus merged).

The abstract text will contain LEMMAS (semantic), from which some rules will decide which lexeme (lexical and very specific to each language) to use according to the target language and other constraints, and then which form of the lexeme (grammatical derivations/inflections/conjugation/contextual mutations or particles, plus capitalizing rules for some syntaxic or presentation forms)

Le mer. 30 juin 2021 à 13:18, Andy borucki.andrzej@gmail.com a écrit :

...
Most of most frequent lexems has more than one sense, one sense usually have only rare lexems. While adding lexem and sense, one must fill not "definition" but "gloss" which should be very short. For example for "dog" is gloss "mammal" although cat and cow are also mammals. It will be good if were both gloss and definition? _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Andy

6:46 p.m.

For example https://www.wikidata.org/wiki/Lexeme:L8059 Here Lexem is whole structure, lemma is word "pies". For one lexem is several senses: dog, cop, dog only masculine, or jargon's masculine fox

Thad Guidry

7:03 p.m.

And furthermore... perhaps some small help iconography buttons added for newcomers that must be clicked to see the help info (not hoverable, as that would interfere) And where the help text would be translatable and the definitions taken from our Data Model and displayed in your Wikidata language preference. For Example:

[image: Proposed_Lexeme_Page_Tooltip_Bubble.png]

- Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:36 PM Thad Guidry thadguidry@gmail.com wrote:

...

What do folks think of this for a proposed better view of our existing Lexeme page (so that it aligns better with our described Data Model in SVG https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg) to help visualize our data model better on the Lexeme pages themselves? Does this align with it? Better? Worse? Needs tweaks?

[image: Proposed_Lexeme_Page.png]

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:33 PM Douglas Clark clarkdd@gmail.com wrote:

...
Agreed mostly. A lexeme is the head word that stands-in for all forms of the same meaning (forms of the same meaning equals lemma or sense). Let's not forget that a lexeme can be more than one word (fire engine, speak up, and even RTFM). From a word perspective, a lexeme is many to many, yet mostly one to many, AND the lexeme as a head word in one repository could also be a lemma of some lexeme in another repository. Author choice. Just wait until you get to the rules of how to select the correct lemma-sense from a lexeme's collection when the clue to the right sense is a sentence four sentences away. It's just going to get more complicated from here. Sadly, Abstract is probably the last large scale manual tagging effort, as there are a plethora of existing tagged corpora that can support Abstract if you would just use a bit of machine learning. Please don't say it's too hard to understand where or how the magic happens, as there is actually a machine learning for dummies book. It's just different.

On Wed, Jun 30, 2021 at 10:49 AM Philippe Verdy verdyp@gmail.com wrote:

...
You are again making a sever confusion between "lexemes" (your comment is true about them: it is a form in some orthographic system) and "lemmas" (strictly identical to "senses").

I just said that your schema makes 1-to-many relations between LEMMAS and SENSES where this should be 1-to-1.

there are 1-to-many relations from LEXEMES to LEMMAS=SENSES, I've not contested that. but we cannot use LEXEMES as the base of text abstraction (in an abstract language), we'll use LEMMAS.

We don't need any complex relation like LEXEME --(1-to-N)--> LEMMA --(1-to-N)--> SENSE (the second pair is non-sense it should be 1-to-1, and thus merged).

The abstract text will contain LEMMAS (semantic), from which some rules will decide which lexeme (lexical and very specific to each language) to use according to the target language and other constraints, and then which form of the lexeme (grammatical derivations/inflections/conjugation/contextual mutations or particles, plus capitalizing rules for some syntaxic or presentation forms)

Le mer. 30 juin 2021 à 13:18, Andy borucki.andrzej@gmail.com a écrit :

...
Most of most frequent lexems has more than one sense, one sense usually have only rare lexems. While adding lexem and sense, one must fill not "definition" but "gloss" which should be very short. For example for "dog" is gloss "mammal" although cat and cow are also mammals. It will be good if were both gloss and definition? _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Thad Guidry

10 p.m.

We've just updated the Data Model with 1. a quick textual hierarchy of the data model from a high level. 2. a few more sentences for "Lemma" bullet point to help explain things a bit better.

(Thanks to a few folks on our Telegram channel)

Take a look! https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#Da...

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 2:03 PM Thad Guidry thadguidry@gmail.com wrote:

...

And furthermore... perhaps some small help iconography buttons added for newcomers that must be clicked to see the help info (not hoverable, as that would interfere) And where the help text would be translatable and the definitions taken from our Data Model and displayed in your Wikidata language preference. For Example:

[image: Proposed_Lexeme_Page_Tooltip_Bubble.png]

Thad

https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:36 PM Thad Guidry thadguidry@gmail.com wrote:

...
What do folks think of this for a proposed better view of our existing Lexeme page (so that it aligns better with our described Data Model in SVG https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg) to help visualize our data model better on the Lexeme pages themselves? Does this align with it? Better? Worse? Needs tweaks?

[image: Proposed_Lexeme_Page.png]

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:33 PM Douglas Clark clarkdd@gmail.com wrote:

...
Agreed mostly. A lexeme is the head word that stands-in for all forms of the same meaning (forms of the same meaning equals lemma or sense). Let's not forget that a lexeme can be more than one word (fire engine, speak up, and even RTFM). From a word perspective, a lexeme is many to many, yet mostly one to many, AND the lexeme as a head word in one repository could also be a lemma of some lexeme in another repository. Author choice. Just wait until you get to the rules of how to select the correct lemma-sense from a lexeme's collection when the clue to the right sense is a sentence four sentences away. It's just going to get more complicated from here. Sadly, Abstract is probably the last large scale manual tagging effort, as there are a plethora of existing tagged corpora that can support Abstract if you would just use a bit of machine learning. Please don't say it's too hard to understand where or how the magic happens, as there is actually a machine learning for dummies book. It's just different.

On Wed, Jun 30, 2021 at 10:49 AM Philippe Verdy verdyp@gmail.com wrote:

...
You are again making a sever confusion between "lexemes" (your comment is true about them: it is a form in some orthographic system) and "lemmas" (strictly identical to "senses").

I just said that your schema makes 1-to-many relations between LEMMAS and SENSES where this should be 1-to-1.

there are 1-to-many relations from LEXEMES to LEMMAS=SENSES, I've not contested that. but we cannot use LEXEMES as the base of text abstraction (in an abstract language), we'll use LEMMAS.

We don't need any complex relation like LEXEME --(1-to-N)--> LEMMA --(1-to-N)--> SENSE (the second pair is non-sense it should be 1-to-1, and thus merged).

The abstract text will contain LEMMAS (semantic), from which some rules will decide which lexeme (lexical and very specific to each language) to use according to the target language and other constraints, and then which form of the lexeme (grammatical derivations/inflections/conjugation/contextual mutations or particles, plus capitalizing rules for some syntaxic or presentation forms)

Le mer. 30 juin 2021 à 13:18, Andy borucki.andrzej@gmail.com a écrit :

...
Most of most frequent lexems has more than one sense, one sense usually have only rare lexems. While adding lexem and sense, one must fill not "definition" but "gloss" which should be very short. For example for "dog" is gloss "mammal" although cat and cow are also mammals. It will be good if were both gloss and definition? _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Philippe Verdy

2 Jul 2 Jul

3:23 a.m.

I still DO NOT AGREE with these statements: ** "Lemma* (plural *Lemmas*) for use as a human readable representation of the lexeme, e.g. "run" or "when pigs fly". WRONG. By definition the human readable form is NOT the LEMMA, but the LEXEME by definition (even if it can have different written forms, e.g. for plurals, or mutations, or alternate orthographies with minor differences such as variable accents, or presence/absence of hyphens in compound forms)

* "A Lexeme can have several lemmas, even though it is rare" The last assertion is completely false: a lexeme VERY FREQUENTLY has multiple lemmas (each lemma however carry a SINGLE sense, by definition!)

* "A list of *Senses*, describing the different meanings of the lexeme" The different meanings of the lexeme are given by the list of its lemmas (so I maintain that LEMMA=SENSE)

We have an unnecessary abstraction level (and this is already visible in the interface): this just complicates the model by adding an extra object (meaning it is inefficient to process, requires additional queries, and maintenance).

Lexemes must be at the root of the tree ! Note that a lexeme can contain multiple words for a complete expression or necessary particles, prepositions, postpositions or other terms (like reflective pronouns), e.g. for verbs (notably in English for verbal particles in prefix or suffix, as in "to fill in" which as both, or "se"/"s'" in French, these particles may be agglutinated and muted, detached or reordered elsewhere in the sentence depending on forms or combined with other particles (e.g. in German "hereinkommen"-> "Er kommt herein.", "Für hereinzukommen, ..."), but they keep their semantic meaning.

Forms of lexemes lowever have restrictions of usage: NOT ALL forms are usable for EVERY lemma=sense of the same lexeme (e.g. a lemma=sense would be only valid for some forms and not for others, e.g. plural forms): this is not frequent but not exceptional.

Note as well that some forms (e.g. the plural) may change the grammatical gender or case: a singular noun could be masculine, but its plural feminine (typical example un French: for the same lemma=sense of the lexeme "amour", the form "amour" found in "mon amour" is masculine as a singular noun, but the form "amours" found in "mes amours" i is feminine as a plural noun).

So the real schema is:

LEXEME * 0. is language specific. * 1. possibly made of several LEXEMES * 2. has one or more FORMS - 2a. the first form is the most generic one (e.g. the singular form if it exists for a noun or adjective ; the present infinitive form if it exists for a verb, because verbs can be defective some some tense, modes). The first form generally carries all grammatical characteristic used by default for all other forms. - 2b. each additional form can have modifications of the base grammatical classification but generally they inherit them unless they are overridden. - 2c. - 2c. some forms may be equivalent to other forms of the same lexeme, so we need one or more REPRESENTATIONS to exhibit them (including in other script systems, such as Latin, Cyrillic, or under different orthographic systems and reforms) * 3. has one or several LEMMAS=SENSES 3a. each lemma=sense could contain some restrictions on the applicable forms 3b. the lemma has a definition of its sense 3d. the lemma may be valid only in some context (e.g. specialized terminology for a domain) or forbidden/depreciated in other contexts (e.g. slang words, popular/vulgar speech, formal declarations) 3c. the lemma is translatable to one or more lemmas in the target language defined separately within a different lexeme specific to that target language

In all cases, each lemma belongs to a single lexeme.

Le jeu. 1 juil. 2021 à 00:01, Thad Guidry thadguidry@gmail.com a écrit :

...

We've just updated the Data Model with

a quick textual hierarchy of the data model from a high level.

a few more sentences for "Lemma" bullet point to help explain things a

bit better.

(Thanks to a few folks on our Telegram channel)

Take a look!

https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#Da...

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 2:03 PM Thad Guidry thadguidry@gmail.com wrote:

...
And furthermore... perhaps some small help iconography buttons added for newcomers that must be clicked to see the help info (not hoverable, as that would interfere) And where the help text would be translatable and the definitions taken from our Data Model and displayed in your Wikidata language preference. For Example:

[image: Proposed_Lexeme_Page_Tooltip_Bubble.png]

Thad

https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:36 PM Thad Guidry thadguidry@gmail.com wrote:

...
What do folks think of this for a proposed better view of our existing Lexeme page (so that it aligns better with our described Data Model in SVG https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg) to help visualize our data model better on the Lexeme pages themselves? Does this align with it? Better? Worse? Needs tweaks?

[image: Proposed_Lexeme_Page.png]

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:33 PM Douglas Clark clarkdd@gmail.com wrote:

...
Agreed mostly. A lexeme is the head word that stands-in for all forms of the same meaning (forms of the same meaning equals lemma or sense). Let's not forget that a lexeme can be more than one word (fire engine, speak up, and even RTFM). From a word perspective, a lexeme is many to many, yet mostly one to many, AND the lexeme as a head word in one repository could also be a lemma of some lexeme in another repository. Author choice. Just wait until you get to the rules of how to select the correct lemma-sense from a lexeme's collection when the clue to the right sense is a sentence four sentences away. It's just going to get more complicated from here. Sadly, Abstract is probably the last large scale manual tagging effort, as there are a plethora of existing tagged corpora that can support Abstract if you would just use a bit of machine learning. Please don't say it's too hard to understand where or how the magic happens, as there is actually a machine learning for dummies book. It's just different.

On Wed, Jun 30, 2021 at 10:49 AM Philippe Verdy verdyp@gmail.com wrote:

...
You are again making a sever confusion between "lexemes" (your comment is true about them: it is a form in some orthographic system) and "lemmas" (strictly identical to "senses").

I just said that your schema makes 1-to-many relations between LEMMAS and SENSES where this should be 1-to-1.

there are 1-to-many relations from LEXEMES to LEMMAS=SENSES, I've not contested that. but we cannot use LEXEMES as the base of text abstraction (in an abstract language), we'll use LEMMAS.

We don't need any complex relation like LEXEME --(1-to-N)--> LEMMA --(1-to-N)--> SENSE (the second pair is non-sense it should be 1-to-1, and thus merged).

The abstract text will contain LEMMAS (semantic), from which some rules will decide which lexeme (lexical and very specific to each language) to use according to the target language and other constraints, and then which form of the lexeme (grammatical derivations/inflections/conjugation/contextual mutations or particles, plus capitalizing rules for some syntaxic or presentation forms)

Le mer. 30 juin 2021 à 13:18, Andy borucki.andrzej@gmail.com a écrit :

...
Most of most frequent lexems has more than one sense, one sense usually have only rare lexems. While adding lexem and sense, one must fill not "definition" but "gloss" which should be very short. For example for "dog" is gloss "mammal" although cat and cow are also mammals. It will be good if were both gloss and definition? _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Thad Guidry

3:45 a.m.

Hi Phillipe!

We are actively discussing some of this over in the Telegram channel for Wikidata Lexicographical data, which I think you are already aware of? Lucas Werkmeister noted that he has drafted an improvement over and above my changes from yesterday. You might go to Telegram and join the discussion there, or stay here and discuss, that's fine.

Here is the draft that Lucas put together thus far... (green color is his proposed changes for that Doc page). https://www.wikidata.org/w/index.php?title=User:Lucas_Werkmeister/Lexeme_Dat...

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Thu, Jul 1, 2021 at 10:24 PM Philippe Verdy verdyp@gmail.com wrote:

...

I still DO NOT AGREE with these statements: ** "Lemma* (plural *Lemmas*) for use as a human readable representation of the lexeme, e.g. "run" or "when pigs fly". WRONG. By definition the human readable form is NOT the LEMMA, but the LEXEME by definition (even if it can have different written forms, e.g. for plurals, or mutations, or alternate orthographies with minor differences such as variable accents, or presence/absence of hyphens in compound forms)

"A Lexeme can have several lemmas, even though it is rare" The last assertion is completely false: a lexeme VERY FREQUENTLY has

multiple lemmas (each lemma however carry a SINGLE sense, by definition!)

"A list of *Senses*, describing the different meanings of the lexeme" The different meanings of the lexeme are given by the list of its lemmas

(so I maintain that LEMMA=SENSE)

We have an unnecessary abstraction level (and this is already visible in the interface): this just complicates the model by adding an extra object (meaning it is inefficient to process, requires additional queries, and maintenance).

Lexemes must be at the root of the tree ! Note that a lexeme can contain multiple words for a complete expression or necessary particles, prepositions, postpositions or other terms (like reflective pronouns), e.g. for verbs (notably in English for verbal particles in prefix or suffix, as in "to fill in" which as both, or "se"/"s'" in French, these particles may be agglutinated and muted, detached or reordered elsewhere in the sentence depending on forms or combined with other particles (e.g. in German "hereinkommen"-> "Er kommt herein.", "Für hereinzukommen, ..."), but they keep their semantic meaning.

Forms of lexemes lowever have restrictions of usage: NOT ALL forms are usable for EVERY lemma=sense of the same lexeme (e.g. a lemma=sense would be only valid for some forms and not for others, e.g. plural forms): this is not frequent but not exceptional.

Note as well that some forms (e.g. the plural) may change the grammatical gender or case: a singular noun could be masculine, but its plural feminine (typical example un French: for the same lemma=sense of the lexeme "amour", the form "amour" found in "mon amour" is masculine as a singular noun, but the form "amours" found in "mes amours" i is feminine as a plural noun).

So the real schema is:

LEXEME

is language specific.

possibly made of several LEXEMES

has one or more FORMS

2a. the first form is the most generic one (e.g. the singular form if

it exists for a noun or adjective ; the present infinitive form if it exists for a verb, because verbs can be defective some some tense, modes). The first form generally carries all grammatical characteristic used by default for all other forms.

2b. each additional form can have modifications of the base

grammatical classification but generally they inherit them unless they are overridden.

2c.

2c. some forms may be equivalent to other forms of the same lexeme,

so we need one or more REPRESENTATIONS to exhibit them (including in other script systems, such as Latin, Cyrillic, or under different orthographic systems and reforms)

has one or several LEMMAS=SENSES

3a. each lemma=sense could contain some restrictions on the applicable

forms 3b. the lemma has a definition of its sense 3d. the lemma may be valid only in some context (e.g. specialized terminology for a domain) or forbidden/depreciated in other contexts (e.g. slang words, popular/vulgar speech, formal declarations) 3c. the lemma is translatable to one or more lemmas in the target language defined separately within a different lexeme specific to that target language

In all cases, each lemma belongs to a single lexeme.

Le jeu. 1 juil. 2021 à 00:01, Thad Guidry thadguidry@gmail.com a écrit :

...
We've just updated the Data Model with

a quick textual hierarchy of the data model from a high level.

a few more sentences for "Lemma" bullet point to help explain things a

bit better.

(Thanks to a few folks on our Telegram channel)

Take a look!

https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#Da...

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 2:03 PM Thad Guidry thadguidry@gmail.com wrote:

...
And furthermore... perhaps some small help iconography buttons added for newcomers that must be clicked to see the help info (not hoverable, as that would interfere) And where the help text would be translatable and the definitions taken from our Data Model and displayed in your Wikidata language preference. For Example:

[image: Proposed_Lexeme_Page_Tooltip_Bubble.png]

Thad

https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:36 PM Thad Guidry thadguidry@gmail.com wrote:

...
What do folks think of this for a proposed better view of our existing Lexeme page (so that it aligns better with our described Data Model in SVG https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg) to help visualize our data model better on the Lexeme pages themselves? Does this align with it? Better? Worse? Needs tweaks?

[image: Proposed_Lexeme_Page.png]

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:33 PM Douglas Clark clarkdd@gmail.com wrote:

...
Agreed mostly. A lexeme is the head word that stands-in for all forms of the same meaning (forms of the same meaning equals lemma or sense). Let's not forget that a lexeme can be more than one word (fire engine, speak up, and even RTFM). From a word perspective, a lexeme is many to many, yet mostly one to many, AND the lexeme as a head word in one repository could also be a lemma of some lexeme in another repository. Author choice. Just wait until you get to the rules of how to select the correct lemma-sense from a lexeme's collection when the clue to the right sense is a sentence four sentences away. It's just going to get more complicated from here. Sadly, Abstract is probably the last large scale manual tagging effort, as there are a plethora of existing tagged corpora that can support Abstract if you would just use a bit of machine learning. Please don't say it's too hard to understand where or how the magic happens, as there is actually a machine learning for dummies book. It's just different.

On Wed, Jun 30, 2021 at 10:49 AM Philippe Verdy verdyp@gmail.com wrote:

...
You are again making a sever confusion between "lexemes" (your comment is true about them: it is a form in some orthographic system) and "lemmas" (strictly identical to "senses").

I just said that your schema makes 1-to-many relations between LEMMAS and SENSES where this should be 1-to-1.

there are 1-to-many relations from LEXEMES to LEMMAS=SENSES, I've not contested that. but we cannot use LEXEMES as the base of text abstraction (in an abstract language), we'll use LEMMAS.

We don't need any complex relation like LEXEME --(1-to-N)--> LEMMA --(1-to-N)--> SENSE (the second pair is non-sense it should be 1-to-1, and thus merged).

The abstract text will contain LEMMAS (semantic), from which some rules will decide which lexeme (lexical and very specific to each language) to use according to the target language and other constraints, and then which form of the lexeme (grammatical derivations/inflections/conjugation/contextual mutations or particles, plus capitalizing rules for some syntaxic or presentation forms)

Le mer. 30 juin 2021 à 13:18, Andy borucki.andrzej@gmail.com a écrit :

> Most of most frequent lexems has more than one sense, one sense > usually have only rare lexems. > While adding lexem and sense, one must fill not "definition" but > "gloss" which should be very short. For example for "dog" is gloss "mammal" > although cat and cow are also mammals. It will be good if were both gloss > and definition? > _______________________________________________ > Abstract-Wikipedia mailing list -- > abstract-wikipedia@lists.wikimedia.org > List information: > https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed... > _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Thomas Douillard

8:27 a.m.

Philip, the definitions taken here are the definitions taken for Wikidata. This does not make sense to say « by definition these definitions are not what they are ». They may differ from other definitions you are aware of but this often happens.

This situation is the same for the non lexical part, « Property » for example is highly polysemic and someone may not agree with the definition taken in Wikidata, same for « qualifier ». But it’s the definition taken.

Le ven. 2 juil. 2021 à 05:24, Philippe Verdy verdyp@gmail.com a écrit :

...

I still DO NOT AGREE with these statements: ** "Lemma* (plural *Lemmas*) for use as a human readable representation of the lexeme, e.g. "run" or "when pigs fly". WRONG. By definition the human readable form is NOT the LEMMA, but the LEXEME by definition (even if it can have different written forms, e.g. for plurals, or mutations, or alternate orthographies with minor differences such as variable accents, or presence/absence of hyphens in compound forms)

"A Lexeme can have several lemmas, even though it is rare" The last assertion is completely false: a lexeme VERY FREQUENTLY has

multiple lemmas (each lemma however carry a SINGLE sense, by definition!)

"A list of *Senses*, describing the different meanings of the lexeme" The different meanings of the lexeme are given by the list of its lemmas

(so I maintain that LEMMA=SENSE)

We have an unnecessary abstraction level (and this is already visible in the interface): this just complicates the model by adding an extra object (meaning it is inefficient to process, requires additional queries, and maintenance).

Lexemes must be at the root of the tree ! Note that a lexeme can contain multiple words for a complete expression or necessary particles, prepositions, postpositions or other terms (like reflective pronouns), e.g. for verbs (notably in English for verbal particles in prefix or suffix, as in "to fill in" which as both, or "se"/"s'" in French, these particles may be agglutinated and muted, detached or reordered elsewhere in the sentence depending on forms or combined with other particles (e.g. in German "hereinkommen"-> "Er kommt herein.", "Für hereinzukommen, ..."), but they keep their semantic meaning.

Forms of lexemes lowever have restrictions of usage: NOT ALL forms are usable for EVERY lemma=sense of the same lexeme (e.g. a lemma=sense would be only valid for some forms and not for others, e.g. plural forms): this is not frequent but not exceptional.

Note as well that some forms (e.g. the plural) may change the grammatical gender or case: a singular noun could be masculine, but its plural feminine (typical example un French: for the same lemma=sense of the lexeme "amour", the form "amour" found in "mon amour" is masculine as a singular noun, but the form "amours" found in "mes amours" i is feminine as a plural noun).

So the real schema is:

LEXEME

is language specific.

possibly made of several LEXEMES

has one or more FORMS

2a. the first form is the most generic one (e.g. the singular form if

it exists for a noun or adjective ; the present infinitive form if it exists for a verb, because verbs can be defective some some tense, modes). The first form generally carries all grammatical characteristic used by default for all other forms.

2b. each additional form can have modifications of the base

grammatical classification but generally they inherit them unless they are overridden.

2c.

2c. some forms may be equivalent to other forms of the same lexeme,

so we need one or more REPRESENTATIONS to exhibit them (including in other script systems, such as Latin, Cyrillic, or under different orthographic systems and reforms)

has one or several LEMMAS=SENSES

3a. each lemma=sense could contain some restrictions on the applicable

forms 3b. the lemma has a definition of its sense 3d. the lemma may be valid only in some context (e.g. specialized terminology for a domain) or forbidden/depreciated in other contexts (e.g. slang words, popular/vulgar speech, formal declarations) 3c. the lemma is translatable to one or more lemmas in the target language defined separately within a different lexeme specific to that target language

In all cases, each lemma belongs to a single lexeme.

Le jeu. 1 juil. 2021 à 00:01, Thad Guidry thadguidry@gmail.com a écrit :

...
We've just updated the Data Model with

a quick textual hierarchy of the data model from a high level.

a few more sentences for "Lemma" bullet point to help explain things a

bit better.

(Thanks to a few folks on our Telegram channel)

Take a look!

https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#Da...

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 2:03 PM Thad Guidry thadguidry@gmail.com wrote:

...
And furthermore... perhaps some small help iconography buttons added for newcomers that must be clicked to see the help info (not hoverable, as that would interfere) And where the help text would be translatable and the definitions taken from our Data Model and displayed in your Wikidata language preference. For Example:

[image: Proposed_Lexeme_Page_Tooltip_Bubble.png]

Thad

https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:36 PM Thad Guidry thadguidry@gmail.com wrote:

...
What do folks think of this for a proposed better view of our existing Lexeme page (so that it aligns better with our described Data Model in SVG https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation#/media/File:Lexeme_data_model.svg) to help visualize our data model better on the Lexeme pages themselves? Does this align with it? Better? Worse? Needs tweaks?

[image: Proposed_Lexeme_Page.png]

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Jun 30, 2021 at 1:33 PM Douglas Clark clarkdd@gmail.com wrote:

...
Agreed mostly. A lexeme is the head word that stands-in for all forms of the same meaning (forms of the same meaning equals lemma or sense). Let's not forget that a lexeme can be more than one word (fire engine, speak up, and even RTFM). From a word perspective, a lexeme is many to many, yet mostly one to many, AND the lexeme as a head word in one repository could also be a lemma of some lexeme in another repository. Author choice. Just wait until you get to the rules of how to select the correct lemma-sense from a lexeme's collection when the clue to the right sense is a sentence four sentences away. It's just going to get more complicated from here. Sadly, Abstract is probably the last large scale manual tagging effort, as there are a plethora of existing tagged corpora that can support Abstract if you would just use a bit of machine learning. Please don't say it's too hard to understand where or how the magic happens, as there is actually a machine learning for dummies book. It's just different.

On Wed, Jun 30, 2021 at 10:49 AM Philippe Verdy verdyp@gmail.com wrote:

...
You are again making a sever confusion between "lexemes" (your comment is true about them: it is a form in some orthographic system) and "lemmas" (strictly identical to "senses").

I just said that your schema makes 1-to-many relations between LEMMAS and SENSES where this should be 1-to-1.

there are 1-to-many relations from LEXEMES to LEMMAS=SENSES, I've not contested that. but we cannot use LEXEMES as the base of text abstraction (in an abstract language), we'll use LEMMAS.

We don't need any complex relation like LEXEME --(1-to-N)--> LEMMA --(1-to-N)--> SENSE (the second pair is non-sense it should be 1-to-1, and thus merged).

The abstract text will contain LEMMAS (semantic), from which some rules will decide which lexeme (lexical and very specific to each language) to use according to the target language and other constraints, and then which form of the lexeme (grammatical derivations/inflections/conjugation/contextual mutations or particles, plus capitalizing rules for some syntaxic or presentation forms)

Le mer. 30 juin 2021 à 13:18, Andy borucki.andrzej@gmail.com a écrit :

> Most of most frequent lexems has more than one sense, one sense > usually have only rare lexems. > While adding lexem and sense, one must fill not "definition" but > "gloss" which should be very short. For example for "dog" is gloss "mammal" > although cat and cow are also mammals. It will be good if were both gloss > and definition? > _______________________________________________ > Abstract-Wikipedia mailing list -- > abstract-wikipedia@lists.wikimedia.org > List information: > https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed... > _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

1116

Age (days ago)

1119

Last active (days ago)

abstract-wikipedia@lists.wikimedia.org

16 comments

6 participants

tags (0)

participants (6)

Andy
Douglas Clark
Gerard Meijssen
Philippe Verdy
Thad Guidry
Thomas Douillard