Hi, Andrzej
The assumption at the moment is, I think, that we will be using the Wikidata lexicographical data [1]. This is not yet as extensive as Wiktionary data [2], but it addresses many of the integrity issues. As far as I understand it, the modelling of Sense still suffers from the flaw that a Sense is presented as a "child" of a Lexeme. So, for example, L1883-S1 is a Sense of Lexeme L1883, representing the English verb to "be" with a gloss of "exist" and a "synonym" relationship to L2148-S1, a Sense of Lexeme L2148, representing the English verb to "exist". I could be wrong, but the simple idea of a word-free Sense to which all languages can link is implemented only through a possible link to a concrete Wikidata Item, so both L1883-S1 and L2148-S1 are linked to Q468777 (existence) and Q203872 (being). Apart from that, a separate translation of each Sense into each corresponding Sense in each language seems to be the intent, at present.
Wikidata also has Forms of Lexemes (but I didn't find "widziałem"). The Lexeme L185 ("see") has a Form L185-F3 ("saw") but this has no link to Form L18498-F1, the uninflected form of the verb to "saw" (unlike Wiktionary, which supports homographs implicitly). Each form has "grammatical features", showing that L185-F3 is the "simple past" of L185 but the same string, "saw", is the "simple present" of L18498. It does not explicitly say that this is not the case in the third person singular, but there is a different form, L18498-F2, which is both "simple present" and "third-person singular", so there may be a presumption that the more particular overrides the more general.
For "abstract" Senses, we could think of "abstract" as a new language, and then have translations between "abstract" "language" and Senses in all natural (and synthetic) languages. This would give you your "senses dictionary" (and allow implied translations between any Senses linked to the "abstract" Sense. When we need to generate a word in a particular language, we would need to translate the "abstract" Sense to the target language Lexeme and then consult the Forms of that Lexeme to identify which ones are applicable, given the "grammatical features" of the context.
Plenty more work to be done!
Best regards, Al.
[1] https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation [2] https://www.aclweb.org/anthology/2020.idl-1.12.pdf
On Monday, 3 August 2020, abstract-wikipedia-request@lists.wikimedia.org wrote:
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
- Re: Comprehension questions (Charles Matthews)
- Natural Language and Mathematics Generation (Adam Sobieski)
- Re: Natural Language and Mathematics Generation (Charles Matthews)
- Loose notes (Andy)
Message: 4 Date: Mon, 3 Aug 2020 12:29:03 +0200 From: Andy borucki.andrzej@gmail.com To: abstract-wikipedia@lists.wikimedia.org Subject: [Abstract-wikipedia] Loose notes Message-ID: <CAE2KeAK00kSL=jJp8gNGPNp_N8KGH0yXXUXKSa6XLM9R-ParvA@ mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi,
Abstract Wikipedia give benefits:
- first, is creating multi-language corpus for machine translation
learning. The big disadvantage of the existing multi-language corpuses is that most of data is from movie subtitles, which are very inaccurate.
- second, that it will data for Word Sense Disambiguation learning and WSD
in many languages(!).
In abstract form should be graph of senses. Senses will be choosed from English Wordnet/UNL or English Wiktionary? UNL is piece of good work but is inactive for years and not evolves. Wiktoinary senses have plus, that are grouped by etymology – quite different senses are in other etymology group. Abstract Wikipedia will linked with Wiktionary? Wiktionary senses numbers should be now persistent , or better have unique idents. Wiktionary has advantage that senses are translated to other languages, with disadvantage that its points to words not senses in other language. Alternative Abstract Wikipedia can have own sense list with idents but how to lik with Wiktionary?
Graph: should be possibility to create text in many/all laguages. For example in English is “I saw”, in Polish “widziałemwidziałam” – Polish need gender, in Abstract form should be gender of verb, even though some languages not uses it.
Senses dictionary can grow gradually with abstract text. If I edit abstract text, editor should enforce me add word with senses to dictionary if not exists and enable me to add new sense if not exists.
Is neede:
abstract text = corpus
growing dictionary of senses
growing senses to national language senses dictionary
possibly link with Wiktionaries
Best regards,
Andrzej