Hi, Andrzej
The assumption at the moment is, I think, that we will be using the Wikidata lexicographical data [1]. This is not yet as extensive as Wiktionary data [2], but it addresses many of the integrity issues. As far as I understand it, the modelling of Sense still suffers from the flaw that a Sense is presented as a "child" of a Lexeme. So, for example, L1883-S1 is a Sense of Lexeme L1883, representing the English verb to "be" with a gloss of "exist" and a "synonym" relationship to L2148-S1, a Sense of Lexeme L2148, representing the English verb to "exist". I could be wrong, but the simple idea of a word-free Sense to which all languages can link is implemented only through a possible link to a concrete Wikidata Item, so both L1883-S1 and L2148-S1 are linked to Q468777 (existence) and Q203872 (being). Apart from that, a separate translation of each Sense into each corresponding Sense in each language seems to be the intent, at present.
Wikidata also has Forms of Lexemes (but I didn't find "widziałem"). The Lexeme L185 ("see") has a Form L185-F3 ("saw") but this has no link to Form L18498-F1, the uninflected form of the verb to "saw" (unlike Wiktionary, which supports homographs implicitly). Each form has "grammatical features", showing that L185-F3 is the "simple past" of L185 but the same string, "saw", is the "simple present" of L18498. It does not explicitly say that this is not the case in the third person singular, but there is a different form, L18498-F2, which is both "simple present" and "third-person singular", so there may be a presumption that the more particular overrides the more general.
For "abstract" Senses, we could think of "abstract" as a new language, and then have translations between "abstract" "language" and Senses in all natural (and synthetic) languages. This would give you your "senses dictionary" (and allow implied translations between any Senses linked to the "abstract" Sense. When we need to generate a word in a particular language, we would need to translate the "abstract" Sense to the target language Lexeme and then consult the Forms of that Lexeme to identify which ones are applicable, given the "grammatical features" of the context.
Plenty more work to be done!
Best regards, Al.
[1] https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation [2] https://www.aclweb.org/anthology/2020.idl-1.12.pdf
On Monday, 3 August 2020, abstract-wikipedia-request@lists.wikimedia.org wrote:
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
- Re: Comprehension questions (Charles Matthews)
- Natural Language and Mathematics Generation (Adam Sobieski)
- Re: Natural Language and Mathematics Generation (Charles Matthews)
- Loose notes (Andy)
Message: 4 Date: Mon, 3 Aug 2020 12:29:03 +0200 From: Andy borucki.andrzej@gmail.com To: abstract-wikipedia@lists.wikimedia.org Subject: [Abstract-wikipedia] Loose notes Message-ID: <CAE2KeAK00kSL=jJp8gNGPNp_N8KGH0yXXUXKSa6XLM9R-ParvA@ mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi,
Abstract Wikipedia give benefits:
- first, is creating multi-language corpus for machine translation
learning. The big disadvantage of the existing multi-language corpuses is that most of data is from movie subtitles, which are very inaccurate.
- second, that it will data for Word Sense Disambiguation learning and WSD
in many languages(!).
In abstract form should be graph of senses. Senses will be choosed from English Wordnet/UNL or English Wiktionary? UNL is piece of good work but is inactive for years and not evolves. Wiktoinary senses have plus, that are grouped by etymology – quite different senses are in other etymology group. Abstract Wikipedia will linked with Wiktionary? Wiktionary senses numbers should be now persistent , or better have unique idents. Wiktionary has advantage that senses are translated to other languages, with disadvantage that its points to words not senses in other language. Alternative Abstract Wikipedia can have own sense list with idents but how to lik with Wiktionary?
Graph: should be possibility to create text in many/all laguages. For example in English is “I saw”, in Polish “widziałemwidziałam” – Polish need gender, in Abstract form should be gender of verb, even though some languages not uses it.
Senses dictionary can grow gradually with abstract text. If I edit abstract text, editor should enforce me add word with senses to dictionary if not exists and enable me to add new sense if not exists.
Is neede:
abstract text = corpus
growing dictionary of senses
growing senses to national language senses dictionary
possibly link with Wiktionaries
Best regards,
Andrzej
I see, Wikidata has also lexicographical data. I think Wikidata lexemes are more computer readable that WIktionary lexemes. But also definitions of lexemes should be Abstract graphes? At the moment only about 10 thousands lexemes. I don’t see translations lexemes to other languages. One sense can be translated to lexem in other language or sens of lexem in other language/ For example I want add Polish “zamek” and give translate link from “lock” to “zamek” but not “zamek” as “castle” or “zip”. (Polish “zamek” = English: castle,lock,zip) Lexicographical data are also in wikidata dump? (it will be well, if can download dump only lexicographical data + properties because dump of all Wikidata is huge) Because number of WIkidata lexemes is relatively little, might be better new set of lexemes, all definitions would be graph-structured as other articles in Abstract WIkipedia and even definitions would have additional information, rules for automatic recognizing sense from context of unstructured text for many languages (but these rules is difficult problem). If we definie noun lexem "band" it can be music group or material belt, For WSD Is needed special rules for analysing context, because Lesk algorithm and its modifications practically not works. For example Let consider sentence: "Each band member wore a band." we must know, that: 1. group of people have members 2. material belt can be worn, not music group or / and 1. are group of persons, active 2. passive Is obvious for humans but this is very not clear from the definitions. It is difficult problem, because if even we write rules as above, computer can't apply its to the sentence. I don;t know, if rules are possible, anyway, it will be well if definitions will be also in structured graph form, whivh can be automatic translate to other languages.
Best regards, Andrzej
pon., 3 sie 2020 o 18:43 Grounder UK grounderuk@gmail.com napisał(a):
[1] https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation [2] https://www.aclweb.org/anthology/2020.idl-1.12.pdf
I'm not sure where you're getting the numbers from; there are over 200,000 lexemes in Wikidata, with roughly a dozen languages having at least thousands of entries. Obviously it's incomplete, but quite a lot of effort has gone into it already. For most nouns, a sense can be linked to a regular Wikidata item that is about a particular concept (and this has been done in at least several languages for 10's of thousands of cases now, but again much more work is needed). One helper tool available to link lexeme senses and regular conceptual (language-independent) items is MachtSinn: https://machtsinn.toolforge.org/ - pick a language you know and help out!
Arthur
On Mon, Aug 3, 2020 at 2:51 PM Andy borucki.andrzej@gmail.com wrote:
I see, Wikidata has also lexicographical data. I think Wikidata lexemes are more computer readable that WIktionary lexemes. But also definitions of lexemes should be Abstract graphes? At the moment only about 10 thousands lexemes. I don’t see translations lexemes to other languages. One sense can be translated to lexem in other language or sens of lexem in other language/ For example I want add Polish “zamek” and give translate link from “lock” to “zamek” but not “zamek” as “castle” or “zip”. (Polish “zamek” = English: castle,lock,zip) Lexicographical data are also in wikidata dump? (it will be well, if can download dump only lexicographical data + properties because dump of all Wikidata is huge) Because number of WIkidata lexemes is relatively little, might be better new set of lexemes, all definitions would be graph-structured as other articles in Abstract WIkipedia and even definitions would have additional information, rules for automatic recognizing sense from context of unstructured text for many languages (but these rules is difficult problem). If we definie noun lexem "band" it can be music group or material belt, For WSD Is needed special rules for analysing context, because Lesk algorithm and its modifications practically not works. For example Let consider sentence: "Each band member wore a band." we must know, that:
- group of people have members
- material belt can be worn, not music group
or / and
- are group of persons, active
- passive
Is obvious for humans but this is very not clear from the definitions. It is difficult problem, because if even we write rules as above, computer can't apply its to the sentence. I don;t know, if rules are possible, anyway, it will be well if definitions will be also in structured graph form, whivh can be automatic translate to other languages.
Best regards, Andrzej
pon., 3 sie 2020 o 18:43 Grounder UK grounderuk@gmail.com napisał(a):
[1] https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation [2] https://www.aclweb.org/anthology/2020.idl-1.12.pdf
Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
I see https://www.wikidata.org/wiki/Wikidata:Lists/lexemes where last index is 9001-10000 Wikidata lexemes are in many languages, whereas for Abstract graph is needed sense of only English lexem (or even only senses independent from lexems) Access to lexemes from other Wikiproject can be slower (?) and can be incoherent.
Abstract Wikipedia can be great basic fro machine translation. (from my Python post): Language to language translator without intermediate form need n*(n+1) language models whereas translator with intermediate form only 2*n. For 100 languages will be 9900 comparing to 200,for 150 languages will be 22350 comparing to 300. Intermediate form must not be usual national language but language without ambiguity, without homonyms.
Let's imagine that we downloads to local computer databases for 5-10 languages, compile c++ binary or use Python script and have local open translation system without any restrictions. Moreover, intermediate form is intelligent and it means understanding the text and can be analysed.
Main problem is WSD (word sense disambiguation):
In https://github.com/alvations/pywsd are Lesk algorithm in variety: - original_lesk - simple_lesk - adapted_lesk - cosine_lesk
Test lesk_simple with sentences from Senseval (https://www.d.umn.edu/~) : "The Glenn Miller Orchestra turned back the clock for an evening of solid gold nostalgia at the Apollo theatre, Oxford, on Saturday." "From the tune Anchors Away right through to In The Mood the young big band under the expert direction of band leader Ray McVay perfectly recreated the Miller sound." "The voices of Sarah Gilbertson and Tony Mansell were joined by three more from the <tag band as the Moonlight Serenaders to give us the cool harmonies of Miller's original Modernairs." --> simple an unofficial association of people or groups sounds pretty sensible, but.. not found any overlap and it is first sense from Wordnet!
"The victims are Hungarian Jews who arrived at Auschwitz in 1944 to be murdered &dash. all 400,000 of them in just two months, consumed by the gas chambers at the rate of 21,000 a day even though the crematoria could not cope.\ They are dragging to the truck an old man wearing tails and a band around his arm." --> adapted a cord-like tissue connecting two larger parts of an anatomical structure sounds pretty sensible, but..found band+two, and this was pure accident because in text is "TWO months" not related to bound something
Comparing words in defiinition with words in context is not enough. Sense must be defined in some way by set of contexts.
Example: List of 13 senses from Wordnet for lexem noun “band” + making hierarchy 1. an unofficial association of people or groups 1.1.1 a group of musicians playing popular music for dancing 1.1.2 instrumentalists not including string players 1.1.3 brass ensemble 1.2 group of criminals or evil people<-------added to Wordnet defs 2. a restraint put around something to hold it together (point 2.1 is too detailed I think) 2.1 a thin flat strip or loop of flexible material that goes around or over something else, typically to hold it together or as a decoration 2.1.1 a thin flat strip of flexible material that is worn around the body or one of the limbs (especially to decorate the body) 2.1.2 a cord-like tissue connecting two larger parts of an anatomical structure 2.2 a stripe or stripes of contrasting color 2.2.1 an adornment consisting of a strip of a contrasting color or material 2.2 a driving belt in machinery 2.3 a strip of material attached to the leg of a bird to identify it (as in studies of bird migration) 3. jewelry consisting of a circlet of precious metal (often set with jewels) worn on the finger 4. range ←---- added to Wordnet defs 4.1 a range of frequencies between two limits
senses for entity “brass band” 1 = 1.1.3 of “band”
senses for entity “rubber band” 1 = 2.1 of “band”
Examples (from https://www.d.umn.edu/~ and Opus100): 1. Since birth they have had to fit in two hours of physiotherapy daily in order to survive and their one hope is `a cure" which embryonic research has brought in reach. How can a band of vociferous and woolly-minded objectors deny the hope of life to thousands of young sufferers by stopping this vital research for the sake of a few unformed embryonic cells?
1. You just met the band when you were delivering pizza to the studio?
1.1.1 Meanwhile, the show is nearly over and the band strike up `I Love You Love". 1.1.1 The NME succumbed to The Smiths success by parading a lengthy Smiths interview by Biba Kopf, a writer not known for his enthusiasm for the Smiths. Although hardly succeeding in exposing the conflict between the band's artistic stance and their situation (which was the probable intention), the dual interview of Morrissey and Marr did, for once, produce worthy quotes.
1.1.2 THE Glenn Miller Orchestra turned back the clock for an evening of solid gold nostalgia at the Apollo theatre, Oxford, on Saturday.
From the tune Anchors Away right through to In The Mood the young big band
under the expert direction of band leader Ray McVay perfectly recreated the Miller sound. The voices of Sarah Gilbertson and Tony Mansell were joined by three more from the band as the Moonlight Serenaders to give us the cool harmonies of Miller's original Modernairs.
1.1.3 I welcome this project and wish it every success Meanwhile, men in impeccable dinner suits swapped business cards at a rapid rate while women sporting glamorous cocktail dresses talked among themselves. A regimental brass band played very British tunes as guests enjoyed a sumptuous five-course banquet for which they paid at least #50 a head.
1.1.3 Oh, what song was the marching band playing, by the way?
1.2 Harukoma's band controls the passage of the village.
1.2 Today, freedom shall ring loud and clear as Olivia Pope's band of lawbreakers must explain their involvement in the Jeannine Locke cover-up.
1.2 He succumbed to a band of ruffians led by a scoundrel called Lagardère.
1.2 Brylov's band has been reported in the area.
2. It had a Smallville Savings and Loan band on it.
2.1 Eee-aww! Now, the Alice band has already perished, so I want you to treat those as though they're made of porcelain.
2.1 And the rubber band on the ball. ←-- this maybe not “band” but “rubber band” but sense is one, two lexems
2.1.1 The victims are Hungarian Jews who arrived at Auschwitz in 1944 to be murdered &dash. all 400,000 of them in just two months, consumed by the gas chambers at the rate of 21,000 a day even though the crematoria could not cope. They are dragging to the truck an old man wearing tails and a band around his arm.
2.2 She stood for a moment before her tiny, packed closet. The lighter clothes were narrow bands of color randomly distributed in the woollier press of darker skirts and coats, forming a pattern like a spectrograph.
2.2.1 Nobody is going to accept a beard and a green band round the turban and a few pious phrases (or even a lot of them!) as proof.
4 But no judge was unsympathetic to the dilemma in which a natural mother found herself. Mr McCormick submitted that if any one of the mother's reasons was possibly valid then it could not be said that the mother's refusal to consent was outside the reasonable band.
4. (confusion with 1) Nothing that emerged from either of these companies ever seemed particularly daring, imaginative or stirring. Similarly, after the first wave of enthusiasm for the films from Channel 4 had died down, it was clear that its drama director, David Rose, was more interested in television drama than cinema. British filmmakers were working in too narrow an aesthetic band, defined largely by television and at no point making any connection to the cinema culture that had last flourished in the late 1940s.
4.1 The example of the police radios shows the relative permanence of being allocated a piece of spectrum &dash. radios and other broadcasting equipment, whether for entertainment or communication, are designed to sort out what it wants to pick up from the rest of the signal. To perform well it has to be tightly targeted to cope with quite a narrow band of frequencies.
4.1 As an initial step the levels of emissions in the FM frequency band (76 o 108 MHz) shall be measured at the vehicle broadcast radio antenna with an average detector.
4.1 Frequency plan for the 169,4 - 169,8125 MHz radio spectrum band
count =1 of “brass band”
Rules (are needed, but difficult to define!) 1. band of, can met 1.1.1 near “strike up”; band's artistic stance 1.1.3 names: big band 1.1.3 names : brass, marching 1.2. band name is people name’s, band of [ruffians, criminals, thiefs,.. 2.band on it, 2.1 rubber band,can be perished 2.1.1 around arm 2.2 bands of color 2.1.band round 4 reasonable band,narrow band 4.1 band of frequencies,frequency, radio, spectrum
Translation to Polish (need point to sense, not only lexem) 1.grupa 1.1.1 zespół 1.1.2 band (jazz, swing) 1.1.3 orkiestra 1.2 banda 2 opaska 2.1 opaska.“rubber band” – gumka recepturka 2.2 pasek 2.3 opaska 4. zakres 4.1 pasmo, zakres
Note: rules are better not for senses independent from lexems but distinguish between definitions for one lexem in many languages (Polish zamek=castle,lock,zip) Thus, due to rules, is unneeded main dictionary of senses without lexems, but needed lexems divided to senses and linked to other senses inside lexems in other languages.
Best regards, Andrzej wt., 4 sie 2020 o 00:46 Arthur Smith arthurpsmith@gmail.com napisał(a):
I'm not sure where you're getting the numbers from; there are over 200,000 lexemes in Wikidata, with roughly a dozen languages having at least thousands of entries. Obviously it's incomplete, but quite a lot of effort has gone into it already. For most nouns, a sense can be linked to a regular Wikidata item that is about a particular concept (and this has been done in at least several languages for 10's of thousands of cases now, but again much more work is needed). One helper tool available to link lexeme senses and regular conceptual (language-independent) items is MachtSinn: https://machtsinn.toolforge.org/ - pick a language you know and help out!
Arthur
Is any road map on https://meta.wikimedia.org/ with estimated points of time for Abstract Wikipedia?
pon., 3 sie 2020 o 18:43 Grounder UK grounderuk@gmail.com napisał(a):
Plenty more work to be done!
abstract-wikipedia@lists.wikimedia.org