I will just add a bit of loose notes:
My/our Ordia website attempts to show you the words and translations.
Zamek is here: https://ordia.toolforge.org/L270298
Link to castle: https://ordia.toolforge.org/Q23413 where there are 5 language.
lock: https://ordia.toolforge.org/Q228039 (only lock and zamek)
and other sense of lock: "system used to ignite propellant of firearm" https://ordia.toolforge.org/Q1134386
The number of lexemes that I find are 303,845 (not "over 325,000"): https://ordia.toolforge.org/statistics/
And yes, there are over 40,000 English lexemes https://ordia.toolforge.org/language/
I have recently written a bit on the statistics and the linkage between languages in the "Lexemes in Wikidata: 2020 Status" article from 7th Workshop on Linked Data in Linguistics (LDL-2020) https://people.compute.dtu.dk/faan/ps/Nielsen2020Lexemes.pdf
Slides are available as https://people.compute.dtu.dk/faan/ps/Nielsen2020Lexemes_slides.pdf
Generally, I would say that the interlinking between languages is still sparse. The sense-Q-item matrix from page 10 in the slides which was done in February shows that only the English-Hebrew combination had more than 1,000 sense-combinations. But I think it is slowly growing. And perhaps Wikilambda/abstract can help spread the word about Wikidata lexemes.
You can get a dump of the lexemes, e.g., here: https://dumps.wikimedia.org/wikidatawiki/entities/20200725/ The smallest compressed file is 200 MB.
best regards Finn Årup Nielsen https://people.compute.dtu.dk/faan/
On 04/08/2020 11:33, Grounder UK wrote:
Andrzej, Yes, there are over 325,000 lexemes in Wikidata now, over 40,000 for English.
"Abstract" definitions are a little tricky, but it is not Lexemes themselves that are defined, it is their Senses, and Senses can be linked to Wikidata Items, which connects Lexemes into the abstract graph of "knowledge".
Translations are still very incomplete but, as with definitions, it is the Sense that should have the translation. The difficulty is that translation cannot imply identity, which means that you cannot assume that a Sense to Sense translation allows you to acquire translations from the Sense you translate into. If you think of each Sense as a set, you cannot tell whether the translated Sense is a subset or a superset. What we need for that is the concept of the intersection between the two sets, which would be part of each Sense but not necessarily the whole of either Sense.
So, broadly, your example of "zamek" is not a problem; you can connect the "lock" Sense to the Sense of the English word "lock" (L1132-S1) as well as to the identifier for the encyclopedic concept Q228039 and/or Q24644118 (claimed to be a subclass of Q228039). But you should not connect it to L1132-S2 (which connects to Q105731 pl:"Śluza wodna") or to L1132-S3 (Q1134386 pl:"Zamek (broń)", assuming that's a different Sense of "zamek" too). (I say this without knowing enough Polish to know if it makes sense; I'm living in Searle's Chiński pokój!)[1]
I don't know whether the lexical data is in the dumps now, but it will be pretty huge just by itself. It is also quite dependent on the main Wikidata pages. For our natural-language generation, that's a great strength, because we can move naturally from the concept to the word and related vocabulary in any language without doing any translation. The extra context we need to be able to choose the right Form of the Lexeme for the Sense... that will need more work on the data, as will characterising thesaurus relations (hypernymy, synonymy, hyponymy, antonymy etc) so that good alternative Lexemes can be found. In an "abstract" context, these can be thought of as "translations" into overlapping Senses, but the extent to which we represent and consult (or navigate within) the broader compound Sense domain (the set union of the Senses) is... an interesting challenge.
As for a fully "abstract" dictionary that can be read in any language... We'll be better able to think about that once we have built a few renderers for our "abstract" encyclopedic content, in my view. Machine translation and natural-language understanding are not our primary goal. I think we will make progress on both, if we remember to pay attention to inverse functions as we evolve our NLG renderers, but we have a very long way to go in all directions (and all languages).
Best regards, Al.
[1] https://pl.wikipedia.org/wiki/Chi%C5%84ski_pok%C3%B3j On Monday, 3 August 2020, <abstract-wikipedia-request@lists.wikimedia.org mailto:abstract-wikipedia-request@lists.wikimedia.org> wrote:
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org <mailto:abstract-wikipedia@lists.wikimedia.org> To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia <https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia> or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org <mailto:abstract-wikipedia-request@lists.wikimedia.org> You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org <mailto:abstract-wikipedia-owner@lists.wikimedia.org> When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..." Today's Topics: 1. Re: Natural Language and Mathematics Generation (Adam Sobieski) 2. Re: Loose notes (Andy) 3. Re: Loose notes (Arthur Smith) ---------------------------------------------------------------------- Message: 1 Date: Mon, 3 Aug 2020 18:23:03 +0000 From: Adam Sobieski <adamsobieski@hotmail.com <mailto:adamsobieski@hotmail.com>> To: Charles Matthews <charles.r.matthews@ntlworld.com <mailto:charles.r.matthews@ntlworld.com>>, "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.org <mailto:abstract-wikipedia@lists.wikimedia.org>> Subject: Re: [Abstract-wikipedia] Natural Language and Mathematics Generation Message-ID: <CH2PR12MB4184F2C81E4CD533ACFE9547C54D0@CH2PR12MB4184.namprd12.prod.outlook.com <mailto:CH2PR12MB4184F2C81E4CD533ACFE9547C54D0@CH2PR12MB4184.namprd12.prod.outlook.com>> Content-Type: text/plain; charset="utf-8" Charles, There is also MathML to consider. Work is underway at the W3C with respect to a new version of MathML, MathML4 [1][2]. Work is underway with respect to adding MathML support to Chromium [3][4]. Instead of LaTeX, MathML could be the way to go. Best regards, Adam [1] https://www.w3.org/community/mathml4/ <https://www.w3.org/community/mathml4/> [2] https://mathml-refresh.github.io/mathml/ <https://mathml-refresh.github.io/mathml/> [3] https://www.chromestatus.com/feature/5240822173794304 <https://www.chromestatus.com/feature/5240822173794304> [4] https://mathml.igalia.com/ From: Charles Matthews via Abstract-Wikipedia<mailto:abstract-wikipedia@lists.wikimedia.org <mailto:abstract-wikipedia@lists.wikimedia.org>> Sent: Monday, August 3, 2020 1:53 PM To: General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)<mailto:abstract-wikipedia@lists.wikimedia.org <mailto:abstract-wikipedia@lists.wikimedia.org>> Subject: Re: [Abstract-wikipedia] Natural Language and Mathematics Generation On 03 August 2020 at 16:50 Adam Sobieski <adamsobieski@hotmail.com <mailto:adamsobieski@hotmail.com>> wrote: By utilizing <math>LaTeX</math> elements in an XML-based intermediate output format, one could simply copy that mathematical content to the resultant output wikitext [3]. Wikitext utilizes this same convention for mathematical expressions [3]. Whether or not to include mathematics in Abstract Wikipedia is an important decision to make at a future point. Choosing to include mathematics would entail discussions about representing mathematical knowledge on Wikidata. It would entail discussions about how specific senses of certain words have mathematical meaning. It would entail discussions about how algorithms should determine when to use mathematical and scientific notations and when they should, instead, use paraphrases with the semantic content expressed using natural language. These are just some of the discussion topics which would arise should we desire to include mathematical and scientific notations in Abstract Wikipedia articles. I'm disagreeing with much of this. On LaTeX: while it is "industry standard", I'd like to draw attention to a point made in https://en.wikipedia.org/wiki/Help:Displaying_a_formula#Rendering <https://en.wikipedia.org/wiki/Help:Displaying_a_formula#Rendering>: "Latex does not have full support for Unicode characters, and not all characters render." It goes on to suggest that Vietnamese, for example, would not be well catered for, in terms of its diacritics. I appreciate that we are only talking currently about scoping, and high-level initial planning. But given AW's objectives, this is not a good sign, and I don't think we should just assume that LaTeX as an incumbent gets waved through. It is pre-Web, and something closer to HTML would be preferable, in my view. My background is in mathematics, and began my Wikipedia career writing mathematics articles. There are certainly issues, such as prose/notation balance. Mathematical language is heavily overloaded, from the disambiguation aspect. But I'm not really recognising the landscape of issues set out there. Charles