I will just add a bit of loose notes:
My/our Ordia website attempts to show you the words and translations.
Zamek is here:
https://ordia.toolforge.org/L270298
Link to castle:
https://ordia.toolforge.org/Q23413 where there are 5
language.
lock:
https://ordia.toolforge.org/Q228039 (only lock and zamek)
and other sense of lock: "system used to ignite propellant of firearm"
https://ordia.toolforge.org/Q1134386
The number of lexemes that I find are 303,845 (not "over 325,000"):
https://ordia.toolforge.org/statistics/
And yes, there are over 40,000 English lexemes
https://ordia.toolforge.org/language/
I have recently written a bit on the statistics and the linkage between
languages in the "Lexemes in Wikidata: 2020 Status" article from 7th
Workshop on Linked Data in Linguistics (LDL-2020)
https://people.compute.dtu.dk/faan/ps/Nielsen2020Lexemes.pdf
Slides are available as
https://people.compute.dtu.dk/faan/ps/Nielsen2020Lexemes_slides.pdf
Generally, I would say that the interlinking between languages is still
sparse. The sense-Q-item matrix from page 10 in the slides which was
done in February shows that only the English-Hebrew combination had more
than 1,000 sense-combinations. But I think it is slowly growing. And
perhaps Wikilambda/abstract can help spread the word about Wikidata lexemes.
You can get a dump of the lexemes, e.g., here:
https://dumps.wikimedia.org/wikidatawiki/entities/20200725/ The smallest
compressed file is 200 MB.
best regards
Finn Årup Nielsen
https://people.compute.dtu.dk/faan/
On 04/08/2020 11:33, Grounder UK wrote:
> Andrzej,
> Yes, there are over 325,000 lexemes in Wikidata now, over 40,000 for
> English.
>
> "Abstract" definitions are a little tricky, but it is not Lexemes
> themselves that are defined, it is their Senses, and Senses can be
> linked to Wikidata Items, which connects Lexemes into the abstract graph
> of "knowledge".
>
> Translations are still very incomplete but, as with definitions, it is
> the Sense that should have the translation. The difficulty is that
> translation cannot imply identity, which means that you cannot assume
> that a Sense to Sense translation allows you to acquire translations
> from the Sense you translate into. If you think of each Sense as a set,
> you cannot tell whether the translated Sense is a subset or a superset.
> What we need for that is the concept of the intersection between the two
> sets, which would be part of each Sense but not necessarily the whole of
> either Sense.
>
> So, broadly, your example of "zamek" is not a problem; you can connect
> the "lock" Sense to the Sense of the English word "lock"
(L1132-S1) as
> well as to the identifier for the encyclopedic concept Q228039 and/or
> Q24644118 (claimed to be a subclass of Q228039). But you should not
> connect it to L1132-S2 (which connects to Q105731 pl:"Śluza wodna") or
> to L1132-S3 (Q1134386 pl:"Zamek (broń)", assuming that's a different
> Sense of "zamek" too). (I say this without knowing enough Polish to know
> if it makes sense; I'm living in Searle's Chiński pokój!)[1]
>
> I don't know whether the lexical data is in the dumps now, but it will
> be pretty huge just by itself. It is also quite dependent on the main
> Wikidata pages. For our natural-language generation, that's a great
> strength, because we can move naturally from the concept to the word and
> related vocabulary in any language without doing any translation. The
> extra context we need to be able to choose the right Form of the Lexeme
> for the Sense... that will need more work on the data, as will
> characterising thesaurus relations (hypernymy, synonymy, hyponymy,
> antonymy etc) so that good alternative Lexemes can be found. In an
> "abstract" context, these can be thought of as "translations"
into
> overlapping Senses, but the extent to which we represent and consult (or
> navigate within) the broader compound Sense domain (the set union of the
> Senses) is... an interesting challenge.
>
> As for a fully "abstract" dictionary that can be read in any language...
> We'll be better able to think about that once we have built a few
> renderers for our "abstract" encyclopedic content, in my view. Machine
> translation and natural-language understanding are not our primary goal.
> I think we will make progress on both, if we remember to pay attention
> to inverse functions as we evolve our NLG renderers, but we have a very
> long way to go in all directions (and all languages).
>
> Best regards,
> Al.
>
> [1]
https://pl.wikipedia.org/wiki/Chi%C5%84ski_pok%C3%B3j
> On Monday, 3 August 2020,
> <abstract-wikipedia-request(a)lists.wikimedia.org
> <mailto:abstract-wikipedia-request@lists.wikimedia.org>> wrote:
>
> Send Abstract-Wikipedia mailing list submissions to
> abstract-wikipedia(a)lists.wikimedia.org
> <mailto:abstract-wikipedia@lists.wikimedia.org>
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
> <https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia>
> or, via email, send a message with subject or body 'help' to
> abstract-wikipedia-request(a)lists.wikimedia.org
> <mailto:abstract-wikipedia-request@lists.wikimedia.org>
>
> You can reach the person managing the list at
> abstract-wikipedia-owner(a)lists.wikimedia.org
> <mailto:abstract-wikipedia-owner@lists.wikimedia.org>
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Abstract-Wikipedia digest..."
>
>
> Today's Topics:
>
> 1. Re: Natural Language and Mathematics Generation (Adam Sobieski)
> 2. Re: Loose notes (Andy)
> 3. Re: Loose notes (Arthur Smith)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 3 Aug 2020 18:23:03 +0000
> From: Adam Sobieski <adamsobieski(a)hotmail.com
> <mailto:adamsobieski@hotmail.com>>
> To: Charles Matthews <charles.r.matthews(a)ntlworld.com
> <mailto:charles.r.matthews@ntlworld.com>>, "General
> public mailing list for the discussion of Abstract
> Wikipedia (aka
> Wikilambda)" <abstract-wikipedia(a)lists.wikimedia.org
> <mailto:abstract-wikipedia@lists.wikimedia.org>>
> Subject: Re: [Abstract-wikipedia] Natural Language and Mathematics
> Generation
> Message-ID:
>
>
<CH2PR12MB4184F2C81E4CD533ACFE9547C54D0(a)CH2PR12MB4184.namprd12.prod.outlook.com
>
<mailto:CH2PR12MB4184F2C81E4CD533ACFE9547C54D0@CH2PR12MB4184.namprd12.prod.outlook.com>>
>
> Content-Type: text/plain; charset="utf-8"
>
> Charles,
>
> There is also MathML to consider. Work is underway at the W3C with
> respect to a new version of MathML, MathML4 [1][2]. Work is underway
> with respect to adding MathML support to Chromium [3][4].
>
> Instead of LaTeX, MathML could be the way to go.
>
>
> Best regards,
> Adam
>
> [1]
https://www.w3.org/community/mathml4/
> <https://www.w3.org/community/mathml4/>
> [2]
https://mathml-refresh.github.io/mathml/
> <https://mathml-refresh.github.io/mathml/>
> [3]
https://www.chromestatus.com/feature/5240822173794304
> <https://www.chromestatus.com/feature/5240822173794304>
> [4]
https://mathml.igalia.com/
>
> From: Charles Matthews via
> Abstract-Wikipedia<mailto:abstract-wikipedia@lists.wikimedia.org
> <mailto:abstract-wikipedia@lists.wikimedia.org>>
> Sent: Monday, August 3, 2020 1:53 PM
> To: General public mailing list for the discussion of Abstract
> Wikipedia (aka
> Wikilambda)<mailto:abstract-wikipedia@lists.wikimedia.org
> <mailto:abstract-wikipedia@lists.wikimedia.org>>
> Subject: Re: [Abstract-wikipedia] Natural Language and Mathematics
> Generation
>
>
>
> On 03 August 2020 at 16:50 Adam Sobieski <adamsobieski(a)hotmail.com
> <mailto:adamsobieski@hotmail.com>> wrote:
>
>
>
> By utilizing <math>LaTeX</math> elements in an XML-based
> intermediate output format, one could simply copy that mathematical
> content to the resultant output wikitext [3]. Wikitext utilizes this
> same convention for mathematical expressions [3].
>
>
>
> Whether or not to include mathematics in Abstract Wikipedia is an
> important decision to make at a future point. Choosing to include
> mathematics would entail discussions about representing mathematical
> knowledge on Wikidata. It would entail discussions about how
> specific senses of certain words have mathematical meaning. It would
> entail discussions about how algorithms should determine when to use
> mathematical and scientific notations and when they should, instead,
> use paraphrases with the semantic content expressed using natural
> language. These are just some of the discussion topics which would
> arise should we desire to include mathematical and scientific
> notations in Abstract Wikipedia articles.
>
>
>
>
>
> I'm disagreeing with much of this.
>
> On LaTeX: while it is "industry standard", I'd like to draw
> attention to a point made in
>
https://en.wikipedia.org/wiki/Help:Displaying_a_formula#Rendering
> <https://en.wikipedia.org/wiki/Help:Displaying_a_formula#Rendering>:
> "Latex does not have full support for Unicode characters, and not
> all characters render."
>
> It goes on to suggest that Vietnamese, for example, would not be
> well catered for, in terms of its diacritics.
>
> I appreciate that we are only talking currently about scoping, and
> high-level initial planning. But given AW's objectives, this is not
> a good sign, and I don't think we should just assume that LaTeX as
> an incumbent gets waved through. It is pre-Web, and something closer
> to HTML would be preferable, in my view.
>
> My background is in mathematics, and began my Wikipedia career
> writing mathematics articles. There are certainly issues, such as
> prose/notation balance. Mathematical language is heavily overloaded,
> from the disambiguation aspect. But I'm not really recognising the
> landscape of issues set out there.
>
> Charles
>
>