New subject: Loose notes

4 Aug 2020

Andrzej,
Yes, there are over 325,000 lexemes in Wikidata now, over 40,000 for
English.

"Abstract" definitions are a little tricky, but it is not Lexemes
themselves that are defined, it is their Senses, and Senses can be linked
to Wikidata Items, which connects Lexemes into the abstract graph of
"knowledge".

Translations are still very incomplete but, as with definitions, it is the
Sense that should have the translation. The difficulty is that translation
cannot imply identity, which means that you cannot assume that a Sense to
Sense translation allows you to acquire translations from the Sense you
translate into. If you think of each Sense as a set, you cannot tell
whether the translated Sense is a subset or a superset. What we need for
that is the concept of the intersection between the two sets, which would
be part of each Sense but not necessarily the whole of either Sense.

So, broadly, your example of "zamek" is not a problem; you can connect the
"lock" Sense to the Sense of the English word "lock" (L1132-S1) as
well as
to the identifier for the encyclopedic concept Q228039 and/or Q24644118
(claimed to be a subclass of Q228039). But you should not connect it to
L1132-S2 (which connects to Q105731 pl:"Śluza wodna") or to L1132-S3
(Q1134386 pl:"Zamek (broń)", assuming that's a different Sense of
"zamek"
too). (I say this without knowing enough Polish to know if it makes sense;
I'm living in Searle's Chiński pokój!)[1]

I don't know whether the lexical data is in the dumps now, but it will be
pretty huge just by itself. It is also quite dependent on the main Wikidata
pages. For our natural-language generation, that's a great strength,
because we can move naturally from the concept to the word and related
vocabulary in any language without doing any translation. The extra context
we need to be able to choose the right Form of the Lexeme for the Sense...
that will need more work on the data, as will characterising thesaurus
relations (hypernymy, synonymy, hyponymy, antonymy etc) so that good
alternative Lexemes can be found. In an "abstract" context, these can be
thought of as "translations" into overlapping Senses, but the extent to
which we represent and consult (or navigate within) the broader compound
Sense domain (the set union of the Senses) is... an interesting challenge.

As for a fully "abstract" dictionary that can be read in any language...
We'll be better able to think about that once we have built a few renderers
for our "abstract" encyclopedic content, in my view. Machine translation
and natural-language understanding are not our primary goal. I think we
will make progress on both, if we remember to pay attention to inverse
functions as we evolve our NLG renderers, but we have a very long way to go
in all directions (and all languages).

Best regards,
Al.

[1] https://pl.wikipedia.org/wiki/Chi%C5%84ski_pok%C3%B3j
On Monday, 3 August 2020, &lt;abstract-wikipedia-request(a)lists.wikimedia.org&gt;
wrote:

...
  Send Abstract-Wikipedia mailing list submissions to
         abstract-wikipedia(a)lists.wikimedia.org

 To subscribe or unsubscribe via the World Wide Web, visit
         https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
 or, via email, send a message with subject or body 'help' to
         abstract-wikipedia-request(a)lists.wikimedia.org

 You can reach the person managing the list at
         abstract-wikipedia-owner(a)lists.wikimedia.org

 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of Abstract-Wikipedia digest..."

 Today's Topics:

    1. Re: Natural Language and Mathematics Generation (Adam Sobieski)
    2. Re: Loose notes (Andy)
    3. Re: Loose notes (Arthur Smith)

 ----------------------------------------------------------------------

 Message: 1
 Date: Mon, 3 Aug 2020 18:23:03 +0000
 From: Adam Sobieski &lt;adamsobieski(a)hotmail.com&gt;
 To: Charles Matthews &lt;charles.r.matthews(a)ntlworld.com&gt;om>, "General
         public mailing list for the discussion of Abstract Wikipedia (aka
         Wikilambda)" &lt;abstract-wikipedia(a)lists.wikimedia.org&gt;
 Subject: Re: [Abstract-wikipedia] Natural Language and Mathematics
         Generation
 Message-ID:
         &lt;CH2PR12MB4184F2C81E4CD533ACFE9547C54D0(a)CH2PR12MB4184.namprd
 12.prod.outlook.com>

 Content-Type: text/plain; charset="utf-8"

 Charles,

 There is also MathML to consider. Work is underway at the W3C with respect
 to a new version of MathML, MathML4 [1][2]. Work is underway with respect
 to adding MathML support to Chromium [3][4].

 Instead of LaTeX, MathML could be the way to go.

 Best regards,
 Adam

 [1] https://www.w3.org/community/mathml4/
 [2] https://mathml-refresh.github.io/mathml/
 [3] https://www.chromestatus.com/feature/5240822173794304
 [4] https://mathml.igalia.com/

 From: Charles Matthews via Abstract-Wikipedia<mailto:abst
 ract-wikipedia(a)lists.wikimedia.org&gt;
 Sent: Monday, August 3, 2020 1:53 PM
 To: General public mailing list for the discussion of Abstract Wikipedia
 (aka Wikilambda)<mailto:abstract-wikipedia@lists.wikimedia.org>
 Subject: Re: [Abstract-wikipedia] Natural Language and Mathematics
 Generation

 On 03 August 2020 at 16:50 Adam Sobieski &lt;adamsobieski(a)hotmail.com&gt; wrote:

 By utilizing <math>LaTeX</math> elements in an XML-based intermediate
 output format, one could simply copy that mathematical content to the
 resultant output wikitext [3]. Wikitext utilizes this same convention for
 mathematical expressions [3].

 Whether or not to include mathematics in Abstract Wikipedia is an
 important decision to make at a future point. Choosing to include
 mathematics would entail discussions about representing mathematical
 knowledge on Wikidata. It would entail discussions about how specific
 senses of certain words have mathematical meaning. It would entail
 discussions about how algorithms should determine when to use mathematical
 and scientific notations and when they should, instead, use paraphrases
 with the semantic content expressed using natural language. These are just
 some of the discussion topics which would arise should we desire to include
 mathematical and scientific notations in Abstract Wikipedia articles.

 I'm disagreeing with much of this.

 On LaTeX: while it is "industry standard", I'd like to draw attention to a
 point made in https://en.wikipedia.org/wiki/Help:Displaying_a_formula#Rend
 ering: "Latex does not have full support for Unicode characters, and not
 all characters render."

 It goes on to suggest that Vietnamese, for example, would not be well
 catered for, in terms of its diacritics.

 I appreciate that we are only talking currently about scoping, and
 high-level initial planning. But given AW's objectives, this is not a good
 sign, and I don't think we should just assume that LaTeX as an incumbent
 gets waved through. It is pre-Web, and something closer to HTML would be
 preferable, in my view.

 My background is in mathematics, and began my Wikipedia career writing
 mathematics articles. There are certainly issues, such as prose/notation
 balance. Mathematical language is heavily overloaded, from the
 disambiguation aspect. But I'm not really recognising  the landscape of
 issues set out there.

 Charles

Re: [Abstract-wikipedia] Loose notes