Just one huge Thank You for Ordia, Finn Årup Nielsen!

It's really coming along nicely now we have so many more Lexemes.

You are quite right, of course; we're not quite up to 325,000. I overlooked the possibility of a Lexeme having multiple lemmas. A few have as many as six, it seems! Sorry, for that slight overstatement. I hope you didn't think you had lost some.

While I'm apologizing, it seems that I got the link to your aclweb.anthology paper wrong when I included it earlier! (It should be "2020.ldl" not "2020.idl", of course.) Sorry for that, too. I assume that 
https://www.aclweb.org/anthology/2020.ldl-1.12.pdf [corrected link] is identical to https://people.compute.dtu.dk/faan/ps/Nielsen2020Lexemes.pdf.

Thank you again for your great work. I hope my mistakes did not inconvenience you too much.

Best regards,
Al.

On Tuesday, 4 August 2020, <abstract-wikipedia-request@lists.wikimedia.org> wrote:
Send Abstract-Wikipedia mailing list submissions to
        abstract-wikipedia@lists.wikimedia.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
or, via email, send a message with subject or body 'help' to
        abstract-wikipedia-request@lists.wikimedia.org

You can reach the person managing the list at
        abstract-wikipedia-owner@lists.wikimedia.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Abstract-Wikipedia digest..."


Today's Topics:

   1. Re: Loose notes (Andy)
   2. Re: Loose notes (fn@imm.dtu.dk)


----------------------------------------------------------------------

Message: 1
Date: Tue, 4 Aug 2020 17:49:03 +0200
From: Andy <borucki.andrzej@gmail.com>
To: "General public mailing list for the discussion of Abstract
        Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.org>
Subject: Re: [Abstract-wikipedia] Loose notes
Message-ID:
        <CAE2KeALchD9EAY0HPZgmR9y760eVPO=O+mWiEd5+o0Ns==zbYA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Is any road map on https://meta.wikimedia.org/ with estimated points of
time for Abstract Wikipedia?

pon., 3 sie 2020 o 18:43 Grounder UK <grounderuk@gmail.com> napisał(a):

> Plenty more work to be done!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200804/3e9fd481/attachment-0001.html>

------------------------------

Message: 2
Date: Tue, 4 Aug 2020 18:26:58 +0200
From: <fn@imm.dtu.dk>
To: <abstract-wikipedia@lists.wikimedia.org>
Subject: Re: [Abstract-wikipedia] Loose notes
Message-ID: <be9f66b3-6b5e-f5d5-cf9c-9c6200fe3059@dtu.dk>
Content-Type: text/plain; charset="utf-8"; format=flowed


I will just add a bit of loose notes:

My/our Ordia website attempts to show you the words and translations.

Zamek is here: https://ordia.toolforge.org/L270298

Link to castle: https://ordia.toolforge.org/Q23413 where there are 5
language.

lock: https://ordia.toolforge.org/Q228039 (only lock and zamek)

and other sense of lock: "system used to ignite propellant of firearm"
https://ordia.toolforge.org/Q1134386

The number of lexemes that I find are 303,845 (not "over 325,000"):
https://ordia.toolforge.org/statistics/

And yes, there are over 40,000 English lexemes
https://ordia.toolforge.org/language/


I have recently written a bit on the statistics and the linkage between
languages in the "Lexemes in Wikidata: 2020 Status" article from 7th
Workshop on Linked Data in Linguistics (LDL-2020)
https://people.compute.dtu.dk/faan/ps/Nielsen2020Lexemes.pdf

Slides are available as
https://people.compute.dtu.dk/faan/ps/Nielsen2020Lexemes_slides.pdf

Generally, I would say that the interlinking between languages is still
sparse. The sense-Q-item matrix from page 10 in the slides which was
done in February shows that only the English-Hebrew combination had more
than 1,000 sense-combinations. But I think it is slowly growing. And
perhaps Wikilambda/abstract can help spread the word about Wikidata lexemes.


You can get a dump of the lexemes, e.g., here:
https://dumps.wikimedia.org/wikidatawiki/entities/20200725/ The smallest
compressed file is 200 MB.



best regards
Finn Årup Nielsen
https://people.compute.dtu.dk/faan/


On 04/08/2020 11:33, Grounder UK wrote:
> Andrzej,
> Yes, there are over 325,000 lexemes in Wikidata now, over 40,000 for
> English.
>
> "Abstract" definitions are a little tricky, but it is not Lexemes
> themselves that are defined, it is their Senses, and Senses can be
> linked to Wikidata Items, which connects Lexemes into the abstract graph
> of "knowledge".
>
> Translations are still very incomplete but, as with definitions, it is
> the Sense that should have the translation. The difficulty is that
> translation cannot imply identity, which means that you cannot assume
> that a Sense to Sense translation allows you to acquire translations
> from the Sense you translate into. If you think of each Sense as a set,
> you cannot tell whether the translated Sense is a subset or a superset.
> What we need for that is the concept of the intersection between the two
> sets, which would be part of each Sense but not necessarily the whole of
> either Sense.
>
> So, broadly, your example of "zamek" is not a problem; you can connect
> the "lock" Sense to the Sense of the English word "lock" (L1132-S1) as
> well as to the identifier for the encyclopedic concept Q228039 and/or
> Q24644118 (claimed to be a subclass of Q228039). But you should not
> connect it to L1132-S2 (which connects to Q105731 pl:"Śluza wodna") or
> to L1132-S3 (Q1134386 pl:"Zamek (broń)", assuming that's a different
> Sense of "zamek" too). (I say this without knowing enough Polish to know
> if it makes sense; I'm living in Searle's Chiński pokój!)[1]
>
> I don't know whether the lexical data is in the dumps now, but it will
> be pretty huge just by itself. It is also quite dependent on the main
> Wikidata pages. For our natural-language generation, that's a great
> strength, because we can move naturally from the concept to the word and
> related vocabulary in any language without doing any translation. The
> extra context we need to be able to choose the right Form of the Lexeme
> for the Sense... that will need more work on the data, as will
> characterising thesaurus relations (hypernymy, synonymy, hyponymy,
> antonymy etc) so that good alternative Lexemes can be found. In an
> "abstract" context, these can be thought of as "translations" into
> overlapping Senses, but the extent to which we represent and consult (or
> navigate within) the broader compound Sense domain (the set union of the
> Senses) is... an interesting challenge.
>
> As for a fully "abstract" dictionary that can be read in any language...
> We'll be better able to think about that once we have built a few
> renderers for our "abstract" encyclopedic content, in my view. Machine
> translation and natural-language understanding are not our primary goal.
> I think we will make progress on both, if we remember to pay attention
> to inverse functions as we evolve our NLG renderers, but we have a very
> long way to go in all directions (and all languages).
>
> Best regards,
> Al.
>
> [1] https://pl.wikipedia.org/wiki/Chi%C5%84ski_pok%C3%B3j
> On Monday, 3 August 2020,
> <abstract-wikipedia-request@lists.wikimedia.org
> <mailto:abstract-wikipedia-request@lists.wikimedia.org>> wrote:
>
>     Send Abstract-Wikipedia mailing list submissions to
>     abstract-wikipedia@lists.wikimedia.org
>     <mailto:abstract-wikipedia@lists.wikimedia.org>
>
>     To subscribe or unsubscribe via the World Wide Web, visit
>     https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
>     <https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia>
>     or, via email, send a message with subject or body 'help' to
>     abstract-wikipedia-request@lists.wikimedia.org
>     <mailto:abstract-wikipedia-request@lists.wikimedia.org>
>
>     You can reach the person managing the list at
>     abstract-wikipedia-owner@lists.wikimedia.org
>     <mailto:abstract-wikipedia-owner@lists.wikimedia.org>
>
>     When replying, please edit your Subject line so it is more specific
>     than "Re: Contents of Abstract-Wikipedia digest..."
>
>
>     Today's Topics:
>
>         1. Re: Natural Language and Mathematics Generation (Adam Sobieski)
>         2. Re: Loose notes (Andy)
>         3. Re: Loose notes (Arthur Smith)
>
>
>     ----------------------------------------------------------------------
>
>     Message: 1
>     Date: Mon, 3 Aug 2020 18:23:03 +0000
>     From: Adam Sobieski <adamsobieski@hotmail.com
>     <mailto:adamsobieski@hotmail.com>>
>     To: Charles Matthews <charles.r.matthews@ntlworld.com
>     <mailto:charles.r.matthews@ntlworld.com>>, "General
>              public mailing list for the discussion of Abstract
>     Wikipedia (aka
>              Wikilambda)" <abstract-wikipedia@lists.wikimedia.org
>     <mailto:abstract-wikipedia@lists.wikimedia.org>>
>     Subject: Re: [Abstract-wikipedia] Natural Language and Mathematics
>              Generation
>     Message-ID:
>             
>     <CH2PR12MB4184F2C81E4CD533ACFE9547C54D0@CH2PR12MB4184.namprd12.prod.outlook.com
>     <mailto:CH2PR12MB4184F2C81E4CD533ACFE9547C54D0@CH2PR12MB4184.namprd12.prod.outlook.com>>
>
>     Content-Type: text/plain; charset="utf-8"
>
>     Charles,
>
>     There is also MathML to consider. Work is underway at the W3C with
>     respect to a new version of MathML, MathML4 [1][2]. Work is underway
>     with respect to adding MathML support to Chromium [3][4].
>
>     Instead of LaTeX, MathML could be the way to go.
>
>
>     Best regards,
>     Adam
>
>     [1] https://www.w3.org/community/mathml4/
>     <https://www.w3.org/community/mathml4/>
>     [2] https://mathml-refresh.github.io/mathml/
>     <https://mathml-refresh.github.io/mathml/>
>     [3] https://www.chromestatus.com/feature/5240822173794304
>     <https://www.chromestatus.com/feature/5240822173794304>
>     [4] https://mathml.igalia.com/
>
>     From: Charles Matthews via
>     Abstract-Wikipedia<mailto:abstract-wikipedia@lists.wikimedia.org
>     <mailto:abstract-wikipedia@lists.wikimedia.org>>
>     Sent: Monday, August 3, 2020 1:53 PM
>     To: General public mailing list for the discussion of Abstract
>     Wikipedia (aka
>     Wikilambda)<mailto:abstract-wikipedia@lists.wikimedia.org
>     <mailto:abstract-wikipedia@lists.wikimedia.org>>
>     Subject: Re: [Abstract-wikipedia] Natural Language and Mathematics
>     Generation
>
>
>
>     On 03 August 2020 at 16:50 Adam Sobieski <adamsobieski@hotmail.com
>     <mailto:adamsobieski@hotmail.com>> wrote:
>
>
>
>     By utilizing <math>LaTeX</math> elements in an XML-based
>     intermediate output format, one could simply copy that mathematical
>     content to the resultant output wikitext [3]. Wikitext utilizes this
>     same convention for mathematical expressions [3].
>
>
>
>     Whether or not to include mathematics in Abstract Wikipedia is an
>     important decision to make at a future point. Choosing to include
>     mathematics would entail discussions about representing mathematical
>     knowledge on Wikidata. It would entail discussions about how
>     specific senses of certain words have mathematical meaning. It would
>     entail discussions about how algorithms should determine when to use
>     mathematical and scientific notations and when they should, instead,
>     use paraphrases with the semantic content expressed using natural
>     language. These are just some of the discussion topics which would
>     arise should we desire to include mathematical and scientific
>     notations in Abstract Wikipedia articles.
>
>
>
>
>
>     I'm disagreeing with much of this.
>
>     On LaTeX: while it is "industry standard", I'd like to draw
>     attention to a point made in
>     https://en.wikipedia.org/wiki/Help:Displaying_a_formula#Rendering
>     <https://en.wikipedia.org/wiki/Help:Displaying_a_formula#Rendering>:
>     "Latex does not have full support for Unicode characters, and not
>     all characters render."
>
>     It goes on to suggest that Vietnamese, for example, would not be
>     well catered for, in terms of its diacritics.
>
>     I appreciate that we are only talking currently about scoping, and
>     high-level initial planning. But given AW's objectives, this is not
>     a good sign, and I don't think we should just assume that LaTeX as
>     an incumbent gets waved through. It is pre-Web, and something closer
>     to HTML would be preferable, in my view.
>
>     My background is in mathematics, and began my Wikipedia career
>     writing mathematics articles. There are certainly issues, such as
>     prose/notation balance. Mathematical language is heavily overloaded,
>     from the disambiguation aspect. But I'm not really recognising  the
>     landscape of issues set out there.
>
>     Charles
>
>     -------------- next part --------------
>     An HTML attachment was scrubbed...
>     URL:
>     <https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200803/9f6bd1c4/attachment-0001.html
>     <https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200803/9f6bd1c4/attachment-0001.html>>
>
>     ------------------------------
>
>     Message: 2
>     Date: Mon, 3 Aug 2020 20:50:46 +0200
>     From: Andy <borucki.andrzej@gmail.com
>     <mailto:borucki.andrzej@gmail.com>>
>     To: "General public mailing list for the discussion of Abstract
>              Wikipedia (aka Wikilambda)"
>     <abstract-wikipedia@lists.wikimedia.org
>     <mailto:abstract-wikipedia@lists.wikimedia.org>>
>     Subject: Re: [Abstract-wikipedia] Loose notes
>     Message-ID:
>             
>     <CAE2KeAJ=yrTE8b_4OcfdL9FODtj=G-cUe8Y39tD_b-s5W+OpKA@mail.gmail.com
>     <mailto:G-cUe8Y39tD_b-s5W+OpKA@mail.gmail.com>>
>     Content-Type: text/plain; charset="utf-8"
>
>     I see, Wikidata has also lexicographical data.
>     I think Wikidata lexemes are more computer readable that WIktionary
>     lexemes. But also definitions of lexemes should be Abstract graphes?
>     At the moment only about 10 thousands lexemes. I don’t see translations
>     lexemes to other languages. One sense can be translated to lexem in
>     other
>     language or sens of lexem in  other language/ For example I want add
>     Polish
>     “zamek” and give translate link from “lock” to “zamek” but not
>     “zamek” as
>     “castle” or “zip”. (Polish “zamek” = English: castle,lock,zip)
>     Lexicographical data are also in wikidata dump? (it will be well, if can
>     download dump only lexicographical data + properties because dump of all
>     Wikidata is huge)
>     Because number of WIkidata lexemes is relatively little, might be better
>     new set of lexemes, all definitions would be graph-structured as other
>     articles in Abstract WIkipedia and even definitions would have
>     additional
>     information, rules for automatic recognizing sense from context of
>     unstructured text for many languages (but these rules is difficult
>     problem). If we definie noun lexem "band" it can be music group or
>     material
>     belt, For WSD Is needed special rules for analysing context, because
>     Lesk
>     algorithm and its  modifications practically not works.
>     For example
>     Let consider sentence: "Each band member wore a band."
>     we must know, that:
>     1. group of people have members
>     2. material belt can be worn, not music group
>     or / and
>     1. are group of persons, active
>     2. passive
>     Is obvious for humans but
>     this is very not clear from the definitions.
>     It is difficult problem, because if even we write rules as above,
>     computer
>     can't apply its to the sentence.
>     I don;t know, if rules are possible, anyway, it will be well if
>     definitions
>     will be also in structured graph form, whivh can be automatic
>     translate to
>     other languages.
>
>     Best regards,
>     Andrzej
>
>     pon., 3 sie 2020 o 18:43 Grounder UK <grounderuk@gmail.com
>     <mailto:grounderuk@gmail.com>> napisał(a):
>
>      > [1]
>      >
>     https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation
>     <https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation>
>      > [2] https://www.aclweb.org/anthology/2020.idl-1.12.pdf
>     <https://www.aclweb.org/anthology/2020.idl-1.12.pdf>
>      >
>      >
>     -------------- next part --------------
>     An HTML attachment was scrubbed...
>     URL:
>     <https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200803/d4eafb57/attachment-0001.html
>     <https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200803/d4eafb57/attachment-0001.html>>
>
>     ------------------------------
>
>     Message: 3
>     Date: Mon, 3 Aug 2020 18:45:55 -0400
>     From: Arthur Smith <arthurpsmith@gmail.com
>     <mailto:arthurpsmith@gmail.com>>
>     To: "General public mailing list for the discussion of Abstract
>              Wikipedia (aka Wikilambda)"
>     <abstract-wikipedia@lists.wikimedia.org
>     <mailto:abstract-wikipedia@lists.wikimedia.org>>
>     Subject: Re: [Abstract-wikipedia] Loose notes
>     Message-ID:
>             
>     <CAEJFamx+BOfDhBnGGcBV_Bg_KPoV3Pbord9K4tyZUhZQ8RbieQ@mail.gmail.com
>     <mailto:CAEJFamx+BOfDhBnGGcBV_Bg_KPoV3Pbord9K4tyZUhZQ8RbieQ@mail.gmail.com>>
>     Content-Type: text/plain; charset="utf-8"
>
>     I'm not sure where you're getting the numbers from; there are over
>     200,000
>     lexemes in Wikidata, with roughly a dozen languages having at least
>     thousands of entries. Obviously it's incomplete, but quite a lot of
>     effort
>     has gone into it already. For most nouns, a sense can be linked to a
>     regular Wikidata item that is about a particular concept (and this
>     has been
>     done in at least several languages for 10's of thousands of cases
>     now, but
>     again much more work is needed). One helper tool available to link
>     lexeme
>     senses and regular conceptual (language-independent) items is MachtSinn:
>     https://machtsinn.toolforge.org/ <https://machtsinn.toolforge.org/>
>     - pick a language you know and help out!
>
>         Arthur
>
>     On Mon, Aug 3, 2020 at 2:51 PM Andy <borucki.andrzej@gmail.com
>     <mailto:borucki.andrzej@gmail.com>> wrote:
>
>      > I see, Wikidata has also lexicographical data.
>      > I think Wikidata lexemes are more computer readable that WIktionary
>      > lexemes. But also definitions of lexemes should be Abstract graphes?
>      > At the moment only about 10 thousands lexemes. I don’t see
>     translations
>      > lexemes to other languages. One sense can be translated to lexem
>     in other
>      > language or sens of lexem in  other language/ For example I want
>     add Polish
>      > “zamek” and give translate link from “lock” to “zamek” but not
>     “zamek” as
>      > “castle” or “zip”. (Polish “zamek” = English: castle,lock,zip)
>      > Lexicographical data are also in wikidata dump? (it will be well,
>     if can
>      > download dump only lexicographical data + properties because dump
>     of all
>      > Wikidata is huge)
>      > Because number of WIkidata lexemes is relatively little, might be
>     better
>      > new set of lexemes, all definitions would be graph-structured as
>     other
>      > articles in Abstract WIkipedia and even definitions would have
>     additional
>      > information, rules for automatic recognizing sense from context of
>      > unstructured text for many languages (but these rules is difficult
>      > problem). If we definie noun lexem "band" it can be music group
>     or material
>      > belt, For WSD Is needed special rules for analysing context,
>     because Lesk
>      > algorithm and its  modifications practically not works.
>      > For example
>      > Let consider sentence: "Each band member wore a band."
>      > we must know, that:
>      > 1. group of people have members
>      > 2. material belt can be worn, not music group
>      > or / and
>      > 1. are group of persons, active
>      > 2. passive
>      > Is obvious for humans but
>      > this is very not clear from the definitions.
>      > It is difficult problem, because if even we write rules as above,
>     computer
>      > can't apply its to the sentence.
>      > I don;t know, if rules are possible, anyway, it will be well if
>      > definitions will be also in structured graph form, whivh can be
>     automatic
>      > translate to other languages.
>      >
>      > Best regards,
>      > Andrzej
>      >
>      > pon., 3 sie 2020 o 18:43 Grounder UK <grounderuk@gmail.com
>     <mailto:grounderuk@gmail.com>> napisał(a):
>      >
>      >> [1]
>      >>
>     https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation
>     <https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation>
>      >> [2] https://www.aclweb.org/anthology/2020.idl-1.12.pdf
>     <https://www.aclweb.org/anthology/2020.idl-1.12.pdf>
>      >>
>      >> _______________________________________________
>      > Abstract-Wikipedia mailing list
>      > Abstract-Wikipedia@lists.wikimedia.org
>     <mailto:Abstract-Wikipedia@lists.wikimedia.org>
>      > https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
>     <https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia>
>      >
>     -------------- next part --------------
>     An HTML attachment was scrubbed...
>     URL:
>     <https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200803/f71a0c6f/attachment.html
>     <https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200803/f71a0c6f/attachment.html>>
>
>     ------------------------------
>
>     Subject: Digest Footer
>
>     _______________________________________________
>     Abstract-Wikipedia mailing list
>     Abstract-Wikipedia@lists.wikimedia.org
>     <mailto:Abstract-Wikipedia@lists.wikimedia.org>
>     https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
>     <https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia>
>
>
>     ------------------------------
>
>     End of Abstract-Wikipedia Digest, Vol 2, Issue 8
>     ************************************************
>
>
> _______________________________________________
> Abstract-Wikipedia mailing list
> Abstract-Wikipedia@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
>



------------------------------

Subject: Digest Footer

_______________________________________________
Abstract-Wikipedia mailing list
Abstract-Wikipedia@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia


------------------------------

End of Abstract-Wikipedia Digest, Vol 2, Issue 11
*************************************************