Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)

14 Jul 2020

Quick side question: is there a role for formal ontology (FOL, DL or CL 
type of thing) in computational linguistics?

Mike

On 7/9/2020 8:22 AM, Louis Lecailliez wrote:
...
  Hi Denny,

 yes, the main problem of most of the systems presented in research 
 papers (UNL or not) is that they are locked in the institutions that 
 made them. A lot of UNL webpages went down since last time I checked 
 (recently), and the system was in fact designed in a way it could work 
 over the web while not letting third-parties access code and data. 
 This is of course the exact reverse of the technical and philosophical 
 approach taken here, and very sad as decade of accumulated knowledge 
 is lost; the papers are far from sufficient to re-create even of 
 fraction of the said systems

 There is also, I guess, a lot of interesting work that is not 
 translated in English at all (notably in linguistics), as making an 
 academic career in the national language was an option in a lot of 
 places until very recently.

  So, would you be willing to work on that? 
 Yes, of course, I wouldn't have posted in the mailing list otherwise. 
 I like the dual, concurrent approach of linguistic/theory you are 
 proposing. Note though that I'm not an expert be any mean in natural 
 language generation, it just happens I stumbled upon UNL recently and 
 it has too much in common on the abstract representation/NLG with this 
 project not to mention it. I also had some researchers name in mind as 
 I met some who worked on the referenced works.

 Concerning the paper authorship, I understand your stance, and yes I'm 
 willing to work more and write about previous works with those 
 interested. Just to have an idea, what it is expected timeframe for a 
 revision?

 Lexicographic data in Wikidata totally flew under my radar. This is 
 indeed something that will be needed in the future, and where I can 
 directly contribute too! As mentioned by [1] the license seems to be 
 an issue notably for importing existing resources, is there any “fix” 
 planned for that?

 All in all, I'm very pleased to see lot of aspects are more planned 
 than it I assumed to be from reading the paper alone, and I’m more 
 confident in the success now.

 Best regards,
 Louis Lecailliez

 [1] http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf

 ------------------------------------------------------------------------
 *De :* Abstract-Wikipedia 
 &lt;abstract-wikipedia-bounces(a)lists.wikimedia.org&gt; de la part de Denny 
 Vrandečić &lt;dvrandecic(a)wikimedia.org&gt;
 *Envoyé :* mercredi 8 juillet 2020 22:37
 *À :* General public mailing list for the discussion of Abstract 
 Wikipedia (aka Wikilambda) &lt;abstract-wikipedia(a)lists.wikimedia.org&gt;
 *Objet :* Re: [Abstract-wikipedia] NLP issues severely overlooked 
 (Amir E. Aharoni)
 Hi Louis, all,

 Louis, thanks for raising that important issue!

 I have been looking in a number of related NLG systems, and one thing 
 I noticed is a pattern of many of these projects being developed very 
 much in isolation from each other, and also often without much concern 
 for ongoing linguistic research. That is what I tried to capture in 
 the research paper by stating that there is no consensus on this, and 
 that it seems too early to commit to a specific solution.

 I had given a quick look to UNL, but the project looked pretty stale 
 to me - I could not see any activity in the last decade. Furthermore, 
 the page didn't provide access to the source code and instead 
 mentioned that part of the technology is under patents, which is quite 
 a red flag for me, and I usually don't look into something like that 
 any further, in order to honestly be able to say that I didn't get any 
 ideas from those patents. If I am mistaken, and there is a freely 
 usable write-up or implementation, I'd be happy to come back and read 
 up more.

 Thank you for the annotated bibliography! That is super useful.

 But I did look into detail into a (small) number of other, similar 
 systems, such as Grammatical Framework or KPML. Tiago mentioned 
 FrameNet, and I learned a lot about that too. To get an overview of 
 the whole field has been a rather frustrating experience, especially 
 since the major textbook in that area - Dale & Reiter - doesn't cover 
 these systems, nor the 2018 update to that book by Gatt & Krahmer, and 
 it seems that research work in that area often omits these practical 
 systems. Accordingly, when I talk with the professors and researchers 
 in this area, also about the proposal here, they are more focussed on 
 specific issues, and don't know that much about the concrete systems 
 (which is understandable - the flow from research to practical systems 
 is a more established flow in many areas). Never mind that when you 
 get to the linguistic side of it, instead of the computer science 
 part, there are even more competing theories, many of which are aimed 
 toward much more encompassing goals and are about covering the whole 
 of language and natural language understanding, which we want to be 
 shying away from.

 The goal of the paper was never meant to be a comprehensive account of 
 the state of the art in natural language generation. That's what Dale 
 & Reiter and Gatt & Krahmer have aimed for, and their works are 
 hundreds of pages. I had the feeling my paper was already too long, 
 and putting in an overview of the state of the art would have made it 
 at least double the length.

 So, given that (and other reasons, as lined out in the paper), it 
 seems that a system which could support any of these approaches seemed 
 a more promising way. So far, for my own prototype, I have been mostly 
 following Grammatical Framework (because it has a very accessible 
 book, the software is free, the community was friendly, etc.), and it 
 worked good enough to leave me convinced that the whole thing is worth 
 trying out. But I don't know whether that's the best approach.

 As mentioned by Chris Cooley, the goal will be to create a new wiki, a 
 library of functions, that can support any of these approaches. My 
 dream would be - and I see that Chris had already suggested that - 
 that experts like you and your colleagues create an overview of the 
 state of the art that will be accessible to the community and that 
 will allow us to make a well-informed decision when the time comes as 
 to which path to explore first. In a parallel track, we will be 
 creating the function wiki, and then, when the time is ripe we can 
 bring these two strands of work together. So, would you be willing to 
 work on that?

 How does this sound for a plan?

 Some further points:

 > This is way easier to implement, test and
deliver than to implement   10 different backends with various progress in
implementation, 
 incompatibilities and runtime characteristics.

 Regarding your point about evaluation environments: I agree, it would 
 be a huge task if the WMF core team were to develop all these 
 different environments. But that's not the plan. The goal is really 
 that *others* will hopefully build these :) All we need to do is to 
 make sure that's possible and encouraged and simple enough. But yeah, 
 not the core team.

 > The paper presents AW as sitting on top on
WL. Both are big   projects. Sitting a big project on top of another one is really
risky, 
 as it means a significant milestone must first be reached in the 
 dependency (here WL), which would likely took some years, before even 
 starting the work on the other project.
 Yes, that's correct. That is exactly the time that allows us do the 
 appropriate state of the art analysis. I hope it won't take us years, 
 but that we will be faster.

 > AW can be realised with current tools and
engineering practices. 
 Only if you commit to a specific implementation, which I am hesitant 
 to do.

 > [English is an obstacle to programming] This
strong affirmation   needs to be sourced.

 https://dl.acm.org/doi/10.1145/3051457.3051464

 > As I spend a significant time (~10 hours)
gathering references and   writing this email (which is 5 pages long in Word), I
would like to be 
 mentioned as co-author in the final paper if any idea or references 
 presented here is used in it.
 Thank you for your detailed comments, which will certainly improve the 
 second version of the paper. I am happy to mention you in the 
 acknowledgments. For co-authorship, I usually go for a more 
 substantial engagement ;) If you're willing to write up the "Previous 
 work" section along the lines you mentioned above (maybe with Tiago? 
 Maybe with others to join?), but for a comprehensive overview of 
 existing systems, then I am open to talk about co-authorship :)

 > For French, the gender of every noun entity
*must* be present ...   For Chinese and Japanese, classifier information must be
present for 
 all noun, in case one must be enumerated.
 That's exactly the goal of the lexicographic project on Wikidata, as 
 was pointed out:
 https://www.wikidata.org/wiki/Lexeme:L12449

 You'll find plenty of Lexemes with their classifiers, forms, etc. The 
 lexicographic project was started with the Abstract Wikipedia in mind, 
 knowing that exactly that will be needed.

 > Yet, the use of any existing formalism is
dismissed, which mean all   the situations I illustrated in this email will need to
be dealt with 
 in an ad hoc fashion.
 No, not at all it doesn't have to be ad-hoc, that's exactly what we 
 can start working on now, long before we get to the point that we need 
 to make that ad-hoc decision. I hope you'll join us to figure out the 
 best way!

 Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your 
 beautiful answers, who raised a number of great replies much better 
 than I ever could. And thanks to Louis for starting this more than 
 interesting thread! Let's continue in this vein!

 Cheers,
 Denny

 On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski &lt;adamsobieski(a)hotmail.com 
 <mailto:adamsobieski@hotmail.com>> wrote:

     Brainstorming: resembling what the document object model (DOM) [1]
     is for XML and attributed trees, perhaps we could create and
     specify an object model for sets of attributed predicate calculus
     expressions.

     With an attributed predicate calculus object model (e.g. “APCOM”)
     for sets of attributed predicate calculus expressions:

     {

     r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4

     r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8

     r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11,
     o7(icl>domain7).@a12).@a13

     }.@a14

     developers could more conveniently utilize sets of attributed
     predicate calculus expressions from JavaScript and Lua.

     Drawing from XML, we can consider that objects, relations,
     attributes could be, instead of plain text strings, uniform
     resource identifiers (URI’s). “r1” could be a URI, “a1” could be a
     URI, “o1” could be a URI, and so forth.

     We can also consider that the attributes in a model could have values:

     {

     r1.[@a1=v1](o1(icl>domain1).[@a2=v2],
     o2(icl>domain2).[@a3=v3]).[@a4=v4]

     r2.[@a5=v5](o3(icl>domain3).[@a6=v6],
     o4(icl>domain4).[@a7=v7]).[@a8=v8]

     r3.[@a9=v9](o5(icl>domain5).[@a10=v10],
     o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13]

     }.[@a14=v14]

     We can consider creating a scripting API (e.g. “APCOM”) for a
     semantic model to convenience developers. We can also consider
     adding attribute-value pairs to a semantic model.

     Best regards,

     Adam

     [1] https://en.wikipedia.org/wiki/Document_Object_Model

     *From: *Tiago Timponi Torrent <mailto:tiago.torrent@ufjf.edu.br>
     *Sent: *Sunday, July 5, 2020 9:06 PM
     *To: *General public mailing list for the discussion of Abstract
     Wikipedia (aka Wikilambda)
     <mailto:abstract-wikipedia@lists.wikimedia.org>
     *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked
     (Amir E. Aharoni)

     That’s a good idea, but I think you would need more than that.
     Take FrameNet, for example, but now departing from verbs instead
     of nouns. FrameNet has a very detailed model for dealing with
     verbs, their semantic arguments and the way they surface in
     morphosyntax. Nonetheless, to apply such a model in a text
     comprehension and/or generation task, you need more than that. You
     need to know prototypical fillers for the positions, which, in
     turn, are associated to other frames and, therefore, participate
     in other clusters of the network of frames. Moreover, you’d want
     those prototypical fillers to function as departing points for
     analogical extensions in the model, since not every sentence is a
     prototypical combination of words. In other words, the collection
     of attributes and relations you refer to should be defined in a
     way that they can be analogically extended to other collections
     not originally assigned to the item you’re looking at.

     Cheers

     Tiago

     Em dom, 5 de jul de 2020 às 20:03, Arthur Smith
     &lt;arthurpsmith(a)gmail.com <mailto:arthurpsmith@gmail.com>> escreveu:

         Yes, thank you for the UNL background, that is extremely
         helpful. I've been reading some of the articles Louis provided
         as references, and it seems to me from just this perhaps naive
         point of view, that a lot of the complexity is associated with
         disambiguation of meaning - for nouns I think Wikidata items
         (and their relations to lexeme senses) solve that problem, but
         we are still missing I think a lot of the detail needed to do
         the same with adjectives and verbs (at least). So there is
         definitely some room for finding better ways to model - but
         maybe Wikidata could be expanded to handle the adjective/verb
         cases too. In general the concept of a single meaning
         associated with a Wikidata item as its identifier and a
         collection of attributes and relationships attached to that
         item is a powerful one that could resolve many such issues.

            Arthur

         On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski
         &lt;adamsobieski(a)hotmail.com <mailto:adamsobieski@hotmail.com>>
         wrote:

             Louis,

             Thank you for the information about the Universal
             Networking Language [1] and the World Atlas of Language
             Structures [2].

               Semantic Modeling

             Do you opine that adding attributes to objects, relations
             and expressions enhances expressiveness for various
             features of natural language?

             r.@a1.@a2(o1(icl>domain1).@a3.@a4,
             o2(icl>domain2).@a5.@a6).@a7.@a8

             I wonder whether there exist mappings or workarounds with
             which to obtain such expressiveness for models such as
             Wikidata’s.

               Scripting Environments for Natural Language Generation

             Supposing that Wikilambda could be JavaScript /
             WebAssembly based, and observing that Lua / WebAssembly
             solutions exist, we can note that scripting engines such
             as V8 are easy to use and to add global objects and API
             to. Resembling how Web browsers provide scripting
             environments and API for functions, we can envision
             providing scripting environments and API for natural
             language generation functions.

             I wonder what you might think about scripting environments
             and API for natural language generation scenarios?

             Best regards,

             Adam

             [1]
             https://en.wikipedia.org/wiki/Universal_Networking_Language

             [2] https://wals.info/ <https://wals.info/>

             *From: *Louis Lecailliez <mailto:louis.lecailliez@outlook.fr>
             *Sent: *Saturday, July 4, 2020 2:10 PM
             *To: *abstract-wikipedia(a)lists.wikimedia.org
             <mailto:abstract-wikipedia@lists.wikimedia.org>
             *Subject: *Re: [Abstract-wikipedia] NLP issues severely
             overlooked (Amir E. Aharoni)

             Hi Amir,

             I understand the process is different that usual research.
             In fact I've seen Wikipedia grown from an unknown website
             to the biggest encyclopedia it is now. I use it daily in
             multiple languages and love it. I know what crowd sourcing
             could achieve.

  It's also possible that the mere *finding* of
these stumbling              blocks by such a big, diverse, open, and active
community,
             will itself be a contribution to the scientific knowledge
             around this subject.

             I disagree here. It would be contribution to scientic
             knowledge if and only if it wasn't discovered before. My
             email was precisely about that: capitalizing on the
             knowledge that has already been discovered, to avoid
             making the same mistake them again. It would not matter
             for a small project, but this one is really ambitious. We
             are speaking of 40 years of work by a horde of talented
             and very knowledgeable people, so this isn't to be
             dismissed easily.

             This thing is, my previous email was a bit abstract,
             because it were a review of the paper, not of the project
             itself. I should have made more examples to illustrate
             where the problem lies.

             Let's start with a simple example, in English, with
             corresponding Wikidata entities in-between parenthesis.
             I'm also using pseudo-turtle notation with made up
             relationships.

             France (Q142) is a country (Q6256).

             <Q142> <rel_is> <Q6256> .

             Creating the English sentence is straightforward with the
             naive approach presented in the paper.

             What is the French equivalent?

             La France est un pays.

             More information is required in the abstract
             representation: the text generator needs to know about the
             gender of both nouns (France and pays). So we need to
             extend the model as such:

             <Q142> <rel_gender> <Q1775415> .

             <Q6256> <rel_gender> <Q499327> .

             Fine! Now what about Chinese?

             法國是一個國家。

             What we have in the middle of the sentence is a classifier
             (個). The model needs the following update:

             <Q499327> <rel_use_classifier> <Q63153> .

             To handle these 3 languages, the model has already 3
             additional triples just for accounting for linguistic
             facts occuring in these languages. Wikipedia exists in
             more than 300 languages, and the world has about 6000 of
             them, each of them having particularities that must be
             taken into account. Fortunately they recoup themselves
             in-between languages. Nonetheless the World Atlas Language
             Structures (https://wals.info/chapter/s1) count 144
             distinct language features. Some are related to speech,
             but this means there is probably something like a hundred
             features that must be taken into account in the data model
             to produce valid natural language sentence.

             Note that in the Chinese example, there is also a number
             (一, one) showing up. This is a phenomenon that must be
             taken into account; and it's not always appearing when
             using 是(to be). How complex the "lambda" system will
             be just to deal with this issue? Hint: very much. It also
             needs to be compatible with dozen of other phenomena.

             Then each of those features require extensive and complete
             data. For French, the gender of every noun entity *must*
             be present, otherwise there is half a chance of producing
             a wrong sentence each time a noun entity is encountered.
             For Chinese and Japanese, classifier information must be
             present for all noun, in case one must be enumerated.
             Where does the project will get the data from? (we are
             speaking of millions of item, most not referenced in
             existing dictionaries) How will this be encoded? Those are
             real questions that must be answered.

             Suppose now we have done the work for "renderers" in these
             three languages. They both use the more or less similar A
             X B structure where X is a verb meaning "to be".

             What would be the Japanese equivalent?

             The more natural structure would be like:

             フランスは国(だ)。

             What is a play here is a topicalization (Q63105) of
             France, followed by a predicate (it's a country). This is
             very different from the previous structure, which, not
             surprisingly enough, needs it's own representation. To
             make situation more difficult, the previous (A be B)
             structure can also exists in Japanese, but would lead to a
             totally different sentence if used.

             The paper states that Figure 1 and 2 are examples that
             will be more complex in real life. Yet, the use of any
             existing formalism is dismissed, which mean all the
             situations I illustrated in this email will need to be
             dealt with in an ad hoc fashion. Moreover, changing
             formalism (be it ad hoc or not) will require to change
             every piece of code/data using it. This will happen
             everytime a language with unsupported feature(s) is added
             to the project. It's not hard to see how this will waste a
             huge amount of time and goodwill from involved people. The
             very code focussed tone of the paper, the english-centric
             approach used in the examples and the lack of references
             shows that the complexity of the task on the NLP front is
             not sufficiently conceptualized.

             Best Regards,

             Louis Lecailliez

             *De :*Abstract-Wikipedia
             &lt;abstract-wikipedia-bounces(a)lists.wikimedia.org
             <mailto:abstract-wikipedia-bounces@lists.wikimedia.org>>
             de la part de
             abstract-wikipedia-request(a)lists.wikimedia.org

<mailto:abstract-wikipedia-request@lists.wikimedia.org><abstract-wikipedia-request@lists.wikimedia.org
             <mailto:abstract-wikipedia-request@lists.wikimedia.org>>
             *Envoyé :* samedi 4 juillet 2020 15:06
             *À :* abstract-wikipedia(a)lists.wikimedia.org

<mailto:abstract-wikipedia@lists.wikimedia.org><abstract-wikipedia@lists.wikimedia.org
             <mailto:abstract-wikipedia@lists.wikimedia.org>>
             *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6

             Send Abstract-Wikipedia mailing list submissions to
             abstract-wikipedia(a)lists.wikimedia.org
             <mailto:abstract-wikipedia@lists.wikimedia.org>

             To subscribe or unsubscribe via the World Wide Web, visit
             https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
             or, via email, send a message with subject or body 'help' to
             abstract-wikipedia-request(a)lists.wikimedia.org
             <mailto:abstract-wikipedia-request@lists.wikimedia.org>

             You can reach the person managing the list at
             abstract-wikipedia-owner(a)lists.wikimedia.org
             <mailto:abstract-wikipedia-owner@lists.wikimedia.org>

             When replying, please edit your Subject line so it is more
             specific
             than "Re: Contents of Abstract-Wikipedia digest..."

             Today's Topics:

                1. Re: NLP issues severely overlooked (Charles Matthews)
                2. Use case: generation of short description (Jakob Voß)
                3. Re: NLP issues severely overlooked (Amir E. Aharoni)

             ----------------------------------------------------------------------

             Message: 1
             Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST)
             From: Charles Matthews &lt;charles.r.matthews(a)ntlworld.com
             <mailto:charles.r.matthews@ntlworld.com>>
             To: "General public mailing list for the discussion of
             Abstract
                     Wikipedia (aka Wikilambda)"
             &lt;abstract-wikipedia(a)lists.wikimedia.org
             <mailto:abstract-wikipedia@lists.wikimedia.org>>
             Subject: Re: [Abstract-wikipedia] NLP issues severely
             overlooked
             Message-ID:
             &lt;2126327926.39940.1593867909152(a)mail2.virginmedia.com
             <mailto:2126327926.39940.1593867909152@mail2.virginmedia.com>>
             Content-Type: text/plain; charset="utf-8"

             It is interesting to be on a list where one can hear about
             software issues, and then computational linguistic
             problems. I'm not an expert in either area.

             I do have 17 years of varied Wikimedia experience (and I
             use my real name there).

  On 04 July 2020 at 12:25 Louis Lecailliez  
           &lt;louis.lecailliez(a)outlook.fr
             <mailto:louis.lecailliez@outlook.fr>> wrote:

             <snip>

       Nothing precise is said about linguistic
resources              in the AW paper except for "These function finally can
             call the lexicographic knowlegde stored in Wikidata.",
             which is not very convincing: first because Wiktionary
             projects themselves severely lacks content and structure
             for those who has some content at all, secondly since
             specialized NLP ressources are missing there too (note:
             I'm not familiar with Wikidata so I could be wrong,
             however I never saw it cited for the kind of NLP resources
             I'm talking about).

             I can comment about this. Besides Wiktionary, there is the
             "lexeme" namespace of Wikidata. It is a relatively new
             part of Wikidata, dealing with verbal forms.

 To finish on a positive note, I would like to
highlight              the points I really like in the paper. First, its
             collaborative and open nature, like all Wikimedia
             projects, gives him a much higher chance of success than
             its predecessors.

             It is worth saying, for context, that there is a certain
             style or philosophy coming from the wiki side: more
             precisely, from the wikis before Wikipedia. There is the
             slogan "what is the simplest thing that would actually
             work?" You might argue, plausibly, that Wikipedia at
             nearly 20 years old, shows that there is a bit more to
             engineering than that.

             On the other hand, looking at Wikidata at seven years old,
             there is some point to the comment. It has a rather simple
             approach to linked structured data, compared to the
             Semantic Web environment. (Really, just write a very large
             piece of JSON and try to cope with it!) But the number of
             binary relations used (8K, if you count the "external
             links" handling) is now quite large, and has grown
             organically. The data modelling is in a sense primitive,
             sometimes non-existent. But the range of content handled
             really is encyclopedic. And in an area like scientific
             bibliography, at a scale of tens of millions of entities,
             the advantages of not much ontological fussiness begin to
             be seen.

             Wikidata started as an index of all Wikipedia articles,
             and is now five times the size needed for that: a very
             enriched "index".

             I suppose the NLP required to code up, for example, 50K
             chemistry articles about molecules, might be a problem
             that could be solved, leaving aside the general problems
             for the moment.

             In any case, there is a certain approach, neither academic
             nor commercial, that comes with Wikimedia and its
             communities, and it will be interesting to see how the
             issues are addressed.

             Charles Matthews (in Cambridge UK)

2024

2023

2022

2021

2020

Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)