Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)

15 Jul 2020

Yep. Some applications use them. Back in the early 2000, there was a big
trend in investigating the interface between ontologies and the lexicon
(ontolex). Nonetheless, I’d say that most recent NLG systems focus on
common sense knowledge (KGs and the like), nonetheless the key issue of the
ontolex problem still remains: Language is not only about expressing facts,
it’s about how you construe them.

Cheers

Tiago

Em ter, 14 de jul de 2020 às 16:36, Mike Bennett &lt;mbennett(a)hypercube.co.uk&gt;
escreveu:

...
  Quick side question: is there a role for formal
ontology (FOL, DL or CL
 type of thing) in computational linguistics?

 Mike

 On 7/9/2020 8:22 AM, Louis Lecailliez wrote:

 Hi Denny,

 yes, the main problem of most of the systems presented in research papers
 (UNL or not) is that they are locked in the institutions that made them. A
 lot of UNL webpages went down since last time I checked (recently), and the
 system was in fact designed in a way it could work over the web while not
 letting third-parties access code and data. This is of course the exact
 reverse of the technical and philosophical approach taken here, and very
 sad as decade of accumulated knowledge is lost; the papers are far from
 sufficient to re-create even of fraction of the said systems

 There is also, I guess, a lot of interesting work that is not translated
 in English at all (notably in linguistics), as making an academic career in
 the national language was an option in a lot of places until very recently.

  So, would you be willing to work on that? 
 Yes, of course, I wouldn't have posted in the mailing list otherwise. I
 like the dual, concurrent approach of linguistic/theory you are proposing.
 Note though that I'm not an expert be any mean in natural language
 generation, it just happens I stumbled upon UNL recently and it has too
 much in common on the abstract representation/NLG with this project not to
 mention it. I also had some researchers name in mind as I met some who
 worked on the referenced works.

 Concerning the paper authorship, I understand your stance, and yes I'm
 willing to work more and write about previous works with those interested.
 Just to have an idea, what it is expected timeframe for a revision?

 Lexicographic data in Wikidata totally flew under my radar. This is indeed
 something that will be needed in the future, and where I can directly
 contribute too! As mentioned by [1] the license seems to be an issue
 notably for importing existing resources, is there any “fix” planned for
 that?

 All in all, I'm very pleased to see lot of aspects are more planned than
 it I assumed to be from reading the paper alone, and I’m more confident in
 the success now.

 Best regards,
 Louis Lecailliez

 [1] http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf

 ------------------------------
 *De :* Abstract-Wikipedia &lt;abstract-wikipedia-bounces(a)lists.wikimedia.org&gt;
 &lt;abstract-wikipedia-bounces(a)lists.wikimedia.org&gt; de la part de Denny
 Vrandečić &lt;dvrandecic(a)wikimedia.org&gt; &lt;dvrandecic(a)wikimedia.org&gt;
 *Envoyé :* mercredi 8 juillet 2020 22:37
 *À :* General public mailing list for the discussion of Abstract
 Wikipedia (aka Wikilambda) &lt;abstract-wikipedia(a)lists.wikimedia.org&gt;
 &lt;abstract-wikipedia(a)lists.wikimedia.org&gt;
 *Objet :* Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir
 E. Aharoni)

 Hi Louis, all,

 Louis, thanks for raising that important issue!

 I have been looking in a number of related NLG systems, and one thing I
 noticed is a pattern of many of these projects being developed very much in
 isolation from each other, and also often without much concern for ongoing
 linguistic research. That is what I tried to capture in the research paper
 by stating that there is no consensus on this, and that it seems too early
 to commit to a specific solution.

 I had given a quick look to UNL, but the project looked pretty stale to me
 - I could not see any activity in the last decade. Furthermore, the page
 didn't provide access to the source code and instead mentioned that part of
 the technology is under patents, which is quite a red flag for me, and I
 usually don't look into something like that any further, in order to
 honestly be able to say that I didn't get any ideas from those patents. If
 I am mistaken, and there is a freely usable write-up or implementation, I'd
 be happy to come back and read up more.

 Thank you for the annotated bibliography! That is super useful.

 But I did look into detail into a (small) number of other, similar
 systems, such as Grammatical Framework or KPML. Tiago mentioned FrameNet,
 and I learned a lot about that too. To get an overview of the whole field
 has been a rather frustrating experience, especially since the major
 textbook in that area - Dale & Reiter - doesn't cover these systems, nor
 the 2018 update to that book by Gatt & Krahmer, and it seems that research
 work in that area often omits these practical systems. Accordingly, when I
 talk with the professors and researchers in this area, also about the
 proposal here, they are more focussed on specific issues, and don't know
 that much about the concrete systems (which is understandable - the flow
 from research to practical systems is a more established flow in many
 areas). Never mind that when you get to the linguistic side of it, instead
 of the computer science part, there are even more competing theories, many
 of which are aimed toward much more encompassing goals and are about
 covering the whole of language and natural language understanding, which we
 want to be shying away from.

 The goal of the paper was never meant to be a comprehensive account of the
 state of the art in natural language generation. That's what Dale & Reiter
 and Gatt & Krahmer have aimed for, and their works are hundreds of pages. I
 had the feeling my paper was already too long, and putting in an overview
 of the state of the art would have made it at least double the length.

 So, given that (and other reasons, as lined out in the paper), it seems
 that a system which could support any of these approaches seemed a more
 promising way. So far, for my own prototype, I have been mostly following
 Grammatical Framework (because it has a very accessible book, the software
 is free, the community was friendly, etc.), and it worked good enough to
 leave me convinced that the whole thing is worth trying out. But I don't
 know whether that's the best approach.

 As mentioned by Chris Cooley, the goal will be to create a new wiki, a
 library of functions, that can support any of these approaches. My dream
 would be - and I see that Chris had already suggested that - that experts
 like you and your colleagues create an overview of the state of the art
 that will be accessible to the community and that will allow us to make a
 well-informed decision when the time comes as to which path to explore
 first. In a parallel track, we will be creating the function wiki, and
 then, when the time is ripe we can bring these two strands of work
 together. So, would you be willing to work on that?

 How does this sound for a plan?

 Some further points:

 > This is way easier to implement, test and
deliver than to implement 10  different backends with various progress in
implementation,
 incompatibilities and runtime characteristics.

 Regarding your point about evaluation environments: I agree, it would be a
 huge task if the WMF core team were to develop all these different
 environments. But that's not the plan. The goal is really that *others*
 will hopefully build these :) All we need to do is to make sure that's
 possible and encouraged and simple enough. But yeah, not the core team.

 > The paper presents AW as sitting on top on
WL. Both are big projects.  Sitting a big project on top of another one is really
risky, as it means a
 significant milestone must first be reached in the dependency (here WL),
 which would likely took some years, before even starting the work on the
 other project.
 Yes, that's correct. That is exactly the time that allows us do the
 appropriate state of the art analysis. I hope it won't take us years, but
 that we will be faster.

 > AW can be realised with current tools and
engineering practices. 
 Only if you commit to a specific implementation, which I am hesitant to do.

 > [English is an obstacle to programming] This
strong affirmation needs  to be sourced.

 https://dl.acm.org/doi/10.1145/3051457.3051464

 > As I spend a significant time (~10 hours)
gathering references and  writing this email (which is 5 pages long in Word), I
would like to be
 mentioned as co-author in the final paper if any idea or references
 presented here is used in it.
 Thank you for your detailed comments, which will certainly improve the
 second version of the paper. I am happy to mention you in the
 acknowledgments. For co-authorship, I usually go for a more substantial
 engagement ;) If you're willing to write up the "Previous work" section
 along the lines you mentioned above (maybe with Tiago? Maybe with others to
 join?), but for a comprehensive overview of existing systems, then I am
 open to talk about co-authorship :)

 > For French, the gender of every noun entity
*must* be present ... For  Chinese and Japanese, classifier information must be
present for all noun,
 in case one must be enumerated.
 That's exactly the goal of the lexicographic project on Wikidata, as was
 pointed out:
 https://www.wikidata.org/wiki/Lexeme:L12449

 You'll find plenty of Lexemes with their classifiers, forms, etc. The
 lexicographic project was started with the Abstract Wikipedia in mind,
 knowing that exactly that will be needed.

 > Yet, the use of any existing formalism is
dismissed, which mean all  the situations I illustrated in this email will need to
be dealt with in an
 ad hoc fashion.
 No, not at all it doesn't have to be ad-hoc, that's exactly what we can
 start working on now, long before we get to the point that we need to make
 that ad-hoc decision. I hope you'll join us to figure out the best way!

 Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your
 beautiful answers, who raised a number of great replies much better than I
 ever could. And thanks to Louis for starting this more than interesting
 thread! Let's continue in this vein!

 Cheers,
 Denny

 On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski &lt;adamsobieski(a)hotmail.com&gt;
 wrote:

 Brainstorming: resembling what the document object model (DOM) [1] is for
 XML and attributed trees, perhaps we could create and specify an object
 model for sets of attributed predicate calculus expressions.

 With an attributed predicate calculus object model (e.g. “APCOM”) for sets
 of attributed predicate calculus expressions:

 {

   r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4

   r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8

   r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11,
 o7(icl>domain7).@a12).@a13

 }.@a14

 developers could more conveniently utilize sets of attributed predicate
 calculus expressions from JavaScript and Lua.

 Drawing from XML, we can consider that objects, relations, attributes
 could be, instead of plain text strings, uniform resource identifiers
 (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and
 so forth.

 We can also consider that the attributes in a model could have values:

 {

   r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4]

   r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8]

   r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11],
 o7(icl>domain7).[@a12=v12]).[@a13=v13]

 }.[@a14=v14]

 We can consider creating a scripting API (e.g. “APCOM”) for a semantic
 model to convenience developers. We can also consider adding
 attribute-value pairs to a semantic model.

 Best regards,

 Adam

 [1] https://en.wikipedia.org/wiki/Document_Object_Model

 *From: *Tiago Timponi Torrent &lt;tiago.torrent(a)ufjf.edu.br&gt;
 *Sent: *Sunday, July 5, 2020 9:06 PM
 *To: *General public mailing list for the discussion of Abstract
 Wikipedia (aka Wikilambda) &lt;abstract-wikipedia(a)lists.wikimedia.org&gt;
 *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir
 E. Aharoni)

 That’s a good idea, but I think you would need more than that. Take
 FrameNet, for example, but now departing from verbs instead of nouns.
 FrameNet has a very detailed model for dealing with verbs, their semantic
 arguments and the way they surface in morphosyntax. Nonetheless, to apply
 such a model in a text comprehension and/or generation task, you need more
 than that. You need to know prototypical fillers for the positions, which,
 in turn, are associated to other frames and, therefore, participate in
 other clusters of the network of frames. Moreover, you’d want those
 prototypical fillers to function as departing points for analogical
 extensions in the model, since not every sentence is a prototypical
 combination of words. In other words, the collection of attributes and
 relations you refer to should be defined in a way that they can be
 analogically extended to other collections not originally assigned to the
 item you’re looking at.

 Cheers

 Tiago

 Em dom, 5 de jul de 2020 às 20:03, Arthur Smith &lt;arthurpsmith(a)gmail.com&gt;
 escreveu:

 Yes, thank you for the UNL background, that is extremely helpful. I've
 been reading some of the articles Louis provided as references, and it
 seems to me from just this perhaps naive point of view, that a lot of the
 complexity is associated with disambiguation of meaning - for nouns I think
 Wikidata items (and their relations to lexeme senses) solve that problem,
 but we are still missing I think a lot of the detail needed to do the same
 with adjectives and verbs (at least). So there is definitely some room for
 finding better ways to model - but maybe Wikidata could be expanded to
 handle the adjective/verb cases too. In general the concept of a single
 meaning associated with a Wikidata item as its identifier and a collection
 of attributes and relationships attached to that item is a powerful one
 that could resolve many such issues.

    Arthur

 On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski &lt;adamsobieski(a)hotmail.com&gt;
 wrote:

 Louis,

 Thank you for the information about the Universal Networking Language [1]
 and the World Atlas of Language Structures [2].

 Semantic Modeling

 Do you opine that adding attributes to objects, relations and expressions
 enhances expressiveness for various features of natural language?

 r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8

 I wonder whether there exist mappings or workarounds with which to obtain
 such expressiveness for models such as Wikidata’s.

 Scripting Environments for Natural Language Generation

 Supposing that Wikilambda could be JavaScript / WebAssembly based, and
 observing that Lua / WebAssembly solutions exist, we can note that
 scripting engines such as V8 are easy to use and to add global objects and
 API to. Resembling how Web browsers provide scripting environments and API
 for functions, we can envision providing scripting environments and API for
 natural language generation functions.

 I wonder what you might think about scripting environments and API for
 natural language generation scenarios?

 Best regards,

 Adam

 [1] https://en.wikipedia.org/wiki/Universal_Networking_Language

 [2] https://wals.info/

 *From: *Louis Lecailliez &lt;louis.lecailliez(a)outlook.fr&gt;
 *Sent: *Saturday, July 4, 2020 2:10 PM
 *To: *abstract-wikipedia(a)lists.wikimedia.org
 *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir
 E. Aharoni)

 Hi Amir,

 I understand the process is different that usual research. In fact I've
 seen Wikipedia grown from an unknown website to the biggest encyclopedia it
 is now. I use it daily in multiple languages and love it. I know what crowd
 sourcing could achieve.

  It's also possible that the mere *finding* of
these stumbling blocks by  such a big, diverse, open, and active community, will
itself be a
 contribution to the scientific knowledge around this subject.

 I disagree here. It would be contribution to scientic knowledge if and
 only if it wasn't discovered before. My email was precisely about that:
 capitalizing on the knowledge that has already been discovered, to avoid
 making the same mistake them again. It would not matter for a small
 project, but this one is really ambitious. We are speaking of 40 years of
 work by a horde of talented and very knowledgeable people, so this isn't to
 be dismissed easily.

 This thing is, my previous email was a bit abstract, because it were a
 review of the paper, not of the project itself. I should have made more
 examples to illustrate where the problem lies.

 Let's start with a simple example, in English, with corresponding Wikidata
 entities in-between parenthesis. I'm also using pseudo-turtle notation with
 made up relationships.

 France (Q142) is a country (Q6256).

 <Q142> <rel_is> <Q6256> .

 Creating the English sentence is straightforward with the naive approach
 presented in the paper.

 What is the French equivalent?

 La France est un pays.

 More information is required in the abstract representation: the text
 generator needs to know about the gender of both nouns (France and pays).
 So we need to extend the model as such:

 <Q142> <rel_gender> <Q1775415> .

 <Q6256> <rel_gender> <Q499327> .

 Fine! Now what about Chinese?

 法國是一個國家。

 What we have in the middle of the sentence is a classifier (個). The model
 needs the following update:

 <Q499327> <rel_use_classifier> <Q63153> .

 To handle these 3 languages, the model has already 3 additional triples
 just for accounting for linguistic facts occuring in these languages.
 Wikipedia exists in more than 300 languages, and the world has about 6000
 of them, each of them having particularities that must be taken into
 account. Fortunately they recoup themselves in-between languages.
 Nonetheless the World Atlas Language Structures (
 https://wals.info/chapter/s1) count 144 distinct language features. Some
 are related to speech, but this means there is probably something like a
 hundred features that must be taken into account in the data model to
 produce valid natural language sentence.

 Note that in the Chinese example, there is also a number (一, one) showing
 up. This is a phenomenon that must be taken into account; and it's not
 always appearing when using 是 (to be). How complex the "lambda" system
 will be just to deal with this issue? Hint: very much. It also needs to be
 compatible with dozen of other phenomena.

 Then each of those features require extensive and complete data. For
 French, the gender of every noun entity *must* be present, otherwise there
 is half a chance of producing a wrong sentence each time a noun entity is
 encountered. For Chinese and Japanese, classifier information must be
 present for all noun, in case one must be enumerated. Where does the
 project will get the data from? (we are speaking of millions of item, most
 not referenced in existing dictionaries) How will this be encoded? Those
 are real questions that must be answered.

 Suppose now we have done the work for "renderers" in these three
 languages. They both use the more or less similar A X B structure where X
 is a verb meaning "to be".

 What would be the Japanese equivalent?

 The more natural structure would be like:

 フランスは国(だ)。

 What is a play here is a topicalization (Q63105) of France, followed by a
 predicate (it's a country). This is very different from the previous
 structure, which, not surprisingly enough, needs it's own representation.
 To make situation more difficult, the previous (A be B) structure can also
 exists in Japanese, but would lead to a totally different sentence if used.

 The paper states that Figure 1 and 2 are examples that will be more
 complex in real life. Yet, the use of any existing formalism is dismissed,
 which mean all the situations I illustrated in this email will need to be
 dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc
 or not) will require to change every piece of code/data using it. This will
 happen everytime a language with unsupported feature(s) is added to the
 project. It's not hard to see how this will waste a huge amount of time and
 goodwill from involved people. The very code focussed tone of the paper,
 the english-centric approach used in the examples and the lack of
 references shows that the complexity of the task on the NLP front is not
 sufficiently conceptualized.

 Best Regards,

 Louis Lecailliez

 *De :* Abstract-Wikipedia &lt;abstract-wikipedia-bounces(a)lists.wikimedia.org&gt;
 de la part de abstract-wikipedia-request(a)lists.wikimedia.org <
 abstract-wikipedia-request(a)lists.wikimedia.org&gt;
 *Envoyé :* samedi 4 juillet 2020 15:06
 *À :* abstract-wikipedia(a)lists.wikimedia.org <
 abstract-wikipedia(a)lists.wikimedia.org&gt;
 *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6

 Send Abstract-Wikipedia mailing list submissions to
         abstract-wikipedia(a)lists.wikimedia.org

 To subscribe or unsubscribe via the World Wide Web, visit
         https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
 or, via email, send a message with subject or body 'help' to
         abstract-wikipedia-request(a)lists.wikimedia.org

 You can reach the person managing the list at
         abstract-wikipedia-owner(a)lists.wikimedia.org

 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of Abstract-Wikipedia digest..."

 Today's Topics:

    1. Re: NLP issues severely overlooked (Charles Matthews)
    2. Use case: generation of short description (Jakob Voß)
    3. Re: NLP issues severely overlooked (Amir E. Aharoni)

 ----------------------------------------------------------------------

 Message: 1
 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST)
 From: Charles Matthews &lt;charles.r.matthews(a)ntlworld.com&gt;
 To: "General public mailing list for the discussion of Abstract
         Wikipedia (aka Wikilambda)" <
 abstract-wikipedia(a)lists.wikimedia.org&gt;
 Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked
 Message-ID: &lt;2126327926.39940.1593867909152(a)mail2.virginmedia.com&gt;
 Content-Type: text/plain; charset="utf-8"

 It is interesting to be on a list where one can hear about software
 issues, and then computational linguistic problems. I'm not an expert in
 either area.

 I do have 17 years of varied Wikimedia experience (and I use my real name
 there).

  On 04 July 2020 at 12:25 Louis Lecailliez
&lt;louis.lecailliez(a)outlook.fr&gt;  wrote:

 <snip>

       Nothing precise is said about linguistic
resources in the AW paper  except for "These function finally can call the
lexicographic knowlegde
 stored in Wikidata.", which is not very convincing: first because
 Wiktionary projects themselves severely lacks content and structure for
 those who has some content at all, secondly since specialized NLP
 ressources are missing there too (note: I'm not familiar with Wikidata so I
 could be wrong, however I never saw it cited for the kind of NLP resources
 I'm talking about).

 I can comment about this. Besides Wiktionary, there is the "lexeme"
 namespace of Wikidata. It is a relatively new part of Wikidata, dealing
 with verbal forms.

 To finish on a positive note, I would like to
highlight the points I  really like in the paper. First, its collaborative and open
nature, like
 all Wikimedia projects, gives him a much higher chance of success than its
 predecessors.

 It is worth saying, for context, that there is a certain style or
 philosophy coming from the wiki side: more precisely, from the wikis before
 Wikipedia. There is the slogan "what is the simplest thing that would
 actually work?" You might argue, plausibly, that Wikipedia at nearly 20
 years old, shows that there is a bit more to engineering than that.

 On the other hand, looking at Wikidata at seven years old, there is some
 point to the comment. It has a rather simple approach to linked structured
 data, compared to the Semantic Web environment. (Really, just write a very
 large piece of JSON and try to cope with it!) But the number of binary
 relations used (8K, if you count the "external links" handling) is now
 quite large, and has grown organically. The data modelling is in a sense
 primitive, sometimes non-existent. But the range of content handled really
 is encyclopedic. And in an area like scientific bibliography, at a scale of
 tens of millions of entities, the advantages of not much ontological
 fussiness begin to be seen.

 Wikidata started as an index of all Wikipedia articles, and is now five
 times the size needed for that: a very enriched "index".

 I suppose the NLP required to code up, for example, 50K chemistry articles
 about molecules, might be a problem that could be solved, leaving aside the
 general problems for the moment.

 In any case, there is a certain approach, neither academic nor commercial,
 that comes with Wikimedia and its communities, and it will be interesting
 to see how the issues are addressed.

 Charles Matthews (in Cambridge UK)

2024

2023

2022

2021

2020

Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)