Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256). <Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent? La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> . <Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese? 法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures (https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence. Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent? The more natural structure would be like: フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards, Louis Lecailliez
________________________________ De : Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de abstract-wikipedia-request@lists.wikimedia.org abstract-wikipedia-request@lists.wikimedia.org Envoyé : samedi 4 juillet 2020 15:06 À : abstract-wikipedia@lists.wikimedia.org abstract-wikipedia@lists.wikimedia.org Objet : Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
1. Re: NLP issues severely overlooked (Charles Matthews) 2. Use case: generation of short description (Jakob Voß) 3. Re: NLP issues severely overlooked (Amir E. Aharoni)
----------------------------------------------------------------------
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews charles.r.matthews@ntlworld.com To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: 2126327926.39940.1593867909152@mail2.virginmedia.com Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez louis.lecailliez@outlook.fr wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK) -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/1113bab0/attachment-0001.html
------------------------------
Message: 2 Date: Sat, 4 Jul 2020 08:18:56 +0200 From: Jakob Voß jakob.voss@gbv.de To: abstract-wikipedia@lists.wikimedia.org Subject: [Abstract-wikipedia] Use case: generation of short description Message-ID: 4403bbda-040b-6c89-9cb6-6540139250dc@gbv.de Content-Type: text/plain; charset="utf-8"
Hi,
I want to auto-generate disambiguation description for African politicians to be added to Wikidata, e.g. from the country Mozambique (Q1029) the following descriptions should be generated:
Mozambican politician (en) Mosambikanischer Politiker (de) politico mozambicano (it) ...
This could be extended to other professions. My questions:
- Can anyone point me to data sources where to best look up country adjectives such as "Mozambican"?
- Where/how to best store the lexical information for best reuse with other renderers
- If a create small renderers for this short descriptions, what architecture do you prefer for best reuse?
My just-get-it-done solution would be a set of CSV files and a few lines of Perl code, but maybe this use case can be aligned with Abstract Wikidata to better learn about it.
Looking forward to collaborate, Jakob
------------------------------
Message: 3 Date: Sat, 4 Jul 2020 18:03:24 +0300 From: "Amir E. Aharoni" amir.aharoni@mail.huji.ac.il To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: CACtNa8t6kbWe21C980h1MxiWNfUp+0eDE82vPMjDUX2UCgb2gw@mail.gmail.com Content-Type: text/plain; charset="utf-8"
Hi,
Thanks a lot for the sources. I am not one of the people implementing Wikilambda, but I am just very curious about it as a member of the wider Wikimedia community. But there's a good chance that they will be useful to people who do work on the implementation.
I will dare to add a little thought I have about it, however. It's possible that the challenge of building a well-functioning natural language generator is underestimated by the founders, and that they don't pay enough attention to existing work (although, knowing Denny, there is a good chance that he actually is aware of at least some of it). But there is something that the wide Wikimedia community has that I'm not sure that the past projects in this field did: The community itself. A big, worldwide, and diverse group of passionate volunteers, who love the idea of spreading free knowledge and who love their languages. Quite a lot of them also know some programming, and in the past they proved unbelievably creative and productive when writing code for Wikimedia projects as a community, in the form of templates, modules, gadgets, bots, extensions, and other tools. I'm quite sure that once the new tools become usable, this community will start doing creative things again, and it will also start reporting bugs and limitations.
So yes, while it's possible that along the way both the core developers and the volunteer community will find all kinds of stumbling blocks, I'm pretty sure that they will also have all kinds of surprising success stories. It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject. And don't underestimate the "open" part—that's where we really shine. This won't be a theoretical work in a lab, published in a paywalled and copyright-restricted academic journal, but fully optimized for accessibility to everyone.
Yes, this whole email from me is incredibly naïve, but it's the same attitude that got us to writing the biggest and most multilingual encyclopedia in history, so maybe we can do something cool again :)
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
בתאריך שבת, 4 ביולי 2020 ב-14:26 מאת Louis Lecailliez < louis.lecailliez@outlook.fr>:
Hello,
my name is Louis Lecailliez, PhD student at Kyoto University in education technology. I'm a Computer Science and NLP graduate. One thing I do is working on language learner's knowledge modelling as graphs.
The Abstract Wikipedia project is really interesting. There is however two very concerning issues I spotted when reading the associated paper draft ( https://arxiv.org/abs/2004.04733). The following email could be read as negative, but please don't take it as such: my purpose is to avoid spending people efforts and money for things that can (need to!) be fixed upfront.
- Issues with NLP
The main issue is that the difficulty of the NLP task of generating natural text from an abstract representation is severely overlooked. This stems from the other main problem: the paper is not based on the decades of previous work in that space.
As I understand it, the main value proposition of Abstract Wikipedia (AW) is a computer representation of encyclopedic knowledge that can be projected into different existing natural languages, with the goal of supporting a huge number of them. Plus, an editor to make this happen easily.
This is in fact surprisingly extremely close to what the Universal Networking Language (UNL) project, which started 20 years ago, aims to do. UNL provides a language agnostic representation of text that uses hypergraph. Software (called EnConverter) produce UNL graphs from natural text in a given language. Another kind of software called DeConverter do the reverse, that is producing natural text from the abstract representation. This is exactly the same function of the "renderers" in the AW paper. The way of doing it is also similar: by applying successive transformations until the final text string is produced. In general, that kind of abstract meaning representation is called an Interlingua, and is widely used in Machine Translation (MT) systems.
Disregarding two decades of work, in the UNL case, on the same problem space (rule-based machine translation, here from an abstract language as fixed source language), which was itself based on few other decades of work, doesn't seem to be a wise move to start a new project. For a start, the graph representation used in the AW will likely not be expressive enough to encode linguistic knowledge; this is why UNL uses hypergraphs instead of graphs.
The problem is glaring when looking at the references list: the list is bloated with irrelevant references (such as those to programming languages [27, 37, 41, 77], Turing completeness being the worst offender [11, 17, 23, ...]) while containing only two references [7, 85] to the really hard part of the project: generating natural language from the abstract representation. There are few more relevant references about natural language generation, but this isn't enough.
Interestingly, [85] is an UNL paper, but not the main one. Moreover, it is cited in Section 9 "Opening future research". This should be instead placed in a "Previous work" section which is missing from the paper.
To fill a part of this section yet to be written, I propose the following references: [*1] Uchida, H., Zhu, M., & Della Senta, T. (1999). A gift for a millennium. IAS/UNU, Tokyo.
https://www.researchgate.net/profile/Hiroshi_Uchida2/publication/239328725_A... [*2] Wang-Ju Tsai (2004) La coédition langue-UNL pour partager la révision entre langues d'un document multilingue. [Language-UNL coedition to share revisions in a multilingual document] Thèse de doctorat. Grenoble.
https://pdfs.semanticscholar.org/b030/ea4662e393657b9a134c006ca5b08e8a23b3.p... [3*] Boitet, C., & Tsai, W. J. (2002). La coédition langue<—> UNL pour partager la révision entre les langues d'un document multilingue: un concept unificateur. Proc. TALN-02, Nancy, 22-26.
http://www.afcp-parole.org/doc/Archives_JEP/2002_XXIVe_JEP_Nancy/talnrecital... [4*] Tomokiyo, M., Mangeot, M., & Boitet, C. (2019). Development of a classifiers/quantifiers dictionary towards French-Japanese MT. arXiv preprint arXiv:1902.08061. https://arxiv.org/pdf/1902.08061.pdf [5*] Boguslavsky, I. (2005). Some controversial issues of UNL: Linguistic aspects. Research on Computer Science, 12, 77-100.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.2058&rep=re... [6*] Boitet, C. (2002). A rationale for using UNL as an interlingua and more in various domains. In Proc. LREC-02 First International Workshop on UNL, other Interlinguas, and their Applications, Las Palmas (pp. 26-31). https://www.cicling.org/2005/unl-book/Papers/003.pdf [7*] Dhanabalan, T., & Geetha, T. V. (2003, December). UNL deconverter for Tamil. In International Conference on the Convergences of Knowledge, Culture, Language and Information Technologies. http://www.cfilt.iitb.ac.in/convergence03/all%20data/paper%20032-372.pdf [8*] Singh, S., Dalal, M., Vachhani, V., Bhattacharyya, P., & Damani, O. P. (2007). Hindi generation from Interlingua (UNL). Machine Translation Summit XI.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.78.979&rep=rep1... [9*] Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., ... & Schneider, N. (2013, August). Abstract meaning representation for sembanking. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse (pp. 178-186). https://www.aclweb.org/anthology/W13-2322.pdf [10*] Berment, V., & Boitet, C. (2012). Heloise—An Ariane-G5 Compatible Rnvironment for Developing Expert MT Systems Online. In Proceedings of COLING 2012: Demonstration Papers (pp. 9-16). https://www.aclweb.org/anthology/C12-3002.pdf [11*] Berment, V. (2005). Online Translation Services for the Lao Language. In Proceedings of the First International Conference on Lao Studies. De Kalb, Illinois, USA (pp. 1-11).
https://www.researchgate.net/profile/Vincent_Berment/publication/242140227_O...
[*1] is the paper that describes UNL. [2*] is a doctoral thesis discussing a core problem AW is trying to address too. [3*] is a short paper done in the scope of [2*], even if you don't understand French you can have a look at the figures: two of them are about an editor similar in principe to what AW wants to incorporate. [5*] Insights about UNL expressivity issues, 10 years after the project's start. [6*] More UNL, with short history and context in which it is used.
[4*] shows how deep natural language conversion goes: this paper addresses the issue of classifiers in French and Japanese. This is just one linguistic issue and there are dozens if not hundreds of such. An important point is that both of the languages involved need to be taken into account when modelling the abstract encoding, otherwise too much information is lost for producing a correct output.
[7*] [8*] are very valuable examples of real world deconverter systems for UNL. As it's visible on [7*]'s Figure 1 and [8*]'s Figure 2, the process is *way* more complicated than a single "renderers" box. Moreover, there are very distinct identifiable steps, informed by linguistics. The AW does not describe any such structuration of natural text generation processing steps, everything is supposed to be happening in some unstructured "lambda" system. Also missing are the specialized resources (UNL-Hindi dictionary, Tamil Word dictionary, co-occurrence dictionary, etc.) required for the task. Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
[10*] is a translation system built with "specialised languages for linguistic programming (SLLPs)" which is the service Wikilambda is supposed to provide for Abstract Wikipedia. [11*] gives the estimation of 2500 hours for the development (by a specialist) of three linguistic modules for Lao processing.
So, in regard to the difficulty of the task, and previous work in the literature, the AW paper does not provide any convincing evidence that the technology on which it is supposed to be built can even reach the state-of-art. Dismissing every existing formal and software systems on the ground of "no consensus commiting to any specific linguistic theory" is not gonna work: this will result in ad hoc implementation-driven formalism that will have hard time fullfilling its goal. The NLP part (generating sentences from abstract representation) is the hardest of the project, yet it’s by far the least convincing one. "Abstract Wikipedia is indeed firmly within this tradition, and in preparation for this project we studied numerous predecessors." I would like to believe so, but the lack of corresponding reference as well as lack of previous work section tends to prove the contrary.
While I can't advice for a switch to UNL, as I'm not specialist of it, it would be smart to capitalize on the work done on it by highly skilled (PhD level) individuals. As the UNL system is built on hypergraphs, it probably could be made interoperable easily with RDF knowledge graphs if named graphs are used. By having a UNL/RDF specification (yet to be written), the vision exposed in the AW paper may be reached sooner by reusing existing software (we are speaking of thousands man-year of work as per [11*]), and almost as importantly, an existing formalism that has been "debugged" for decades. There are probably other systems I'm unaware of that are worth investigating too, some like [9*] having more specialized usage. In any case, there is a strong need to back the paper and the project on the existing (huge) literature.
- Other issues
"In order to evaluate a function call, an evaluator can choose from a multitude of backends: it may evaluate the function call in the browser, in the cloud, on the servers of the Wikimedia Foundation, on a distributed peer-to-peer evaluation platform, or natively on the user’s machine in a dedicated hosting runtime, which could be a mobile app or a server on the user’s computer."
This part is big technical creep. There is no reason to turn the project into a distributed heterogenous computing platform with a dedicated runtime, which could be a research project on its own, when the stated goal is to provide abstract multilingual encyclopedic content. All the computation can be done on servers (cloud is servers too) and cached. This is way easier to implement, test and deliver than to implement 10 different backends with various progress in implementation, incompatibilities and runtime characteristics.
The paper presents AW as sitting on top on WL. Both are big projects. Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project. AW can be realised with current tools and engineering practices.
"One obstacle in the democratization of programming has been that almost every programming language requires first to learn some basic English."
This strong affirmation needs to be sourced. Programming languages, save for a few keywords, doesn't rely much on English. The vast insuccess of localized version of programming languages (such as French Basic) as well as the heavy use of existing programming language in countries that doesn't even use the Latin alphabet (China, Russia) tends to prove that English is not all a bottleneck for the democratization of programming. [53] is cited later in the paper but is a pop-linguistic article from an online newspaper, not an academic article.
- Final words
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors. If UNL is not too well-known, it’s not because it didn't yield research achievements, but because one selected institution per language is working on it and keep the resources and software within the lab walls. Secondly, there are some very welcome out-of-scope features: conversion from natural language, restriction to encyclopedic style text. This will allow for more focused effort towards the end goal, making it more achievable. And finally, the choice to go with symbolic/rule-based system with a touch of other ML where useful. This is, as said in the paper, a big win for explainability and using human contributions to build the system. This will also keep the computing cost to a more sane baseline than what the current deep learning models require.
I think the project can succeed thanks to its openess, yet there is are real dangers visible in the paper: on the NLP side to reinvent a wheel that took 40 years to build, and on the technical side to lose time and effort on a project not required per se for AW to be build.
As I spend a significant time (~10 hours) gathering references and writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it.
Best regards, Louis Lecailliez
PS: 4. Typos
- "These two projects will considerably expand the capabilities of the
Wikimedia platform to enable every single human being to freely share share in the sum of all knowledge." => duplicate share
- "The content is than turned into" => The content is then turned into
- "[26] Charles J Fillmore, Russell Lee-Goldman, and Russell Rhodes. The
framenet constructicon. Sign-based construction grammar, pages 309–372, 2012." => The framenet construction
- "These function finally can call the lexicographic knowlegde stored in
Wikidata." => These function finally can call the lexicographic knowledge stored in Wikidata
- "[102] George Kinsley Zipf. Human Behavior and the Pirnciple of Least
Effort. Addison-Wesley, 1949." => [102] George Kinsley Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
- "Allowing the individual language Wikipedias to call Wikilambda has an
addtional benefit." => Allowing the individual language Wikipedias to call Wikilambda has an additional benefit. _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
-------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/5cb85890/attachment.html
------------------------------
Subject: Digest Footer
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
------------------------------
End of Abstract-Wikipedia Digest, Vol 1, Issue 6 ************************************************
On 04 July 2020 at 17:12 Louis Lecailliez louis.lecailliez@outlook.fr wrote:
Hi Amir, I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
Um - I really do prefer it if Wikipedia is described as "collaborative and crowdfunded", which is more accurate. What the Abstract Wikipedia can be is collaborative; and funded by some mechanism. I think the distinction is more than a pedantic one.
Charles
בתאריך שבת, 4 ביולי 2020 ב-21:10 מאת Louis Lecailliez < louis.lecailliez@outlook.fr>:
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by
such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before.
Of course. We don't actually disagree.
The content of Wikipedia is by definition secondary (or even tertiary), but its process is innovative, and it was the subject of thousands of academic papers.
The process of developing renderers for multiple languages by a large and open community of volunteers, rather than a small group of paid academics, is probably going to be of interest to researchers, too.
And the output of the process can be useful and innovative, too. Of course, there are things that a team of trained and well-read academics can do and a community of untrained amateurs cannot, no matter how large and motivated it is. But it works the other way, too: there are things that very well-trained academics cannot (or would not) do, but a large motivated community can (and would). Of course, it will work best if there is collaboration between professionals and amateurs.
On 05 July 2020 at 10:25 "Amir E. Aharoni" amir.aharoni@mail.huji.ac.il wrote:
Of course, there are things that a team of trained and well-read academics can do and a community of untrained amateurs cannot, no matter how large and motivated it is. But it works the other way, too: there are things that very well-trained academics cannot (or would not) do, but a large motivated community can (and would). Of course, it will work best if there is collaboration between professionals and amateurs.
It is worth giving some background here. A precursor of AW is the Wikimedia "article placeholder" project. User:Frimelle wrote a dissertation on it, and I see it is online:
https://commons.wikimedia.org/wiki/File:Generating_Article_Placeholders_from...
That was in 2016. In any case, the Wikimedians and academics are not really disjoint groups of people.
Charles
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards, Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language [2] https://wals.info/
From: Louis Lecailliezmailto:louis.lecailliez@outlook.fr Sent: Saturday, July 4, 2020 2:10 PM To: abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256). <Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent? La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> . <Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese? 法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures (https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence. Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent? The more natural structure would be like: フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards, Louis Lecailliez
De : Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de abstract-wikipedia-request@lists.wikimedia.org abstract-wikipedia-request@lists.wikimedia.org Envoyé : samedi 4 juillet 2020 15:06 À : abstract-wikipedia@lists.wikimedia.org abstract-wikipedia@lists.wikimedia.org Objet : Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
1. Re: NLP issues severely overlooked (Charles Matthews) 2. Use case: generation of short description (Jakob Voß) 3. Re: NLP issues severely overlooked (Amir E. Aharoni)
----------------------------------------------------------------------
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews charles.r.matthews@ntlworld.com To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: 2126327926.39940.1593867909152@mail2.virginmedia.com Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez louis.lecailliez@outlook.fr wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK) -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/1113bab0/attachment-0001.html
------------------------------
Message: 2 Date: Sat, 4 Jul 2020 08:18:56 +0200 From: Jakob Voß jakob.voss@gbv.de To: abstract-wikipedia@lists.wikimedia.org Subject: [Abstract-wikipedia] Use case: generation of short description Message-ID: 4403bbda-040b-6c89-9cb6-6540139250dc@gbv.de Content-Type: text/plain; charset="utf-8"
Hi,
I want to auto-generate disambiguation description for African politicians to be added to Wikidata, e.g. from the country Mozambique (Q1029) the following descriptions should be generated:
Mozambican politician (en) Mosambikanischer Politiker (de) politico mozambicano (it) ...
This could be extended to other professions. My questions:
- Can anyone point me to data sources where to best look up country adjectives such as "Mozambican"?
- Where/how to best store the lexical information for best reuse with other renderers
- If a create small renderers for this short descriptions, what architecture do you prefer for best reuse?
My just-get-it-done solution would be a set of CSV files and a few lines of Perl code, but maybe this use case can be aligned with Abstract Wikidata to better learn about it.
Looking forward to collaborate, Jakob
------------------------------
Message: 3 Date: Sat, 4 Jul 2020 18:03:24 +0300 From: "Amir E. Aharoni" amir.aharoni@mail.huji.ac.il To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: CACtNa8t6kbWe21C980h1MxiWNfUp+0eDE82vPMjDUX2UCgb2gw@mail.gmail.com Content-Type: text/plain; charset="utf-8"
Hi,
Thanks a lot for the sources. I am not one of the people implementing Wikilambda, but I am just very curious about it as a member of the wider Wikimedia community. But there's a good chance that they will be useful to people who do work on the implementation.
I will dare to add a little thought I have about it, however. It's possible that the challenge of building a well-functioning natural language generator is underestimated by the founders, and that they don't pay enough attention to existing work (although, knowing Denny, there is a good chance that he actually is aware of at least some of it). But there is something that the wide Wikimedia community has that I'm not sure that the past projects in this field did: The community itself. A big, worldwide, and diverse group of passionate volunteers, who love the idea of spreading free knowledge and who love their languages. Quite a lot of them also know some programming, and in the past they proved unbelievably creative and productive when writing code for Wikimedia projects as a community, in the form of templates, modules, gadgets, bots, extensions, and other tools. I'm quite sure that once the new tools become usable, this community will start doing creative things again, and it will also start reporting bugs and limitations.
So yes, while it's possible that along the way both the core developers and the volunteer community will find all kinds of stumbling blocks, I'm pretty sure that they will also have all kinds of surprising success stories. It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject. And don't underestimate the "open" part—that's where we really shine. This won't be a theoretical work in a lab, published in a paywalled and copyright-restricted academic journal, but fully optimized for accessibility to everyone.
Yes, this whole email from me is incredibly naïve, but it's the same attitude that got us to writing the biggest and most multilingual encyclopedia in history, so maybe we can do something cool again :)
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
בתאריך שבת, 4 ביולי 2020 ב-14:26 מאת Louis Lecailliez < louis.lecailliez@outlook.fr>:
Hello,
my name is Louis Lecailliez, PhD student at Kyoto University in education technology. I'm a Computer Science and NLP graduate. One thing I do is working on language learner's knowledge modelling as graphs.
The Abstract Wikipedia project is really interesting. There is however two very concerning issues I spotted when reading the associated paper draft ( https://arxiv.org/abs/2004.04733). The following email could be read as negative, but please don't take it as such: my purpose is to avoid spending people efforts and money for things that can (need to!) be fixed upfront.
- Issues with NLP
The main issue is that the difficulty of the NLP task of generating natural text from an abstract representation is severely overlooked. This stems from the other main problem: the paper is not based on the decades of previous work in that space.
As I understand it, the main value proposition of Abstract Wikipedia (AW) is a computer representation of encyclopedic knowledge that can be projected into different existing natural languages, with the goal of supporting a huge number of them. Plus, an editor to make this happen easily.
This is in fact surprisingly extremely close to what the Universal Networking Language (UNL) project, which started 20 years ago, aims to do. UNL provides a language agnostic representation of text that uses hypergraph. Software (called EnConverter) produce UNL graphs from natural text in a given language. Another kind of software called DeConverter do the reverse, that is producing natural text from the abstract representation. This is exactly the same function of the "renderers" in the AW paper. The way of doing it is also similar: by applying successive transformations until the final text string is produced. In general, that kind of abstract meaning representation is called an Interlingua, and is widely used in Machine Translation (MT) systems.
Disregarding two decades of work, in the UNL case, on the same problem space (rule-based machine translation, here from an abstract language as fixed source language), which was itself based on few other decades of work, doesn't seem to be a wise move to start a new project. For a start, the graph representation used in the AW will likely not be expressive enough to encode linguistic knowledge; this is why UNL uses hypergraphs instead of graphs.
The problem is glaring when looking at the references list: the list is bloated with irrelevant references (such as those to programming languages [27, 37, 41, 77], Turing completeness being the worst offender [11, 17, 23, ...]) while containing only two references [7, 85] to the really hard part of the project: generating natural language from the abstract representation. There are few more relevant references about natural language generation, but this isn't enough.
Interestingly, [85] is an UNL paper, but not the main one. Moreover, it is cited in Section 9 "Opening future research". This should be instead placed in a "Previous work" section which is missing from the paper.
To fill a part of this section yet to be written, I propose the following references: [*1] Uchida, H., Zhu, M., & Della Senta, T. (1999). A gift for a millennium. IAS/UNU, Tokyo.
https://www.researchgate.net/profile/Hiroshi_Uchida2/publication/239328725_A... [*2] Wang-Ju Tsai (2004) La coédition langue-UNL pour partager la révision entre langues d'un document multilingue. [Language-UNL coedition to share revisions in a multilingual document] Thèse de doctorat. Grenoble.
https://pdfs.semanticscholar.org/b030/ea4662e393657b9a134c006ca5b08e8a23b3.p... [3*] Boitet, C., & Tsai, W. J. (2002). La coédition langue<—> UNL pour partager la révision entre les langues d'un document multilingue: un concept unificateur. Proc. TALN-02, Nancy, 22-26.
http://www.afcp-parole.org/doc/Archives_JEP/2002_XXIVe_JEP_Nancy/talnrecital... [4*] Tomokiyo, M., Mangeot, M., & Boitet, C. (2019). Development of a classifiers/quantifiers dictionary towards French-Japanese MT. arXiv preprint arXiv:1902.08061. https://arxiv.org/pdf/1902.08061.pdf [5*] Boguslavsky, I. (2005). Some controversial issues of UNL: Linguistic aspects. Research on Computer Science, 12, 77-100.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.2058&rep=re... [6*] Boitet, C. (2002). A rationale for using UNL as an interlingua and more in various domains. In Proc. LREC-02 First International Workshop on UNL, other Interlinguas, and their Applications, Las Palmas (pp. 26-31). https://www.cicling.org/2005/unl-book/Papers/003.pdf [7*] Dhanabalan, T., & Geetha, T. V. (2003, December). UNL deconverter for Tamil. In International Conference on the Convergences of Knowledge, Culture, Language and Information Technologies. http://www.cfilt.iitb.ac.in/convergence03/all%20data/paper%20032-372.pdf [8*] Singh, S., Dalal, M., Vachhani, V., Bhattacharyya, P., & Damani, O. P. (2007). Hindi generation from Interlingua (UNL). Machine Translation Summit XI.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.78.979&rep=rep1... [9*] Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., ... & Schneider, N. (2013, August). Abstract meaning representation for sembanking. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse (pp. 178-186). https://www.aclweb.org/anthology/W13-2322.pdf [10*] Berment, V., & Boitet, C. (2012). Heloise—An Ariane-G5 Compatible Rnvironment for Developing Expert MT Systems Online. In Proceedings of COLING 2012: Demonstration Papers (pp. 9-16). https://www.aclweb.org/anthology/C12-3002.pdf [11*] Berment, V. (2005). Online Translation Services for the Lao Language. In Proceedings of the First International Conference on Lao Studies. De Kalb, Illinois, USA (pp. 1-11).
https://www.researchgate.net/profile/Vincent_Berment/publication/242140227_O...
[*1] is the paper that describes UNL. [2*] is a doctoral thesis discussing a core problem AW is trying to address too. [3*] is a short paper done in the scope of [2*], even if you don't understand French you can have a look at the figures: two of them are about an editor similar in principe to what AW wants to incorporate. [5*] Insights about UNL expressivity issues, 10 years after the project's start. [6*] More UNL, with short history and context in which it is used.
[4*] shows how deep natural language conversion goes: this paper addresses the issue of classifiers in French and Japanese. This is just one linguistic issue and there are dozens if not hundreds of such. An important point is that both of the languages involved need to be taken into account when modelling the abstract encoding, otherwise too much information is lost for producing a correct output.
[7*] [8*] are very valuable examples of real world deconverter systems for UNL. As it's visible on [7*]'s Figure 1 and [8*]'s Figure 2, the process is *way* more complicated than a single "renderers" box. Moreover, there are very distinct identifiable steps, informed by linguistics. The AW does not describe any such structuration of natural text generation processing steps, everything is supposed to be happening in some unstructured "lambda" system. Also missing are the specialized resources (UNL-Hindi dictionary, Tamil Word dictionary, co-occurrence dictionary, etc.) required for the task. Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
[10*] is a translation system built with "specialised languages for linguistic programming (SLLPs)" which is the service Wikilambda is supposed to provide for Abstract Wikipedia. [11*] gives the estimation of 2500 hours for the development (by a specialist) of three linguistic modules for Lao processing.
So, in regard to the difficulty of the task, and previous work in the literature, the AW paper does not provide any convincing evidence that the technology on which it is supposed to be built can even reach the state-of-art. Dismissing every existing formal and software systems on the ground of "no consensus commiting to any specific linguistic theory" is not gonna work: this will result in ad hoc implementation-driven formalism that will have hard time fullfilling its goal. The NLP part (generating sentences from abstract representation) is the hardest of the project, yet it’s by far the least convincing one. "Abstract Wikipedia is indeed firmly within this tradition, and in preparation for this project we studied numerous predecessors." I would like to believe so, but the lack of corresponding reference as well as lack of previous work section tends to prove the contrary.
While I can't advice for a switch to UNL, as I'm not specialist of it, it would be smart to capitalize on the work done on it by highly skilled (PhD level) individuals. As the UNL system is built on hypergraphs, it probably could be made interoperable easily with RDF knowledge graphs if named graphs are used. By having a UNL/RDF specification (yet to be written), the vision exposed in the AW paper may be reached sooner by reusing existing software (we are speaking of thousands man-year of work as per [11*]), and almost as importantly, an existing formalism that has been "debugged" for decades. There are probably other systems I'm unaware of that are worth investigating too, some like [9*] having more specialized usage. In any case, there is a strong need to back the paper and the project on the existing (huge) literature.
- Other issues
"In order to evaluate a function call, an evaluator can choose from a multitude of backends: it may evaluate the function call in the browser, in the cloud, on the servers of the Wikimedia Foundation, on a distributed peer-to-peer evaluation platform, or natively on the user’s machine in a dedicated hosting runtime, which could be a mobile app or a server on the user’s computer."
This part is big technical creep. There is no reason to turn the project into a distributed heterogenous computing platform with a dedicated runtime, which could be a research project on its own, when the stated goal is to provide abstract multilingual encyclopedic content. All the computation can be done on servers (cloud is servers too) and cached. This is way easier to implement, test and deliver than to implement 10 different backends with various progress in implementation, incompatibilities and runtime characteristics.
The paper presents AW as sitting on top on WL. Both are big projects. Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project. AW can be realised with current tools and engineering practices.
"One obstacle in the democratization of programming has been that almost every programming language requires first to learn some basic English."
This strong affirmation needs to be sourced. Programming languages, save for a few keywords, doesn't rely much on English. The vast insuccess of localized version of programming languages (such as French Basic) as well as the heavy use of existing programming language in countries that doesn't even use the Latin alphabet (China, Russia) tends to prove that English is not all a bottleneck for the democratization of programming. [53] is cited later in the paper but is a pop-linguistic article from an online newspaper, not an academic article.
- Final words
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors. If UNL is not too well-known, it’s not because it didn't yield research achievements, but because one selected institution per language is working on it and keep the resources and software within the lab walls. Secondly, there are some very welcome out-of-scope features: conversion from natural language, restriction to encyclopedic style text. This will allow for more focused effort towards the end goal, making it more achievable. And finally, the choice to go with symbolic/rule-based system with a touch of other ML where useful. This is, as said in the paper, a big win for explainability and using human contributions to build the system. This will also keep the computing cost to a more sane baseline than what the current deep learning models require.
I think the project can succeed thanks to its openess, yet there is are real dangers visible in the paper: on the NLP side to reinvent a wheel that took 40 years to build, and on the technical side to lose time and effort on a project not required per se for AW to be build.
As I spend a significant time (~10 hours) gathering references and writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it.
Best regards, Louis Lecailliez
PS: 4. Typos
- "These two projects will considerably expand the capabilities of the
Wikimedia platform to enable every single human being to freely share share in the sum of all knowledge." => duplicate share
- "The content is than turned into" => The content is then turned into
- "[26] Charles J Fillmore, Russell Lee-Goldman, and Russell Rhodes. The
framenet constructicon. Sign-based construction grammar, pages 309–372, 2012." => The framenet construction
- "These function finally can call the lexicographic knowlegde stored in
Wikidata." => These function finally can call the lexicographic knowledge stored in Wikidata
- "[102] George Kinsley Zipf. Human Behavior and the Pirnciple of Least
Effort. Addison-Wesley, 1949." => [102] George Kinsley Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
- "Allowing the individual language Wikipedias to call Wikilambda has an
addtional benefit." => Allowing the individual language Wikipedias to call Wikilambda has an additional benefit. _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
-------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/5cb85890/attachment.html
------------------------------
Subject: Digest Footer
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
------------------------------
End of Abstract-Wikipedia Digest, Vol 1, Issue 6 ************************************************
Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language
*From: *Louis Lecailliez louis.lecailliez@outlook.fr *Sent: *Saturday, July 4, 2020 2:10 PM *To: *abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by
such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures ( https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence.
Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de abstract-wikipedia-request@lists.wikimedia.org < abstract-wikipedia-request@lists.wikimedia.org> *Envoyé :* samedi 4 juillet 2020 15:06 *À :* abstract-wikipedia@lists.wikimedia.org < abstract-wikipedia@lists.wikimedia.org> *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
- Re: NLP issues severely overlooked (Charles Matthews)
- Use case: generation of short description (Jakob Voß)
- Re: NLP issues severely overlooked (Amir E. Aharoni)
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews charles.r.matthews@ntlworld.com To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" < abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: 2126327926.39940.1593867909152@mail2.virginmedia.com Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez louis.lecailliez@outlook.fr
wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper
except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I
really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK)
That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith arthurpsmith@gmail.com escreveu:
Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language
*From: *Louis Lecailliez louis.lecailliez@outlook.fr *Sent: *Saturday, July 4, 2020 2:10 PM *To: *abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by
such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures ( https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence.
Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de abstract-wikipedia-request@lists.wikimedia.org < abstract-wikipedia-request@lists.wikimedia.org> *Envoyé :* samedi 4 juillet 2020 15:06 *À :* abstract-wikipedia@lists.wikimedia.org < abstract-wikipedia@lists.wikimedia.org> *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
- Re: NLP issues severely overlooked (Charles Matthews)
- Use case: generation of short description (Jakob Voß)
- Re: NLP issues severely overlooked (Amir E. Aharoni)
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews charles.r.matthews@ntlworld.com To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" < abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: 2126327926.39940.1593867909152@mail2.virginmedia.com Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez louis.lecailliez@outlook.fr
wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper
except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I
really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK)
Brainstorming: resembling what the document object model (DOM) [1] is for XML and attributed trees, perhaps we could create and specify an object model for sets of attributed predicate calculus expressions.
With an attributed predicate calculus object model (e.g. “APCOM”) for sets of attributed predicate calculus expressions:
{ r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4 r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8 r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11, o7(icl>domain7).@a12).@a13 }.@a14
developers could more conveniently utilize sets of attributed predicate calculus expressions from JavaScript and Lua.
Drawing from XML, we can consider that objects, relations, attributes could be, instead of plain text strings, uniform resource identifiers (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and so forth.
We can also consider that the attributes in a model could have values:
{ r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4] r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8] r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13] }.[@a14=v14]
We can consider creating a scripting API (e.g. “APCOM”) for a semantic model to convenience developers. We can also consider adding attribute-value pairs to a semantic model.
Best regards, Adam
[1] https://en.wikipedia.org/wiki/Document_Object_Model
From: Tiago Timponi Torrentmailto:tiago.torrent@ufjf.edu.br Sent: Sunday, July 5, 2020 9:06 PM To: General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)mailto:abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith <arthurpsmith@gmail.commailto:arthurpsmith@gmail.com> escreveu: Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski <adamsobieski@hotmail.commailto:adamsobieski@hotmail.com> wrote: Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards, Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language [2] https://wals.info/
From: Louis Lecailliezmailto:louis.lecailliez@outlook.fr Sent: Saturday, July 4, 2020 2:10 PM To: abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256). <Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent? La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> . <Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese? 法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures (https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence. Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent? The more natural structure would be like: フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards, Louis Lecailliez
De : Abstract-Wikipedia <abstract-wikipedia-bounces@lists.wikimedia.orgmailto:abstract-wikipedia-bounces@lists.wikimedia.org> de la part de abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org <abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org> Envoyé : samedi 4 juillet 2020 15:06 À : abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Objet : Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.orgmailto:abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
1. Re: NLP issues severely overlooked (Charles Matthews) 2. Use case: generation of short description (Jakob Voß) 3. Re: NLP issues severely overlooked (Amir E. Aharoni)
----------------------------------------------------------------------
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews <charles.r.matthews@ntlworld.commailto:charles.r.matthews@ntlworld.com> To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: <2126327926.39940.1593867909152@mail2.virginmedia.commailto:2126327926.39940.1593867909152@mail2.virginmedia.com> Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez <louis.lecailliez@outlook.frmailto:louis.lecailliez@outlook.fr> wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK) -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/1113bab0/attachment-0001.html
------------------------------
Message: 2 Date: Sat, 4 Jul 2020 08:18:56 +0200 From: Jakob Voß <jakob.voss@gbv.demailto:jakob.voss@gbv.de> To: <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: [Abstract-wikipedia] Use case: generation of short description Message-ID: <4403bbda-040b-6c89-9cb6-6540139250dc@gbv.demailto:4403bbda-040b-6c89-9cb6-6540139250dc@gbv.de> Content-Type: text/plain; charset="utf-8"
Hi,
I want to auto-generate disambiguation description for African politicians to be added to Wikidata, e.g. from the country Mozambique (Q1029) the following descriptions should be generated:
Mozambican politician (en) Mosambikanischer Politiker (de) politico mozambicano (it) ...
This could be extended to other professions. My questions:
- Can anyone point me to data sources where to best look up country adjectives such as "Mozambican"?
- Where/how to best store the lexical information for best reuse with other renderers
- If a create small renderers for this short descriptions, what architecture do you prefer for best reuse?
My just-get-it-done solution would be a set of CSV files and a few lines of Perl code, but maybe this use case can be aligned with Abstract Wikidata to better learn about it.
Looking forward to collaborate, Jakob
------------------------------
Message: 3 Date: Sat, 4 Jul 2020 18:03:24 +0300 From: "Amir E. Aharoni" <amir.aharoni@mail.huji.ac.ilmailto:amir.aharoni@mail.huji.ac.il> To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: <CACtNa8t6kbWe21C980h1MxiWNfUp+0eDE82vPMjDUX2UCgb2gw@mail.gmail.commailto:CACtNa8t6kbWe21C980h1MxiWNfUp%2B0eDE82vPMjDUX2UCgb2gw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi,
Thanks a lot for the sources. I am not one of the people implementing Wikilambda, but I am just very curious about it as a member of the wider Wikimedia community. But there's a good chance that they will be useful to people who do work on the implementation.
I will dare to add a little thought I have about it, however. It's possible that the challenge of building a well-functioning natural language generator is underestimated by the founders, and that they don't pay enough attention to existing work (although, knowing Denny, there is a good chance that he actually is aware of at least some of it). But there is something that the wide Wikimedia community has that I'm not sure that the past projects in this field did: The community itself. A big, worldwide, and diverse group of passionate volunteers, who love the idea of spreading free knowledge and who love their languages. Quite a lot of them also know some programming, and in the past they proved unbelievably creative and productive when writing code for Wikimedia projects as a community, in the form of templates, modules, gadgets, bots, extensions, and other tools. I'm quite sure that once the new tools become usable, this community will start doing creative things again, and it will also start reporting bugs and limitations.
So yes, while it's possible that along the way both the core developers and the volunteer community will find all kinds of stumbling blocks, I'm pretty sure that they will also have all kinds of surprising success stories. It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject. And don't underestimate the "open" part—that's where we really shine. This won't be a theoretical work in a lab, published in a paywalled and copyright-restricted academic journal, but fully optimized for accessibility to everyone.
Yes, this whole email from me is incredibly naïve, but it's the same attitude that got us to writing the biggest and most multilingual encyclopedia in history, so maybe we can do something cool again :)
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
בתאריך שבת, 4 ביולי 2020 ב-14:26 מאת Louis Lecailliez < louis.lecailliez@outlook.frmailto:louis.lecailliez@outlook.fr>:
Hello,
my name is Louis Lecailliez, PhD student at Kyoto University in education technology. I'm a Computer Science and NLP graduate. One thing I do is working on language learner's knowledge modelling as graphs.
The Abstract Wikipedia project is really interesting. There is however two very concerning issues I spotted when reading the associated paper draft ( https://arxiv.org/abs/2004.04733). The following email could be read as negative, but please don't take it as such: my purpose is to avoid spending people efforts and money for things that can (need to!) be fixed upfront.
- Issues with NLP
The main issue is that the difficulty of the NLP task of generating natural text from an abstract representation is severely overlooked. This stems from the other main problem: the paper is not based on the decades of previous work in that space.
As I understand it, the main value proposition of Abstract Wikipedia (AW) is a computer representation of encyclopedic knowledge that can be projected into different existing natural languages, with the goal of supporting a huge number of them. Plus, an editor to make this happen easily.
This is in fact surprisingly extremely close to what the Universal Networking Language (UNL) project, which started 20 years ago, aims to do. UNL provides a language agnostic representation of text that uses hypergraph. Software (called EnConverter) produce UNL graphs from natural text in a given language. Another kind of software called DeConverter do the reverse, that is producing natural text from the abstract representation. This is exactly the same function of the "renderers" in the AW paper. The way of doing it is also similar: by applying successive transformations until the final text string is produced. In general, that kind of abstract meaning representation is called an Interlingua, and is widely used in Machine Translation (MT) systems.
Disregarding two decades of work, in the UNL case, on the same problem space (rule-based machine translation, here from an abstract language as fixed source language), which was itself based on few other decades of work, doesn't seem to be a wise move to start a new project. For a start, the graph representation used in the AW will likely not be expressive enough to encode linguistic knowledge; this is why UNL uses hypergraphs instead of graphs.
The problem is glaring when looking at the references list: the list is bloated with irrelevant references (such as those to programming languages [27, 37, 41, 77], Turing completeness being the worst offender [11, 17, 23, ...]) while containing only two references [7, 85] to the really hard part of the project: generating natural language from the abstract representation. There are few more relevant references about natural language generation, but this isn't enough.
Interestingly, [85] is an UNL paper, but not the main one. Moreover, it is cited in Section 9 "Opening future research". This should be instead placed in a "Previous work" section which is missing from the paper.
To fill a part of this section yet to be written, I propose the following references: [*1] Uchida, H., Zhu, M., & Della Senta, T. (1999). A gift for a millennium. IAS/UNU, Tokyo.
https://www.researchgate.net/profile/Hiroshi_Uchida2/publication/239328725_A... [*2] Wang-Ju Tsai (2004) La coédition langue-UNL pour partager la révision entre langues d'un document multilingue. [Language-UNL coedition to share revisions in a multilingual document] Thèse de doctorat. Grenoble.
https://pdfs.semanticscholar.org/b030/ea4662e393657b9a134c006ca5b08e8a23b3.p... [3*] Boitet, C., & Tsai, W. J. (2002). La coédition langue<—> UNL pour partager la révision entre les langues d'un document multilingue: un concept unificateur. Proc. TALN-02, Nancy, 22-26.
http://www.afcp-parole.org/doc/Archives_JEP/2002_XXIVe_JEP_Nancy/talnrecital... [4*] Tomokiyo, M., Mangeot, M., & Boitet, C. (2019). Development of a classifiers/quantifiers dictionary towards French-Japanese MT. arXiv preprint arXiv:1902.08061. https://arxiv.org/pdf/1902.08061.pdf [5*] Boguslavsky, I. (2005). Some controversial issues of UNL: Linguistic aspects. Research on Computer Science, 12, 77-100.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.2058&rep=re... [6*] Boitet, C. (2002). A rationale for using UNL as an interlingua and more in various domains. In Proc. LREC-02 First International Workshop on UNL, other Interlinguas, and their Applications, Las Palmas (pp. 26-31). https://www.cicling.org/2005/unl-book/Papers/003.pdf [7*] Dhanabalan, T., & Geetha, T. V. (2003, December). UNL deconverter for Tamil. In International Conference on the Convergences of Knowledge, Culture, Language and Information Technologies. http://www.cfilt.iitb.ac.in/convergence03/all%20data/paper%20032-372.pdf [8*] Singh, S., Dalal, M., Vachhani, V., Bhattacharyya, P., & Damani, O. P. (2007). Hindi generation from Interlingua (UNL). Machine Translation Summit XI.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.78.979&rep=rep1... [9*] Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., ... & Schneider, N. (2013, August). Abstract meaning representation for sembanking. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse (pp. 178-186). https://www.aclweb.org/anthology/W13-2322.pdf [10*] Berment, V., & Boitet, C. (2012). Heloise—An Ariane-G5 Compatible Rnvironment for Developing Expert MT Systems Online. In Proceedings of COLING 2012: Demonstration Papers (pp. 9-16). https://www.aclweb.org/anthology/C12-3002.pdf [11*] Berment, V. (2005). Online Translation Services for the Lao Language. In Proceedings of the First International Conference on Lao Studies. De Kalb, Illinois, USA (pp. 1-11).
https://www.researchgate.net/profile/Vincent_Berment/publication/242140227_O...
[*1] is the paper that describes UNL. [2*] is a doctoral thesis discussing a core problem AW is trying to address too. [3*] is a short paper done in the scope of [2*], even if you don't understand French you can have a look at the figures: two of them are about an editor similar in principe to what AW wants to incorporate. [5*] Insights about UNL expressivity issues, 10 years after the project's start. [6*] More UNL, with short history and context in which it is used.
[4*] shows how deep natural language conversion goes: this paper addresses the issue of classifiers in French and Japanese. This is just one linguistic issue and there are dozens if not hundreds of such. An important point is that both of the languages involved need to be taken into account when modelling the abstract encoding, otherwise too much information is lost for producing a correct output.
[7*] [8*] are very valuable examples of real world deconverter systems for UNL. As it's visible on [7*]'s Figure 1 and [8*]'s Figure 2, the process is *way* more complicated than a single "renderers" box. Moreover, there are very distinct identifiable steps, informed by linguistics. The AW does not describe any such structuration of natural text generation processing steps, everything is supposed to be happening in some unstructured "lambda" system. Also missing are the specialized resources (UNL-Hindi dictionary, Tamil Word dictionary, co-occurrence dictionary, etc.) required for the task. Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
[10*] is a translation system built with "specialised languages for linguistic programming (SLLPs)" which is the service Wikilambda is supposed to provide for Abstract Wikipedia. [11*] gives the estimation of 2500 hours for the development (by a specialist) of three linguistic modules for Lao processing.
So, in regard to the difficulty of the task, and previous work in the literature, the AW paper does not provide any convincing evidence that the technology on which it is supposed to be built can even reach the state-of-art. Dismissing every existing formal and software systems on the ground of "no consensus commiting to any specific linguistic theory" is not gonna work: this will result in ad hoc implementation-driven formalism that will have hard time fullfilling its goal. The NLP part (generating sentences from abstract representation) is the hardest of the project, yet it’s by far the least convincing one. "Abstract Wikipedia is indeed firmly within this tradition, and in preparation for this project we studied numerous predecessors." I would like to believe so, but the lack of corresponding reference as well as lack of previous work section tends to prove the contrary.
While I can't advice for a switch to UNL, as I'm not specialist of it, it would be smart to capitalize on the work done on it by highly skilled (PhD level) individuals. As the UNL system is built on hypergraphs, it probably could be made interoperable easily with RDF knowledge graphs if named graphs are used. By having a UNL/RDF specification (yet to be written), the vision exposed in the AW paper may be reached sooner by reusing existing software (we are speaking of thousands man-year of work as per [11*]), and almost as importantly, an existing formalism that has been "debugged" for decades. There are probably other systems I'm unaware of that are worth investigating too, some like [9*] having more specialized usage. In any case, there is a strong need to back the paper and the project on the existing (huge) literature.
- Other issues
"In order to evaluate a function call, an evaluator can choose from a multitude of backends: it may evaluate the function call in the browser, in the cloud, on the servers of the Wikimedia Foundation, on a distributed peer-to-peer evaluation platform, or natively on the user’s machine in a dedicated hosting runtime, which could be a mobile app or a server on the user’s computer."
This part is big technical creep. There is no reason to turn the project into a distributed heterogenous computing platform with a dedicated runtime, which could be a research project on its own, when the stated goal is to provide abstract multilingual encyclopedic content. All the computation can be done on servers (cloud is servers too) and cached. This is way easier to implement, test and deliver than to implement 10 different backends with various progress in implementation, incompatibilities and runtime characteristics.
The paper presents AW as sitting on top on WL. Both are big projects. Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project. AW can be realised with current tools and engineering practices.
"One obstacle in the democratization of programming has been that almost every programming language requires first to learn some basic English."
This strong affirmation needs to be sourced. Programming languages, save for a few keywords, doesn't rely much on English. The vast insuccess of localized version of programming languages (such as French Basic) as well as the heavy use of existing programming language in countries that doesn't even use the Latin alphabet (China, Russia) tends to prove that English is not all a bottleneck for the democratization of programming. [53] is cited later in the paper but is a pop-linguistic article from an online newspaper, not an academic article.
- Final words
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors. If UNL is not too well-known, it’s not because it didn't yield research achievements, but because one selected institution per language is working on it and keep the resources and software within the lab walls. Secondly, there are some very welcome out-of-scope features: conversion from natural language, restriction to encyclopedic style text. This will allow for more focused effort towards the end goal, making it more achievable. And finally, the choice to go with symbolic/rule-based system with a touch of other ML where useful. This is, as said in the paper, a big win for explainability and using human contributions to build the system. This will also keep the computing cost to a more sane baseline than what the current deep learning models require.
I think the project can succeed thanks to its openess, yet there is are real dangers visible in the paper: on the NLP side to reinvent a wheel that took 40 years to build, and on the technical side to lose time and effort on a project not required per se for AW to be build.
As I spend a significant time (~10 hours) gathering references and writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it.
Best regards, Louis Lecailliez
PS: 4. Typos
- "These two projects will considerably expand the capabilities of the
Wikimedia platform to enable every single human being to freely share share in the sum of all knowledge." => duplicate share
- "The content is than turned into" => The content is then turned into
- "[26] Charles J Fillmore, Russell Lee-Goldman, and Russell Rhodes. The
framenet constructicon. Sign-based construction grammar, pages 309–372, 2012." => The framenet construction
- "These function finally can call the lexicographic knowlegde stored in
Wikidata." => These function finally can call the lexicographic knowledge stored in Wikidata
- "[102] George Kinsley Zipf. Human Behavior and the Pirnciple of Least
Effort. Addison-Wesley, 1949." => [102] George Kinsley Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
- "Allowing the individual language Wikipedias to call Wikilambda has an
addtional benefit." => Allowing the individual language Wikipedias to call Wikilambda has an additional benefit. _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
-------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/5cb85890/attachment.html
------------------------------
Subject: Digest Footer
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
------------------------------
End of Abstract-Wikipedia Digest, Vol 1, Issue 6 ************************************************
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia -- Tiago Timponi Torrent PPG-Linguística - FrameNet Brasil Universidade Federal de Juiz de Fora http://tiagotorrent.com
Hi Louis, all,
Louis, thanks for raising that important issue!
I have been looking in a number of related NLG systems, and one thing I noticed is a pattern of many of these projects being developed very much in isolation from each other, and also often without much concern for ongoing linguistic research. That is what I tried to capture in the research paper by stating that there is no consensus on this, and that it seems too early to commit to a specific solution.
I had given a quick look to UNL, but the project looked pretty stale to me - I could not see any activity in the last decade. Furthermore, the page didn't provide access to the source code and instead mentioned that part of the technology is under patents, which is quite a red flag for me, and I usually don't look into something like that any further, in order to honestly be able to say that I didn't get any ideas from those patents. If I am mistaken, and there is a freely usable write-up or implementation, I'd be happy to come back and read up more.
Thank you for the annotated bibliography! That is super useful.
But I did look into detail into a (small) number of other, similar systems, such as Grammatical Framework or KPML. Tiago mentioned FrameNet, and I learned a lot about that too. To get an overview of the whole field has been a rather frustrating experience, especially since the major textbook in that area - Dale & Reiter - doesn't cover these systems, nor the 2018 update to that book by Gatt & Krahmer, and it seems that research work in that area often omits these practical systems. Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
The goal of the paper was never meant to be a comprehensive account of the state of the art in natural language generation. That's what Dale & Reiter and Gatt & Krahmer have aimed for, and their works are hundreds of pages. I had the feeling my paper was already too long, and putting in an overview of the state of the art would have made it at least double the length.
So, given that (and other reasons, as lined out in the paper), it seems that a system which could support any of these approaches seemed a more promising way. So far, for my own prototype, I have been mostly following Grammatical Framework (because it has a very accessible book, the software is free, the community was friendly, etc.), and it worked good enough to leave me convinced that the whole thing is worth trying out. But I don't know whether that's the best approach.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first. In a parallel track, we will be creating the function wiki, and then, when the time is ripe we can bring these two strands of work together. So, would you be willing to work on that?
How does this sound for a plan?
Some further points:
This is way easier to implement, test and deliver than to implement 10
different backends with various progress in implementation, incompatibilities and runtime characteristics.
Regarding your point about evaluation environments: I agree, it would be a huge task if the WMF core team were to develop all these different environments. But that's not the plan. The goal is really that *others* will hopefully build these :) All we need to do is to make sure that's possible and encouraged and simple enough. But yeah, not the core team.
The paper presents AW as sitting on top on WL. Both are big projects.
Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project.
Yes, that's correct. That is exactly the time that allows us do the appropriate state of the art analysis. I hope it won't take us years, but that we will be faster.
AW can be realised with current tools and engineering practices.
Only if you commit to a specific implementation, which I am hesitant to do.
[English is an obstacle to programming] This strong affirmation needs to
be sourced.
https://dl.acm.org/doi/10.1145/3051457.3051464
As I spend a significant time (~10 hours) gathering references and
writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it.
Thank you for your detailed comments, which will certainly improve the second version of the paper. I am happy to mention you in the acknowledgments. For co-authorship, I usually go for a more substantial engagement ;) If you're willing to write up the "Previous work" section along the lines you mentioned above (maybe with Tiago? Maybe with others to join?), but for a comprehensive overview of existing systems, then I am open to talk about co-authorship :)
For French, the gender of every noun entity *must* be present ... For
Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated.
That's exactly the goal of the lexicographic project on Wikidata, as was pointed out:
https://www.wikidata.org/wiki/Lexeme:L12449
You'll find plenty of Lexemes with their classifiers, forms, etc. The lexicographic project was started with the Abstract Wikipedia in mind, knowing that exactly that will be needed.
Yet, the use of any existing formalism is dismissed, which mean all the
situations I illustrated in this email will need to be dealt with in an ad hoc fashion.
No, not at all it doesn't have to be ad-hoc, that's exactly what we can start working on now, long before we get to the point that we need to make that ad-hoc decision. I hope you'll join us to figure out the best way!
Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your beautiful answers, who raised a number of great replies much better than I ever could. And thanks to Louis for starting this more than interesting thread! Let's continue in this vein!
Cheers, Denny
On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Brainstorming: resembling what the document object model (DOM) [1] is for XML and attributed trees, perhaps we could create and specify an object model for sets of attributed predicate calculus expressions.
With an attributed predicate calculus object model (e.g. “APCOM”) for sets of attributed predicate calculus expressions:
{
r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4
r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8
r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11, o7(icl>domain7).@a12).@a13
}.@a14
developers could more conveniently utilize sets of attributed predicate calculus expressions from JavaScript and Lua.
Drawing from XML, we can consider that objects, relations, attributes could be, instead of plain text strings, uniform resource identifiers (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and so forth.
We can also consider that the attributes in a model could have values:
{
r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4]
r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8]
r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13]
}.[@a14=v14]
We can consider creating a scripting API (e.g. “APCOM”) for a semantic model to convenience developers. We can also consider adding attribute-value pairs to a semantic model.
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Document_Object_Model
*From: *Tiago Timponi Torrent tiago.torrent@ufjf.edu.br *Sent: *Sunday, July 5, 2020 9:06 PM *To: *General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith arthurpsmith@gmail.com escreveu:
Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language
*From: *Louis Lecailliez louis.lecailliez@outlook.fr *Sent: *Saturday, July 4, 2020 2:10 PM *To: *abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by
such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures ( https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence.
Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de abstract-wikipedia-request@lists.wikimedia.org < abstract-wikipedia-request@lists.wikimedia.org> *Envoyé :* samedi 4 juillet 2020 15:06 *À :* abstract-wikipedia@lists.wikimedia.org < abstract-wikipedia@lists.wikimedia.org> *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
- Re: NLP issues severely overlooked (Charles Matthews)
- Use case: generation of short description (Jakob Voß)
- Re: NLP issues severely overlooked (Amir E. Aharoni)
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews charles.r.matthews@ntlworld.com To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" < abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: 2126327926.39940.1593867909152@mail2.virginmedia.com Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez louis.lecailliez@outlook.fr
wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper
except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I
really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK)
Denny Vrandečić wrote:
Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
Well, there is a whole research community at the crossroad between computer science and lingustics with Computational Linguistics. The annual ACL conference is taking place just this week: https://acl2020.org/ The CL community may have its own quirks but at least an understanding of both linguistic problems and issues of technical implementations should be there.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first.
I cannot force anyone how to organize references to scholarly publications and software artifacts but I would at least recommend to use Wikidata to do so. We can get nice overviews with Scholia, once the references are collected and organized in Wikidata. The current coverage of natural language generation however is rather shallow:
https://scholia.toolforge.org/topic/Q1513879
Even if Wikidata is not the best tool to collect references, it will surely play some kind of role in Abstract Wikipedia, so it makes sense to get used to it.
Jakob
On 09 July 2020 at 09:27 Jakob Voß jakob.voss@gbv.de wrote:
I cannot force anyone how to organize references to scholarly publications and software artifacts but I would at least recommend to use Wikidata to do so. We can get nice overviews with Scholia, once the references are collected and organized in Wikidata. The current coverage of natural language generation however is rather shallow:
https://scholia.toolforge.org/topic/Q1513879
Even if Wikidata is not the best tool to collect references, it will surely play some kind of role in Abstract Wikipedia, so it makes sense to get used to it.
As a side issue, maybe (or maybe not), I will mention here a proposal I recently heard, to import from the DBLP computer science bibliography cite to Wikidata. For those familiar with WikiCite, the idea would simply be to do with DBLP what has been done with PubMed. This would be a one year project, basically, using known techniques on a dataset of 4M items.
Rather than explaining more about the bot work that would be involved there, let me just make a basic remark: WikiCite, Scholia and the big effort so far have all concentrated on the biomedical area. In the middle of a global pandemic, it is not likely that I shall be thought to be criticising that emphasis.
But if Abstract Wikipedia means that more attention will be given to DBLP and Wikidata, and any other repositories in the area that will be relevant, that would hardly be a bad thing.
Charles
To preface, an opportunity I see with this project given its particular nature (open, potential of large-scale collaboration) is to be able to engage with language in a manner more founded in a linguistic tradition (like a FrameNet) than many other NL*X* projects.
Well, there is a whole research community at the crossroad between
computer science and lingustics with Computational Linguistics. The annual ACL conference is taking place just this week: https://acl2020.org/ The CL community may have its own quirks but at least an understanding of both linguistic problems and issues of technical implementations should be there.
I think it is important to involve the computational linguistics community because they have much of the experience in natural language generation. However, I think it is also important to involve red-blooded linguists and *linguist* computational linguists when discussion of theory and linguistic problems come into play. There is much of computational linguistics that is quite divorced from linguistics and the linguistic tradition in general.
Thanks,
Chris Cooley
On Thu, Jul 9, 2020 at 4:54 AM Jakob Voß jakob.voss@gbv.de wrote:
Denny Vrandečić wrote:
Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
Well, there is a whole research community at the crossroad between computer science and lingustics with Computational Linguistics. The annual ACL conference is taking place just this week: https://acl2020.org/ The CL community may have its own quirks but at least an understanding of both linguistic problems and issues of technical implementations should be there.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first.
I cannot force anyone how to organize references to scholarly publications and software artifacts but I would at least recommend to use Wikidata to do so. We can get nice overviews with Scholia, once the references are collected and organized in Wikidata. The current coverage of natural language generation however is rather shallow:
https://scholia.toolforge.org/topic/Q1513879
Even if Wikidata is not the best tool to collect references, it will surely play some kind of role in Abstract Wikipedia, so it makes sense to get used to it.
Jakob
Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
Hi, Chris, hi all
As a linguist computational linguist I couldn’t agree more! The good news is that this year’s ACL had a great collection of papers discussing precisely how to bring Linguistics back into Comp Ling especially in what regards meaning. By the end of this week they should make all the video recorded presentations accessible to those who did not register for the conference. Then I’ll send some links for those papers I think may be of interest to this group here.
Cheers
Tiago
Em qui, 9 de jul de 2020 às 07:50, Chris Cooley chris.cooley.mail@gmail.com escreveu:
To preface, an opportunity I see with this project given its particular nature (open, potential of large-scale collaboration) is to be able to engage with language in a manner more founded in a linguistic tradition (like a FrameNet) than many other NL*X* projects.
Well, there is a whole research community at the crossroad between
computer science and lingustics with Computational Linguistics. The annual ACL conference is taking place just this week: https://acl2020.org/ The CL community may have its own quirks but at least an understanding of both linguistic problems and issues of technical implementations should be there.
I think it is important to involve the computational linguistics community because they have much of the experience in natural language generation. However, I think it is also important to involve red-blooded linguists and *linguist* computational linguists when discussion of theory and linguistic problems come into play. There is much of computational linguistics that is quite divorced from linguistics and the linguistic tradition in general.
Thanks,
Chris Cooley
On Thu, Jul 9, 2020 at 4:54 AM Jakob Voß jakob.voss@gbv.de wrote:
Denny Vrandečić wrote:
Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
Well, there is a whole research community at the crossroad between computer science and lingustics with Computational Linguistics. The annual ACL conference is taking place just this week: https://acl2020.org/ The CL community may have its own quirks but at least an understanding of both linguistic problems and issues of technical implementations should be there.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first.
I cannot force anyone how to organize references to scholarly publications and software artifacts but I would at least recommend to use Wikidata to do so. We can get nice overviews with Scholia, once the references are collected and organized in Wikidata. The current coverage of natural language generation however is rather shallow:
https://scholia.toolforge.org/topic/Q1513879
Even if Wikidata is not the best tool to collect references, it will surely play some kind of role in Abstract Wikipedia, so it makes sense to get used to it.
Jakob
Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
D'oh, the idea to use Wikidata for collecting the bibliography is so obvious and good, that I am wondering why it wasn't my default assumption.
Oh, wait I am supposed to say: "Yes, obviously, we should collect the literature with Wikidata, when we write up the State of the Art."
Thanks Jakob!
Tiago, by the way, can you compile a list of interesting talks from ACL for us?
Thank you!
And yes, working the the ACL community, particularly with SIGGEN, would be great! I probably should write an email to their mailing list. Anyone who has experience with SIGGEN willing to check the mail for tone? I'd like to make a good impression.
Denny
On Thu, Jul 9, 2020 at 4:03 AM Tiago Timponi Torrent < tiago.torrent@ufjf.edu.br> wrote:
Hi, Chris, hi all
As a linguist computational linguist I couldn’t agree more! The good news is that this year’s ACL had a great collection of papers discussing precisely how to bring Linguistics back into Comp Ling especially in what regards meaning. By the end of this week they should make all the video recorded presentations accessible to those who did not register for the conference. Then I’ll send some links for those papers I think may be of interest to this group here.
Cheers
Tiago
Em qui, 9 de jul de 2020 às 07:50, Chris Cooley < chris.cooley.mail@gmail.com> escreveu:
To preface, an opportunity I see with this project given its particular nature (open, potential of large-scale collaboration) is to be able to engage with language in a manner more founded in a linguistic tradition (like a FrameNet) than many other NL*X* projects.
Well, there is a whole research community at the crossroad between
computer science and lingustics with Computational Linguistics. The annual ACL conference is taking place just this week: https://acl2020.org/ The CL community may have its own quirks but at least an understanding of both linguistic problems and issues of technical implementations should be there.
I think it is important to involve the computational linguistics community because they have much of the experience in natural language generation. However, I think it is also important to involve red-blooded linguists and *linguist* computational linguists when discussion of theory and linguistic problems come into play. There is much of computational linguistics that is quite divorced from linguistics and the linguistic tradition in general.
Thanks,
Chris Cooley
On Thu, Jul 9, 2020 at 4:54 AM Jakob Voß jakob.voss@gbv.de wrote:
Denny Vrandečić wrote:
Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
Well, there is a whole research community at the crossroad between computer science and lingustics with Computational Linguistics. The annual ACL conference is taking place just this week: https://acl2020.org/ The CL community may have its own quirks but at least an understanding of both linguistic problems and issues of technical implementations should be there.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first.
I cannot force anyone how to organize references to scholarly publications and software artifacts but I would at least recommend to use Wikidata to do so. We can get nice overviews with Scholia, once the references are collected and organized in Wikidata. The current coverage of natural language generation however is rather shallow:
https://scholia.toolforge.org/topic/Q1513879
Even if Wikidata is not the best tool to collect references, it will surely play some kind of role in Abstract Wikipedia, so it makes sense to get used to it.
Jakob
Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
-- Tiago Timponi Torrent PPG-Linguística - FrameNet Brasil Universidade Federal de Juiz de Fora http://tiagotorrent.com _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
Hey!
Here's a list of what got my attention during the conference. Please be reminded that this is ACL, so, Language Model tends to mean BERT and the like.
Logical Natural Language Generation from Open-Domain Tables https://www.aclweb.org/anthology/2020.acl-main.708.pdf Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen, William Yang Wang
AMR-To-Text Generation with Graph Transformer https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00297 Tianming Wang, Xiaojun Wan, Hanqi Jin
Structural Information Preserving for Graph-to-Text Generation https://www.aclweb.org/anthology/2020.acl-main.712.pdf (Linfeng Song, Ante Wang, Jinsong Su, Yue Zhang, Kun Xu, Yubin Ge, Dong Yu)
The other things I found interesting are either not publicly available yet, namely, Kathy McKeown's plenary Rewriting the Past: Assessing the Field through the Lens of Language Generation, or only partially available (the livestream of the talks is not), namely this Tutorial https://tinyurl.com/acl2020-commonsense on Commonsense Reasoning.
Hope that helps!
Cheers!
Em qua., 15 de jul. de 2020 às 00:52, Denny Vrandečić < dvrandecic@wikimedia.org> escreveu:
D'oh, the idea to use Wikidata for collecting the bibliography is so obvious and good, that I am wondering why it wasn't my default assumption.
Oh, wait I am supposed to say: "Yes, obviously, we should collect the literature with Wikidata, when we write up the State of the Art."
Thanks Jakob!
Tiago, by the way, can you compile a list of interesting talks from ACL for us?
Thank you!
And yes, working the the ACL community, particularly with SIGGEN, would be great! I probably should write an email to their mailing list. Anyone who has experience with SIGGEN willing to check the mail for tone? I'd like to make a good impression.
Denny
On Thu, Jul 9, 2020 at 4:03 AM Tiago Timponi Torrent < tiago.torrent@ufjf.edu.br> wrote:
Hi, Chris, hi all
As a linguist computational linguist I couldn’t agree more! The good news is that this year’s ACL had a great collection of papers discussing precisely how to bring Linguistics back into Comp Ling especially in what regards meaning. By the end of this week they should make all the video recorded presentations accessible to those who did not register for the conference. Then I’ll send some links for those papers I think may be of interest to this group here.
Cheers
Tiago
Em qui, 9 de jul de 2020 às 07:50, Chris Cooley < chris.cooley.mail@gmail.com> escreveu:
To preface, an opportunity I see with this project given its particular nature (open, potential of large-scale collaboration) is to be able to engage with language in a manner more founded in a linguistic tradition (like a FrameNet) than many other NL*X* projects.
Well, there is a whole research community at the crossroad between
computer science and lingustics with Computational Linguistics. The annual ACL conference is taking place just this week: https://acl2020.org/ The CL community may have its own quirks but at least an understanding of both linguistic problems and issues of technical implementations should be there.
I think it is important to involve the computational linguistics community because they have much of the experience in natural language generation. However, I think it is also important to involve red-blooded linguists and *linguist* computational linguists when discussion of theory and linguistic problems come into play. There is much of computational linguistics that is quite divorced from linguistics and the linguistic tradition in general.
Thanks,
Chris Cooley
On Thu, Jul 9, 2020 at 4:54 AM Jakob Voß jakob.voss@gbv.de wrote:
Denny Vrandečić wrote:
Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
Well, there is a whole research community at the crossroad between computer science and lingustics with Computational Linguistics. The annual ACL conference is taking place just this week: https://acl2020.org/ The CL community may have its own quirks but at least an understanding of both linguistic problems and issues of technical implementations should be there.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first.
I cannot force anyone how to organize references to scholarly publications and software artifacts but I would at least recommend to use Wikidata to do so. We can get nice overviews with Scholia, once the references are collected and organized in Wikidata. The current coverage of natural language generation however is rather shallow:
https://scholia.toolforge.org/topic/Q1513879
Even if Wikidata is not the best tool to collect references, it will surely play some kind of role in Abstract Wikipedia, so it makes sense to get used to it.
Jakob
Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
-- Tiago Timponi Torrent PPG-Linguística - FrameNet Brasil Universidade Federal de Juiz de Fora http://tiagotorrent.com _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
Hi Denny,
yes, the main problem of most of the systems presented in research papers (UNL or not) is that they are locked in the institutions that made them. A lot of UNL webpages went down since last time I checked (recently), and the system was in fact designed in a way it could work over the web while not letting third-parties access code and data. This is of course the exact reverse of the technical and philosophical approach taken here, and very sad as decade of accumulated knowledge is lost; the papers are far from sufficient to re-create even of fraction of the said systems
There is also, I guess, a lot of interesting work that is not translated in English at all (notably in linguistics), as making an academic career in the national language was an option in a lot of places until very recently.
So, would you be willing to work on that?
Yes, of course, I wouldn't have posted in the mailing list otherwise. I like the dual, concurrent approach of linguistic/theory you are proposing. Note though that I'm not an expert be any mean in natural language generation, it just happens I stumbled upon UNL recently and it has too much in common on the abstract representation/NLG with this project not to mention it. I also had some researchers name in mind as I met some who worked on the referenced works.
Concerning the paper authorship, I understand your stance, and yes I'm willing to work more and write about previous works with those interested. Just to have an idea, what it is expected timeframe for a revision?
Lexicographic data in Wikidata totally flew under my radar. This is indeed something that will be needed in the future, and where I can directly contribute too! As mentioned by [1] the license seems to be an issue notably for importing existing resources, is there any “fix” planned for that?
All in all, I'm very pleased to see lot of aspects are more planned than it I assumed to be from reading the paper alone, and I’m more confident in the success now.
Best regards, Louis Lecailliez
[1] http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf
________________________________ De : Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de Denny Vrandečić dvrandecic@wikimedia.org Envoyé : mercredi 8 juillet 2020 22:37 À : General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org Objet : Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Louis, all,
Louis, thanks for raising that important issue!
I have been looking in a number of related NLG systems, and one thing I noticed is a pattern of many of these projects being developed very much in isolation from each other, and also often without much concern for ongoing linguistic research. That is what I tried to capture in the research paper by stating that there is no consensus on this, and that it seems too early to commit to a specific solution.
I had given a quick look to UNL, but the project looked pretty stale to me - I could not see any activity in the last decade. Furthermore, the page didn't provide access to the source code and instead mentioned that part of the technology is under patents, which is quite a red flag for me, and I usually don't look into something like that any further, in order to honestly be able to say that I didn't get any ideas from those patents. If I am mistaken, and there is a freely usable write-up or implementation, I'd be happy to come back and read up more.
Thank you for the annotated bibliography! That is super useful.
But I did look into detail into a (small) number of other, similar systems, such as Grammatical Framework or KPML. Tiago mentioned FrameNet, and I learned a lot about that too. To get an overview of the whole field has been a rather frustrating experience, especially since the major textbook in that area - Dale & Reiter - doesn't cover these systems, nor the 2018 update to that book by Gatt & Krahmer, and it seems that research work in that area often omits these practical systems. Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
The goal of the paper was never meant to be a comprehensive account of the state of the art in natural language generation. That's what Dale & Reiter and Gatt & Krahmer have aimed for, and their works are hundreds of pages. I had the feeling my paper was already too long, and putting in an overview of the state of the art would have made it at least double the length.
So, given that (and other reasons, as lined out in the paper), it seems that a system which could support any of these approaches seemed a more promising way. So far, for my own prototype, I have been mostly following Grammatical Framework (because it has a very accessible book, the software is free, the community was friendly, etc.), and it worked good enough to leave me convinced that the whole thing is worth trying out. But I don't know whether that's the best approach.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first. In a parallel track, we will be creating the function wiki, and then, when the time is ripe we can bring these two strands of work together. So, would you be willing to work on that?
How does this sound for a plan?
Some further points:
This is way easier to implement, test and deliver than to implement 10 different backends with various progress in implementation, incompatibilities and runtime characteristics.
Regarding your point about evaluation environments: I agree, it would be a huge task if the WMF core team were to develop all these different environments. But that's not the plan. The goal is really that *others* will hopefully build these :) All we need to do is to make sure that's possible and encouraged and simple enough. But yeah, not the core team.
The paper presents AW as sitting on top on WL. Both are big projects. Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project.
Yes, that's correct. That is exactly the time that allows us do the appropriate state of the art analysis. I hope it won't take us years, but that we will be faster.
AW can be realised with current tools and engineering practices.
Only if you commit to a specific implementation, which I am hesitant to do.
[English is an obstacle to programming] This strong affirmation needs to be sourced.
https://dl.acm.org/doi/10.1145/3051457.3051464
As I spend a significant time (~10 hours) gathering references and writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it.
Thank you for your detailed comments, which will certainly improve the second version of the paper. I am happy to mention you in the acknowledgments. For co-authorship, I usually go for a more substantial engagement ;) If you're willing to write up the "Previous work" section along the lines you mentioned above (maybe with Tiago? Maybe with others to join?), but for a comprehensive overview of existing systems, then I am open to talk about co-authorship :)
For French, the gender of every noun entity *must* be present ... For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated.
That's exactly the goal of the lexicographic project on Wikidata, as was pointed out:
https://www.wikidata.org/wiki/Lexeme:L12449
You'll find plenty of Lexemes with their classifiers, forms, etc. The lexicographic project was started with the Abstract Wikipedia in mind, knowing that exactly that will be needed.
Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion.
No, not at all it doesn't have to be ad-hoc, that's exactly what we can start working on now, long before we get to the point that we need to make that ad-hoc decision. I hope you'll join us to figure out the best way!
Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your beautiful answers, who raised a number of great replies much better than I ever could. And thanks to Louis for starting this more than interesting thread! Let's continue in this vein!
Cheers, Denny
On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski <adamsobieski@hotmail.commailto:adamsobieski@hotmail.com> wrote:
Brainstorming: resembling what the document object model (DOM) [1] is for XML and attributed trees, perhaps we could create and specify an object model for sets of attributed predicate calculus expressions.
With an attributed predicate calculus object model (e.g. “APCOM”) for sets of attributed predicate calculus expressions:
{
r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4
r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8
r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11, o7(icl>domain7).@a12).@a13
}.@a14
developers could more conveniently utilize sets of attributed predicate calculus expressions from JavaScript and Lua.
Drawing from XML, we can consider that objects, relations, attributes could be, instead of plain text strings, uniform resource identifiers (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and so forth.
We can also consider that the attributes in a model could have values:
{
r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4]
r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8]
r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13]
}.[@a14=v14]
We can consider creating a scripting API (e.g. “APCOM”) for a semantic model to convenience developers. We can also consider adding attribute-value pairs to a semantic model.
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Document_Object_Model
From: Tiago Timponi Torrentmailto:tiago.torrent@ufjf.edu.br Sent: Sunday, July 5, 2020 9:06 PM To: General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)mailto:abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith <arthurpsmith@gmail.commailto:arthurpsmith@gmail.com> escreveu:
Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski <adamsobieski@hotmail.commailto:adamsobieski@hotmail.com> wrote:
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language
From: Louis Lecailliezmailto:louis.lecailliez@outlook.fr Sent: Saturday, July 4, 2020 2:10 PM To: abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures (https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence.
Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
De : Abstract-Wikipedia <abstract-wikipedia-bounces@lists.wikimedia.orgmailto:abstract-wikipedia-bounces@lists.wikimedia.org> de la part de abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org <abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org> Envoyé : samedi 4 juillet 2020 15:06 À : abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Objet : Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.orgmailto:abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
1. Re: NLP issues severely overlooked (Charles Matthews) 2. Use case: generation of short description (Jakob Voß) 3. Re: NLP issues severely overlooked (Amir E. Aharoni)
----------------------------------------------------------------------
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews <charles.r.matthews@ntlworld.commailto:charles.r.matthews@ntlworld.com> To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: <2126327926.39940.1593867909152@mail2.virginmedia.commailto:2126327926.39940.1593867909152@mail2.virginmedia.com> Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez <louis.lecailliez@outlook.frmailto:louis.lecailliez@outlook.fr> wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK) -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/1113bab0/attachment-0001.html
------------------------------
Message: 2 Date: Sat, 4 Jul 2020 08:18:56 +0200 From: Jakob Voß <jakob.voss@gbv.demailto:jakob.voss@gbv.de> To: <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: [Abstract-wikipedia] Use case: generation of short description Message-ID: <4403bbda-040b-6c89-9cb6-6540139250dc@gbv.demailto:4403bbda-040b-6c89-9cb6-6540139250dc@gbv.de> Content-Type: text/plain; charset="utf-8"
Hi,
I want to auto-generate disambiguation description for African politicians to be added to Wikidata, e.g. from the country Mozambique (Q1029) the following descriptions should be generated:
Mozambican politician (en) Mosambikanischer Politiker (de) politico mozambicano (it) ...
This could be extended to other professions. My questions:
- Can anyone point me to data sources where to best look up country adjectives such as "Mozambican"?
- Where/how to best store the lexical information for best reuse with other renderers
- If a create small renderers for this short descriptions, what architecture do you prefer for best reuse?
My just-get-it-done solution would be a set of CSV files and a few lines of Perl code, but maybe this use case can be aligned with Abstract Wikidata to better learn about it.
Looking forward to collaborate, Jakob
------------------------------
Message: 3 Date: Sat, 4 Jul 2020 18:03:24 +0300 From: "Amir E. Aharoni" <amir.aharoni@mail.huji.ac.ilmailto:amir.aharoni@mail.huji.ac.il> To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: <CACtNa8t6kbWe21C980h1MxiWNfUp+0eDE82vPMjDUX2UCgb2gw@mail.gmail.commailto:CACtNa8t6kbWe21C980h1MxiWNfUp%2B0eDE82vPMjDUX2UCgb2gw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi,
Thanks a lot for the sources. I am not one of the people implementing Wikilambda, but I am just very curious about it as a member of the wider Wikimedia community. But there's a good chance that they will be useful to people who do work on the implementation.
I will dare to add a little thought I have about it, however. It's possible that the challenge of building a well-functioning natural language generator is underestimated by the founders, and that they don't pay enough attention to existing work (although, knowing Denny, there is a good chance that he actually is aware of at least some of it). But there is something that the wide Wikimedia community has that I'm not sure that the past projects in this field did: The community itself. A big, worldwide, and diverse group of passionate volunteers, who love the idea of spreading free knowledge and who love their languages. Quite a lot of them also know some programming, and in the past they proved unbelievably creative and productive when writing code for Wikimedia projects as a community, in the form of templates, modules, gadgets, bots, extensions, and other tools. I'm quite sure that once the new tools become usable, this community will start doing creative things again, and it will also start reporting bugs and limitations.
So yes, while it's possible that along the way both the core developers and the volunteer community will find all kinds of stumbling blocks, I'm pretty sure that they will also have all kinds of surprising success stories. It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject. And don't underestimate the "open" part—that's where we really shine. This won't be a theoretical work in a lab, published in a paywalled and copyright-restricted academic journal, but fully optimized for accessibility to everyone.
Yes, this whole email from me is incredibly naïve, but it's the same attitude that got us to writing the biggest and most multilingual encyclopedia in history, so maybe we can do something cool again :)
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
בתאריך שבת, 4 ביולי 2020 ב-14:26 מאת Louis Lecailliez < louis.lecailliez@outlook.frmailto:louis.lecailliez@outlook.fr>:
Hello,
my name is Louis Lecailliez, PhD student at Kyoto University in education technology. I'm a Computer Science and NLP graduate. One thing I do is working on language learner's knowledge modelling as graphs.
The Abstract Wikipedia project is really interesting. There is however two very concerning issues I spotted when reading the associated paper draft ( https://arxiv.org/abs/2004.04733). The following email could be read as negative, but please don't take it as such: my purpose is to avoid spending people efforts and money for things that can (need to!) be fixed upfront.
- Issues with NLP
The main issue is that the difficulty of the NLP task of generating natural text from an abstract representation is severely overlooked. This stems from the other main problem: the paper is not based on the decades of previous work in that space.
As I understand it, the main value proposition of Abstract Wikipedia (AW) is a computer representation of encyclopedic knowledge that can be projected into different existing natural languages, with the goal of supporting a huge number of them. Plus, an editor to make this happen easily.
This is in fact surprisingly extremely close to what the Universal Networking Language (UNL) project, which started 20 years ago, aims to do. UNL provides a language agnostic representation of text that uses hypergraph. Software (called EnConverter) produce UNL graphs from natural text in a given language. Another kind of software called DeConverter do the reverse, that is producing natural text from the abstract representation. This is exactly the same function of the "renderers" in the AW paper. The way of doing it is also similar: by applying successive transformations until the final text string is produced. In general, that kind of abstract meaning representation is called an Interlingua, and is widely used in Machine Translation (MT) systems.
Disregarding two decades of work, in the UNL case, on the same problem space (rule-based machine translation, here from an abstract language as fixed source language), which was itself based on few other decades of work, doesn't seem to be a wise move to start a new project. For a start, the graph representation used in the AW will likely not be expressive enough to encode linguistic knowledge; this is why UNL uses hypergraphs instead of graphs.
The problem is glaring when looking at the references list: the list is bloated with irrelevant references (such as those to programming languages [27, 37, 41, 77], Turing completeness being the worst offender [11, 17, 23, ...]) while containing only two references [7, 85] to the really hard part of the project: generating natural language from the abstract representation. There are few more relevant references about natural language generation, but this isn't enough.
Interestingly, [85] is an UNL paper, but not the main one. Moreover, it is cited in Section 9 "Opening future research". This should be instead placed in a "Previous work" section which is missing from the paper.
To fill a part of this section yet to be written, I propose the following references: [*1] Uchida, H., Zhu, M., & Della Senta, T. (1999). A gift for a millennium. IAS/UNU, Tokyo.
https://www.researchgate.net/profile/Hiroshi_Uchida2/publication/239328725_A... [*2] Wang-Ju Tsai (2004) La coédition langue-UNL pour partager la révision entre langues d'un document multilingue. [Language-UNL coedition to share revisions in a multilingual document] Thèse de doctorat. Grenoble.
https://pdfs.semanticscholar.org/b030/ea4662e393657b9a134c006ca5b08e8a23b3.p... [3*] Boitet, C., & Tsai, W. J. (2002). La coédition langue<—> UNL pour partager la révision entre les langues d'un document multilingue: un concept unificateur. Proc. TALN-02, Nancy, 22-26.
http://www.afcp-parole.org/doc/Archives_JEP/2002_XXIVe_JEP_Nancy/talnrecital... [4*] Tomokiyo, M., Mangeot, M., & Boitet, C. (2019). Development of a classifiers/quantifiers dictionary towards French-Japanese MT. arXiv preprint arXiv:1902.08061. https://arxiv.org/pdf/1902.08061.pdf [5*] Boguslavsky, I. (2005). Some controversial issues of UNL: Linguistic aspects. Research on Computer Science, 12, 77-100.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.2058&rep=re... [6*] Boitet, C. (2002). A rationale for using UNL as an interlingua and more in various domains. In Proc. LREC-02 First International Workshop on UNL, other Interlinguas, and their Applications, Las Palmas (pp. 26-31). https://www.cicling.org/2005/unl-book/Papers/003.pdf [7*] Dhanabalan, T., & Geetha, T. V. (2003, December). UNL deconverter for Tamil. In International Conference on the Convergences of Knowledge, Culture, Language and Information Technologies. http://www.cfilt.iitb.ac.in/convergence03/all%20data/paper%20032-372.pdf [8*] Singh, S., Dalal, M., Vachhani, V., Bhattacharyya, P., & Damani, O. P. (2007). Hindi generation from Interlingua (UNL). Machine Translation Summit XI.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.78.979&rep=rep1... [9*] Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., ... & Schneider, N. (2013, August). Abstract meaning representation for sembanking. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse (pp. 178-186). https://www.aclweb.org/anthology/W13-2322.pdf [10*] Berment, V., & Boitet, C. (2012). Heloise—An Ariane-G5 Compatible Rnvironment for Developing Expert MT Systems Online. In Proceedings of COLING 2012: Demonstration Papers (pp. 9-16). https://www.aclweb.org/anthology/C12-3002.pdf [11*] Berment, V. (2005). Online Translation Services for the Lao Language. In Proceedings of the First International Conference on Lao Studies. De Kalb, Illinois, USA (pp. 1-11).
https://www.researchgate.net/profile/Vincent_Berment/publication/242140227_O...
[*1] is the paper that describes UNL. [2*] is a doctoral thesis discussing a core problem AW is trying to address too. [3*] is a short paper done in the scope of [2*], even if you don't understand French you can have a look at the figures: two of them are about an editor similar in principe to what AW wants to incorporate. [5*] Insights about UNL expressivity issues, 10 years after the project's start. [6*] More UNL, with short history and context in which it is used.
[4*] shows how deep natural language conversion goes: this paper addresses the issue of classifiers in French and Japanese. This is just one linguistic issue and there are dozens if not hundreds of such. An important point is that both of the languages involved need to be taken into account when modelling the abstract encoding, otherwise too much information is lost for producing a correct output.
[7*] [8*] are very valuable examples of real world deconverter systems for UNL. As it's visible on [7*]'s Figure 1 and [8*]'s Figure 2, the process is *way* more complicated than a single "renderers" box. Moreover, there are very distinct identifiable steps, informed by linguistics. The AW does not describe any such structuration of natural text generation processing steps, everything is supposed to be happening in some unstructured "lambda" system. Also missing are the specialized resources (UNL-Hindi dictionary, Tamil Word dictionary, co-occurrence dictionary, etc.) required for the task. Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
[10*] is a translation system built with "specialised languages for linguistic programming (SLLPs)" which is the service Wikilambda is supposed to provide for Abstract Wikipedia. [11*] gives the estimation of 2500 hours for the development (by a specialist) of three linguistic modules for Lao processing.
So, in regard to the difficulty of the task, and previous work in the literature, the AW paper does not provide any convincing evidence that the technology on which it is supposed to be built can even reach the state-of-art. Dismissing every existing formal and software systems on the ground of "no consensus commiting to any specific linguistic theory" is not gonna work: this will result in ad hoc implementation-driven formalism that will have hard time fullfilling its goal. The NLP part (generating sentences from abstract representation) is the hardest of the project, yet it’s by far the least convincing one. "Abstract Wikipedia is indeed firmly within this tradition, and in preparation for this project we studied numerous predecessors." I would like to believe so, but the lack of corresponding reference as well as lack of previous work section tends to prove the contrary.
While I can't advice for a switch to UNL, as I'm not specialist of it, it would be smart to capitalize on the work done on it by highly skilled (PhD level) individuals. As the UNL system is built on hypergraphs, it probably could be made interoperable easily with RDF knowledge graphs if named graphs are used. By having a UNL/RDF specification (yet to be written), the vision exposed in the AW paper may be reached sooner by reusing existing software (we are speaking of thousands man-year of work as per [11*]), and almost as importantly, an existing formalism that has been "debugged" for decades. There are probably other systems I'm unaware of that are worth investigating too, some like [9*] having more specialized usage. In any case, there is a strong need to back the paper and the project on the existing (huge) literature.
- Other issues
"In order to evaluate a function call, an evaluator can choose from a multitude of backends: it may evaluate the function call in the browser, in the cloud, on the servers of the Wikimedia Foundation, on a distributed peer-to-peer evaluation platform, or natively on the user’s machine in a dedicated hosting runtime, which could be a mobile app or a server on the user’s computer."
This part is big technical creep. There is no reason to turn the project into a distributed heterogenous computing platform with a dedicated runtime, which could be a research project on its own, when the stated goal is to provide abstract multilingual encyclopedic content. All the computation can be done on servers (cloud is servers too) and cached. This is way easier to implement, test and deliver than to implement 10 different backends with various progress in implementation, incompatibilities and runtime characteristics.
The paper presents AW as sitting on top on WL. Both are big projects. Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project. AW can be realised with current tools and engineering practices.
"One obstacle in the democratization of programming has been that almost every programming language requires first to learn some basic English."
This strong affirmation needs to be sourced. Programming languages, save for a few keywords, doesn't rely much on English. The vast insuccess of localized version of programming languages (such as French Basic) as well as the heavy use of existing programming language in countries that doesn't even use the Latin alphabet (China, Russia) tends to prove that English is not all a bottleneck for the democratization of programming. [53] is cited later in the paper but is a pop-linguistic article from an online newspaper, not an academic article.
- Final words
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors. If UNL is not too well-known, it’s not because it didn't yield research achievements, but because one selected institution per language is working on it and keep the resources and software within the lab walls. Secondly, there are some very welcome out-of-scope features: conversion from natural language, restriction to encyclopedic style text. This will allow for more focused effort towards the end goal, making it more achievable. And finally, the choice to go with symbolic/rule-based system with a touch of other ML where useful. This is, as said in the paper, a big win for explainability and using human contributions to build the system. This will also keep the computing cost to a more sane baseline than what the current deep learning models require.
I think the project can succeed thanks to its openess, yet there is are real dangers visible in the paper: on the NLP side to reinvent a wheel that took 40 years to build, and on the technical side to lose time and effort on a project not required per se for AW to be build.
As I spend a significant time (~10 hours) gathering references and writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it.
Best regards, Louis Lecailliez
PS: 4. Typos
- "These two projects will considerably expand the capabilities of the
Wikimedia platform to enable every single human being to freely share share in the sum of all knowledge." => duplicate share
- "The content is than turned into" => The content is then turned into
- "[26] Charles J Fillmore, Russell Lee-Goldman, and Russell Rhodes. The
framenet constructicon. Sign-based construction grammar, pages 309–372, 2012." => The framenet construction
- "These function finally can call the lexicographic knowlegde stored in
Wikidata." => These function finally can call the lexicographic knowledge stored in Wikidata
- "[102] George Kinsley Zipf. Human Behavior and the Pirnciple of Least
Effort. Addison-Wesley, 1949." => [102] George Kinsley Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
- "Allowing the individual language Wikipedias to call Wikilambda has an
addtional benefit." => Allowing the individual language Wikipedias to call Wikilambda has an additional benefit. _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
-------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/5cb85890/attachment.html
------------------------------
Subject: Digest Footer
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
------------------------------
End of Abstract-Wikipedia Digest, Vol 1, Issue 6 ************************************************
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
--
Tiago Timponi Torrent
PPG-Linguística - FrameNet Brasil
Universidade Federal de Juiz de Fora
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
Quick side question: is there a role for formal ontology (FOL, DL or CL type of thing) in computational linguistics?
Mike
On 7/9/2020 8:22 AM, Louis Lecailliez wrote:
Hi Denny,
yes, the main problem of most of the systems presented in research papers (UNL or not) is that they are locked in the institutions that made them. A lot of UNL webpages went down since last time I checked (recently), and the system was in fact designed in a way it could work over the web while not letting third-parties access code and data. This is of course the exact reverse of the technical and philosophical approach taken here, and very sad as decade of accumulated knowledge is lost; the papers are far from sufficient to re-create even of fraction of the said systems
There is also, I guess, a lot of interesting work that is not translated in English at all (notably in linguistics), as making an academic career in the national language was an option in a lot of places until very recently.
So, would you be willing to work on that?
Yes, of course, I wouldn't have posted in the mailing list otherwise. I like the dual, concurrent approach of linguistic/theory you are proposing. Note though that I'm not an expert be any mean in natural language generation, it just happens I stumbled upon UNL recently and it has too much in common on the abstract representation/NLG with this project not to mention it. I also had some researchers name in mind as I met some who worked on the referenced works.
Concerning the paper authorship, I understand your stance, and yes I'm willing to work more and write about previous works with those interested. Just to have an idea, what it is expected timeframe for a revision?
Lexicographic data in Wikidata totally flew under my radar. This is indeed something that will be needed in the future, and where I can directly contribute too! As mentioned by [1] the license seems to be an issue notably for importing existing resources, is there any “fix” planned for that?
All in all, I'm very pleased to see lot of aspects are more planned than it I assumed to be from reading the paper alone, and I’m more confident in the success now.
Best regards, Louis Lecailliez
[1] http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de Denny Vrandečić dvrandecic@wikimedia.org *Envoyé :* mercredi 8 juillet 2020 22:37 *À :* General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org *Objet :* Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni) Hi Louis, all,
Louis, thanks for raising that important issue!
I have been looking in a number of related NLG systems, and one thing I noticed is a pattern of many of these projects being developed very much in isolation from each other, and also often without much concern for ongoing linguistic research. That is what I tried to capture in the research paper by stating that there is no consensus on this, and that it seems too early to commit to a specific solution.
I had given a quick look to UNL, but the project looked pretty stale to me - I could not see any activity in the last decade. Furthermore, the page didn't provide access to the source code and instead mentioned that part of the technology is under patents, which is quite a red flag for me, and I usually don't look into something like that any further, in order to honestly be able to say that I didn't get any ideas from those patents. If I am mistaken, and there is a freely usable write-up or implementation, I'd be happy to come back and read up more.
Thank you for the annotated bibliography! That is super useful.
But I did look into detail into a (small) number of other, similar systems, such as Grammatical Framework or KPML. Tiago mentioned FrameNet, and I learned a lot about that too. To get an overview of the whole field has been a rather frustrating experience, especially since the major textbook in that area - Dale & Reiter - doesn't cover these systems, nor the 2018 update to that book by Gatt & Krahmer, and it seems that research work in that area often omits these practical systems. Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
The goal of the paper was never meant to be a comprehensive account of the state of the art in natural language generation. That's what Dale & Reiter and Gatt & Krahmer have aimed for, and their works are hundreds of pages. I had the feeling my paper was already too long, and putting in an overview of the state of the art would have made it at least double the length.
So, given that (and other reasons, as lined out in the paper), it seems that a system which could support any of these approaches seemed a more promising way. So far, for my own prototype, I have been mostly following Grammatical Framework (because it has a very accessible book, the software is free, the community was friendly, etc.), and it worked good enough to leave me convinced that the whole thing is worth trying out. But I don't know whether that's the best approach.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first. In a parallel track, we will be creating the function wiki, and then, when the time is ripe we can bring these two strands of work together. So, would you be willing to work on that?
How does this sound for a plan?
Some further points:
This is way easier to implement, test and deliver than to implement
10 different backends with various progress in implementation, incompatibilities and runtime characteristics.
Regarding your point about evaluation environments: I agree, it would be a huge task if the WMF core team were to develop all these different environments. But that's not the plan. The goal is really that *others* will hopefully build these :) All we need to do is to make sure that's possible and encouraged and simple enough. But yeah, not the core team.
The paper presents AW as sitting on top on WL. Both are big
projects. Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project. Yes, that's correct. That is exactly the time that allows us do the appropriate state of the art analysis. I hope it won't take us years, but that we will be faster.
AW can be realised with current tools and engineering practices.
Only if you commit to a specific implementation, which I am hesitant to do.
[English is an obstacle to programming] This strong affirmation
needs to be sourced.
https://dl.acm.org/doi/10.1145/3051457.3051464
As I spend a significant time (~10 hours) gathering references and
writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it. Thank you for your detailed comments, which will certainly improve the second version of the paper. I am happy to mention you in the acknowledgments. For co-authorship, I usually go for a more substantial engagement ;) If you're willing to write up the "Previous work" section along the lines you mentioned above (maybe with Tiago? Maybe with others to join?), but for a comprehensive overview of existing systems, then I am open to talk about co-authorship :)
For French, the gender of every noun entity *must* be present ...
For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. That's exactly the goal of the lexicographic project on Wikidata, as was pointed out: https://www.wikidata.org/wiki/Lexeme:L12449
You'll find plenty of Lexemes with their classifiers, forms, etc. The lexicographic project was started with the Abstract Wikipedia in mind, knowing that exactly that will be needed.
Yet, the use of any existing formalism is dismissed, which mean all
the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. No, not at all it doesn't have to be ad-hoc, that's exactly what we can start working on now, long before we get to the point that we need to make that ad-hoc decision. I hope you'll join us to figure out the best way!
Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your beautiful answers, who raised a number of great replies much better than I ever could. And thanks to Louis for starting this more than interesting thread! Let's continue in this vein!
Cheers, Denny
On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski <adamsobieski@hotmail.com mailto:adamsobieski@hotmail.com> wrote:
Brainstorming: resembling what the document object model (DOM) [1] is for XML and attributed trees, perhaps we could create and specify an object model for sets of attributed predicate calculus expressions. With an attributed predicate calculus object model (e.g. “APCOM”) for sets of attributed predicate calculus expressions: { r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4 r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8 r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11, o7(icl>domain7).@a12).@a13 }.@a14 developers could more conveniently utilize sets of attributed predicate calculus expressions from JavaScript and Lua. Drawing from XML, we can consider that objects, relations, attributes could be, instead of plain text strings, uniform resource identifiers (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and so forth. We can also consider that the attributes in a model could have values: { r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4] r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8] r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13] }.[@a14=v14] We can consider creating a scripting API (e.g. “APCOM”) for a semantic model to convenience developers. We can also consider adding attribute-value pairs to a semantic model. Best regards, Adam [1] https://en.wikipedia.org/wiki/Document_Object_Model *From: *Tiago Timponi Torrent <mailto:tiago.torrent@ufjf.edu.br> *Sent: *Sunday, July 5, 2020 9:06 PM *To: *General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) <mailto:abstract-wikipedia@lists.wikimedia.org> *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni) That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at. Cheers Tiago Em dom, 5 de jul de 2020 às 20:03, Arthur Smith <arthurpsmith@gmail.com <mailto:arthurpsmith@gmail.com>> escreveu: Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues. Arthur On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski <adamsobieski@hotmail.com <mailto:adamsobieski@hotmail.com>> wrote: Louis, Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2]. Semantic Modeling Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language? r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8 I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s. Scripting Environments for Natural Language Generation Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions. I wonder what you might think about scripting environments and API for natural language generation scenarios? Best regards, Adam [1] https://en.wikipedia.org/wiki/Universal_Networking_Language [2] https://wals.info/ <https://wals.info/> *From: *Louis Lecailliez <mailto:louis.lecailliez@outlook.fr> *Sent: *Saturday, July 4, 2020 2:10 PM *To: *abstract-wikipedia@lists.wikimedia.org <mailto:abstract-wikipedia@lists.wikimedia.org> *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni) Hi Amir, I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve. > It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject. I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily. This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies. Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships. France (Q142) is a country (Q6256). <Q142> <rel_is> <Q6256> . Creating the English sentence is straightforward with the naive approach presented in the paper. What is the French equivalent? La France est un pays. More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such: <Q142> <rel_gender> <Q1775415> . <Q6256> <rel_gender> <Q499327> . Fine! Now what about Chinese? 法國是一個國家。 What we have in the middle of the sentence is a classifier (個). The model needs the following update: <Q499327> <rel_use_classifier> <Q63153> . To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures (https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence. Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是(to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena. Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered. Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be". What would be the Japanese equivalent? The more natural structure would be like: フランスは国(だ)。 What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used. The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized. Best Regards, Louis Lecailliez *De :*Abstract-Wikipedia <abstract-wikipedia-bounces@lists.wikimedia.org <mailto:abstract-wikipedia-bounces@lists.wikimedia.org>> de la part de abstract-wikipedia-request@lists.wikimedia.org <mailto:abstract-wikipedia-request@lists.wikimedia.org><abstract-wikipedia-request@lists.wikimedia.org <mailto:abstract-wikipedia-request@lists.wikimedia.org>> *Envoyé :* samedi 4 juillet 2020 15:06 *À :* abstract-wikipedia@lists.wikimedia.org <mailto:abstract-wikipedia@lists.wikimedia.org><abstract-wikipedia@lists.wikimedia.org <mailto:abstract-wikipedia@lists.wikimedia.org>> *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6 Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org <mailto:abstract-wikipedia@lists.wikimedia.org> To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org <mailto:abstract-wikipedia-request@lists.wikimedia.org> You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org <mailto:abstract-wikipedia-owner@lists.wikimedia.org> When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..." Today's Topics: 1. Re: NLP issues severely overlooked (Charles Matthews) 2. Use case: generation of short description (Jakob Voß) 3. Re: NLP issues severely overlooked (Amir E. Aharoni) ---------------------------------------------------------------------- Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews <charles.r.matthews@ntlworld.com <mailto:charles.r.matthews@ntlworld.com>> To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.org <mailto:abstract-wikipedia@lists.wikimedia.org>> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: <2126327926.39940.1593867909152@mail2.virginmedia.com <mailto:2126327926.39940.1593867909152@mail2.virginmedia.com>> Content-Type: text/plain; charset="utf-8" It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area. I do have 17 years of varied Wikimedia experience (and I use my real name there). > On 04 July 2020 at 12:25 Louis Lecailliez <louis.lecailliez@outlook.fr <mailto:louis.lecailliez@outlook.fr>> wrote: > <snip> > Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about). > I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms. >To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors. It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that. On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen. Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index". I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment. In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed. Charles Matthews (in Cambridge UK)
Yep. Some applications use them. Back in the early 2000, there was a big trend in investigating the interface between ontologies and the lexicon (ontolex). Nonetheless, I’d say that most recent NLG systems focus on common sense knowledge (KGs and the like), nonetheless the key issue of the ontolex problem still remains: Language is not only about expressing facts, it’s about how you construe them.
Cheers
Tiago
Em ter, 14 de jul de 2020 às 16:36, Mike Bennett mbennett@hypercube.co.uk escreveu:
Quick side question: is there a role for formal ontology (FOL, DL or CL type of thing) in computational linguistics?
Mike
On 7/9/2020 8:22 AM, Louis Lecailliez wrote:
Hi Denny,
yes, the main problem of most of the systems presented in research papers (UNL or not) is that they are locked in the institutions that made them. A lot of UNL webpages went down since last time I checked (recently), and the system was in fact designed in a way it could work over the web while not letting third-parties access code and data. This is of course the exact reverse of the technical and philosophical approach taken here, and very sad as decade of accumulated knowledge is lost; the papers are far from sufficient to re-create even of fraction of the said systems
There is also, I guess, a lot of interesting work that is not translated in English at all (notably in linguistics), as making an academic career in the national language was an option in a lot of places until very recently.
So, would you be willing to work on that?
Yes, of course, I wouldn't have posted in the mailing list otherwise. I like the dual, concurrent approach of linguistic/theory you are proposing. Note though that I'm not an expert be any mean in natural language generation, it just happens I stumbled upon UNL recently and it has too much in common on the abstract representation/NLG with this project not to mention it. I also had some researchers name in mind as I met some who worked on the referenced works.
Concerning the paper authorship, I understand your stance, and yes I'm willing to work more and write about previous works with those interested. Just to have an idea, what it is expected timeframe for a revision?
Lexicographic data in Wikidata totally flew under my radar. This is indeed something that will be needed in the future, and where I can directly contribute too! As mentioned by [1] the license seems to be an issue notably for importing existing resources, is there any “fix” planned for that?
All in all, I'm very pleased to see lot of aspects are more planned than it I assumed to be from reading the paper alone, and I’m more confident in the success now.
Best regards, Louis Lecailliez
[1] http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org abstract-wikipedia-bounces@lists.wikimedia.org de la part de Denny Vrandečić dvrandecic@wikimedia.org dvrandecic@wikimedia.org *Envoyé :* mercredi 8 juillet 2020 22:37 *À :* General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org abstract-wikipedia@lists.wikimedia.org *Objet :* Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Louis, all,
Louis, thanks for raising that important issue!
I have been looking in a number of related NLG systems, and one thing I noticed is a pattern of many of these projects being developed very much in isolation from each other, and also often without much concern for ongoing linguistic research. That is what I tried to capture in the research paper by stating that there is no consensus on this, and that it seems too early to commit to a specific solution.
I had given a quick look to UNL, but the project looked pretty stale to me
- I could not see any activity in the last decade. Furthermore, the page
didn't provide access to the source code and instead mentioned that part of the technology is under patents, which is quite a red flag for me, and I usually don't look into something like that any further, in order to honestly be able to say that I didn't get any ideas from those patents. If I am mistaken, and there is a freely usable write-up or implementation, I'd be happy to come back and read up more.
Thank you for the annotated bibliography! That is super useful.
But I did look into detail into a (small) number of other, similar systems, such as Grammatical Framework or KPML. Tiago mentioned FrameNet, and I learned a lot about that too. To get an overview of the whole field has been a rather frustrating experience, especially since the major textbook in that area - Dale & Reiter - doesn't cover these systems, nor the 2018 update to that book by Gatt & Krahmer, and it seems that research work in that area often omits these practical systems. Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
The goal of the paper was never meant to be a comprehensive account of the state of the art in natural language generation. That's what Dale & Reiter and Gatt & Krahmer have aimed for, and their works are hundreds of pages. I had the feeling my paper was already too long, and putting in an overview of the state of the art would have made it at least double the length.
So, given that (and other reasons, as lined out in the paper), it seems that a system which could support any of these approaches seemed a more promising way. So far, for my own prototype, I have been mostly following Grammatical Framework (because it has a very accessible book, the software is free, the community was friendly, etc.), and it worked good enough to leave me convinced that the whole thing is worth trying out. But I don't know whether that's the best approach.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first. In a parallel track, we will be creating the function wiki, and then, when the time is ripe we can bring these two strands of work together. So, would you be willing to work on that?
How does this sound for a plan?
Some further points:
This is way easier to implement, test and deliver than to implement 10
different backends with various progress in implementation, incompatibilities and runtime characteristics.
Regarding your point about evaluation environments: I agree, it would be a huge task if the WMF core team were to develop all these different environments. But that's not the plan. The goal is really that *others* will hopefully build these :) All we need to do is to make sure that's possible and encouraged and simple enough. But yeah, not the core team.
The paper presents AW as sitting on top on WL. Both are big projects.
Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project. Yes, that's correct. That is exactly the time that allows us do the appropriate state of the art analysis. I hope it won't take us years, but that we will be faster.
AW can be realised with current tools and engineering practices.
Only if you commit to a specific implementation, which I am hesitant to do.
[English is an obstacle to programming] This strong affirmation needs
to be sourced.
https://dl.acm.org/doi/10.1145/3051457.3051464
As I spend a significant time (~10 hours) gathering references and
writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it. Thank you for your detailed comments, which will certainly improve the second version of the paper. I am happy to mention you in the acknowledgments. For co-authorship, I usually go for a more substantial engagement ;) If you're willing to write up the "Previous work" section along the lines you mentioned above (maybe with Tiago? Maybe with others to join?), but for a comprehensive overview of existing systems, then I am open to talk about co-authorship :)
For French, the gender of every noun entity *must* be present ... For
Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. That's exactly the goal of the lexicographic project on Wikidata, as was pointed out: https://www.wikidata.org/wiki/Lexeme:L12449
You'll find plenty of Lexemes with their classifiers, forms, etc. The lexicographic project was started with the Abstract Wikipedia in mind, knowing that exactly that will be needed.
Yet, the use of any existing formalism is dismissed, which mean all
the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. No, not at all it doesn't have to be ad-hoc, that's exactly what we can start working on now, long before we get to the point that we need to make that ad-hoc decision. I hope you'll join us to figure out the best way!
Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your beautiful answers, who raised a number of great replies much better than I ever could. And thanks to Louis for starting this more than interesting thread! Let's continue in this vein!
Cheers, Denny
On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Brainstorming: resembling what the document object model (DOM) [1] is for XML and attributed trees, perhaps we could create and specify an object model for sets of attributed predicate calculus expressions.
With an attributed predicate calculus object model (e.g. “APCOM”) for sets of attributed predicate calculus expressions:
{
r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4
r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8
r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11, o7(icl>domain7).@a12).@a13
}.@a14
developers could more conveniently utilize sets of attributed predicate calculus expressions from JavaScript and Lua.
Drawing from XML, we can consider that objects, relations, attributes could be, instead of plain text strings, uniform resource identifiers (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and so forth.
We can also consider that the attributes in a model could have values:
{
r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4]
r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8]
r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13]
}.[@a14=v14]
We can consider creating a scripting API (e.g. “APCOM”) for a semantic model to convenience developers. We can also consider adding attribute-value pairs to a semantic model.
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Document_Object_Model
*From: *Tiago Timponi Torrent tiago.torrent@ufjf.edu.br *Sent: *Sunday, July 5, 2020 9:06 PM *To: *General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith arthurpsmith@gmail.com escreveu:
Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language
*From: *Louis Lecailliez louis.lecailliez@outlook.fr *Sent: *Saturday, July 4, 2020 2:10 PM *To: *abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by
such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures ( https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence.
Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de abstract-wikipedia-request@lists.wikimedia.org < abstract-wikipedia-request@lists.wikimedia.org> *Envoyé :* samedi 4 juillet 2020 15:06 *À :* abstract-wikipedia@lists.wikimedia.org < abstract-wikipedia@lists.wikimedia.org> *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
- Re: NLP issues severely overlooked (Charles Matthews)
- Use case: generation of short description (Jakob Voß)
- Re: NLP issues severely overlooked (Amir E. Aharoni)
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews charles.r.matthews@ntlworld.com To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" < abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: 2126327926.39940.1593867909152@mail2.virginmedia.com Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez louis.lecailliez@outlook.fr
wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper
except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I
really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK)
Actually Ontolex keep being developped. I stumbled upon the lastest development Lexicog (https://www.w3.org/2019/09/lexicog/) this week, with tries to make the model more applicable to human-targeted dictionaries. This iteration included Kernerman as co-author, so the spec is likely taking the industry needs into account more than previous ones. Representing dictionaries as graphs and using them is sometimes I work on since a few years so I keep an eye on what's going on there.
That being said I don't think Ontolex fits the bills for the kind of details we need in this project. Morevoer it doesn't take in account logographic writing systems where the reading of a sequence of logograms has an impact on the meaning that is expressed.
Best regards, Louis Lecailliez
________________________________ De : Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de Tiago Timponi Torrent tiago.torrent@ufjf.edu.br Envoyé : mardi 14 juillet 2020 23:35 À : General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org Objet : Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Yep. Some applications use them. Back in the early 2000, there was a big trend in investigating the interface between ontologies and the lexicon (ontolex). Nonetheless, I’d say that most recent NLG systems focus on common sense knowledge (KGs and the like), nonetheless the key issue of the ontolex problem still remains: Language is not only about expressing facts, it’s about how you construe them.
Cheers
Tiago
Em ter, 14 de jul de 2020 às 16:36, Mike Bennett <mbennett@hypercube.co.ukmailto:mbennett@hypercube.co.uk> escreveu:
Quick side question: is there a role for formal ontology (FOL, DL or CL type of thing) in computational linguistics?
Mike
On 7/9/2020 8:22 AM, Louis Lecailliez wrote: Hi Denny,
yes, the main problem of most of the systems presented in research papers (UNL or not) is that they are locked in the institutions that made them. A lot of UNL webpages went down since last time I checked (recently), and the system was in fact designed in a way it could work over the web while not letting third-parties access code and data. This is of course the exact reverse of the technical and philosophical approach taken here, and very sad as decade of accumulated knowledge is lost; the papers are far from sufficient to re-create even of fraction of the said systems
There is also, I guess, a lot of interesting work that is not translated in English at all (notably in linguistics), as making an academic career in the national language was an option in a lot of places until very recently.
So, would you be willing to work on that?
Yes, of course, I wouldn't have posted in the mailing list otherwise. I like the dual, concurrent approach of linguistic/theory you are proposing. Note though that I'm not an expert be any mean in natural language generation, it just happens I stumbled upon UNL recently and it has too much in common on the abstract representation/NLG with this project not to mention it. I also had some researchers name in mind as I met some who worked on the referenced works.
Concerning the paper authorship, I understand your stance, and yes I'm willing to work more and write about previous works with those interested. Just to have an idea, what it is expected timeframe for a revision?
Lexicographic data in Wikidata totally flew under my radar. This is indeed something that will be needed in the future, and where I can directly contribute too! As mentioned by [1] the license seems to be an issue notably for importing existing resources, is there any “fix” planned for that?
All in all, I'm very pleased to see lot of aspects are more planned than it I assumed to be from reading the paper alone, and I’m more confident in the success now.
Best regards, Louis Lecailliez
[1] http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf
________________________________ De : Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.orgmailto:abstract-wikipedia-bounces@lists.wikimedia.org de la part de Denny Vrandečić dvrandecic@wikimedia.orgmailto:dvrandecic@wikimedia.org Envoyé : mercredi 8 juillet 2020 22:37 À : General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org Objet : Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Louis, all,
Louis, thanks for raising that important issue!
I have been looking in a number of related NLG systems, and one thing I noticed is a pattern of many of these projects being developed very much in isolation from each other, and also often without much concern for ongoing linguistic research. That is what I tried to capture in the research paper by stating that there is no consensus on this, and that it seems too early to commit to a specific solution.
I had given a quick look to UNL, but the project looked pretty stale to me - I could not see any activity in the last decade. Furthermore, the page didn't provide access to the source code and instead mentioned that part of the technology is under patents, which is quite a red flag for me, and I usually don't look into something like that any further, in order to honestly be able to say that I didn't get any ideas from those patents. If I am mistaken, and there is a freely usable write-up or implementation, I'd be happy to come back and read up more.
Thank you for the annotated bibliography! That is super useful.
But I did look into detail into a (small) number of other, similar systems, such as Grammatical Framework or KPML. Tiago mentioned FrameNet, and I learned a lot about that too. To get an overview of the whole field has been a rather frustrating experience, especially since the major textbook in that area - Dale & Reiter - doesn't cover these systems, nor the 2018 update to that book by Gatt & Krahmer, and it seems that research work in that area often omits these practical systems. Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
The goal of the paper was never meant to be a comprehensive account of the state of the art in natural language generation. That's what Dale & Reiter and Gatt & Krahmer have aimed for, and their works are hundreds of pages. I had the feeling my paper was already too long, and putting in an overview of the state of the art would have made it at least double the length.
So, given that (and other reasons, as lined out in the paper), it seems that a system which could support any of these approaches seemed a more promising way. So far, for my own prototype, I have been mostly following Grammatical Framework (because it has a very accessible book, the software is free, the community was friendly, etc.), and it worked good enough to leave me convinced that the whole thing is worth trying out. But I don't know whether that's the best approach.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first. In a parallel track, we will be creating the function wiki, and then, when the time is ripe we can bring these two strands of work together. So, would you be willing to work on that?
How does this sound for a plan?
Some further points:
This is way easier to implement, test and deliver than to implement 10 different backends with various progress in implementation, incompatibilities and runtime characteristics.
Regarding your point about evaluation environments: I agree, it would be a huge task if the WMF core team were to develop all these different environments. But that's not the plan. The goal is really that *others* will hopefully build these :) All we need to do is to make sure that's possible and encouraged and simple enough. But yeah, not the core team.
The paper presents AW as sitting on top on WL. Both are big projects. Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project.
Yes, that's correct. That is exactly the time that allows us do the appropriate state of the art analysis. I hope it won't take us years, but that we will be faster.
AW can be realised with current tools and engineering practices.
Only if you commit to a specific implementation, which I am hesitant to do.
[English is an obstacle to programming] This strong affirmation needs to be sourced.
https://dl.acm.org/doi/10.1145/3051457.3051464
As I spend a significant time (~10 hours) gathering references and writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it.
Thank you for your detailed comments, which will certainly improve the second version of the paper. I am happy to mention you in the acknowledgments. For co-authorship, I usually go for a more substantial engagement ;) If you're willing to write up the "Previous work" section along the lines you mentioned above (maybe with Tiago? Maybe with others to join?), but for a comprehensive overview of existing systems, then I am open to talk about co-authorship :)
For French, the gender of every noun entity *must* be present ... For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated.
That's exactly the goal of the lexicographic project on Wikidata, as was pointed out: https://www.wikidata.org/wiki/Lexeme:L12449
You'll find plenty of Lexemes with their classifiers, forms, etc. The lexicographic project was started with the Abstract Wikipedia in mind, knowing that exactly that will be needed.
Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion.
No, not at all it doesn't have to be ad-hoc, that's exactly what we can start working on now, long before we get to the point that we need to make that ad-hoc decision. I hope you'll join us to figure out the best way!
Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your beautiful answers, who raised a number of great replies much better than I ever could. And thanks to Louis for starting this more than interesting thread! Let's continue in this vein!
Cheers, Denny
On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski <adamsobieski@hotmail.commailto:adamsobieski@hotmail.com> wrote:
Brainstorming: resembling what the document object model (DOM) [1] is for XML and attributed trees, perhaps we could create and specify an object model for sets of attributed predicate calculus expressions.
With an attributed predicate calculus object model (e.g. “APCOM”) for sets of attributed predicate calculus expressions:
{
r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4
r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8
r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11, o7(icl>domain7).@a12).@a13
}.@a14
developers could more conveniently utilize sets of attributed predicate calculus expressions from JavaScript and Lua.
Drawing from XML, we can consider that objects, relations, attributes could be, instead of plain text strings, uniform resource identifiers (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and so forth.
We can also consider that the attributes in a model could have values:
{
r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4]
r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8]
r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13]
}.[@a14=v14]
We can consider creating a scripting API (e.g. “APCOM”) for a semantic model to convenience developers. We can also consider adding attribute-value pairs to a semantic model.
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Document_Object_Model
From: Tiago Timponi Torrentmailto:tiago.torrent@ufjf.edu.br Sent: Sunday, July 5, 2020 9:06 PM To: General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)mailto:abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith <arthurpsmith@gmail.commailto:arthurpsmith@gmail.com> escreveu:
Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski <adamsobieski@hotmail.commailto:adamsobieski@hotmail.com> wrote:
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2mailto:r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language
From: Louis Lecailliezmailto:louis.lecailliez@outlook.fr Sent: Saturday, July 4, 2020 2:10 PM To: abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures (https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence.
Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
[cid:1734fabaaec13f784891]
De : Abstract-Wikipedia <abstract-wikipedia-bounces@lists.wikimedia.orgmailto:abstract-wikipedia-bounces@lists.wikimedia.org> de la part de abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org <abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org> Envoyé : samedi 4 juillet 2020 15:06 À : abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Objet : Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.orgmailto:abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
1. Re: NLP issues severely overlooked (Charles Matthews) 2. Use case: generation of short description (Jakob Voß) 3. Re: NLP issues severely overlooked (Amir E. Aharoni)
----------------------------------------------------------------------
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews <charles.r.matthews@ntlworld.commailto:charles.r.matthews@ntlworld.com> To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: <2126327926.39940.1593867909152@mail2.virginmedia.commailto:2126327926.39940.1593867909152@mail2.virginmedia.com> Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez <louis.lecailliez@outlook.frmailto:louis.lecailliez@outlook.fr> wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK) -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/1113bab0/attachment-0001.html
------------------------------
Message: 2 Date: Sat, 4 Jul 2020 08:18:56 +0200 From: Jakob Voß <jakob.voss@gbv.demailto:jakob.voss@gbv.de> To: <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: [Abstract-wikipedia] Use case: generation of short description Message-ID: <4403bbda-040b-6c89-9cb6-6540139250dc@gbv.demailto:4403bbda-040b-6c89-9cb6-6540139250dc@gbv.de> Content-Type: text/plain; charset="utf-8"
Hi,
I want to auto-generate disambiguation description for African politicians to be added to Wikidata, e.g. from the country Mozambique (Q1029) the following descriptions should be generated:
Mozambican politician (en) Mosambikanischer Politiker (de) politico mozambicano (it) ...
This could be extended to other professions. My questions:
- Can anyone point me to data sources where to best look up country adjectives such as "Mozambican"?
- Where/how to best store the lexical information for best reuse with other renderers
- If a create small renderers for this short descriptions, what architecture do you prefer for best reuse?
My just-get-it-done solution would be a set of CSV files and a few lines of Perl code, but maybe this use case can be aligned with Abstract Wikidata to better learn about it.
Looking forward to collaborate, Jakob
------------------------------
Message: 3 Date: Sat, 4 Jul 2020 18:03:24 +0300 From: "Amir E. Aharoni" <amir.aharoni@mail.huji.ac.ilmailto:amir.aharoni@mail.huji.ac.il> To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: <CACtNa8t6kbWe21C980h1MxiWNfUp+0eDE82vPMjDUX2UCgb2gw@mail.gmail.commailto:CACtNa8t6kbWe21C980h1MxiWNfUp%2B0eDE82vPMjDUX2UCgb2gw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi,
Thanks a lot for the sources. I am not one of the people implementing Wikilambda, but I am just very curious about it as a member of the wider Wikimedia community. But there's a good chance that they will be useful to people who do work on the implementation.
I will dare to add a little thought I have about it, however. It's possible that the challenge of building a well-functioning natural language generator is underestimated by the founders, and that they don't pay enough attention to existing work (although, knowing Denny, there is a good chance that he actually is aware of at least some of it). But there is something that the wide Wikimedia community has that I'm not sure that the past projects in this field did: The community itself. A big, worldwide, and diverse group of passionate volunteers, who love the idea of spreading free knowledge and who love their languages. Quite a lot of them also know some programming, and in the past they proved unbelievably creative and productive when writing code for Wikimedia projects as a community, in the form of templates, modules, gadgets, bots, extensions, and other tools. I'm quite sure that once the new tools become usable, this community will start doing creative things again, and it will also start reporting bugs and limitations.
So yes, while it's possible that along the way both the core developers and the volunteer community will find all kinds of stumbling blocks, I'm pretty sure that they will also have all kinds of surprising success stories. It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject. And don't underestimate the "open" part—that's where we really shine. This won't be a theoretical work in a lab, published in a paywalled and copyright-restricted academic journal, but fully optimized for accessibility to everyone.
Yes, this whole email from me is incredibly naïve, but it's the same attitude that got us to writing the biggest and most multilingual encyclopedia in history, so maybe we can do something cool again :)
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
בתאריך שבת, 4 ביולי 2020 ב-14:26 מאת Louis Lecailliez < louis.lecailliez@outlook.frmailto:louis.lecailliez@outlook.fr>:
Hello,
my name is Louis Lecailliez, PhD student at Kyoto University in education technology. I'm a Computer Science and NLP graduate. One thing I do is working on language learner's knowledge modelling as graphs.
The Abstract Wikipedia project is really interesting. There is however two very concerning issues I spotted when reading the associated paper draft ( https://arxiv.org/abs/2004.04733). The following email could be read as negative, but please don't take it as such: my purpose is to avoid spending people efforts and money for things that can (need to!) be fixed upfront.
- Issues with NLP
The main issue is that the difficulty of the NLP task of generating natural text from an abstract representation is severely overlooked. This stems from the other main problem: the paper is not based on the decades of previous work in that space.
As I understand it, the main value proposition of Abstract Wikipedia (AW) is a computer representation of encyclopedic knowledge that can be projected into different existing natural languages, with the goal of supporting a huge number of them. Plus, an editor to make this happen easily.
This is in fact surprisingly extremely close to what the Universal Networking Language (UNL) project, which started 20 years ago, aims to do. UNL provides a language agnostic representation of text that uses hypergraph. Software (called EnConverter) produce UNL graphs from natural text in a given language. Another kind of software called DeConverter do the reverse, that is producing natural text from the abstract representation. This is exactly the same function of the "renderers" in the AW paper. The way of doing it is also similar: by applying successive transformations until the final text string is produced. In general, that kind of abstract meaning representation is called an Interlingua, and is widely used in Machine Translation (MT) systems.
Disregarding two decades of work, in the UNL case, on the same problem space (rule-based machine translation, here from an abstract language as fixed source language), which was itself based on few other decades of work, doesn't seem to be a wise move to start a new project. For a start, the graph representation used in the AW will likely not be expressive enough to encode linguistic knowledge; this is why UNL uses hypergraphs instead of graphs.
The problem is glaring when looking at the references list: the list is bloated with irrelevant references (such as those to programming languages [27, 37, 41, 77], Turing completeness being the worst offender [11, 17, 23, ...]) while containing only two references [7, 85] to the really hard part of the project: generating natural language from the abstract representation. There are few more relevant references about natural language generation, but this isn't enough.
Interestingly, [85] is an UNL paper, but not the main one. Moreover, it is cited in Section 9 "Opening future research". This should be instead placed in a "Previous work" section
-- Tiago Timponi Torrent PPG-Linguística - FrameNet Brasil Universidade Federal de Juiz de Fora http://tiagotorrent.com
It does, indeed, but not getting that much attention in LREC, for instance... and sure, it does not seem to be of a great help in this project. Totally agree.
Em qua, 15 de jul de 2020 às 12:17, Louis Lecailliez < louis.lecailliez@outlook.fr> escreveu:
Actually Ontolex keep being developped. I stumbled upon the lastest development Lexicog (https://www.w3.org/2019/09/lexicog/) this week, with tries to make the model more applicable to human-targeted dictionaries. This iteration included Kernerman as co-author, so the spec is likely taking the industry needs into account more than previous ones. Representing dictionaries as graphs and using them is sometimes I work on since a few years so I keep an eye on what's going on there.
That being said I don't think Ontolex fits the bills for the kind of details we need in this project. Morevoer it doesn't take in account logographic writing systems where the reading of a sequence of logograms has an impact on the meaning that is expressed.
Best regards, Louis Lecailliez
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de Tiago Timponi Torrent tiago.torrent@ufjf.edu.br *Envoyé :* mardi 14 juillet 2020 23:35
*À :* General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org *Objet :* Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Yep. Some applications use them. Back in the early 2000, there was a big trend in investigating the interface between ontologies and the lexicon (ontolex). Nonetheless, I’d say that most recent NLG systems focus on common sense knowledge (KGs and the like), nonetheless the key issue of the ontolex problem still remains: Language is not only about expressing facts, it’s about how you construe them.
Cheers
Tiago
Em ter, 14 de jul de 2020 às 16:36, Mike Bennett mbennett@hypercube.co.uk escreveu:
Quick side question: is there a role for formal ontology (FOL, DL or CL type of thing) in computational linguistics?
Mike
On 7/9/2020 8:22 AM, Louis Lecailliez wrote:
Hi Denny,
yes, the main problem of most of the systems presented in research papers (UNL or not) is that they are locked in the institutions that made them. A lot of UNL webpages went down since last time I checked (recently), and the system was in fact designed in a way it could work over the web while not letting third-parties access code and data. This is of course the exact reverse of the technical and philosophical approach taken here, and very sad as decade of accumulated knowledge is lost; the papers are far from sufficient to re-create even of fraction of the said systems
There is also, I guess, a lot of interesting work that is not translated in English at all (notably in linguistics), as making an academic career in the national language was an option in a lot of places until very recently.
So, would you be willing to work on that?
Yes, of course, I wouldn't have posted in the mailing list otherwise. I like the dual, concurrent approach of linguistic/theory you are proposing. Note though that I'm not an expert be any mean in natural language generation, it just happens I stumbled upon UNL recently and it has too much in common on the abstract representation/NLG with this project not to mention it. I also had some researchers name in mind as I met some who worked on the referenced works.
Concerning the paper authorship, I understand your stance, and yes I'm willing to work more and write about previous works with those interested. Just to have an idea, what it is expected timeframe for a revision?
Lexicographic data in Wikidata totally flew under my radar. This is indeed something that will be needed in the future, and where I can directly contribute too! As mentioned by [1] the license seems to be an issue notably for importing existing resources, is there any “fix” planned for that?
All in all, I'm very pleased to see lot of aspects are more planned than it I assumed to be from reading the paper alone, and I’m more confident in the success now.
Best regards, Louis Lecailliez
[1] http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org abstract-wikipedia-bounces@lists.wikimedia.org de la part de Denny Vrandečić dvrandecic@wikimedia.org dvrandecic@wikimedia.org *Envoyé :* mercredi 8 juillet 2020 22:37 *À :* General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org abstract-wikipedia@lists.wikimedia.org *Objet :* Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Louis, all,
Louis, thanks for raising that important issue!
I have been looking in a number of related NLG systems, and one thing I noticed is a pattern of many of these projects being developed very much in isolation from each other, and also often without much concern for ongoing linguistic research. That is what I tried to capture in the research paper by stating that there is no consensus on this, and that it seems too early to commit to a specific solution.
I had given a quick look to UNL, but the project looked pretty stale to me
- I could not see any activity in the last decade. Furthermore, the page
didn't provide access to the source code and instead mentioned that part of the technology is under patents, which is quite a red flag for me, and I usually don't look into something like that any further, in order to honestly be able to say that I didn't get any ideas from those patents. If I am mistaken, and there is a freely usable write-up or implementation, I'd be happy to come back and read up more.
Thank you for the annotated bibliography! That is super useful.
But I did look into detail into a (small) number of other, similar systems, such as Grammatical Framework or KPML. Tiago mentioned FrameNet, and I learned a lot about that too. To get an overview of the whole field has been a rather frustrating experience, especially since the major textbook in that area - Dale & Reiter - doesn't cover these systems, nor the 2018 update to that book by Gatt & Krahmer, and it seems that research work in that area often omits these practical systems. Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
The goal of the paper was never meant to be a comprehensive account of the state of the art in natural language generation. That's what Dale & Reiter and Gatt & Krahmer have aimed for, and their works are hundreds of pages. I had the feeling my paper was already too long, and putting in an overview of the state of the art would have made it at least double the length.
So, given that (and other reasons, as lined out in the paper), it seems that a system which could support any of these approaches seemed a more promising way. So far, for my own prototype, I have been mostly following Grammatical Framework (because it has a very accessible book, the software is free, the community was friendly, etc.), and it worked good enough to leave me convinced that the whole thing is worth trying out. But I don't know whether that's the best approach.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first. In a parallel track, we will be creating the function wiki, and then, when the time is ripe we can bring these two strands of work together. So, would you be willing to work on that?
How does this sound for a plan?
Some further points:
This is way easier to implement, test and deliver than to implement 10
different backends with various progress in implementation, incompatibilities and runtime characteristics.
Regarding your point about evaluation environments: I agree, it would be a huge task if the WMF core team were to develop all these different environments. But that's not the plan. The goal is really that *others* will hopefully build these :) All we need to do is to make sure that's possible and encouraged and simple enough. But yeah, not the core team.
The paper presents AW as sitting on top on WL. Both are big projects.
Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project. Yes, that's correct. That is exactly the time that allows us do the appropriate state of the art analysis. I hope it won't take us years, but that we will be faster.
AW can be realised with current tools and engineering practices.
Only if you commit to a specific implementation, which I am hesitant to do.
[English is an obstacle to programming] This strong affirmation needs
to be sourced.
https://dl.acm.org/doi/10.1145/3051457.3051464
As I spend a significant time (~10 hours) gathering references and
writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it. Thank you for your detailed comments, which will certainly improve the second version of the paper. I am happy to mention you in the acknowledgments. For co-authorship, I usually go for a more substantial engagement ;) If you're willing to write up the "Previous work" section along the lines you mentioned above (maybe with Tiago? Maybe with others to join?), but for a comprehensive overview of existing systems, then I am open to talk about co-authorship :)
For French, the gender of every noun entity *must* be present ... For
Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. That's exactly the goal of the lexicographic project on Wikidata, as was pointed out: https://www.wikidata.org/wiki/Lexeme:L12449
You'll find plenty of Lexemes with their classifiers, forms, etc. The lexicographic project was started with the Abstract Wikipedia in mind, knowing that exactly that will be needed.
Yet, the use of any existing formalism is dismissed, which mean all
the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. No, not at all it doesn't have to be ad-hoc, that's exactly what we can start working on now, long before we get to the point that we need to make that ad-hoc decision. I hope you'll join us to figure out the best way!
Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your beautiful answers, who raised a number of great replies much better than I ever could. And thanks to Louis for starting this more than interesting thread! Let's continue in this vein!
Cheers, Denny
On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Brainstorming: resembling what the document object model (DOM) [1] is for XML and attributed trees, perhaps we could create and specify an object model for sets of attributed predicate calculus expressions.
With an attributed predicate calculus object model (e.g. “APCOM”) for sets of attributed predicate calculus expressions:
{
r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4
r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8
r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11, o7(icl>domain7).@a12).@a13
}.@a14
developers could more conveniently utilize sets of attributed predicate calculus expressions from JavaScript and Lua.
Drawing from XML, we can consider that objects, relations, attributes could be, instead of plain text strings, uniform resource identifiers (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and so forth.
We can also consider that the attributes in a model could have values:
{
r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4]
r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8]
r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13]
}.[@a14=v14]
We can consider creating a scripting API (e.g. “APCOM”) for a semantic model to convenience developers. We can also consider adding attribute-value pairs to a semantic model.
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Document_Object_Model
*From: *Tiago Timponi Torrent tiago.torrent@ufjf.edu.br *Sent: *Sunday, July 5, 2020 9:06 PM *To: *General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith arthurpsmith@gmail.com escreveu:
Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language
*From: *Louis Lecailliez louis.lecailliez@outlook.fr *Sent: *Saturday, July 4, 2020 2:10 PM *To: *abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by
such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures ( https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence.
Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de abstract-wikipedia-request@lists.wikimedia.org < abstract-wikipedia-request@lists.wikimedia.org> *Envoyé :* samedi 4 juillet 2020 15:06 *À :* abstract-wikipedia@lists.wikimedia.org < abstract-wikipedia@lists.wikimedia.org> *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
- Re: NLP issues severely overlooked (Charles Matthews)
- Use case: generation of short description (Jakob Voß)
- Re: NLP issues severely overlooked (Amir E. Aharoni)
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews charles.r.matthews@ntlworld.com To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" < abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: 2126327926.39940.1593867909152@mail2.virginmedia.com Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez louis.lecailliez@outlook.fr
wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper
except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I
really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK)
I'm a bit late to the party, but as a Chinese and Japanese user I should point out that, where the meaning of the same logogram varies with pronunciation, the vast majority of cases are either:
(1) the different readings are in complementary distribution and can thus be inferred accurately from context, e.g. Mandarin 創意 chuàng yì (creativity) vs 創傷 chuāng shāng (trauma) e.g. Cantonese 畫畫 waahk wáa (draw[v.] picture[n.])
or: (2) the different readings reflect preferences that have minimal influence on meaning. e.g. Japanese 家 ie vs uchi (home, house). "ie" leans more towards "house" and "uchi" leans more towards "home", but in most contexts they are interchangeable.
Where both of the above were false, it is likely that native speakers would have innovated with the written form to disambiguate corresponding divergences in the spoken language. e.g. Middle Chinese 無 mju (doesn't exist [v.]) > Modern Cantonese 無 mòuh (negation prefix in compound words), 冇 móuh (doesn't have [v.])
More abstractly, we have a one-to-many mapping between the written word and the spoken word. Fundamentally, the question we need to ask ourselves is, which written or spoken dialect of a language are we aiming for? Once we make that decision, the answer to ideograms with multiple readings will fall into place.
In general, the opposite issue is usually more pertinent: words and phrases that are distinguished in writing but not when spoken out loud. For languages that have a millennia-long written tradition (e.g. Greek; any kind of Chinese) homophones are a much bigger issue, encouraging more compact use of the written language than the spoken language and creating challenges for algorithmic parsing.
--Deryck
On Wed, 15 Jul 2020 at 16:17, Louis Lecailliez louis.lecailliez@outlook.fr wrote:
Actually Ontolex keep being developped. I stumbled upon the lastest development Lexicog (https://www.w3.org/2019/09/lexicog/) this week, with tries to make the model more applicable to human-targeted dictionaries. This iteration included Kernerman as co-author, so the spec is likely taking the industry needs into account more than previous ones. Representing dictionaries as graphs and using them is sometimes I work on since a few years so I keep an eye on what's going on there.
That being said I don't think Ontolex fits the bills for the kind of details we need in this project. Morevoer it doesn't take in account logographic writing systems where the reading of a sequence of logograms has an impact on the meaning that is expressed.
Best regards, Louis Lecailliez
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de Tiago Timponi Torrent tiago.torrent@ufjf.edu.br *Envoyé :* mardi 14 juillet 2020 23:35 *À :* General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org *Objet :* Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Yep. Some applications use them. Back in the early 2000, there was a big trend in investigating the interface between ontologies and the lexicon (ontolex). Nonetheless, I’d say that most recent NLG systems focus on common sense knowledge (KGs and the like), nonetheless the key issue of the ontolex problem still remains: Language is not only about expressing facts, it’s about how you construe them.
Cheers
Tiago
Em ter, 14 de jul de 2020 às 16:36, Mike Bennett mbennett@hypercube.co.uk escreveu:
Quick side question: is there a role for formal ontology (FOL, DL or CL type of thing) in computational linguistics?
Mike
On 7/9/2020 8:22 AM, Louis Lecailliez wrote:
Hi Denny,
yes, the main problem of most of the systems presented in research papers (UNL or not) is that they are locked in the institutions that made them. A lot of UNL webpages went down since last time I checked (recently), and the system was in fact designed in a way it could work over the web while not letting third-parties access code and data. This is of course the exact reverse of the technical and philosophical approach taken here, and very sad as decade of accumulated knowledge is lost; the papers are far from sufficient to re-create even of fraction of the said systems
There is also, I guess, a lot of interesting work that is not translated in English at all (notably in linguistics), as making an academic career in the national language was an option in a lot of places until very recently.
So, would you be willing to work on that?
Yes, of course, I wouldn't have posted in the mailing list otherwise. I like the dual, concurrent approach of linguistic/theory you are proposing. Note though that I'm not an expert be any mean in natural language generation, it just happens I stumbled upon UNL recently and it has too much in common on the abstract representation/NLG with this project not to mention it. I also had some researchers name in mind as I met some who worked on the referenced works.
Concerning the paper authorship, I understand your stance, and yes I'm willing to work more and write about previous works with those interested. Just to have an idea, what it is expected timeframe for a revision?
Lexicographic data in Wikidata totally flew under my radar. This is indeed something that will be needed in the future, and where I can directly contribute too! As mentioned by [1] the license seems to be an issue notably for importing existing resources, is there any “fix” planned for that?
All in all, I'm very pleased to see lot of aspects are more planned than it I assumed to be from reading the paper alone, and I’m more confident in the success now.
Best regards, Louis Lecailliez
[1] http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org abstract-wikipedia-bounces@lists.wikimedia.org de la part de Denny Vrandečić dvrandecic@wikimedia.org dvrandecic@wikimedia.org *Envoyé :* mercredi 8 juillet 2020 22:37 *À :* General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org abstract-wikipedia@lists.wikimedia.org *Objet :* Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Louis, all,
Louis, thanks for raising that important issue!
I have been looking in a number of related NLG systems, and one thing I noticed is a pattern of many of these projects being developed very much in isolation from each other, and also often without much concern for ongoing linguistic research. That is what I tried to capture in the research paper by stating that there is no consensus on this, and that it seems too early to commit to a specific solution.
I had given a quick look to UNL, but the project looked pretty stale to me
- I could not see any activity in the last decade. Furthermore, the page
didn't provide access to the source code and instead mentioned that part of the technology is under patents, which is quite a red flag for me, and I usually don't look into something like that any further, in order to honestly be able to say that I didn't get any ideas from those patents. If I am mistaken, and there is a freely usable write-up or implementation, I'd be happy to come back and read up more.
Thank you for the annotated bibliography! That is super useful.
But I did look into detail into a (small) number of other, similar systems, such as Grammatical Framework or KPML. Tiago mentioned FrameNet, and I learned a lot about that too. To get an overview of the whole field has been a rather frustrating experience, especially since the major textbook in that area - Dale & Reiter - doesn't cover these systems, nor the 2018 update to that book by Gatt & Krahmer, and it seems that research work in that area often omits these practical systems. Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
The goal of the paper was never meant to be a comprehensive account of the state of the art in natural language generation. That's what Dale & Reiter and Gatt & Krahmer have aimed for, and their works are hundreds of pages. I had the feeling my paper was already too long, and putting in an overview of the state of the art would have made it at least double the length.
So, given that (and other reasons, as lined out in the paper), it seems that a system which could support any of these approaches seemed a more promising way. So far, for my own prototype, I have been mostly following Grammatical Framework (because it has a very accessible book, the software is free, the community was friendly, etc.), and it worked good enough to leave me convinced that the whole thing is worth trying out. But I don't know whether that's the best approach.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first. In a parallel track, we will be creating the function wiki, and then, when the time is ripe we can bring these two strands of work together. So, would you be willing to work on that?
How does this sound for a plan?
Some further points:
This is way easier to implement, test and deliver than to implement 10
different backends with various progress in implementation, incompatibilities and runtime characteristics.
Regarding your point about evaluation environments: I agree, it would be a huge task if the WMF core team were to develop all these different environments. But that's not the plan. The goal is really that *others* will hopefully build these :) All we need to do is to make sure that's possible and encouraged and simple enough. But yeah, not the core team.
The paper presents AW as sitting on top on WL. Both are big projects.
Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project. Yes, that's correct. That is exactly the time that allows us do the appropriate state of the art analysis. I hope it won't take us years, but that we will be faster.
AW can be realised with current tools and engineering practices.
Only if you commit to a specific implementation, which I am hesitant to do.
[English is an obstacle to programming] This strong affirmation needs
to be sourced.
https://dl.acm.org/doi/10.1145/3051457.3051464
As I spend a significant time (~10 hours) gathering references and
writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it. Thank you for your detailed comments, which will certainly improve the second version of the paper. I am happy to mention you in the acknowledgments. For co-authorship, I usually go for a more substantial engagement ;) If you're willing to write up the "Previous work" section along the lines you mentioned above (maybe with Tiago? Maybe with others to join?), but for a comprehensive overview of existing systems, then I am open to talk about co-authorship :)
For French, the gender of every noun entity *must* be present ... For
Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. That's exactly the goal of the lexicographic project on Wikidata, as was pointed out: https://www.wikidata.org/wiki/Lexeme:L12449
You'll find plenty of Lexemes with their classifiers, forms, etc. The lexicographic project was started with the Abstract Wikipedia in mind, knowing that exactly that will be needed.
Yet, the use of any existing formalism is dismissed, which mean all
the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. No, not at all it doesn't have to be ad-hoc, that's exactly what we can start working on now, long before we get to the point that we need to make that ad-hoc decision. I hope you'll join us to figure out the best way!
Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your beautiful answers, who raised a number of great replies much better than I ever could. And thanks to Louis for starting this more than interesting thread! Let's continue in this vein!
Cheers, Denny
On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Brainstorming: resembling what the document object model (DOM) [1] is for XML and attributed trees, perhaps we could create and specify an object model for sets of attributed predicate calculus expressions.
With an attributed predicate calculus object model (e.g. “APCOM”) for sets of attributed predicate calculus expressions:
{
r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4
r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8
r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11, o7(icl>domain7).@a12).@a13
}.@a14
developers could more conveniently utilize sets of attributed predicate calculus expressions from JavaScript and Lua.
Drawing from XML, we can consider that objects, relations, attributes could be, instead of plain text strings, uniform resource identifiers (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and so forth.
We can also consider that the attributes in a model could have values:
{
r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4]
r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8]
r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13]
}.[@a14=v14]
We can consider creating a scripting API (e.g. “APCOM”) for a semantic model to convenience developers. We can also consider adding attribute-value pairs to a semantic model.
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Document_Object_Model
*From: *Tiago Timponi Torrent tiago.torrent@ufjf.edu.br *Sent: *Sunday, July 5, 2020 9:06 PM *To: *General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith arthurpsmith@gmail.com escreveu:
Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language
*From: *Louis Lecailliez louis.lecailliez@outlook.fr *Sent: *Saturday, July 4, 2020 2:10 PM *To: *abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by
such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures ( https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence.
Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de abstract-wikipedia-request@lists.wikimedia.org < abstract-wikipedia-request@lists.wikimedia.org> *Envoyé :* samedi 4 juillet 2020 15:06 *À :* abstract-wikipedia@lists.wikimedia.org < abstract-wikipedia@lists.wikimedia.org> *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
- Re: NLP issues severely overlooked (Charles Matthews)
- Use case: generation of short description (Jakob Voß)
- Re: NLP issues severely overlooked (Amir E. Aharoni)
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews charles.r.matthews@ntlworld.com To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" < abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: 2126327926.39940.1593867909152@mail2.virginmedia.com Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez louis.lecailliez@outlook.fr
wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper
except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I
really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK)
Hi Louis,
OK, here's a suggestion for the paper: I will finish the v2 - I hope this month or the next - and then make the sources available as a git repository, and then anyone who wants to work on it, can do so.
Does this sound good?
Cheers, Denny
On Thu, Jul 9, 2020 at 5:22 AM Louis Lecailliez louis.lecailliez@outlook.fr wrote:
Hi Denny,
yes, the main problem of most of the systems presented in research papers (UNL or not) is that they are locked in the institutions that made them. A lot of UNL webpages went down since last time I checked (recently), and the system was in fact designed in a way it could work over the web while not letting third-parties access code and data. This is of course the exact reverse of the technical and philosophical approach taken here, and very sad as decade of accumulated knowledge is lost; the papers are far from sufficient to re-create even of fraction of the said systems
There is also, I guess, a lot of interesting work that is not translated in English at all (notably in linguistics), as making an academic career in the national language was an option in a lot of places until very recently.
So, would you be willing to work on that?
Yes, of course, I wouldn't have posted in the mailing list otherwise. I like the dual, concurrent approach of linguistic/theory you are proposing. Note though that I'm not an expert be any mean in natural language generation, it just happens I stumbled upon UNL recently and it has too much in common on the abstract representation/NLG with this project not to mention it. I also had some researchers name in mind as I met some who worked on the referenced works.
Concerning the paper authorship, I understand your stance, and yes I'm willing to work more and write about previous works with those interested. Just to have an idea, what it is expected timeframe for a revision?
Lexicographic data in Wikidata totally flew under my radar. This is indeed something that will be needed in the future, and where I can directly contribute too! As mentioned by [1] the license seems to be an issue notably for importing existing resources, is there any “fix” planned for that?
All in all, I'm very pleased to see lot of aspects are more planned than it I assumed to be from reading the paper alone, and I’m more confident in the success now.
Best regards, Louis Lecailliez
[1] http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de Denny Vrandečić dvrandecic@wikimedia.org *Envoyé :* mercredi 8 juillet 2020 22:37 *À :* General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org *Objet :* Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Louis, all,
Louis, thanks for raising that important issue!
I have been looking in a number of related NLG systems, and one thing I noticed is a pattern of many of these projects being developed very much in isolation from each other, and also often without much concern for ongoing linguistic research. That is what I tried to capture in the research paper by stating that there is no consensus on this, and that it seems too early to commit to a specific solution.
I had given a quick look to UNL, but the project looked pretty stale to me
- I could not see any activity in the last decade. Furthermore, the page
didn't provide access to the source code and instead mentioned that part of the technology is under patents, which is quite a red flag for me, and I usually don't look into something like that any further, in order to honestly be able to say that I didn't get any ideas from those patents. If I am mistaken, and there is a freely usable write-up or implementation, I'd be happy to come back and read up more.
Thank you for the annotated bibliography! That is super useful.
But I did look into detail into a (small) number of other, similar systems, such as Grammatical Framework or KPML. Tiago mentioned FrameNet, and I learned a lot about that too. To get an overview of the whole field has been a rather frustrating experience, especially since the major textbook in that area - Dale & Reiter - doesn't cover these systems, nor the 2018 update to that book by Gatt & Krahmer, and it seems that research work in that area often omits these practical systems. Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
The goal of the paper was never meant to be a comprehensive account of the state of the art in natural language generation. That's what Dale & Reiter and Gatt & Krahmer have aimed for, and their works are hundreds of pages. I had the feeling my paper was already too long, and putting in an overview of the state of the art would have made it at least double the length.
So, given that (and other reasons, as lined out in the paper), it seems that a system which could support any of these approaches seemed a more promising way. So far, for my own prototype, I have been mostly following Grammatical Framework (because it has a very accessible book, the software is free, the community was friendly, etc.), and it worked good enough to leave me convinced that the whole thing is worth trying out. But I don't know whether that's the best approach.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first. In a parallel track, we will be creating the function wiki, and then, when the time is ripe we can bring these two strands of work together. So, would you be willing to work on that?
How does this sound for a plan?
Some further points:
This is way easier to implement, test and deliver than to implement 10
different backends with various progress in implementation, incompatibilities and runtime characteristics.
Regarding your point about evaluation environments: I agree, it would be a huge task if the WMF core team were to develop all these different environments. But that's not the plan. The goal is really that *others* will hopefully build these :) All we need to do is to make sure that's possible and encouraged and simple enough. But yeah, not the core team.
The paper presents AW as sitting on top on WL. Both are big projects.
Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project.
Yes, that's correct. That is exactly the time that allows us do the appropriate state of the art analysis. I hope it won't take us years, but that we will be faster.
AW can be realised with current tools and engineering practices.
Only if you commit to a specific implementation, which I am hesitant to do.
[English is an obstacle to programming] This strong affirmation needs
to be sourced.
https://dl.acm.org/doi/10.1145/3051457.3051464
As I spend a significant time (~10 hours) gathering references and
writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it.
Thank you for your detailed comments, which will certainly improve the second version of the paper. I am happy to mention you in the acknowledgments. For co-authorship, I usually go for a more substantial engagement ;) If you're willing to write up the "Previous work" section along the lines you mentioned above (maybe with Tiago? Maybe with others to join?), but for a comprehensive overview of existing systems, then I am open to talk about co-authorship :)
For French, the gender of every noun entity *must* be present ... For
Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated.
That's exactly the goal of the lexicographic project on Wikidata, as was pointed out:
https://www.wikidata.org/wiki/Lexeme:L12449
You'll find plenty of Lexemes with their classifiers, forms, etc. The lexicographic project was started with the Abstract Wikipedia in mind, knowing that exactly that will be needed.
Yet, the use of any existing formalism is dismissed, which mean all
the situations I illustrated in this email will need to be dealt with in an ad hoc fashion.
No, not at all it doesn't have to be ad-hoc, that's exactly what we can start working on now, long before we get to the point that we need to make that ad-hoc decision. I hope you'll join us to figure out the best way!
Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your beautiful answers, who raised a number of great replies much better than I ever could. And thanks to Louis for starting this more than interesting thread! Let's continue in this vein!
Cheers, Denny
On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Brainstorming: resembling what the document object model (DOM) [1] is for XML and attributed trees, perhaps we could create and specify an object model for sets of attributed predicate calculus expressions.
With an attributed predicate calculus object model (e.g. “APCOM”) for sets of attributed predicate calculus expressions:
{
r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4
r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8
r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11, o7(icl>domain7).@a12).@a13
}.@a14
developers could more conveniently utilize sets of attributed predicate calculus expressions from JavaScript and Lua.
Drawing from XML, we can consider that objects, relations, attributes could be, instead of plain text strings, uniform resource identifiers (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and so forth.
We can also consider that the attributes in a model could have values:
{
r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4]
r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8]
r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13]
}.[@a14=v14]
We can consider creating a scripting API (e.g. “APCOM”) for a semantic model to convenience developers. We can also consider adding attribute-value pairs to a semantic model.
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Document_Object_Model
*From: *Tiago Timponi Torrent tiago.torrent@ufjf.edu.br *Sent: *Sunday, July 5, 2020 9:06 PM *To: *General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith arthurpsmith@gmail.com escreveu:
Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski adamsobieski@hotmail.com wrote:
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language
*From: *Louis Lecailliez louis.lecailliez@outlook.fr *Sent: *Saturday, July 4, 2020 2:10 PM *To: *abstract-wikipedia@lists.wikimedia.org *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by
such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures ( https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence.
Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
*De :* Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de abstract-wikipedia-request@lists.wikimedia.org < abstract-wikipedia-request@lists.wikimedia.org> *Envoyé :* samedi 4 juillet 2020 15:06 *À :* abstract-wikipedia@lists.wikimedia.org < abstract-wikipedia@lists.wikimedia.org> *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
- Re: NLP issues severely overlooked (Charles Matthews)
- Use case: generation of short description (Jakob Voß)
- Re: NLP issues severely overlooked (Amir E. Aharoni)
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews charles.r.matthews@ntlworld.com To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" < abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: 2126327926.39940.1593867909152@mail2.virginmedia.com Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez louis.lecailliez@outlook.fr
wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper
except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I
really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK)
Hello Denny,
sounds very good for me!
I'll also put the references I collected on the Wiki when there will be a page for that.
Best regards, Louis Lecailliez
________________________________ De : Abstract-Wikipedia abstract-wikipedia-bounces@lists.wikimedia.org de la part de Denny Vrandečić dvrandecic@wikimedia.org Envoyé : mercredi 15 juillet 2020 03:54 À : General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) abstract-wikipedia@lists.wikimedia.org Objet : Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Louis,
OK, here's a suggestion for the paper: I will finish the v2 - I hope this month or the next - and then make the sources available as a git repository, and then anyone who wants to work on it, can do so.
Does this sound good?
Cheers, Denny
On Thu, Jul 9, 2020 at 5:22 AM Louis Lecailliez <louis.lecailliez@outlook.frmailto:louis.lecailliez@outlook.fr> wrote: Hi Denny,
yes, the main problem of most of the systems presented in research papers (UNL or not) is that they are locked in the institutions that made them. A lot of UNL webpages went down since last time I checked (recently), and the system was in fact designed in a way it could work over the web while not letting third-parties access code and data. This is of course the exact reverse of the technical and philosophical approach taken here, and very sad as decade of accumulated knowledge is lost; the papers are far from sufficient to re-create even of fraction of the said systems
There is also, I guess, a lot of interesting work that is not translated in English at all (notably in linguistics), as making an academic career in the national language was an option in a lot of places until very recently.
So, would you be willing to work on that?
Yes, of course, I wouldn't have posted in the mailing list otherwise. I like the dual, concurrent approach of linguistic/theory you are proposing. Note though that I'm not an expert be any mean in natural language generation, it just happens I stumbled upon UNL recently and it has too much in common on the abstract representation/NLG with this project not to mention it. I also had some researchers name in mind as I met some who worked on the referenced works.
Concerning the paper authorship, I understand your stance, and yes I'm willing to work more and write about previous works with those interested. Just to have an idea, what it is expected timeframe for a revision?
Lexicographic data in Wikidata totally flew under my radar. This is indeed something that will be needed in the future, and where I can directly contribute too! As mentioned by [1] the license seems to be an issue notably for importing existing resources, is there any “fix” planned for that?
All in all, I'm very pleased to see lot of aspects are more planned than it I assumed to be from reading the paper alone, and I’m more confident in the success now.
Best regards, Louis Lecailliez
[1] http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf
________________________________ De : Abstract-Wikipedia <abstract-wikipedia-bounces@lists.wikimedia.orgmailto:abstract-wikipedia-bounces@lists.wikimedia.org> de la part de Denny Vrandečić <dvrandecic@wikimedia.orgmailto:dvrandecic@wikimedia.org> Envoyé : mercredi 8 juillet 2020 22:37 À : General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda) <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Objet : Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Louis, all,
Louis, thanks for raising that important issue!
I have been looking in a number of related NLG systems, and one thing I noticed is a pattern of many of these projects being developed very much in isolation from each other, and also often without much concern for ongoing linguistic research. That is what I tried to capture in the research paper by stating that there is no consensus on this, and that it seems too early to commit to a specific solution.
I had given a quick look to UNL, but the project looked pretty stale to me - I could not see any activity in the last decade. Furthermore, the page didn't provide access to the source code and instead mentioned that part of the technology is under patents, which is quite a red flag for me, and I usually don't look into something like that any further, in order to honestly be able to say that I didn't get any ideas from those patents. If I am mistaken, and there is a freely usable write-up or implementation, I'd be happy to come back and read up more.
Thank you for the annotated bibliography! That is super useful.
But I did look into detail into a (small) number of other, similar systems, such as Grammatical Framework or KPML. Tiago mentioned FrameNet, and I learned a lot about that too. To get an overview of the whole field has been a rather frustrating experience, especially since the major textbook in that area - Dale & Reiter - doesn't cover these systems, nor the 2018 update to that book by Gatt & Krahmer, and it seems that research work in that area often omits these practical systems. Accordingly, when I talk with the professors and researchers in this area, also about the proposal here, they are more focussed on specific issues, and don't know that much about the concrete systems (which is understandable - the flow from research to practical systems is a more established flow in many areas). Never mind that when you get to the linguistic side of it, instead of the computer science part, there are even more competing theories, many of which are aimed toward much more encompassing goals and are about covering the whole of language and natural language understanding, which we want to be shying away from.
The goal of the paper was never meant to be a comprehensive account of the state of the art in natural language generation. That's what Dale & Reiter and Gatt & Krahmer have aimed for, and their works are hundreds of pages. I had the feeling my paper was already too long, and putting in an overview of the state of the art would have made it at least double the length.
So, given that (and other reasons, as lined out in the paper), it seems that a system which could support any of these approaches seemed a more promising way. So far, for my own prototype, I have been mostly following Grammatical Framework (because it has a very accessible book, the software is free, the community was friendly, etc.), and it worked good enough to leave me convinced that the whole thing is worth trying out. But I don't know whether that's the best approach.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a library of functions, that can support any of these approaches. My dream would be - and I see that Chris had already suggested that - that experts like you and your colleagues create an overview of the state of the art that will be accessible to the community and that will allow us to make a well-informed decision when the time comes as to which path to explore first. In a parallel track, we will be creating the function wiki, and then, when the time is ripe we can bring these two strands of work together. So, would you be willing to work on that?
How does this sound for a plan?
Some further points:
This is way easier to implement, test and deliver than to implement 10 different backends with various progress in implementation, incompatibilities and runtime characteristics.
Regarding your point about evaluation environments: I agree, it would be a huge task if the WMF core team were to develop all these different environments. But that's not the plan. The goal is really that *others* will hopefully build these :) All we need to do is to make sure that's possible and encouraged and simple enough. But yeah, not the core team.
The paper presents AW as sitting on top on WL. Both are big projects. Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project.
Yes, that's correct. That is exactly the time that allows us do the appropriate state of the art analysis. I hope it won't take us years, but that we will be faster.
AW can be realised with current tools and engineering practices.
Only if you commit to a specific implementation, which I am hesitant to do.
[English is an obstacle to programming] This strong affirmation needs to be sourced.
https://dl.acm.org/doi/10.1145/3051457.3051464
As I spend a significant time (~10 hours) gathering references and writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it.
Thank you for your detailed comments, which will certainly improve the second version of the paper. I am happy to mention you in the acknowledgments. For co-authorship, I usually go for a more substantial engagement ;) If you're willing to write up the "Previous work" section along the lines you mentioned above (maybe with Tiago? Maybe with others to join?), but for a comprehensive overview of existing systems, then I am open to talk about co-authorship :)
For French, the gender of every noun entity *must* be present ... For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated.
That's exactly the goal of the lexicographic project on Wikidata, as was pointed out:
https://www.wikidata.org/wiki/Lexeme:L12449
You'll find plenty of Lexemes with their classifiers, forms, etc. The lexicographic project was started with the Abstract Wikipedia in mind, knowing that exactly that will be needed.
Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion.
No, not at all it doesn't have to be ad-hoc, that's exactly what we can start working on now, long before we get to the point that we need to make that ad-hoc decision. I hope you'll join us to figure out the best way!
Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your beautiful answers, who raised a number of great replies much better than I ever could. And thanks to Louis for starting this more than interesting thread! Let's continue in this vein!
Cheers, Denny
On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski <adamsobieski@hotmail.commailto:adamsobieski@hotmail.com> wrote:
Brainstorming: resembling what the document object model (DOM) [1] is for XML and attributed trees, perhaps we could create and specify an object model for sets of attributed predicate calculus expressions.
With an attributed predicate calculus object model (e.g. “APCOM”) for sets of attributed predicate calculus expressions:
{
r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4
r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8
r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11, o7(icl>domain7).@a12).@a13
}.@a14
developers could more conveniently utilize sets of attributed predicate calculus expressions from JavaScript and Lua.
Drawing from XML, we can consider that objects, relations, attributes could be, instead of plain text strings, uniform resource identifiers (URI’s). “r1” could be a URI, “a1” could be a URI, “o1” could be a URI, and so forth.
We can also consider that the attributes in a model could have values:
{
r1.[@a1=v1](o1(icl>domain1).[@a2=v2], o2(icl>domain2).[@a3=v3]).[@a4=v4]
r2.[@a5=v5](o3(icl>domain3).[@a6=v6], o4(icl>domain4).[@a7=v7]).[@a8=v8]
r3.[@a9=v9](o5(icl>domain5).[@a10=v10], o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13]
}.[@a14=v14]
We can consider creating a scripting API (e.g. “APCOM”) for a semantic model to convenience developers. We can also consider adding attribute-value pairs to a semantic model.
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Document_Object_Model
From: Tiago Timponi Torrentmailto:tiago.torrent@ufjf.edu.br Sent: Sunday, July 5, 2020 9:06 PM To: General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)mailto:abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
That’s a good idea, but I think you would need more than that. Take FrameNet, for example, but now departing from verbs instead of nouns. FrameNet has a very detailed model for dealing with verbs, their semantic arguments and the way they surface in morphosyntax. Nonetheless, to apply such a model in a text comprehension and/or generation task, you need more than that. You need to know prototypical fillers for the positions, which, in turn, are associated to other frames and, therefore, participate in other clusters of the network of frames. Moreover, you’d want those prototypical fillers to function as departing points for analogical extensions in the model, since not every sentence is a prototypical combination of words. In other words, the collection of attributes and relations you refer to should be defined in a way that they can be analogically extended to other collections not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith <arthurpsmith@gmail.commailto:arthurpsmith@gmail.com> escreveu:
Yes, thank you for the UNL background, that is extremely helpful. I've been reading some of the articles Louis provided as references, and it seems to me from just this perhaps naive point of view, that a lot of the complexity is associated with disambiguation of meaning - for nouns I think Wikidata items (and their relations to lexeme senses) solve that problem, but we are still missing I think a lot of the detail needed to do the same with adjectives and verbs (at least). So there is definitely some room for finding better ways to model - but maybe Wikidata could be expanded to handle the adjective/verb cases too. In general the concept of a single meaning associated with a Wikidata item as its identifier and a collection of attributes and relationships attached to that item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski <adamsobieski@hotmail.commailto:adamsobieski@hotmail.com> wrote:
Louis,
Thank you for the information about the Universal Networking Language [1] and the World Atlas of Language Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations and expressions enhances expressiveness for various features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with which to obtain such expressiveness for models such as Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript / WebAssembly based, and observing that Lua / WebAssembly solutions exist, we can note that scripting engines such as V8 are easy to use and to add global objects and API to. Resembling how Web browsers provide scripting environments and API for functions, we can envision providing scripting environments and API for natural language generation functions.
I wonder what you might think about scripting environments and API for natural language generation scenarios?
Best regards,
Adam
[1] https://en.wikipedia.org/wiki/Universal_Networking_Language
From: Louis Lecailliezmailto:louis.lecailliez@outlook.fr Sent: Saturday, July 4, 2020 2:10 PM To: abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research. In fact I've seen Wikipedia grown from an unknown website to the biggest encyclopedia it is now. I use it daily in multiple languages and love it. I know what crowd sourcing could achieve.
It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject.
I disagree here. It would be contribution to scientic knowledge if and only if it wasn't discovered before. My email was precisely about that: capitalizing on the knowledge that has already been discovered, to avoid making the same mistake them again. It would not matter for a small project, but this one is really ambitious. We are speaking of 40 years of work by a horde of talented and very knowledgeable people, so this isn't to be dismissed easily.
This thing is, my previous email was a bit abstract, because it were a review of the paper, not of the project itself. I should have made more examples to illustrate where the problem lies.
Let's start with a simple example, in English, with corresponding Wikidata entities in-between parenthesis. I'm also using pseudo-turtle notation with made up relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract representation: the text generator needs to know about the gender of both nouns (France and pays). So we need to extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier (個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3 additional triples just for accounting for linguistic facts occuring in these languages. Wikipedia exists in more than 300 languages, and the world has about 6000 of them, each of them having particularities that must be taken into account. Fortunately they recoup themselves in-between languages. Nonetheless the World Atlas Language Structures (https://wals.info/chapter/s1) count 144 distinct language features. Some are related to speech, but this means there is probably something like a hundred features that must be taken into account in the data model to produce valid natural language sentence.
Note that in the Chinese example, there is also a number (一, one) showing up. This is a phenomenon that must be taken into account; and it's not always appearing when using 是 (to be). How complex the "lambda" system will be just to deal with this issue? Hint: very much. It also needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete data. For French, the gender of every noun entity *must* be present, otherwise there is half a chance of producing a wrong sentence each time a noun entity is encountered. For Chinese and Japanese, classifier information must be present for all noun, in case one must be enumerated. Where does the project will get the data from? (we are speaking of millions of item, most not referenced in existing dictionaries) How will this be encoded? Those are real questions that must be answered.
Suppose now we have done the work for "renderers" in these three languages. They both use the more or less similar A X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of France, followed by a predicate (it's a country). This is very different from the previous structure, which, not surprisingly enough, needs it's own representation. To make situation more difficult, the previous (A be B) structure can also exists in Japanese, but would lead to a totally different sentence if used.
The paper states that Figure 1 and 2 are examples that will be more complex in real life. Yet, the use of any existing formalism is dismissed, which mean all the situations I illustrated in this email will need to be dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc or not) will require to change every piece of code/data using it. This will happen everytime a language with unsupported feature(s) is added to the project. It's not hard to see how this will waste a huge amount of time and goodwill from involved people. The very code focussed tone of the paper, the english-centric approach used in the examples and the lack of references shows that the complexity of the task on the NLP front is not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
De : Abstract-Wikipedia <abstract-wikipedia-bounces@lists.wikimedia.orgmailto:abstract-wikipedia-bounces@lists.wikimedia.org> de la part de abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org <abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org> Envoyé : samedi 4 juillet 2020 15:06 À : abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Objet : Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia or, via email, send a message with subject or body 'help' to abstract-wikipedia-request@lists.wikimedia.orgmailto:abstract-wikipedia-request@lists.wikimedia.org
You can reach the person managing the list at abstract-wikipedia-owner@lists.wikimedia.orgmailto:abstract-wikipedia-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
1. Re: NLP issues severely overlooked (Charles Matthews) 2. Use case: generation of short description (Jakob Voß) 3. Re: NLP issues severely overlooked (Amir E. Aharoni)
----------------------------------------------------------------------
Message: 1 Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST) From: Charles Matthews <charles.r.matthews@ntlworld.commailto:charles.r.matthews@ntlworld.com> To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: <2126327926.39940.1593867909152@mail2.virginmedia.commailto:2126327926.39940.1593867909152@mail2.virginmedia.com> Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about software issues, and then computational linguistic problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez <louis.lecailliez@outlook.frmailto:louis.lecailliez@outlook.fr> wrote:
<snip>
Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
I can comment about this. Besides Wiktionary, there is the "lexeme" namespace of Wikidata. It is a relatively new part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors.
It is worth saying, for context, that there is a certain style or philosophy coming from the wiki side: more precisely, from the wikis before Wikipedia. There is the slogan "what is the simplest thing that would actually work?" You might argue, plausibly, that Wikipedia at nearly 20 years old, shows that there is a bit more to engineering than that.
On the other hand, looking at Wikidata at seven years old, there is some point to the comment. It has a rather simple approach to linked structured data, compared to the Semantic Web environment. (Really, just write a very large piece of JSON and try to cope with it!) But the number of binary relations used (8K, if you count the "external links" handling) is now quite large, and has grown organically. The data modelling is in a sense primitive, sometimes non-existent. But the range of content handled really is encyclopedic. And in an area like scientific bibliography, at a scale of tens of millions of entities, the advantages of not much ontological fussiness begin to be seen.
Wikidata started as an index of all Wikipedia articles, and is now five times the size needed for that: a very enriched "index".
I suppose the NLP required to code up, for example, 50K chemistry articles about molecules, might be a problem that could be solved, leaving aside the general problems for the moment.
In any case, there is a certain approach, neither academic nor commercial, that comes with Wikimedia and its communities, and it will be interesting to see how the issues are addressed.
Charles Matthews (in Cambridge UK) -------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/1113bab0/attachment-0001.html
------------------------------
Message: 2 Date: Sat, 4 Jul 2020 08:18:56 +0200 From: Jakob Voß <jakob.voss@gbv.demailto:jakob.voss@gbv.de> To: <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: [Abstract-wikipedia] Use case: generation of short description Message-ID: <4403bbda-040b-6c89-9cb6-6540139250dc@gbv.demailto:4403bbda-040b-6c89-9cb6-6540139250dc@gbv.de> Content-Type: text/plain; charset="utf-8"
Hi,
I want to auto-generate disambiguation description for African politicians to be added to Wikidata, e.g. from the country Mozambique (Q1029) the following descriptions should be generated:
Mozambican politician (en) Mosambikanischer Politiker (de) politico mozambicano (it) ...
This could be extended to other professions. My questions:
- Can anyone point me to data sources where to best look up country adjectives such as "Mozambican"?
- Where/how to best store the lexical information for best reuse with other renderers
- If a create small renderers for this short descriptions, what architecture do you prefer for best reuse?
My just-get-it-done solution would be a set of CSV files and a few lines of Perl code, but maybe this use case can be aligned with Abstract Wikidata to better learn about it.
Looking forward to collaborate, Jakob
------------------------------
Message: 3 Date: Sat, 4 Jul 2020 18:03:24 +0300 From: "Amir E. Aharoni" <amir.aharoni@mail.huji.ac.ilmailto:amir.aharoni@mail.huji.ac.il> To: "General public mailing list for the discussion of Abstract Wikipedia (aka Wikilambda)" <abstract-wikipedia@lists.wikimedia.orgmailto:abstract-wikipedia@lists.wikimedia.org> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked Message-ID: <CACtNa8t6kbWe21C980h1MxiWNfUp+0eDE82vPMjDUX2UCgb2gw@mail.gmail.commailto:CACtNa8t6kbWe21C980h1MxiWNfUp%2B0eDE82vPMjDUX2UCgb2gw@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi,
Thanks a lot for the sources. I am not one of the people implementing Wikilambda, but I am just very curious about it as a member of the wider Wikimedia community. But there's a good chance that they will be useful to people who do work on the implementation.
I will dare to add a little thought I have about it, however. It's possible that the challenge of building a well-functioning natural language generator is underestimated by the founders, and that they don't pay enough attention to existing work (although, knowing Denny, there is a good chance that he actually is aware of at least some of it). But there is something that the wide Wikimedia community has that I'm not sure that the past projects in this field did: The community itself. A big, worldwide, and diverse group of passionate volunteers, who love the idea of spreading free knowledge and who love their languages. Quite a lot of them also know some programming, and in the past they proved unbelievably creative and productive when writing code for Wikimedia projects as a community, in the form of templates, modules, gadgets, bots, extensions, and other tools. I'm quite sure that once the new tools become usable, this community will start doing creative things again, and it will also start reporting bugs and limitations.
So yes, while it's possible that along the way both the core developers and the volunteer community will find all kinds of stumbling blocks, I'm pretty sure that they will also have all kinds of surprising success stories. It's also possible that the mere *finding* of these stumbling blocks by such a big, diverse, open, and active community, will itself be a contribution to the scientific knowledge around this subject. And don't underestimate the "open" part—that's where we really shine. This won't be a theoretical work in a lab, published in a paywalled and copyright-restricted academic journal, but fully optimized for accessibility to everyone.
Yes, this whole email from me is incredibly naïve, but it's the same attitude that got us to writing the biggest and most multilingual encyclopedia in history, so maybe we can do something cool again :)
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
בתאריך שבת, 4 ביולי 2020 ב-14:26 מאת Louis Lecailliez < louis.lecailliez@outlook.frmailto:louis.lecailliez@outlook.fr>:
Hello,
my name is Louis Lecailliez, PhD student at Kyoto University in education technology. I'm a Computer Science and NLP graduate. One thing I do is working on language learner's knowledge modelling as graphs.
The Abstract Wikipedia project is really interesting. There is however two very concerning issues I spotted when reading the associated paper draft ( https://arxiv.org/abs/2004.04733). The following email could be read as negative, but please don't take it as such: my purpose is to avoid spending people efforts and money for things that can (need to!) be fixed upfront.
- Issues with NLP
The main issue is that the difficulty of the NLP task of generating natural text from an abstract representation is severely overlooked. This stems from the other main problem: the paper is not based on the decades of previous work in that space.
As I understand it, the main value proposition of Abstract Wikipedia (AW) is a computer representation of encyclopedic knowledge that can be projected into different existing natural languages, with the goal of supporting a huge number of them. Plus, an editor to make this happen easily.
This is in fact surprisingly extremely close to what the Universal Networking Language (UNL) project, which started 20 years ago, aims to do. UNL provides a language agnostic representation of text that uses hypergraph. Software (called EnConverter) produce UNL graphs from natural text in a given language. Another kind of software called DeConverter do the reverse, that is producing natural text from the abstract representation. This is exactly the same function of the "renderers" in the AW paper. The way of doing it is also similar: by applying successive transformations until the final text string is produced. In general, that kind of abstract meaning representation is called an Interlingua, and is widely used in Machine Translation (MT) systems.
Disregarding two decades of work, in the UNL case, on the same problem space (rule-based machine translation, here from an abstract language as fixed source language), which was itself based on few other decades of work, doesn't seem to be a wise move to start a new project. For a start, the graph representation used in the AW will likely not be expressive enough to encode linguistic knowledge; this is why UNL uses hypergraphs instead of graphs.
The problem is glaring when looking at the references list: the list is bloated with irrelevant references (such as those to programming languages [27, 37, 41, 77], Turing completeness being the worst offender [11, 17, 23, ...]) while containing only two references [7, 85] to the really hard part of the project: generating natural language from the abstract representation. There are few more relevant references about natural language generation, but this isn't enough.
Interestingly, [85] is an UNL paper, but not the main one. Moreover, it is cited in Section 9 "Opening future research". This should be instead placed in a "Previous work" section which is missing from the paper.
To fill a part of this section yet to be written, I propose the following references: [*1] Uchida, H., Zhu, M., & Della Senta, T. (1999). A gift for a millennium. IAS/UNU, Tokyo.
https://www.researchgate.net/profile/Hiroshi_Uchida2/publication/239328725_A... [*2] Wang-Ju Tsai (2004) La coédition langue-UNL pour partager la révision entre langues d'un document multilingue. [Language-UNL coedition to share revisions in a multilingual document] Thèse de doctorat. Grenoble.
https://pdfs.semanticscholar.org/b030/ea4662e393657b9a134c006ca5b08e8a23b3.p... [3*] Boitet, C., & Tsai, W. J. (2002). La coédition langue<—> UNL pour partager la révision entre les langues d'un document multilingue: un concept unificateur. Proc. TALN-02, Nancy, 22-26.
http://www.afcp-parole.org/doc/Archives_JEP/2002_XXIVe_JEP_Nancy/talnrecital... [4*] Tomokiyo, M., Mangeot, M., & Boitet, C. (2019). Development of a classifiers/quantifiers dictionary towards French-Japanese MT. arXiv preprint arXiv:1902.08061. https://arxiv.org/pdf/1902.08061.pdf [5*] Boguslavsky, I. (2005). Some controversial issues of UNL: Linguistic aspects. Research on Computer Science, 12, 77-100.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.2058&rep=re... [6*] Boitet, C. (2002). A rationale for using UNL as an interlingua and more in various domains. In Proc. LREC-02 First International Workshop on UNL, other Interlinguas, and their Applications, Las Palmas (pp. 26-31). https://www.cicling.org/2005/unl-book/Papers/003.pdf [7*] Dhanabalan, T., & Geetha, T. V. (2003, December). UNL deconverter for Tamil. In International Conference on the Convergences of Knowledge, Culture, Language and Information Technologies. http://www.cfilt.iitb.ac.in/convergence03/all%20data/paper%20032-372.pdf [8*] Singh, S., Dalal, M., Vachhani, V., Bhattacharyya, P., & Damani, O. P. (2007). Hindi generation from Interlingua (UNL). Machine Translation Summit XI.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.78.979&rep=rep1... [9*] Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., ... & Schneider, N. (2013, August). Abstract meaning representation for sembanking. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse (pp. 178-186). https://www.aclweb.org/anthology/W13-2322.pdf [10*] Berment, V., & Boitet, C. (2012). Heloise—An Ariane-G5 Compatible Rnvironment for Developing Expert MT Systems Online. In Proceedings of COLING 2012: Demonstration Papers (pp. 9-16). https://www.aclweb.org/anthology/C12-3002.pdf [11*] Berment, V. (2005). Online Translation Services for the Lao Language. In Proceedings of the First International Conference on Lao Studies. De Kalb, Illinois, USA (pp. 1-11).
https://www.researchgate.net/profile/Vincent_Berment/publication/242140227_O...
[*1] is the paper that describes UNL. [2*] is a doctoral thesis discussing a core problem AW is trying to address too. [3*] is a short paper done in the scope of [2*], even if you don't understand French you can have a look at the figures: two of them are about an editor similar in principe to what AW wants to incorporate. [5*] Insights about UNL expressivity issues, 10 years after the project's start. [6*] More UNL, with short history and context in which it is used.
[4*] shows how deep natural language conversion goes: this paper addresses the issue of classifiers in French and Japanese. This is just one linguistic issue and there are dozens if not hundreds of such. An important point is that both of the languages involved need to be taken into account when modelling the abstract encoding, otherwise too much information is lost for producing a correct output.
[7*] [8*] are very valuable examples of real world deconverter systems for UNL. As it's visible on [7*]'s Figure 1 and [8*]'s Figure 2, the process is *way* more complicated than a single "renderers" box. Moreover, there are very distinct identifiable steps, informed by linguistics. The AW does not describe any such structuration of natural text generation processing steps, everything is supposed to be happening in some unstructured "lambda" system. Also missing are the specialized resources (UNL-Hindi dictionary, Tamil Word dictionary, co-occurrence dictionary, etc.) required for the task. Nothing precise is said about linguistic resources in the AW paper except for "These function finally can call the lexicographic knowlegde stored in Wikidata.", which is not very convincing: first because Wiktionary projects themselves severely lacks content and structure for those who has some content at all, secondly since specialized NLP ressources are missing there too (note: I'm not familiar with Wikidata so I could be wrong, however I never saw it cited for the kind of NLP resources I'm talking about).
[10*] is a translation system built with "specialised languages for linguistic programming (SLLPs)" which is the service Wikilambda is supposed to provide for Abstract Wikipedia. [11*] gives the estimation of 2500 hours for the development (by a specialist) of three linguistic modules for Lao processing.
So, in regard to the difficulty of the task, and previous work in the literature, the AW paper does not provide any convincing evidence that the technology on which it is supposed to be built can even reach the state-of-art. Dismissing every existing formal and software systems on the ground of "no consensus commiting to any specific linguistic theory" is not gonna work: this will result in ad hoc implementation-driven formalism that will have hard time fullfilling its goal. The NLP part (generating sentences from abstract representation) is the hardest of the project, yet it’s by far the least convincing one. "Abstract Wikipedia is indeed firmly within this tradition, and in preparation for this project we studied numerous predecessors." I would like to believe so, but the lack of corresponding reference as well as lack of previous work section tends to prove the contrary.
While I can't advice for a switch to UNL, as I'm not specialist of it, it would be smart to capitalize on the work done on it by highly skilled (PhD level) individuals. As the UNL system is built on hypergraphs, it probably could be made interoperable easily with RDF knowledge graphs if named graphs are used. By having a UNL/RDF specification (yet to be written), the vision exposed in the AW paper may be reached sooner by reusing existing software (we are speaking of thousands man-year of work as per [11*]), and almost as importantly, an existing formalism that has been "debugged" for decades. There are probably other systems I'm unaware of that are worth investigating too, some like [9*] having more specialized usage. In any case, there is a strong need to back the paper and the project on the existing (huge) literature.
- Other issues
"In order to evaluate a function call, an evaluator can choose from a multitude of backends: it may evaluate the function call in the browser, in the cloud, on the servers of the Wikimedia Foundation, on a distributed peer-to-peer evaluation platform, or natively on the user’s machine in a dedicated hosting runtime, which could be a mobile app or a server on the user’s computer."
This part is big technical creep. There is no reason to turn the project into a distributed heterogenous computing platform with a dedicated runtime, which could be a research project on its own, when the stated goal is to provide abstract multilingual encyclopedic content. All the computation can be done on servers (cloud is servers too) and cached. This is way easier to implement, test and deliver than to implement 10 different backends with various progress in implementation, incompatibilities and runtime characteristics.
The paper presents AW as sitting on top on WL. Both are big projects. Sitting a big project on top of another one is really risky, as it means a significant milestone must first be reached in the dependency (here WL), which would likely took some years, before even starting the work on the other project. AW can be realised with current tools and engineering practices.
"One obstacle in the democratization of programming has been that almost every programming language requires first to learn some basic English."
This strong affirmation needs to be sourced. Programming languages, save for a few keywords, doesn't rely much on English. The vast insuccess of localized version of programming languages (such as French Basic) as well as the heavy use of existing programming language in countries that doesn't even use the Latin alphabet (China, Russia) tends to prove that English is not all a bottleneck for the democratization of programming. [53] is cited later in the paper but is a pop-linguistic article from an online newspaper, not an academic article.
- Final words
To finish on a positive note, I would like to highlight the points I really like in the paper. First, its collaborative and open nature, like all Wikimedia projects, gives him a much higher chance of success than its predecessors. If UNL is not too well-known, it’s not because it didn't yield research achievements, but because one selected institution per language is working on it and keep the resources and software within the lab walls. Secondly, there are some very welcome out-of-scope features: conversion from natural language, restriction to encyclopedic style text. This will allow for more focused effort towards the end goal, making it more achievable. And finally, the choice to go with symbolic/rule-based system with a touch of other ML where useful. This is, as said in the paper, a big win for explainability and using human contributions to build the system. This will also keep the computing cost to a more sane baseline than what the current deep learning models require.
I think the project can succeed thanks to its openess, yet there is are real dangers visible in the paper: on the NLP side to reinvent a wheel that took 40 years to build, and on the technical side to lose time and effort on a project not required per se for AW to be build.
As I spend a significant time (~10 hours) gathering references and writing this email (which is 5 pages long in Word), I would like to be mentioned as co-author in the final paper if any idea or references presented here is used in it.
Best regards, Louis Lecailliez
PS: 4. Typos
- "These two projects will considerably expand the capabilities of the
Wikimedia platform to enable every single human being to freely share share in the sum of all knowledge." => duplicate share
- "The content is than turned into" => The content is then turned into
- "[26] Charles J Fillmore, Russell Lee-Goldman, and Russell Rhodes. The
framenet constructicon. Sign-based construction grammar, pages 309–372, 2012." => The framenet construction
- "These function finally can call the lexicographic knowlegde stored in
Wikidata." => These function finally can call the lexicographic knowledge stored in Wikidata
- "[102] George Kinsley Zipf. Human Behavior and the Pirnciple of Least
Effort. Addison-Wesley, 1949." => [102] George Kinsley Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
- "Allowing the individual language Wikipedias to call Wikilambda has an
addtional benefit." => Allowing the individual language Wikipedias to call Wikilambda has an additional benefit. _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
-------------- next part -------------- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/abstract-wikipedia/attachments/20200704/5cb85890/attachment.html
------------------------------
Subject: Digest Footer
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
------------------------------
End of Abstract-Wikipedia Digest, Vol 1, Issue 6 ************************************************
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
--
Tiago Timponi Torrent
PPG-Linguística - FrameNet Brasil
Universidade Federal de Juiz de Fora
_______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia _______________________________________________ Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.orgmailto:Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
abstract-wikipedia@lists.wikimedia.org