Quick side question: is there a role for formal ontology (FOL, DL or CL
type of thing) in computational linguistics?
Mike
On 7/9/2020 8:22 AM, Louis Lecailliez wrote:
Hi Denny,
yes, the main problem of most of the systems presented in research
papers (UNL or not) is that they are locked in the institutions that
made them. A lot of UNL webpages went down since last time I checked
(recently), and the system was in fact designed in a way it could work
over the web while not letting third-parties access code and data.
This is of course the exact reverse of the technical and philosophical
approach taken here, and very sad as decade of accumulated knowledge
is lost; the papers are far from sufficient to re-create even of
fraction of the said systems
There is also, I guess, a lot of interesting work that is not
translated in English at all (notably in linguistics), as making an
academic career in the national language was an option in a lot of
places until very recently.
So, would you be willing to work on that?
Yes, of course, I wouldn't have posted in the mailing list otherwise.
I like the dual, concurrent approach of linguistic/theory you are
proposing. Note though that I'm not an expert be any mean in natural
language generation, it just happens I stumbled upon UNL recently and
it has too much in common on the abstract representation/NLG with this
project not to mention it. I also had some researchers name in mind as
I met some who worked on the referenced works.
Concerning the paper authorship, I understand your stance, and yes I'm
willing to work more and write about previous works with those
interested. Just to have an idea, what it is expected timeframe for a
revision?
Lexicographic data in Wikidata totally flew under my radar. This is
indeed something that will be needed in the future, and where I can
directly contribute too! As mentioned by [1] the license seems to be
an issue notably for importing existing resources, is there any “fix”
planned for that?
All in all, I'm very pleased to see lot of aspects are more planned
than it I assumed to be from reading the paper alone, and I’m more
confident in the success now.
Best regards,
Louis Lecailliez
[1]
http://www2.imm.dtu.dk/pubdb/edoc/imm7154.pdf
------------------------------------------------------------------------
*De :* Abstract-Wikipedia
<abstract-wikipedia-bounces(a)lists.wikimedia.org> de la part de Denny
Vrandečić <dvrandecic(a)wikimedia.org>
*Envoyé :* mercredi 8 juillet 2020 22:37
*À :* General public mailing list for the discussion of Abstract
Wikipedia (aka Wikilambda) <abstract-wikipedia(a)lists.wikimedia.org>
*Objet :* Re: [Abstract-wikipedia] NLP issues severely overlooked
(Amir E. Aharoni)
Hi Louis, all,
Louis, thanks for raising that important issue!
I have been looking in a number of related NLG systems, and one thing
I noticed is a pattern of many of these projects being developed very
much in isolation from each other, and also often without much concern
for ongoing linguistic research. That is what I tried to capture in
the research paper by stating that there is no consensus on this, and
that it seems too early to commit to a specific solution.
I had given a quick look to UNL, but the project looked pretty stale
to me - I could not see any activity in the last decade. Furthermore,
the page didn't provide access to the source code and instead
mentioned that part of the technology is under patents, which is quite
a red flag for me, and I usually don't look into something like that
any further, in order to honestly be able to say that I didn't get any
ideas from those patents. If I am mistaken, and there is a freely
usable write-up or implementation, I'd be happy to come back and read
up more.
Thank you for the annotated bibliography! That is super useful.
But I did look into detail into a (small) number of other, similar
systems, such as Grammatical Framework or KPML. Tiago mentioned
FrameNet, and I learned a lot about that too. To get an overview of
the whole field has been a rather frustrating experience, especially
since the major textbook in that area - Dale & Reiter - doesn't cover
these systems, nor the 2018 update to that book by Gatt & Krahmer, and
it seems that research work in that area often omits these practical
systems. Accordingly, when I talk with the professors and researchers
in this area, also about the proposal here, they are more focussed on
specific issues, and don't know that much about the concrete systems
(which is understandable - the flow from research to practical systems
is a more established flow in many areas). Never mind that when you
get to the linguistic side of it, instead of the computer science
part, there are even more competing theories, many of which are aimed
toward much more encompassing goals and are about covering the whole
of language and natural language understanding, which we want to be
shying away from.
The goal of the paper was never meant to be a comprehensive account of
the state of the art in natural language generation. That's what Dale
& Reiter and Gatt & Krahmer have aimed for, and their works are
hundreds of pages. I had the feeling my paper was already too long,
and putting in an overview of the state of the art would have made it
at least double the length.
So, given that (and other reasons, as lined out in the paper), it
seems that a system which could support any of these approaches seemed
a more promising way. So far, for my own prototype, I have been mostly
following Grammatical Framework (because it has a very accessible
book, the software is free, the community was friendly, etc.), and it
worked good enough to leave me convinced that the whole thing is worth
trying out. But I don't know whether that's the best approach.
As mentioned by Chris Cooley, the goal will be to create a new wiki, a
library of functions, that can support any of these approaches. My
dream would be - and I see that Chris had already suggested that -
that experts like you and your colleagues create an overview of the
state of the art that will be accessible to the community and that
will allow us to make a well-informed decision when the time comes as
to which path to explore first. In a parallel track, we will be
creating the function wiki, and then, when the time is ripe we can
bring these two strands of work together. So, would you be willing to
work on that?
How does this sound for a plan?
Some further points:
> This is way easier to implement, test and
deliver than to implement
10 different backends with various progress in
implementation,
incompatibilities and runtime characteristics.
Regarding your point about evaluation environments: I agree, it would
be a huge task if the WMF core team were to develop all these
different environments. But that's not the plan. The goal is really
that *others* will hopefully build these :) All we need to do is to
make sure that's possible and encouraged and simple enough. But yeah,
not the core team.
> The paper presents AW as sitting on top on
WL. Both are big
projects. Sitting a big project on top of another one is really
risky,
as it means a significant milestone must first be reached in the
dependency (here WL), which would likely took some years, before even
starting the work on the other project.
Yes, that's correct. That is exactly the time that allows us do the
appropriate state of the art analysis. I hope it won't take us years,
but that we will be faster.
> AW can be realised with current tools and
engineering practices.
Only if you commit to a specific implementation, which I am hesitant
to do.
> [English is an obstacle to programming] This
strong affirmation
needs to be sourced.
https://dl.acm.org/doi/10.1145/3051457.3051464
> As I spend a significant time (~10 hours)
gathering references and
writing this email (which is 5 pages long in Word), I
would like to be
mentioned as co-author in the final paper if any idea or references
presented here is used in it.
Thank you for your detailed comments, which will certainly improve the
second version of the paper. I am happy to mention you in the
acknowledgments. For co-authorship, I usually go for a more
substantial engagement ;) If you're willing to write up the "Previous
work" section along the lines you mentioned above (maybe with Tiago?
Maybe with others to join?), but for a comprehensive overview of
existing systems, then I am open to talk about co-authorship :)
> For French, the gender of every noun entity
*must* be present ...
For Chinese and Japanese, classifier information must be
present for
all noun, in case one must be enumerated.
That's exactly the goal of the lexicographic project on Wikidata, as
was pointed out:
https://www.wikidata.org/wiki/Lexeme:L12449
You'll find plenty of Lexemes with their classifiers, forms, etc. The
lexicographic project was started with the Abstract Wikipedia in mind,
knowing that exactly that will be needed.
> Yet, the use of any existing formalism is
dismissed, which mean all
the situations I illustrated in this email will need to
be dealt with
in an ad hoc fashion.
No, not at all it doesn't have to be ad-hoc, that's exactly what we
can start working on now, long before we get to the point that we need
to make that ad-hoc decision. I hope you'll join us to figure out the
best way!
Thanks to Charles, Amir, Tiago, Christopher, Arthur, and Adam for your
beautiful answers, who raised a number of great replies much better
than I ever could. And thanks to Louis for starting this more than
interesting thread! Let's continue in this vein!
Cheers,
Denny
On Sun, Jul 5, 2020 at 9:49 PM Adam Sobieski <adamsobieski(a)hotmail.com
<mailto:adamsobieski@hotmail.com>> wrote:
Brainstorming: resembling what the document object model (DOM) [1]
is for XML and attributed trees, perhaps we could create and
specify an object model for sets of attributed predicate calculus
expressions.
With an attributed predicate calculus object model (e.g. “APCOM”)
for sets of attributed predicate calculus expressions:
{
r1.@a1(o1(icl>domain1).@a2, o2(icl>domain2).@a3).@a4
r2.@a5(o3(icl>domain3).@a6, o4(icl>domain4).@a7).@a8
r3.@a9(o5(icl>domain5).@a10, o6(icl>domain6).@a11,
o7(icl>domain7).@a12).@a13
}.@a14
developers could more conveniently utilize sets of attributed
predicate calculus expressions from JavaScript and Lua.
Drawing from XML, we can consider that objects, relations,
attributes could be, instead of plain text strings, uniform
resource identifiers (URI’s). “r1” could be a URI, “a1” could be a
URI, “o1” could be a URI, and so forth.
We can also consider that the attributes in a model could have values:
{
r1.[@a1=v1](o1(icl>domain1).[@a2=v2],
o2(icl>domain2).[@a3=v3]).[@a4=v4]
r2.[@a5=v5](o3(icl>domain3).[@a6=v6],
o4(icl>domain4).[@a7=v7]).[@a8=v8]
r3.[@a9=v9](o5(icl>domain5).[@a10=v10],
o6(icl>domain6).[@a11=v11], o7(icl>domain7).[@a12=v12]).[@a13=v13]
}.[@a14=v14]
We can consider creating a scripting API (e.g. “APCOM”) for a
semantic model to convenience developers. We can also consider
adding attribute-value pairs to a semantic model.
Best regards,
Adam
[1]
https://en.wikipedia.org/wiki/Document_Object_Model
*From: *Tiago Timponi Torrent <mailto:tiago.torrent@ufjf.edu.br>
*Sent: *Sunday, July 5, 2020 9:06 PM
*To: *General public mailing list for the discussion of Abstract
Wikipedia (aka Wikilambda)
<mailto:abstract-wikipedia@lists.wikimedia.org>
*Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked
(Amir E. Aharoni)
That’s a good idea, but I think you would need more than that.
Take FrameNet, for example, but now departing from verbs instead
of nouns. FrameNet has a very detailed model for dealing with
verbs, their semantic arguments and the way they surface in
morphosyntax. Nonetheless, to apply such a model in a text
comprehension and/or generation task, you need more than that. You
need to know prototypical fillers for the positions, which, in
turn, are associated to other frames and, therefore, participate
in other clusters of the network of frames. Moreover, you’d want
those prototypical fillers to function as departing points for
analogical extensions in the model, since not every sentence is a
prototypical combination of words. In other words, the collection
of attributes and relations you refer to should be defined in a
way that they can be analogically extended to other collections
not originally assigned to the item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith
<arthurpsmith(a)gmail.com <mailto:arthurpsmith@gmail.com>> escreveu:
Yes, thank you for the UNL background, that is extremely
helpful. I've been reading some of the articles Louis provided
as references, and it seems to me from just this perhaps naive
point of view, that a lot of the complexity is associated with
disambiguation of meaning - for nouns I think Wikidata items
(and their relations to lexeme senses) solve that problem, but
we are still missing I think a lot of the detail needed to do
the same with adjectives and verbs (at least). So there is
definitely some room for finding better ways to model - but
maybe Wikidata could be expanded to handle the adjective/verb
cases too. In general the concept of a single meaning
associated with a Wikidata item as its identifier and a
collection of attributes and relationships attached to that
item is a powerful one that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski
<adamsobieski(a)hotmail.com <mailto:adamsobieski@hotmail.com>>
wrote:
Louis,
Thank you for the information about the Universal
Networking Language [1] and the World Atlas of Language
Structures [2].
Semantic Modeling
Do you opine that adding attributes to objects, relations
and expressions enhances expressiveness for various
features of natural language?
r.@a1.@a2(o1(icl>domain1).@a3.@a4,
o2(icl>domain2).@a5.@a6).@a7.@a8
I wonder whether there exist mappings or workarounds with
which to obtain such expressiveness for models such as
Wikidata’s.
Scripting Environments for Natural Language Generation
Supposing that Wikilambda could be JavaScript /
WebAssembly based, and observing that Lua / WebAssembly
solutions exist, we can note that scripting engines such
as V8 are easy to use and to add global objects and API
to. Resembling how Web browsers provide scripting
environments and API for functions, we can envision
providing scripting environments and API for natural
language generation functions.
I wonder what you might think about scripting environments
and API for natural language generation scenarios?
Best regards,
Adam
[1]
https://en.wikipedia.org/wiki/Universal_Networking_Language
[2]
https://wals.info/ <https://wals.info/>
*From: *Louis Lecailliez <mailto:louis.lecailliez@outlook.fr>
*Sent: *Saturday, July 4, 2020 2:10 PM
*To: *abstract-wikipedia(a)lists.wikimedia.org
<mailto:abstract-wikipedia@lists.wikimedia.org>
*Subject: *Re: [Abstract-wikipedia] NLP issues severely
overlooked (Amir E. Aharoni)
Hi Amir,
I understand the process is different that usual research.
In fact I've seen Wikipedia grown from an unknown website
to the biggest encyclopedia it is now. I use it daily in
multiple languages and love it. I know what crowd sourcing
could achieve.
It's also possible that the mere *finding* of
these stumbling
blocks by such a big, diverse, open, and active
community,
will itself be a contribution to the scientific knowledge
around this subject.
I disagree here. It would be contribution to scientic
knowledge if and only if it wasn't discovered before. My
email was precisely about that: capitalizing on the
knowledge that has already been discovered, to avoid
making the same mistake them again. It would not matter
for a small project, but this one is really ambitious. We
are speaking of 40 years of work by a horde of talented
and very knowledgeable people, so this isn't to be
dismissed easily.
This thing is, my previous email was a bit abstract,
because it were a review of the paper, not of the project
itself. I should have made more examples to illustrate
where the problem lies.
Let's start with a simple example, in English, with
corresponding Wikidata entities in-between parenthesis.
I'm also using pseudo-turtle notation with made up
relationships.
France (Q142) is a country (Q6256).
<Q142> <rel_is> <Q6256> .
Creating the English sentence is straightforward with the
naive approach presented in the paper.
What is the French equivalent?
La France est un pays.
More information is required in the abstract
representation: the text generator needs to know about the
gender of both nouns (France and pays). So we need to
extend the model as such:
<Q142> <rel_gender> <Q1775415> .
<Q6256> <rel_gender> <Q499327> .
Fine! Now what about Chinese?
法國是一個國家。
What we have in the middle of the sentence is a classifier
(個). The model needs the following update:
<Q499327> <rel_use_classifier> <Q63153> .
To handle these 3 languages, the model has already 3
additional triples just for accounting for linguistic
facts occuring in these languages. Wikipedia exists in
more than 300 languages, and the world has about 6000 of
them, each of them having particularities that must be
taken into account. Fortunately they recoup themselves
in-between languages. Nonetheless the World Atlas Language
Structures (
https://wals.info/chapter/s1) count 144
distinct language features. Some are related to speech,
but this means there is probably something like a hundred
features that must be taken into account in the data model
to produce valid natural language sentence.
Note that in the Chinese example, there is also a number
(一, one) showing up. This is a phenomenon that must be
taken into account; and it's not always appearing when
using 是(to be). How complex the "lambda" system will
be just to deal with this issue? Hint: very much. It also
needs to be compatible with dozen of other phenomena.
Then each of those features require extensive and complete
data. For French, the gender of every noun entity *must*
be present, otherwise there is half a chance of producing
a wrong sentence each time a noun entity is encountered.
For Chinese and Japanese, classifier information must be
present for all noun, in case one must be enumerated.
Where does the project will get the data from? (we are
speaking of millions of item, most not referenced in
existing dictionaries) How will this be encoded? Those are
real questions that must be answered.
Suppose now we have done the work for "renderers" in these
three languages. They both use the more or less similar A
X B structure where X is a verb meaning "to be".
What would be the Japanese equivalent?
The more natural structure would be like:
フランスは国(だ)。
What is a play here is a topicalization (Q63105) of
France, followed by a predicate (it's a country). This is
very different from the previous structure, which, not
surprisingly enough, needs it's own representation. To
make situation more difficult, the previous (A be B)
structure can also exists in Japanese, but would lead to a
totally different sentence if used.
The paper states that Figure 1 and 2 are examples that
will be more complex in real life. Yet, the use of any
existing formalism is dismissed, which mean all the
situations I illustrated in this email will need to be
dealt with in an ad hoc fashion. Moreover, changing
formalism (be it ad hoc or not) will require to change
every piece of code/data using it. This will happen
everytime a language with unsupported feature(s) is added
to the project. It's not hard to see how this will waste a
huge amount of time and goodwill from involved people. The
very code focussed tone of the paper, the english-centric
approach used in the examples and the lack of references
shows that the complexity of the task on the NLP front is
not sufficiently conceptualized.
Best Regards,
Louis Lecailliez
*De :*Abstract-Wikipedia
<abstract-wikipedia-bounces(a)lists.wikimedia.org
<mailto:abstract-wikipedia-bounces@lists.wikimedia.org>>
de la part de
abstract-wikipedia-request(a)lists.wikimedia.org
<mailto:abstract-wikipedia-request@lists.wikimedia.org><abstract-wikipedia-request@lists.wikimedia.org
<mailto:abstract-wikipedia-request@lists.wikimedia.org>>
*Envoyé :* samedi 4 juillet 2020 15:06
*À :* abstract-wikipedia(a)lists.wikimedia.org
<mailto:abstract-wikipedia@lists.wikimedia.org><abstract-wikipedia@lists.wikimedia.org
<mailto:abstract-wikipedia@lists.wikimedia.org>>
*Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6
Send Abstract-Wikipedia mailing list submissions to
abstract-wikipedia(a)lists.wikimedia.org
<mailto:abstract-wikipedia@lists.wikimedia.org>
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
or, via email, send a message with subject or body 'help' to
abstract-wikipedia-request(a)lists.wikimedia.org
<mailto:abstract-wikipedia-request@lists.wikimedia.org>
You can reach the person managing the list at
abstract-wikipedia-owner(a)lists.wikimedia.org
<mailto:abstract-wikipedia-owner@lists.wikimedia.org>
When replying, please edit your Subject line so it is more
specific
than "Re: Contents of Abstract-Wikipedia digest..."
Today's Topics:
1. Re: NLP issues severely overlooked (Charles Matthews)
2. Use case: generation of short description (Jakob Voß)
3. Re: NLP issues severely overlooked (Amir E. Aharoni)
----------------------------------------------------------------------
Message: 1
Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST)
From: Charles Matthews <charles.r.matthews(a)ntlworld.com
<mailto:charles.r.matthews@ntlworld.com>>
To: "General public mailing list for the discussion of
Abstract
Wikipedia (aka Wikilambda)"
<abstract-wikipedia(a)lists.wikimedia.org
<mailto:abstract-wikipedia@lists.wikimedia.org>>
Subject: Re: [Abstract-wikipedia] NLP issues severely
overlooked
Message-ID:
<2126327926.39940.1593867909152(a)mail2.virginmedia.com
<mailto:2126327926.39940.1593867909152@mail2.virginmedia.com>>
Content-Type: text/plain; charset="utf-8"
It is interesting to be on a list where one can hear about
software issues, and then computational linguistic
problems. I'm not an expert in either area.
I do have 17 years of varied Wikimedia experience (and I
use my real name there).
On 04 July 2020 at 12:25 Louis Lecailliez
<louis.lecailliez(a)outlook.fr
<mailto:louis.lecailliez@outlook.fr>> wrote:
<snip>
Nothing precise is said about linguistic
resources
in the AW paper except for "These function finally can
call the lexicographic knowlegde stored in Wikidata.",
which is not very convincing: first because Wiktionary
projects themselves severely lacks content and structure
for those who has some content at all, secondly since
specialized NLP ressources are missing there too (note:
I'm not familiar with Wikidata so I could be wrong,
however I never saw it cited for the kind of NLP resources
I'm talking about).
I can comment about this. Besides Wiktionary, there is the
"lexeme" namespace of Wikidata. It is a relatively new
part of Wikidata, dealing with verbal forms.
To finish on a positive note, I would like to
highlight
the points I really like in the paper. First, its
collaborative and open nature, like all Wikimedia
projects, gives him a much higher chance of success than
its predecessors.
It is worth saying, for context, that there is a certain
style or philosophy coming from the wiki side: more
precisely, from the wikis before Wikipedia. There is the
slogan "what is the simplest thing that would actually
work?" You might argue, plausibly, that Wikipedia at
nearly 20 years old, shows that there is a bit more to
engineering than that.
On the other hand, looking at Wikidata at seven years old,
there is some point to the comment. It has a rather simple
approach to linked structured data, compared to the
Semantic Web environment. (Really, just write a very large
piece of JSON and try to cope with it!) But the number of
binary relations used (8K, if you count the "external
links" handling) is now quite large, and has grown
organically. The data modelling is in a sense primitive,
sometimes non-existent. But the range of content handled
really is encyclopedic. And in an area like scientific
bibliography, at a scale of tens of millions of entities,
the advantages of not much ontological fussiness begin to
be seen.
Wikidata started as an index of all Wikipedia articles,
and is now five times the size needed for that: a very
enriched "index".
I suppose the NLP required to code up, for example, 50K
chemistry articles about molecules, might be a problem
that could be solved, leaving aside the general problems
for the moment.
In any case, there is a certain approach, neither academic
nor commercial, that comes with Wikimedia and its
communities, and it will be interesting to see how the
issues are addressed.
Charles Matthews (in Cambridge UK)