That’s a good idea, but I think you would need more than that. Take
FrameNet, for example, but now departing from verbs instead of nouns.
FrameNet has a very detailed model for dealing with verbs, their semantic
arguments and the way they surface in morphosyntax. Nonetheless, to apply
such a model in a text comprehension and/or generation task, you need more
than that. You need to know prototypical fillers for the positions, which,
in turn, are associated to other frames and, therefore, participate in
other clusters of the network of frames. Moreover, you’d want those
prototypical fillers to function as departing points for analogical
extensions in the model, since not every sentence is a prototypical
combination of words. In other words, the collection of attributes and
relations you refer to should be defined in a way that they can be
analogically extended to other collections not originally assigned to the
item you’re looking at.
Cheers
Tiago
Em dom, 5 de jul de 2020 às 20:03, Arthur Smith <arthurpsmith(a)gmail.com>
escreveu:
Yes, thank you for the UNL background, that is
extremely helpful. I've
been reading some of the articles Louis provided as references, and it
seems to me from just this perhaps naive point of view, that a lot of the
complexity is associated with disambiguation of meaning - for nouns I think
Wikidata items (and their relations to lexeme senses) solve that problem,
but we are still missing I think a lot of the detail needed to do the same
with adjectives and verbs (at least). So there is definitely some room for
finding better ways to model - but maybe Wikidata could be expanded to
handle the adjective/verb cases too. In general the concept of a single
meaning associated with a Wikidata item as its identifier and a collection
of attributes and relationships attached to that item is a powerful one
that could resolve many such issues.
Arthur
On Sun, Jul 5, 2020 at 6:55 PM Adam Sobieski <adamsobieski(a)hotmail.com>
wrote:
> Louis,
>
>
>
> Thank you for the information about the Universal Networking Language [1]
> and the World Atlas of Language Structures [2].
>
>
> Semantic Modeling
>
>
>
> Do you opine that adding attributes to objects, relations and expressions
> enhances expressiveness for various features of natural language?
>
>
>
> r.@a1.@a2(o1(icl>domain1).@a3.@a4, o2(icl>domain2).@a5.@a6).@a7.@a8
>
>
>
> I wonder whether there exist mappings or workarounds with which to obtain
> such expressiveness for models such as Wikidata’s.
>
>
> Scripting Environments for Natural Language Generation
>
>
>
> Supposing that Wikilambda could be JavaScript / WebAssembly based, and
> observing that Lua / WebAssembly solutions exist, we can note that
> scripting engines such as V8 are easy to use and to add global objects and
> API to. Resembling how Web browsers provide scripting environments and API
> for functions, we can envision providing scripting environments and API for
> natural language generation functions.
>
>
>
> I wonder what you might think about scripting environments and API for
> natural language generation scenarios?
>
>
>
>
>
> Best regards,
>
> Adam
>
>
>
> [1]
https://en.wikipedia.org/wiki/Universal_Networking_Language
>
> [2]
https://wals.info/
>
>
>
> *From: *Louis Lecailliez <louis.lecailliez(a)outlook.fr>
> *Sent: *Saturday, July 4, 2020 2:10 PM
> *To: *abstract-wikipedia(a)lists.wikimedia.org
> *Subject: *Re: [Abstract-wikipedia] NLP issues severely overlooked (Amir
> E. Aharoni)
>
>
>
> Hi Amir,
>
>
>
> I understand the process is different that usual research. In fact I've
> seen Wikipedia grown from an unknown website to the biggest encyclopedia it
> is now. I use it daily in multiple languages and love it. I know what crowd
> sourcing could achieve.
>
>
>
> > It's also possible that the mere *finding* of these stumbling blocks by
> such a big, diverse, open, and active community, will itself be a
> contribution to the scientific knowledge around this subject.
>
>
>
> I disagree here. It would be contribution to scientic knowledge if and
> only if it wasn't discovered before. My email was precisely about that:
> capitalizing on the knowledge that has already been discovered, to avoid
> making the same mistake them again. It would not matter for a small
> project, but this one is really ambitious. We are speaking of 40 years of
> work by a horde of talented and very knowledgeable people, so this isn't to
> be dismissed easily.
>
>
>
> This thing is, my previous email was a bit abstract, because it were a
> review of the paper, not of the project itself. I should have made more
> examples to illustrate where the problem lies.
>
>
>
> Let's start with a simple example, in English, with corresponding
> Wikidata entities in-between parenthesis. I'm also using pseudo-turtle
> notation with made up relationships.
>
>
>
> France (Q142) is a country (Q6256).
>
> <Q142> <rel_is> <Q6256> .
>
>
>
> Creating the English sentence is straightforward with the naive approach
> presented in the paper.
>
>
>
> What is the French equivalent?
>
> La France est un pays.
>
>
>
> More information is required in the abstract representation: the text
> generator needs to know about the gender of both nouns (France and pays).
> So we need to extend the model as such:
>
>
>
> <Q142> <rel_gender> <Q1775415> .
>
> <Q6256> <rel_gender> <Q499327> .
>
>
>
> Fine! Now what about Chinese?
>
> 法國是一個國家。
>
>
>
> What we have in the middle of the sentence is a classifier (個). The
> model needs the following update:
>
>
>
> <Q499327> <rel_use_classifier> <Q63153> .
>
>
>
> To handle these 3 languages, the model has already 3 additional triples
> just for accounting for linguistic facts occuring in these languages.
> Wikipedia exists in more than 300 languages, and the world has about 6000
> of them, each of them having particularities that must be taken into
> account. Fortunately they recoup themselves in-between languages.
> Nonetheless the World Atlas Language Structures (
>
https://wals.info/chapter/s1) count 144 distinct language features. Some
> are related to speech, but this means there is probably something like a
> hundred features that must be taken into account in the data model to
> produce valid natural language sentence.
>
> Note that in the Chinese example, there is also a number (一, one)
> showing up. This is a phenomenon that must be taken into account; and it's
> not always appearing when using 是 (to be). How complex the "lambda"
> system will be just to deal with this issue? Hint: very much. It also needs
> to be compatible with dozen of other phenomena.
>
>
>
> Then each of those features require extensive and complete data. For
> French, the gender of every noun entity *must* be present, otherwise there
> is half a chance of producing a wrong sentence each time a noun entity is
> encountered. For Chinese and Japanese, classifier information must be
> present for all noun, in case one must be enumerated. Where does the
> project will get the data from? (we are speaking of millions of item, most
> not referenced in existing dictionaries) How will this be encoded? Those
> are real questions that must be answered.
>
>
>
> Suppose now we have done the work for "renderers" in these three
> languages. They both use the more or less similar A X B structure where X
> is a verb meaning "to be".
>
>
>
> What would be the Japanese equivalent?
>
> The more natural structure would be like:
>
> フランスは国(だ)。
>
>
>
> What is a play here is a topicalization (Q63105) of France, followed by a
> predicate (it's a country). This is very different from the previous
> structure, which, not surprisingly enough, needs it's own representation.
> To make situation more difficult, the previous (A be B) structure can also
> exists in Japanese, but would lead to a totally different sentence if used.
>
>
>
> The paper states that Figure 1 and 2 are examples that will be more
> complex in real life. Yet, the use of any existing formalism is dismissed,
> which mean all the situations I illustrated in this email will need to be
> dealt with in an ad hoc fashion. Moreover, changing formalism (be it ad hoc
> or not) will require to change every piece of code/data using it. This will
> happen everytime a language with unsupported feature(s) is added to the
> project. It's not hard to see how this will waste a huge amount of time and
> goodwill from involved people. The very code focussed tone of the paper,
> the english-centric approach used in the examples and the lack of
> references shows that the complexity of the task on the NLP front is not
> sufficiently conceptualized.
>
>
>
> Best Regards,
>
> Louis Lecailliez
>
>
>
> *De :* Abstract-Wikipedia <abstract-wikipedia-bounces(a)lists.wikimedia.org>
> de la part de abstract-wikipedia-request(a)lists.wikimedia.org <
> abstract-wikipedia-request(a)lists.wikimedia.org>
> *Envoyé :* samedi 4 juillet 2020 15:06
> *À :* abstract-wikipedia(a)lists.wikimedia.org <
> abstract-wikipedia(a)lists.wikimedia.org>
> *Objet :* Abstract-Wikipedia Digest, Vol 1, Issue 6
>
>
>
> Send Abstract-Wikipedia mailing list submissions to
> abstract-wikipedia(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia
> or, via email, send a message with subject or body 'help' to
> abstract-wikipedia-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> abstract-wikipedia-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Abstract-Wikipedia digest..."
>
>
> Today's Topics:
>
> 1. Re: NLP issues severely overlooked (Charles Matthews)
> 2. Use case: generation of short description (Jakob Voß)
> 3. Re: NLP issues severely overlooked (Amir E. Aharoni)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 4 Jul 2020 14:05:09 +0100 (BST)
> From: Charles Matthews <charles.r.matthews(a)ntlworld.com>
> To: "General public mailing list for the discussion of Abstract
> Wikipedia (aka Wikilambda)" <
> abstract-wikipedia(a)lists.wikimedia.org>
> Subject: Re: [Abstract-wikipedia] NLP issues severely overlooked
> Message-ID: <2126327926.39940.1593867909152(a)mail2.virginmedia.com>
> Content-Type: text/plain; charset="utf-8"
>
> It is interesting to be on a list where one can hear about software
> issues, and then computational linguistic problems. I'm not an expert in
> either area.
>
> I do have 17 years of varied Wikimedia experience (and I use my real name
> there).
>
> > On 04 July 2020 at 12:25 Louis Lecailliez <louis.lecailliez(a)outlook.fr>
> wrote:
> >
>
> <snip>
>
> > Nothing precise is said about linguistic resources in the AW paper
> except for "These function finally can call the lexicographic knowlegde
> stored in Wikidata.", which is not very convincing: first because
> Wiktionary projects themselves severely lacks content and structure for
> those who has some content at all, secondly since specialized NLP
> ressources are missing there too (note: I'm not familiar with Wikidata so I
> could be wrong, however I never saw it cited for the kind of NLP resources
> I'm talking about).
> >
>
> I can comment about this. Besides Wiktionary, there is the "lexeme"
> namespace of Wikidata. It is a relatively new part of Wikidata, dealing
> with verbal forms.
>
>
> >To finish on a positive note, I would like to highlight the points I
> really like in the paper. First, its collaborative and open nature, like
> all Wikimedia projects, gives him a much higher chance of success than its
> predecessors.
>
> It is worth saying, for context, that there is a certain style or
> philosophy coming from the wiki side: more precisely, from the wikis before
> Wikipedia. There is the slogan "what is the simplest thing that would
> actually work?" You might argue, plausibly, that Wikipedia at nearly 20
> years old, shows that there is a bit more to engineering than that.
>
> On the other hand, looking at Wikidata at seven years old, there is some
> point to the comment. It has a rather simple approach to linked structured
> data, compared to the Semantic Web environment. (Really, just write a very
> large piece of JSON and try to cope with it!) But the number of binary
> relations used (8K, if you count the "external links" handling) is now
> quite large, and has grown organically. The data modelling is in a sense
> primitive, sometimes non-existent. But the range of content handled really
> is encyclopedic. And in an area like scientific bibliography, at a scale of
> tens of millions of entities, the advantages of not much ontological
> fussiness begin to be seen.
>
> Wikidata started as an index of all Wikipedia articles, and is now five
> times the size needed for that: a very enriched "index".
>
> I suppose the NLP required to code up, for example, 50K chemistry
> articles about molecules, might be a problem that could be solved, leaving
> aside the general problems for the moment.
>
> In any case, there is a certain approach, neither academic nor
> commercial, that comes with Wikimedia and its communities, and it will be
> interesting to see how the issues are addressed.
>
> Charles Matthews (in Cambridge UK)
>