The on-wiki version of this newsletter is here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-03 

Due to the embedded table it might be easier to read on-wiki.

--

The update this week has been written by Mahir Morshed. Mahir is a long time contributor to Wikidata and particularly also to the lexicographical data on Wikidata. He has developed a prototype that generates natural language from an abstract content representation in Bengali and Swedish, a prototype with the goal that this could be implementable within Wikifunctions. In this newsletter, Mahir describes the prototype.


Discussion around Abstract Wikipedia's natural language generation capabilities has revolved around the presence of abstract constructors and concrete renderers per language, while also noting the use of Wikidata items and lexemes as a basis for mapping concepts to language. In the interest of making this connection a bit clearer to imagine, I have started to build a text generation system. This uses items, lexemes, and wrappers for them as building blocks, and these blocks are then assembled into syntactic trees, based in part on the Universal Dependencies syntactic annotation scheme.

(If this seems like a different approach from what was discussed in a newsletter two months prior, that's because it is. Feel free to drop me a message if you'd like to discuss it.)

The system is composed of three parts, where the last is likely to be something we could skip in a port to Wikifunctions:

Some design choices in this system worth noting are as follows:

At the moment, there are just enough constructors to represent Sentence 1.1 from the Jupiter examples, as well as renderers in Bengali and Swedish for those constructors (thanks to BodhisattwaJan, and Dennis for feedback on those). Building up to the Jupiter sentence should demonstrate how these work:

Building up to the Jupiter sentence step by step
Constructor textBengali outputSwedish outputGloss (not renderer output!)Notes
Identification(
  Concept(Q(319)),
  Concept(Q(634)))
বৃহস্পতি গ্রহ।Jupiter är klot.Jupiter is planet.We start by simply identifying the two concepts of Jupiter (Q319) and planet (Q634) as being equal.
Identification(
  Concept(Q(319)),
  Instance(
    Concept(Q(634))))
বৃহস্পতি একটা গ্রহ।Jupiter är ett klot.Jupiter is a planet.Instead of equating the concepts alone, we might instead equate "Jupiter" with an instance of "planet".
Identification(
  Concept(Q(319)),
  Instance(
    Concept(Q(634)),
    Definite()))
বৃহস্পতি গ্রহটি।Jupiter är klotet.Jupiter is the planet.We may further refine that by making clear that "Jupiter" is a definite instance of "planet".
Identification(
  Concept(Q(319)),
  Instance(
    Attribution(
      Concept(Q(634)),
      Concept(Q(59863338))),
    Definite()))
বৃহস্পতি বড় গ্রহটা।Jupiter är det stora klotet.Jupiter is the large planet.Now we might ascribe an attribute to the definite planet instance in question, this attribute being large (Q59863338).
Identification(
  Concept(Q(319)),
  Instance(
    Attribution(
      Concept(Q(634)),
      Superlative(
        Concept(Q(59863338)))),
    Definite()))
বৃহস্পতি সবচেয়ে বড় গ্রহটি।Jupiter är det största klotet.Jupiter is the largest planet.This attribute being superlative for Jupiter can be marked by modifying the attribute.
Identification(
  Concept(Q(319)),
  Instance(
    Attribution(
      Concept(Q(634)),
      Superlative(
        Concept(Q(59863338)),
        Locative(
          Concept(Q(544))))),
    Definite()))
বৃহস্পতি সৌরমণ্ডলে সবচেয়ে বড় গ্রহ।Jupiter är den största planeten i solsystemet.Jupiter is the largest planet in the solar system.Once we specify the location where Jupiter being the largest applies (that is, in the Solar System (Q544)), we're done!

Note that the sense resolution system does not have enough information to choose which of '-টা' or '-টি' (for Bengali) or of 'klot' or 'planet' (for Swedish) to use in some of these examples, so currently in the prototype one is chosen at random. This therefore means that re-rendering any examples which pull those in might use something different.

Besides this, there is clearly a lot more functionality to be added, and because Bengali and Swedish are both Indo-European languages (however distant), there are likely linguistic phenomena that won't be considered simply by developing renderers for those two languages alone. If there's something particular in your language that isn't present in those two languages, this may then raise the question: what can you do for your language?

I can think of at least four things, not in any particular order:

We'd love to hear your thoughts on this prototype, and what it might mean for realizing Abstract Wikipedia through Wikidata's lexicographic data and Wikifunctions's platform.


Thank you Mahir for the great update! If you too want to contribute to the weekly, get in touch. This is a project we all build together.

In addition, this week Slate published a great explaining article on the goals of Abstract Wikipedia and Wikifunctions: Wikipedia Is Trying to Transcend the Limits of Human Language