The on-wiki version of this newsletter is here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-03
Due to the embedded table it might be easier to read on-wiki.
--
The update this week has been written by Mahir Morshed
<https://meta.wikimedia.org/wiki/User:Mahir256>. Mahir is a long time
contributor to Wikidata and particularly also to the lexicographical data
on Wikidata. He has developed a prototype that generates natural language
from an abstract content representation in Bengali and Swedish, a prototype
with the goal that this could be implementable within Wikifunctions. In
this newsletter, Mahir describes the prototype.
------------------------------
Discussion around Abstract Wikipedia's natural language generation
capabilities has revolved around the presence of abstract constructors and
concrete renderers per language, while also noting the use of Wikidata
items and lexemes as a basis for mapping concepts to language. In the
interest of making this connection a bit clearer to imagine, I have started
to build a text generation system. This uses items, lexemes, and wrappers
for them as building blocks, and these blocks are then assembled into
syntactic trees, based in part on the Universal Dependencies
<https://universaldependencies.org/> syntactic annotation scheme.
(If this seems like a different approach from what was discussed in a
newsletter two months prior
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-06-24>,
that's because it is. Feel free to drop me a message if you'd like to
discuss it.)
The system is composed of three parts, where the last is likely to be
something we could skip in a port to Wikifunctions:
- Ninai <https://bitbucket.org/mmorshe2/ninai/> (from the Classical
Tamil for "to think") holds all constructors, logic at a sufficiently high
level for renderers, and a resolution system from items (each wrapped in a
"Concept" object) to sense IDs for a given language. Decisions and actions
in Ninai are meant to be agnostic to the methods for text formation
underneath, which are supplied by...
- Udiron <https://bitbucket.org/mmorshe2/udiron/> (from the Bengali
pronunciation of the Sanskrit for "communicating, saying, speaking"), which
holds lower-level text manipulation functions for specific languages. These
functions operate on syntactic trees of lexemes (each lexeme wrapped in a
"Clause" object). These lexemes are imported via...
- tfsl <https://phabricator.wikimedia.org/source/tool-twofivesixlex/> (from
"twofivesixlex"), a lexeme manipulation tool, which is intended to be akin
to pywikibot but with a specific focus on the handling of Wikibase objects.
Both of the above components depend on this one, although if 'native' item
and lexeme access and manipulation becomes possible with Wikifunctions
built-ins then tfsl could possibly be omitted.
Some design choices in this system worth noting are as follows:
- Constructors, while being language-agnostic and falling within some
portion of a class hierarchy, are purely containers for their arguments,
carrying no other logic within. This means, for example, that an instance
of a constructor Existence(subject), to indicate that the subject in
question exists, only holds that subject within that instance, and does
nothing else until a renderer encounters that constructor.
- Every constructor allows, in addition to any required inputs, a list
of extra modifiers in any order (the 'scope' of the idea represented by
that constructor). This means, for example, that a constructor
Benefaction(benefactor,
beneficiary) might be invoked with extra arguments for the time, place,
mode, and other specifiers after the beneficiary.
- When one 'renders' a composition of constructors, a Clause object
(representing the root of a syntactic tree) is returned; turning it into a
string of text is done with Python's str() built-in applied to that
object.
At the moment, there are just enough constructors to represent Sentence 1.1
from the Jupiter examples
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Examples/Jupiter>, as
well as renderers in Bengali and Swedish for those constructors (thanks to
Bodhisattwa <https://meta.wikimedia.org/wiki/User:Bodhisattwa>, Jan
<https://meta.wikimedia.org/wiki/User:Ainali>, and Dennis
<https://meta.wikimedia.org/wiki/User:So9q> for feedback on those).
Building up to the Jupiter sentence should demonstrate how these work:
Building up to the Jupiter sentence step by step
Constructor textBengali outputSwedish outputGloss *(not renderer output!)*
Notes
Identification(
Concept(Q(319)),
Concept(Q(634)))
বৃহস্পতি গ্রহ। Jupiter är klot. Jupiter is planet. We start by simply
*identifying* the two *concepts* of Jupiter (Q319)
<https://www.wikidata.org/wiki/Q319> and planet (Q634)
<https://www.wikidata.org/wiki/Q634> as being equal.
Identification(
Concept(Q(319)),
Instance(
Concept(Q(634))))
বৃহস্পতি একটা গ্রহ। Jupiter är ett klot. Jupiter is a planet. Instead of
equating the concepts alone, we might instead equate "Jupiter" with an
*instance* of "planet".
Identification(
Concept(Q(319)),
Instance(
Concept(Q(634)),
Definite()))
বৃহস্পতি গ্রহটি। Jupiter är klotet. Jupiter is the planet. We may further
refine that by making clear that "Jupiter" is a *definite* instance of
"planet".
Identification(
Concept(Q(319)),
Instance(
Attribution(
Concept(Q(634)),
Concept(Q(59863338))),
Definite()))
বৃহস্পতি বড় গ্রহটা। Jupiter är det stora klotet. Jupiter is the large
planet. Now we might ascribe an *attribute* to the definite planet instance
in question, this attribute being large (Q59863338)
<https://www.wikidata.org/wiki/Q59863338>.
Identification(
Concept(Q(319)),
Instance(
Attribution(
Concept(Q(634)),
Superlative(
Concept(Q(59863338)))),
Definite()))
বৃহস্পতি সবচেয়ে বড় গ্রহটি। Jupiter är det största klotet. Jupiter is the
largest planet. This attribute being *superlative* for Jupiter can be
marked by modifying the attribute.
Identification(
Concept(Q(319)),
Instance(
Attribution(
Concept(Q(634)),
Superlative(
Concept(Q(59863338)),
Locative(
Concept(Q(544))))),
Definite()))
বৃহস্পতি সৌরমণ্ডলে সবচেয়ে বড় গ্রহ। Jupiter är den största planeten i
solsystemet. Jupiter is the largest planet in the solar system. Once we
specify the *location* where Jupiter being the largest applies (that is, in
the Solar System (Q544) <https://www.wikidata.org/wiki/Q544>), we're done!
Note that the sense resolution system does not have enough information to
choose which of '-টা' or '-টি' (for Bengali) or of 'klot' or
'planet' (for
Swedish) to use in some of these examples, so currently in the prototype
one is chosen at random. This therefore means that re-rendering any
examples which pull those in might use something different.
Besides this, there is clearly a lot more functionality to be added, and
because Bengali and Swedish are both Indo-European languages (however
distant), there are likely linguistic phenomena that won't be considered
simply by developing renderers for those two languages alone. If there's
something particular in your language that isn't present in those two
languages, this may then raise the question: what can you do for your
language?
I can think of at least four things, not in any particular order:
- Create lexemes and add senses to them! What matters most to the system
is that words have meanings (possibly in some context, and possibly with
equivalents in other languages or to Wikidata items) so that those words
can be properly retrieved based on those equivalences; that these words
might have a second-person plural negative past conditional form is largely
secondary!
- Think about how you might perform some basic grammatical tasks in your
language: how do you inflect adjectives? add objects to verbs? indicate in
a sentence where something happened?
- Think about how you might perform higher-level tasks involving
meaning: what do you do to indicate that something exists? to indicate that
something happened in the past but is no longer the case? to change a
simple declarative sentence into a question?
- If you have some ideas on how to render the Jupiter sentence in your
language, and the lexemes you would need to build that sentence exist on
Wikidata, and those lexemes have senses for the meanings those lexemes take
in that sentence, let me know!
We'd love to hear your thoughts on this prototype, and what it might mean
for realizing Abstract Wikipedia through Wikidata's lexicographic data and
Wikifunctions's platform.
------------------------------
Thank you Mahir for the great update! If you too want to contribute to the
weekly, get in touch. This is a project we all build together.
In addition, this week Slate published a great explaining article on the
goals of Abstract Wikipedia and Wikifunctions: Wikipedia Is Trying to
Transcend the Limits of Human Language
<https://slate.com/technology/2021/09/wikipedia-human-language-wikifunctions.html>