Newsletter #92: One Ring, or a thousand flowers? - Abstract-Wikipedia

4 Nov 2022

The on-wiki version of this newsletter can be found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-11-04

--
One Ring, or a thousand flowers?
<https://meta.wikimedia.org/wiki/File:One_Ring_or_a_thousand_flowers.jpg>
<https://meta.wikimedia.org/wiki/File:One_Ring_or_a_thousand_flowers.jpg>
One Ring, or a thousand flowers?

One of the Abstract Wikipedia workstreams is focused on the natural
language generation tasks that will be necessary for creating and
maintaining Wikipedia articles in hundreds of languages. Unlike the other
workstreams, this work is not focused on the immediate future and launch of
Wikifunctions, but explores the next steps necessary once Wikifunctions is
available and connected to the other Wikimedia projects, particularly
Wikidata and Wikipedia.

In previous newsletters we have talked about some of the approaches and
work around natural language generation for Abstract Wikipedia: Mahir
Morshed talked about Ninai and Udiron
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-03>, we
talked about Grammatical Framework
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-06-24>,
which has been a major influence for the development and design of the
project, Ariel Gutman and Maria Keet have presented the a template language
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-08-19>
specification
(accompanied now by a Scribunto implementation
<https://meta.wikimedia.org/wiki/Module:Sandbox/AbstractWikipedia>), and
had a recent update on Diff
<https://diff.wikimedia.org/2022/09/21/the-state-of-abstract-wikipedia-natural-language-generation/>
.

With these three solutions, are we done? Do we now know what the solution
to natural language generation will look like?

I don’t know. It might be. These solutions have been developed by very
smart folks, with a cumulative decades of experience under their belts. The
most important goal that these implementations do right now is to provide
an existence proof, they demonstrate that a solution is possible. They show
us that the goals of Abstract Wikipedia are not too lofty. Grammatical
Framework did that for Abstract Wikipedia as a whole. I can genuinely say
that without Grammatical Framework, the Abstract Wikipedia project as it is
wouldn’t exist.

But are any of these solutions the approach that Abstract Wikipedia will
ultimately take?

I don’t know. One major novelty is that Abstract Wikipedia would benefit
from being able to scale to a large number of contributors with very
diverse skill levels. Some might be experienced programmers, some might be
trained linguists, others might bring native language level skills. Which
solution really scales well for a community of volunteer Wikimedians? This
is very difficult to predict in advance. And this is why I don’t want us
yet to commit to a specific solution.

I would like to see a Cambrian explosion
<https://en.wikipedia.org/wiki/Cambrian_explosion> of possible solutions.
This is one of the reasons why Wikifunctions allows for all kinds of
functions, why it is explicitly Turing complete
<https://en.wikipedia.org/wiki/Turing_completeness>: so we don’t lock
ourselves prematurely into a single architecture, into a single solution. I
am looking forward to a large number of different approaches being tried
out, and then having the community building around these approaches discuss
the advantages and disadvantages and also simply vote with their feet,
through activity.

Yes, in the end we should make sure that we unify on a single solution. It
would obviously be a tragic mistake if the natural language generation for
Bengali would work entirely differently than the one for Hausa, using
different abstract contents. But sometimes it might be necessary to develop
some morphological or grammatical functions which are unique to a specific
language, and then integrate them into the overall architecture for
generating whole texts. Examples for that are the noun classes in Niger
Congo languages
<https://en.wikipedia.org/wiki/Noun_class#Nominal_classes_in_Swahili>, or
the morphology of Arabic and Hebrew interleaving vowels and consonants
<https://en.wikipedia.org/wiki/Nonconcatenative_morphology>.

I see the community entirely taking the lead on which solution to choose,
implement, and pursue. But I see that happening mostly implicitly, through
the community’s actions, and not so much by the explicit means of debating
and voting on a single solution. I don’t want us to prematurely decide on a
single way, but rather to stay open and invite experimentation and new
ideas. The space of possible ideas is so vast, and the benefit of choosing
a solution better fitted to our community is so big, that it makes sense
for us to be creative.

I expect the development to go through four different levels of evolution.

First level: we might start with simple lookup tables. We might have a
function that returns a short description for a city, selecting the
relevant word or phrase in the right language, just based on the language.
There are 550 items in Wikidata that have the English short description
“city”, more than 1,700 items with “scientific paper”, and more than 400
with the French short description *“article scientifique”*. Even with such
a simple look-up table we could already improve the descriptions of
thousands of items on Wikidata.

Second level: we can extend the possibilities widely by allowing for
arguments in a templated structure, say in order to make a short
description such as “city in Azerbaijan” or “city in Israel” (each fifty
occurrences), or “French author” or “Argentinian chemist”. Such simple
patterns will both be useful for a large number of items as well as uncover
already a surprising number of edge cases. This will be useful for Wikidata
descriptions, but also can be already useful for Wikipedia articles: many
bots such as Rambot on English Wikipedia or LSJbot on Swedish have been
working exactly like this.

Third level: we add constructors to the templates. The constructors allow
us to build whole articles from individual sentences. Instead of having model
articles
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-06-07> for
a whole category, we now allow for manually written articles
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-06-21>.
This makes the task at the same time easier and harder: because the
constructors now need to be reusable, they need to be more like modular
sentences than whole articles.

Fourth level: as with the third level the number of constructors will grow,
we should aim to rein in the amount of work that needs to be done by the
smaller language communities. This can be achieved by having abstract
renderers for constructors: winning an award can then, instead of having a
direct template in English such as (the example is simplified)

 “{person} received the {award} on {date}”

have an abstract (i.e. language independent) renderer such as (again
simplified)

 “Clause(subject=person, predicate=Q76664785
<https://www.wikidata.org/wiki/Q76664785>, object=award, time=date)”

which in turn has language dependent renderers, but fewer of those. This
would often lead to less idiomatic results.

Some of the solutions so far fare really well with the second or third
level, and others seem also to be capable of dealing with the fourth level.
The third level lends itself well to a certain kind of user experience,
which the fourth level does not. There will be advantages and disadvantages
to balance.

The goal of this newsletter is not to prescribe such a development. Unlike
with the development plan
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Plan>, we are not going
to actively work on making these levels happen. We do not have to go
through these particular levels, and we don’t have to be uniform about the
levels in the different domains. In some areas, the first level might be
entirely sufficient, in others we might flourish by using ideas described
for the fourth level, and again others might simply not fit into the
described levels at all. And that’s OK.

The goal of this newsletter is to explain my thinking and decisions, show
how the different systems and approaches we have previously mentioned fit
together, and to allow for rational predictions of where we are going and
what kind of contributions we are looking for. This is also an invitation
to all of you: the NLG system will be developed by all of us together.
Volunteer corner

Next week, on Monday, November 7, 18:30 UTC, we are going to host our next
volunteer corner. You can join us here: https://meet.google.com/evj-ktbq-hbn
Staff editing discussion is closing

Giving activity has calmed down, we are planning to close the staff editing
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Staff_editing> discussion
soon.
New developer channel

We will use the #wikipedia-abstract-techconnect
<https://web.libera.chat/?channel=#wikipedia-abstract-tech> channel on IRC
(also bridged to Telegram <https://t.me/abstract_wikipedia_tech>) as a
space more focused on developers and technology around Wikifunctions and
Abstract Wikipedia. Our channels are documented here: Abstract
Wikipedia#Participate
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia#Participate>
Development update for the week of October 28, 2022

Experience & Performance:

   - Avoidance of type expansion in orchestrator (T297742
   <https://phabricator.wikimedia.org/T297742>)
   - Removed Work Summary component
   - Aligned on what fields are mandatory and what fields are optional
   during ZFunction and ZObject creation
   - Finalized designs for Publish component
   - Fixed more FE bugs
   - Completed another round of testing

Meta-data:

   - Initial implementation of Readable summaries of all error types (
   T312611 <https://phabricator.wikimedia.org/T312611>)

Natural Language Generation:

   - Shared document on UI and grammaticality judgments
   - Made progress on the template language
   - Demo of possible template creation GUI made (here
   <https://github.com/mkeet/ToCTeditor>)