Newsletter #75: Model articles - Abstract-Wikipedia

7 Jun 2022

The on-wiki version of this newsletter can be found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-06-07

--

Communities will create (at least) two different types of articles using
Abstract Wikipedia: on the one hand, we will have highly-standardised
articles based entirely on Wikidata; and on the other hand, we will have
bespoke, hand-crafted content, assembled sentence by sentence. Today we
will discuss the first type, and we will discuss the second type in an
upcoming newsletter.

Articles of the first type can be created very quickly and will likely
constitute the vast majority of articles for a long time to come. For that
we can use models, *i.e.* a text with variables. Put differently, a text
with gaps which get filled from a different source such as a list, along
the lines of the mad libs <https://en.wikipedia.org/wiki/Mad_Libs> game. A
model can be created once for a specific type of item and then used for
every single item of this type that has enough data in Wikidata. The
resulting articles are similar to many bot-created articles that already
exist in various Wikipedias.

For example, in many languages, bots were used to create or maintain the
articles for years (such as the articles about 1313
<https://www.wikidata.org/wiki/Q5735>, 1428
<https://www.wikidata.org/wiki/Q6315>, or 1697
<https://www.wikidata.org/wiki/Q7702>, each of which is available in more
than a hundred languages). In English Wikipedia, many articles for US
cities were created by a bot
<https://en.wikipedia.org/wiki/List_of_Wikipedia_controversies#2002> based
on the US census, and later updated after the 2010 census. Lsjbot
<https://en.wikipedia.org/wiki/Lsjbot> by Sverker Johansson is a well known
example of a bot that has created millions of articles about locations or
species across a few languages such as Swedish, Waray Waray, or Cebuano.
Comparable activities, although not as prolific, have been going on in
quite a few other languages.

How do these approaches work? Assume you have a dataset such as the
following list of countries:
Caption
CountryCountryCapitalPopulation
Jordan Asia Amman 10428241
Nicaragua Central America Managua 5142098
Kyrgyzstan Asia Bishkek 6201500
Laos Asia Vientiane 6858160
Lebanon Asia Beirut 6100075

Now we can create a model that can generate a complete text from this data,
such as

“*<Country>* is a country in *<Continent>* with a population of
*<Population>*. The capital of *<Country>* is *<Capital>*.”

With this text and the above dataset, we would have created the following
five proto-articles (references not shown for simplicity):

*Jordan* is a country in Asia with a population of 10,428,241. The capital
of Jordan is Amman.

*Nicaragua* is a country in Central America with a population of 5,142,098.
The capital of Nicaragua is Managua.

*Kyrgyzstan* is a country in Asia with a population of 6,201,500. The
capital of Kyrgyzstan is Bishkek.

*Laos* is a country in Asia with a population of 6,858,160. The capital of
Laos is Vientiane.

*Lebanon* is a country in Asia with a population of 6,100,075. The capital
of Lebanon is Beirut.

Classical textbooks on that topic such as *“Building natural language
generation systems”
<https://en.wikipedia.org/wiki/Special:BookSources/978-0-521-02451-8>* call
this method *“mail merge”* (even though it is used for more than mail). A
model is combined with a dataset, often from a spreadsheet or a database.
This has been used for decades to create bulk mailings
<https://en.wikipedia.org/wiki/Mail_merge> and other bulk content, and is a
form of mass customisation
<https://en.wikipedia.org/wiki/Mass_customization>. The methods have become
increasingly complex over time and are able to answer more questions: How
to deal with missing or optional information? How to adapt part of the text
to the data, *e.g.* use plurals or grammatical gender or noun classes where
appropriate, *etc.*? The bots that were mentioned above, which created
millions of articles in various languages on Wikipedia, have mostly worked
along these lines.

For a great example of how far the model approach can be pushed, consider
Magnus Manske’s Reasonator <https://meta.wikimedia.org/wiki/Reasonator>,
which, based on the data in Wikidata, creates the following automatic
description for Douglas Adams <https://reasonator.toolforge.org/?q=Q42>:

*Douglas Adams* was a British playwright, screenwriter, novelist,
children's writer, science fiction writer, comedian, and writer. He was
born on March 11, 1952 in Cambridge to Christopher Douglas Adams and Janet
Adams. He studied at St John's College from 1971 until 1974 and Brentwood
School from 1959 until 1970. His field of work included science fiction,
comedy, satire, and science fiction. He was a member of Groucho Club and
Footlights. He worked for The Digital Village from 1996 and for BBC. He
married Jane Belson on November 25, 1991 (married until on May 11, 2001 ),
Jane Belson on November 25, 1991 (married until on May 11, 2001 ), and Jane
Belson on November 25, 1991 (married until on May 11, 2001 ). His children
include Polly Adams, Polly Adams, and Polly Adams. He died of myocardial
infarction on May 11, 2001 in Santa Barbara. He was buried at Highgate
Cemetery.

If we were to say that this is merely better than nothing, I think we would
undersell the achievement of Reasonator. The above text, together with the
appealing display of the structured data in Reasonator, leads to a more
comprehensive access to knowledge than many of the individual language
Wikipedias provide for Douglas Adams. For comparison, check out the
articles in Azery <https://az.wikipedia.org/wiki/Duqlas_Adams>, Urdu
<https://ur.wikipedia.org/wiki/%DA%88%DA%AF%D9%84%D8%B3_%D8%A7%DB%8C%DA%88%D9%85%D8%B3>
, Malayalam
<https://ml.wikipedia.org/wiki/%E0%B4%A1%E0%B4%97%E0%B5%8D%E0%B4%B2%E0%B4%B8%E0%B5%8D%E0%B4%86_%E0%B4%A1%E0%B4%82%E0%B4%B8%E0%B5%8D>
, Korean
<https://ko.wikipedia.org/wiki/%EB%8D%94%EA%B8%80%EB%9F%AC%EC%8A%A4_%EC%95%A0%EB%8D%A4%EC%8A%A4>,
or Danish <https://da.wikipedia.org/wiki/Douglas_Adams>. At the same time,
it shows errors that most contributors wouldn’t know how to fix (such as
the repetition of the names of the children, or the spaces inside the
brackets, *etc.*).

The Article placeholder
<https://www.mediawiki.org/wiki/Extension:ArticlePlaceholder> project has
partially fulfilled the role of filling content gaps, but the developers
have intentionally shied away from the results looking too much like an
article. They display structured data from Wikidata within the context of a
language Wikipedia. For example, here is the generated page about
*triceratops* in Haitian Creole
<https://ht.wikipedia.org/wiki/Espesyal:AboutTopic/Q14384>.

One large disadvantage of using bots to create articles in Wikipedia has
been that this content was mostly controlled by a very small subset of the
community — often a single person. Many of the bots and datasets have not
been open sourced in a way that someone else could easily come in, make a
change, and re-run the bot. (Reasonator avoids this issue, because the text
is generated dynamically and is not incorporated into the actual Wikipedia
article.)

With Wikifunctions and Wikidata, we will be able to give control over all
these steps to the wider community. Both the models and the data will be
edited on wiki, with all the usual advantages of having a wiki: there is a
clear history, everyone can edit through the Web, people can discuss, *etc.*.
The data used to populate the models will be maintained in Wikidata, and
the models themselves in Wikifunctions. This will allow us to collaborate
on the texts, unleash the creativity of the community, spot and correct
errors and edge cases together, and slowly extend the types of items and
the coverage per type.

In a follow-up essay, we will discuss a different approach to creating
abstract content, where the content is not the result of a model based on
the type of the described item, but rather a manually constructed article,
built up sentence by sentence.

*Development update from the week of May 27:*

   - The team had a session at Hackathon, which was well attended (about 30
   people). Thanks to everyone for being there and your questions and comments!
   - We also had follow-up meetings with User:Mahir256, to improve
   alignment on the NLG stream
   - Below is the brief weekly summary highlighting the status of each
   workstream
      - Performance:
         - Observability document drafted.
         - Updated Helm charts for getting function-* services in staging.
         - Completed performance metrics design and shared for review
      - NLG:
         - Scoped out necessary changes to Wikifunctions post-launch
      - Metadata:
         - Started recording and passing up some function-evaluator timing
         metrics to the orchestrator
      - Experience:
         - WikiLambda (PHP) layer has been migrated to the new format of
         typed lists
         - Improved the mobile experience of the function view page
         - Transitioned the Tabs component to use Codex's, thanks to the
         Design Systems Team.
         - Design: Carried out end-to-end user flow testing in Bangla.

*(Apologies for this update being late. We plan to send out another update
this week)*