Hi Denny,
It appears quickly to me that much more metadata will likely need to be
added to our existing Wikidata properties in order to surface their full
contextual application towards curating and generating appropriate models
for specific types of items. (Predicates are king sorta thing)
For instance, "in" found in a sentence could be contextualized with
containerization or generally "grouping" into a set... or not.
Example: Is it "in" something (a container, an ocean, a pot, etc.) or is
it "in" a location/place, or is it "in" a set of things (one element
in a
set or group, a specific chemical bond in a chemical compound, etc.)
It will be interesting to see what additional metadata is going to be
needed for Wikidata properties beyond our current rudimentary "instance of"
P31
I can imagine adding much more knowledge in general (metadata) about Wikidata
property to indicate a location - Wikidata
<https://www.wikidata.org/wiki/Q18615777> as an example.
Keep up the great work and knowledge sharing, team!
Thad
On Tue, Jun 7, 2022 at 3:38 PM Denny Vrandečić <dvrandecic(a)wikimedia.org>
wrote:
The on-wiki version of this newsletter can be found
here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-06-07
--
Communities will create (at least) two different types of articles using
Abstract Wikipedia: on the one hand, we will have highly-standardised
articles based entirely on Wikidata; and on the other hand, we will have
bespoke, hand-crafted content, assembled sentence by sentence. Today we
will discuss the first type, and we will discuss the second type in an
upcoming newsletter.
Articles of the first type can be created very quickly and will likely
constitute the vast majority of articles for a long time to come. For that
we can use models, *i.e.* a text with variables. Put differently, a text
with gaps which get filled from a different source such as a list, along
the lines of the mad libs <https://en.wikipedia.org/wiki/Mad_Libs> game.
A model can be created once for a specific type of item and then used for
every single item of this type that has enough data in Wikidata. The
resulting articles are similar to many bot-created articles that already
exist in various Wikipedias.
For example, in many languages, bots were used to create or maintain the
articles for years (such as the articles about 1313
<https://www.wikidata.org/wiki/Q5735>, 1428
<https://www.wikidata.org/wiki/Q6315>, or 1697
<https://www.wikidata.org/wiki/Q7702>, each of which is available in more
than a hundred languages). In English Wikipedia, many articles for US
cities were created by a bot
<https://en.wikipedia.org/wiki/List_of_Wikipedia_controversies#2002> based
on the US census, and later updated after the 2010 census. Lsjbot
<https://en.wikipedia.org/wiki/Lsjbot> by Sverker Johansson is a well
known example of a bot that has created millions of articles about
locations or species across a few languages such as Swedish, Waray Waray,
or Cebuano. Comparable activities, although not as prolific, have been
going on in quite a few other languages.
How do these approaches work? Assume you have a dataset such as the
following list of countries:
Caption
CountryCountryCapitalPopulation
Jordan Asia Amman 10428241
Nicaragua Central America Managua 5142098
Kyrgyzstan Asia Bishkek 6201500
Laos Asia Vientiane 6858160
Lebanon Asia Beirut 6100075
Now we can create a model that can generate a complete text from this
data, such as
“*<Country>* is a country in *<Continent>* with a population of
*<Population>*. The capital of *<Country>* is *<Capital>*.”
With this text and the above dataset, we would have created the following
five proto-articles (references not shown for simplicity):
*Jordan* is a country in Asia with a population of 10,428,241. The
capital of Jordan is Amman.
*Nicaragua* is a country in Central America with a population of
5,142,098. The capital of Nicaragua is Managua.
*Kyrgyzstan* is a country in Asia with a population of 6,201,500. The
capital of Kyrgyzstan is Bishkek.
*Laos* is a country in Asia with a population of 6,858,160. The capital
of Laos is Vientiane.
*Lebanon* is a country in Asia with a population of 6,100,075. The
capital of Lebanon is Beirut.
Classical textbooks on that topic such as *“Building natural language
generation systems”
<https://en.wikipedia.org/wiki/Special:BookSources/978-0-521-02451-8>* call
this method *“mail merge”* (even though it is used for more than mail). A
model is combined with a dataset, often from a spreadsheet or a database.
This has been used for decades to create bulk mailings
<https://en.wikipedia.org/wiki/Mail_merge> and other bulk content, and is
a form of mass customisation
<https://en.wikipedia.org/wiki/Mass_customization>. The methods have
become increasingly complex over time and are able to answer more
questions: How to deal with missing or optional information? How to adapt
part of the text to the data, *e.g.* use plurals or grammatical gender or
noun classes where appropriate, *etc.*? The bots that were mentioned
above, which created millions of articles in various languages on
Wikipedia, have mostly worked along these lines.
For a great example of how far the model approach can be pushed, consider
Magnus Manske’s Reasonator <https://meta.wikimedia.org/wiki/Reasonator>,
which, based on the data in Wikidata, creates the following automatic
description for Douglas Adams <https://reasonator.toolforge.org/?q=Q42>:
*Douglas Adams* was a British playwright, screenwriter, novelist,
children's writer, science fiction writer, comedian, and writer. He was
born on March 11, 1952 in Cambridge to Christopher Douglas Adams and Janet
Adams. He studied at St John's College from 1971 until 1974 and Brentwood
School from 1959 until 1970. His field of work included science fiction,
comedy, satire, and science fiction. He was a member of Groucho Club and
Footlights. He worked for The Digital Village from 1996 and for BBC. He
married Jane Belson on November 25, 1991 (married until on May 11, 2001 ),
Jane Belson on November 25, 1991 (married until on May 11, 2001 ), and Jane
Belson on November 25, 1991 (married until on May 11, 2001 ). His children
include Polly Adams, Polly Adams, and Polly Adams. He died of myocardial
infarction on May 11, 2001 in Santa Barbara. He was buried at Highgate
Cemetery.
If we were to say that this is merely better than nothing, I think we
would undersell the achievement of Reasonator. The above text, together
with the appealing display of the structured data in Reasonator, leads to a
more comprehensive access to knowledge than many of the individual language
Wikipedias provide for Douglas Adams. For comparison, check out the
articles in Azery <https://az.wikipedia.org/wiki/Duqlas_Adams>, Urdu
<https://ur.wikipedia.org/wiki/%DA%88%DA%AF%D9%84%D8%B3_%D8%A7%DB%8C%DA%88%D9%85%D8%B3>
, Malayalam
<https://ml.wikipedia.org/wiki/%E0%B4%A1%E0%B4%97%E0%B5%8D%E0%B4%B2%E0%B4%B8%E0%B5%8D%E0%B4%86_%E0%B4%A1%E0%B4%82%E0%B4%B8%E0%B5%8D>
, Korean
<https://ko.wikipedia.org/wiki/%EB%8D%94%EA%B8%80%EB%9F%AC%EC%8A%A4_%EC%95%A0%EB%8D%A4%EC%8A%A4>,
or Danish <https://da.wikipedia.org/wiki/Douglas_Adams>. At the same
time, it shows errors that most contributors wouldn’t know how to fix (such
as the repetition of the names of the children, or the spaces inside the
brackets, *etc.*).
The Article placeholder
<https://www.mediawiki.org/wiki/Extension:ArticlePlaceholder> project has
partially fulfilled the role of filling content gaps, but the developers
have intentionally shied away from the results looking too much like an
article. They display structured data from Wikidata within the context of a
language Wikipedia. For example, here is the generated page about
*triceratops* in Haitian Creole
<https://ht.wikipedia.org/wiki/Espesyal:AboutTopic/Q14384>.
One large disadvantage of using bots to create articles in Wikipedia has
been that this content was mostly controlled by a very small subset of the
community — often a single person. Many of the bots and datasets have not
been open sourced in a way that someone else could easily come in, make a
change, and re-run the bot. (Reasonator avoids this issue, because the text
is generated dynamically and is not incorporated into the actual Wikipedia
article.)
With Wikifunctions and Wikidata, we will be able to give control over all
these steps to the wider community. Both the models and the data will be
edited on wiki, with all the usual advantages of having a wiki: there is a
clear history, everyone can edit through the Web, people can discuss,
*etc.*. The data used to populate the models will be maintained in
Wikidata, and the models themselves in Wikifunctions. This will allow us to
collaborate on the texts, unleash the creativity of the community, spot and
correct errors and edge cases together, and slowly extend the types of
items and the coverage per type.
In a follow-up essay, we will discuss a different approach to creating
abstract content, where the content is not the result of a model based on
the type of the described item, but rather a manually constructed article,
built up sentence by sentence.
*Development update from the week of May 27:*
- The team had a session at Hackathon, which was well attended (about
30 people). Thanks to everyone for being there and your questions and
comments!
- We also had follow-up meetings with User:Mahir256, to improve
alignment on the NLG stream
- Below is the brief weekly summary highlighting the status of each
workstream
- Performance:
- Observability document drafted.
- Updated Helm charts for getting function-* services in staging.
- Completed performance metrics design and shared for review
- NLG:
- Scoped out necessary changes to Wikifunctions post-launch
- Metadata:
- Started recording and passing up some function-evaluator
timing metrics to the orchestrator
- Experience:
- WikiLambda (PHP) layer has been migrated to the new format of
typed lists
- Improved the mobile experience of the function view page
- Transitioned the Tabs component to use Codex's, thanks to the
Design Systems Team.
- Design: Carried out end-to-end user flow testing in Bangla.
*(Apologies for this update being late. We plan to send out another update
this week)*
_______________________________________________
Abstract-Wikipedia mailing list -- abstract-wikipedia(a)lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikime…