[Abstract-wikipedia] Re: Newsletter #75: Model articles

7 Jun 2022

Hi Denny,

It appears quickly to me that much more metadata will likely need to be
added to our existing Wikidata properties in order to surface their full
contextual application towards curating and generating appropriate models
for specific types of items. (Predicates are king sorta thing)

For instance, "in" found in a sentence could be contextualized with
containerization or generally "grouping" into a set... or not.
Example:  Is it "in" something (a container, an ocean, a pot, etc.) or is
it "in" a location/place, or is it "in" a set of things (one element
in a
set or group, a specific chemical bond in a chemical compound, etc.)

It will be interesting to see what additional metadata is going to be
needed for Wikidata properties beyond our current rudimentary "instance of"
P31
I can imagine adding much more knowledge in general (metadata) about Wikidata
property to indicate a location - Wikidata
<https://www.wikidata.org/wiki/Q18615777> as an example.

Keep up the great work and knowledge sharing, team!

Thad
https://www.linkedin.com/in/thadguidry/
https://calendly.com/thadguidry/

On Tue, Jun 7, 2022 at 3:38 PM Denny Vrandečić &lt;dvrandecic(a)wikimedia.org&gt;
wrote:

...
  The on-wiki version of this newsletter can be found
here:
 https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2022-06-07

 --

 Communities will create (at least) two different types of articles using
 Abstract Wikipedia: on the one hand, we will have highly-standardised
 articles based entirely on Wikidata; and on the other hand, we will have
 bespoke, hand-crafted content, assembled sentence by sentence. Today we
 will discuss the first type, and we will discuss the second type in an
 upcoming newsletter.

 Articles of the first type can be created very quickly and will likely
 constitute the vast majority of articles for a long time to come. For that
 we can use models, *i.e.* a text with variables. Put differently, a text
 with gaps which get filled from a different source such as a list, along
 the lines of the mad libs <https://en.wikipedia.org/wiki/Mad_Libs> game.
 A model can be created once for a specific type of item and then used for
 every single item of this type that has enough data in Wikidata. The
 resulting articles are similar to many bot-created articles that already
 exist in various Wikipedias.

 For example, in many languages, bots were used to create or maintain the
 articles for years (such as the articles about 1313
 <https://www.wikidata.org/wiki/Q5735>, 1428
 <https://www.wikidata.org/wiki/Q6315>, or 1697
 <https://www.wikidata.org/wiki/Q7702>, each of which is available in more
 than a hundred languages). In English Wikipedia, many articles for US
 cities were created by a bot
 <https://en.wikipedia.org/wiki/List_of_Wikipedia_controversies#2002> based
 on the US census, and later updated after the 2010 census. Lsjbot
 <https://en.wikipedia.org/wiki/Lsjbot> by Sverker Johansson is a well
 known example of a bot that has created millions of articles about
 locations or species across a few languages such as Swedish, Waray Waray,
 or Cebuano. Comparable activities, although not as prolific, have been
 going on in quite a few other languages.

 How do these approaches work? Assume you have a dataset such as the
 following list of countries:
 Caption
 CountryCountryCapitalPopulation
 Jordan Asia Amman 10428241
 Nicaragua Central America Managua 5142098
 Kyrgyzstan Asia Bishkek 6201500
 Laos Asia Vientiane 6858160
 Lebanon Asia Beirut 6100075

 Now we can create a model that can generate a complete text from this
 data, such as

 “*<Country>* is a country in *<Continent>* with a population of
 *<Population>*. The capital of *<Country>* is *<Capital>*.”

 With this text and the above dataset, we would have created the following
 five proto-articles (references not shown for simplicity):

 *Jordan* is a country in Asia with a population of 10,428,241. The
 capital of Jordan is Amman.

 *Nicaragua* is a country in Central America with a population of
 5,142,098. The capital of Nicaragua is Managua.

 *Kyrgyzstan* is a country in Asia with a population of 6,201,500. The
 capital of Kyrgyzstan is Bishkek.

 *Laos* is a country in Asia with a population of 6,858,160. The capital
 of Laos is Vientiane.

 *Lebanon* is a country in Asia with a population of 6,100,075. The
 capital of Lebanon is Beirut.

 Classical textbooks on that topic such as *“Building natural language
 generation systems”
 <https://en.wikipedia.org/wiki/Special:BookSources/978-0-521-02451-8>* call
 this method *“mail merge”* (even though it is used for more than mail). A
 model is combined with a dataset, often from a spreadsheet or a database.
 This has been used for decades to create bulk mailings
 <https://en.wikipedia.org/wiki/Mail_merge> and other bulk content, and is
 a form of mass customisation
 <https://en.wikipedia.org/wiki/Mass_customization>. The methods have
 become increasingly complex over time and are able to answer more
 questions: How to deal with missing or optional information? How to adapt
 part of the text to the data, *e.g.* use plurals or grammatical gender or
 noun classes where appropriate, *etc.*? The bots that were mentioned
 above, which created millions of articles in various languages on
 Wikipedia, have mostly worked along these lines.

 For a great example of how far the model approach can be pushed, consider
 Magnus Manske’s Reasonator <https://meta.wikimedia.org/wiki/Reasonator>,
 which, based on the data in Wikidata, creates the following automatic
 description for Douglas Adams <https://reasonator.toolforge.org/?q=Q42>:

 *Douglas Adams* was a British playwright, screenwriter, novelist,
 children's writer, science fiction writer, comedian, and writer. He was
 born on March 11, 1952 in Cambridge to Christopher Douglas Adams and Janet
 Adams. He studied at St John's College from 1971 until 1974 and Brentwood
 School from 1959 until 1970. His field of work included science fiction,
 comedy, satire, and science fiction. He was a member of Groucho Club and
 Footlights. He worked for The Digital Village from 1996 and for BBC. He
 married Jane Belson on November 25, 1991 (married until on May 11, 2001 ),
 Jane Belson on November 25, 1991 (married until on May 11, 2001 ), and Jane
 Belson on November 25, 1991 (married until on May 11, 2001 ). His children
 include Polly Adams, Polly Adams, and Polly Adams. He died of myocardial
 infarction on May 11, 2001 in Santa Barbara. He was buried at Highgate
 Cemetery.

 If we were to say that this is merely better than nothing, I think we
 would undersell the achievement of Reasonator. The above text, together
 with the appealing display of the structured data in Reasonator, leads to a
 more comprehensive access to knowledge than many of the individual language
 Wikipedias provide for Douglas Adams. For comparison, check out the
 articles in Azery <https://az.wikipedia.org/wiki/Duqlas_Adams>, Urdu

<https://ur.wikipedia.org/wiki/%DA%88%DA%AF%D9%84%D8%B3_%D8%A7%DB%8C%DA%88%D9%85%D8%B3>
 , Malayalam

<https://ml.wikipedia.org/wiki/%E0%B4%A1%E0%B4%97%E0%B5%8D%E0%B4%B2%E0%B4%B8%E0%B5%8D%E0%B4%86_%E0%B4%A1%E0%B4%82%E0%B4%B8%E0%B5%8D>
 , Korean

<https://ko.wikipedia.org/wiki/%EB%8D%94%EA%B8%80%EB%9F%AC%EC%8A%A4_%EC%95%A0%EB%8D%A4%EC%8A%A4>,
 or Danish <https://da.wikipedia.org/wiki/Douglas_Adams>. At the same
 time, it shows errors that most contributors wouldn’t know how to fix (such
 as the repetition of the names of the children, or the spaces inside the
 brackets, *etc.*).

 The Article placeholder
 <https://www.mediawiki.org/wiki/Extension:ArticlePlaceholder> project has
 partially fulfilled the role of filling content gaps, but the developers
 have intentionally shied away from the results looking too much like an
 article. They display structured data from Wikidata within the context of a
 language Wikipedia. For example, here is the generated page about
 *triceratops* in Haitian Creole
 <https://ht.wikipedia.org/wiki/Espesyal:AboutTopic/Q14384>.

 One large disadvantage of using bots to create articles in Wikipedia has
 been that this content was mostly controlled by a very small subset of the
 community — often a single person. Many of the bots and datasets have not
 been open sourced in a way that someone else could easily come in, make a
 change, and re-run the bot. (Reasonator avoids this issue, because the text
 is generated dynamically and is not incorporated into the actual Wikipedia
 article.)

 With Wikifunctions and Wikidata, we will be able to give control over all
 these steps to the wider community. Both the models and the data will be
 edited on wiki, with all the usual advantages of having a wiki: there is a
 clear history, everyone can edit through the Web, people can discuss,
 *etc.*. The data used to populate the models will be maintained in
 Wikidata, and the models themselves in Wikifunctions. This will allow us to
 collaborate on the texts, unleash the creativity of the community, spot and
 correct errors and edge cases together, and slowly extend the types of
 items and the coverage per type.

 In a follow-up essay, we will discuss a different approach to creating
 abstract content, where the content is not the result of a model based on
 the type of the described item, but rather a manually constructed article,
 built up sentence by sentence.

 *Development update from the week of May 27:*

    - The team had a session at Hackathon, which was well attended (about
    30 people). Thanks to everyone for being there and your questions and
    comments!
    - We also had follow-up meetings with User:Mahir256, to improve
    alignment on the NLG stream
    - Below is the brief weekly summary highlighting the status of each
    workstream
       - Performance:
          - Observability document drafted.
          - Updated Helm charts for getting function-* services in staging.
          - Completed performance metrics design and shared for review
       - NLG:
          - Scoped out necessary changes to Wikifunctions post-launch
       - Metadata:
          - Started recording and passing up some function-evaluator
          timing metrics to the orchestrator
       - Experience:
          - WikiLambda (PHP) layer has been migrated to the new format of
          typed lists
          - Improved the mobile experience of the function view page
          - Transitioned the Tabs component to use Codex's, thanks to the
          Design Systems Team.
          - Design: Carried out end-to-end user flow testing in Bangla.

 *(Apologies for this update being late. We plan to send out another update
 this week)*
 _______________________________________________
 Abstract-Wikipedia mailing list -- abstract-wikipedia(a)lists.wikimedia.org
 List information:
 https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikime…

2024

2023

2022

2021

2020

[Abstract-wikipedia] Re: Newsletter #75: Model articles