The on-wiki version of this newsletter is here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-10
--
One of the early types of functions we want to start building in
Wikifunctions are functions that perform regular morphological
transformations on words. That is, functions that, given the base form of a
word, can create the regular inflected forms of a word. Or, to give an
example: that can tell me that the plural of *“book”* in English is
*“books”*.
English is a comparably simple example, but that should make it easier to
sketch out the proposal in this newsletter. In many other cases, the
morphological functions and the grammar are likely to be more complicated.
The most regular way to create a plural from an English noun’s base form is
to add the letter *“s”* to it. Let’s now see how many of Wikidata’s entries
would be covered by this simple rule.
Wikidata currently has about 28,100 <https://w.wiki/43N6> English nouns.
Whereas Wikidata allows for a lot of flexibility when entering
lexicographical entries, Wikifunctions will require the data to have a more
predictable shape in order to use it effectively. One way to express these
shapes is through lexical masks <https://github.com/google/lexical-masks/>.
English nouns have two different lexical masks
<https://github.com/google/lexical-masks/blob/master/masks/en.json>: one
with only two forms (a singular and a plural, e.g. *“book”* and *“books”*)
and one with four forms (including two genitive forms, i.e. *“book’s”* and
*“books’”*). Both of these masks have been automatically translated
<https://github.com/google/lexical-masks/blob/master/shex/en.shex> into Shex
<https://www.wikidata.org/wiki/Wikidata:WikiProject_Schemas>, the language
that is used by Wikidata for checking data completeness. But only the
two-form version has been turned into an Entity Schema in Wikidata
<https://www.wikidata.org/wiki/EntitySchema:E155>.
Now we can take the 28,000 English nouns in Wikidata and check how many of
them fulfill the requirements described above (let me know if there is
interest in the code). It turns out that more than 25,500, that is more
than 91% of the nouns, fulfill the requirement. And all of them fulfill the
two-form schema. Four nouns (*contract
<https://www.wikidata.org/wiki/Lexeme:L5605>*, *player*
<https://www.wikidata.org/wiki/Lexeme:L5607>, *swimmer*
<https://www.wikidata.org/wiki/Lexeme:L7384>, and *sport
<https://www.wikidata.org/wiki/Lexeme:L301>*) almost fulfill the four-form
schema, but on each of them the cases on the nominative forms are missing.
<https://meta.wikimedia.org/wiki/File:Book_to_books_in_NotWikiLambda.png>
<https://meta.wikimedia.org/wiki/File:Book_to_books_in_NotWikiLambda.png>
Evaluating "Add s" on "book" in NotWikiLambda
So let’s focus on the 25,500 nouns that pass the structural requirements.
We created a function that adds the letter *“s”* at the end of the word in
NotWikiLambda. When we count how many of the plurals are generated this
way, we see that 21,000 English nouns are created correctly by simply
adding *"s"*, 82% of all nouns. Adding *“s”* is one paradigm, and, as we
can see, the most common one for English nouns.
On the right-hand side of the Function's page you can see a heading
“Evaluate Function,” and there you can enter a value, say *“book”*. If you
click on “Call Function” below, the result *“books”* should come back.
(Note that WikiLambda <https://www.mediawiki.org/wiki/Extension:WikiLambda> is
in heavy development, and the test site
<https://notwikilambda.toolforge.org/> might have hiccups at any time. A
screenshot of the evaluation working correctly is shown here.)
Another paradigm works for many English nouns that end with the letter *“y”*.
There are many cases where we replace the letter *“y”* with the letter
*“ies”*, e.g. when turning *“baby”* into *“babies”*, or *“fairy”* into
*“fairies”*. We created the function replacing *“y”* at the end with *“ies”*
<https://notwikilambda.toolforge.org/wiki/Z10129> in NotWikiLambda. When we
run this paradigm against the nouns in Wikidata, more than 2,000 nouns
(almost 8%) get covered by this function.
<https://meta.wikimedia.org/wiki/File:Baby_to_babies_in_NotWikiLambda.png>
<https://meta.wikimedia.org/wiki/File:Baby_to_babies_in_NotWikiLambda.png>
Evaluating "Replace y with ies at end" in NotWikiLambda
We could create further paradigms (e.g. add *“es”*, which would cover more
than 1,800 nouns), and we could even write a single function which tries to
discern which of these functions to apply (e.g. if it ends with *“s”* or
*“sh”*, add *“es”*; if it ends with a *“y”* preceded by a consonant,
replace that *“y”* with an *“ies”*; else simply add an *“s”*, etc.), which
would give us a more powerful function that can deal with many more words
(a bit of experimentation got me to a function
<https://notwikilambda.toolforge.org/wiki/Z10132> that covers 98.3% of all
cases).
Grammatical Framework has introduced these functions as so-called smart
paradigms <https://aclanthology.org/E12-1066.pdf>. Their web-based
implementation of smart paradigms
<https://cloud.grammaticalframework.org/gfmorpho/> for English nouns covers
96% of the nouns in Wikidata. I would be very curious to see how either of
these numbers compares to modern, machine-learning based solutions, and I
also want to invite people to create an even smarter paradigm with better
coverage without the code becoming too complex.
Smart paradigms are useful when data in Wikidata is incomplete. For example
for loan words, technical terms, neologisms, names, or when verbing nouns
<https://www.gocomics.com/calvinandhobbes/1993/01/25> (so-called conversion
<https://en.wikipedia.org/wiki/Conversion_(word_formation)#Verb_conversion_in_English>),
we might need to create a form automatically that Wikidata doesn’t yet
explicitly know about.
As this week’s entry is already getting quite long, we will defer to next
time the discussion of some of the possibilities of how those paradigms
implemented in Wikifunctions might interplay with the lexicographic data in
Wikidata. This will also shed more light on the role that the morphological
paradigms might play for Abstract Wikipedia in the future.
----
In other news:
This week, Abstract Wikipedia was covered within the US NPR radio news
programme The World
<https://en.wikipedia.org/wiki/The_World_(radio_program)>. Host Marco
Werman interviewed Denny
<https://www.pri.org/file/2021-09-07/wikipedia-s-efforts-get-its-300-language-versions-same-page>
in
a five-minute segment that was broadcasted on numerous public radio
stations. The segment is now also available online.
The German public TV station 3sat <https://en.wikipedia.org/wiki/3sat>
broadcast
a documentary about Wikipedia this week: “Wikipedia - Die Schwarmoffensive”
<https://www.3sat.de/film/dokumentarfilm/wikipedia--die-schwarmoffensive-100.html>.
The German-language documentary can be viewed online from Germany,
Switzerland, and Austria. It also discusses Abstract Wikipedia for a few
minutes at the end of the documentary.