One of the early types of functions we want to start building in Wikifunctions are functions that perform regular morphological transformations on words. That is, functions that, given the base form of a word, can create the regular inflected forms of a word. Or, to give an example: that can tell me that the plural of “book” in English is “books”.
English is a comparably simple example, but that should make it easier to sketch out the proposal in this newsletter. In many other cases, the morphological functions and the grammar are likely to be more complicated.
The most regular way to create a plural from an English noun’s base form is to add the letter “s” to it. Let’s now see how many of Wikidata’s entries would be covered by this simple rule.
Wikidata currently has about 28,100 English nouns.
Whereas Wikidata allows for a lot of flexibility when entering lexicographical entries, Wikifunctions will require the data to have a more predictable shape in order to use it effectively. One way to express these shapes is through lexical masks. English nouns have two different lexical masks: one with only two forms (a singular and a plural, e.g. “book” and “books”) and one with four forms (including two genitive forms, i.e. “book’s” and “books’”). Both of these masks have been automatically translated into Shex, the language that is used by Wikidata for checking data completeness. But only the two-form version has been turned into an Entity Schema in Wikidata.
Now we can take the 28,000 English nouns in Wikidata and check how many of them fulfill the requirements described above (let me know if there is interest in the code). It turns out that more than 25,500, that is more than 91% of the nouns, fulfill the requirement. And all of them fulfill the two-form schema. Four nouns (contract, player, swimmer, and sport) almost fulfill the four-form schema, but on each of them the cases on the nominative forms are missing.
So let’s focus on the 25,500 nouns that pass the structural requirements. We created a function that adds the letter “s” at the end of the word in NotWikiLambda. When we count how many of the plurals are generated this way, we see that 21,000 English nouns are created correctly by simply adding "s", 82% of all nouns. Adding “s” is one paradigm, and, as we can see, the most common one for English nouns.
On the right-hand side of the Function's page you can see a heading “Evaluate Function,” and there you can enter a value, say “book”. If you click on “Call Function” below, the result “books” should come back. (Note that WikiLambda is in heavy development, and the test site might have hiccups at any time. A screenshot of the evaluation working correctly is shown here.)
Another paradigm works for many English nouns that end with the letter “y”. There are many cases where we replace the letter “y” with the letter “ies”, e.g. when turning “baby” into “babies”, or “fairy” into “fairies”. We created the function replacing “y” at the end with “ies” in NotWikiLambda. When we run this paradigm against the nouns in Wikidata, more than 2,000 nouns (almost 8%) get covered by this function.
We could create further paradigms (e.g. add “es”, which would cover more than 1,800 nouns), and we could even write a single function which tries to discern which of these functions to apply (e.g. if it ends with “s” or “sh”, add “es”; if it ends with a “y” preceded by a consonant, replace that “y” with an “ies”; else simply add an “s”, etc.), which would give us a more powerful function that can deal with many more words (a bit of experimentation got me to a function that covers 98.3% of all cases).
Grammatical Framework has introduced these functions as so-called smart paradigms. Their web-based implementation of smart paradigms for English nouns covers 96% of the nouns in Wikidata. I would be very curious to see how either of these numbers compares to modern, machine-learning based solutions, and I also want to invite people to create an even smarter paradigm with better coverage without the code becoming too complex.
Smart paradigms are useful when data in Wikidata is incomplete. For example for loan words, technical terms, neologisms, names, or when verbing nouns (so-called conversion), we might need to create a form automatically that Wikidata doesn’t yet explicitly know about.
As this week’s entry is already getting quite long, we will defer to next time the discussion of some of the possibilities of how those paradigms implemented in Wikifunctions might interplay with the lexicographic data in Wikidata. This will also shed more light on the role that the morphological paradigms might play for Abstract Wikipedia in the future.
----
In other news:
This week, Abstract Wikipedia was covered within the US NPR radio news programme The World. Host Marco Werman interviewed Denny in a five-minute segment that was broadcasted on numerous public radio stations. The segment is now also available online.
The German public TV station 3sat broadcast a documentary about Wikipedia this week: “Wikipedia - Die Schwarmoffensive”. The German-language documentary can be viewed online from Germany, Switzerland, and Austria. It also discusses Abstract Wikipedia for a few minutes at the end of the documentary.