Two weeks ago, we previewed the first form of access to knowledge on Wikidata, and last week we announced that it has gone live. This week we want to sketch out what we are aiming for by the end of the year.
As we have pointed out in the last two weeks, there were a number of issues we are working on to improve access to Lexemes and other entities on Wikidata, from issues with the selector for Lexemes (which has already been fixed) to a better selector and display of Wikidata items (which we are still working on). Thanks to everybody who has given us feedback and pointed out further issues, or helped us prioritize the tasks.
But what is the goal for this year? What are we building towards, where do we want to be by the end of 2024?
The goal is to be able to build up phrases from Lexemes using linguistic agreement. What does that mean?
Many languages require agreement in order to be correct (some languages do not, such as Japanese, some need a little, such as English, and some need a lot, such as Swahili). Agreement, or concord, means that one word or phrase has to change in order to fit another word or phrase in a given sentence. Let’s take a look at an English example:
In the first sentence, the word “an” requires to be followed by the singular, whereas in the second sentence the word “two” requires to be followed by the plural. So the first sentence has the word form “apple”, and the second sentence the word form “apples”.
Many languages such as Italian, Hindi, or Ukrainian have grammatical genders for nouns, such as for their respective words for cat: in Italian, gatto is masculine, in Hindi, बिल्ली is feminine, and so is the Ukrainian кішка. If a noun is being described by an adjective, the adjective in these languages has to agree with the gender of the noun. So, if we want to express little cat in Italian, we would say:
Turtle in Italian is tartaruga, which is a feminine noun. If we want to express little turtle in Italian, we would say:
Note the different ending on the adjective: it is piccolo for masculine nouns, and piccola for feminine nouns.
Assume a function that takes two arguments, both Lexemes, one an Italian noun, the other an Italian adjective. In Italian, the adjective usually just precedes the noun. But in order to choose the right form, we need to know the grammatical gender of the noun. In Wikidata, there is a property for grammatical gender. Before the end of the year, we plan to enable you to run a function in Wikifunctions on an Italian noun, and get back the value for the grammatical gender of that noun, if it is given in Italian.
With the value for grammatical gender, you will then be able to filter the adjective in order to pick the right form. Once we have the right form of the adjective and the noun, we can concatenate the two with a space in between, and get a grammatically correct phrase with an adjective and a noun.
We are looking forward to offering you these capabilities and to see what you will build with that.
Since Lexemes are new to Wikifunctions, we will look this week at one of the brand new community-created functions for Lexemes: plural form of lexeme as monolingual text (Z19260). You can go to that function, select a Lexeme, and run the function, and it will return the first form on that Lexeme that is a plural.
For example, enter the English noun goose, and it returns geese in English, enter the Spanish noun compás and it returns compases in Spanish. This function should work on every language, and always return a correct form, as long as it is in Wikidata (and if it is missing in Wikidata, feel free to enter it).
The function takes one argument of type Wikidata Lexeme and returns a monolingual text (that is, a text in a specific language).
There are two tests written for this function: a plural of dog being dogs, and a plural of amigo being amigos. We have the same issues with tests like last week: the tests depend as much on Wikidata as they do on Wikifunctions. The second test illustrates that well: it so happens that on the Lexeme for the Spanish noun amigo the form amigos is listed before the form amigas, but both of them are correct plural forms, the former being masculine and the latter feminine. The forms could have been written the other way around just as well.
The function has one implementation, using a composition. We will read the composition from the inside to the outside.
Currently, the function fails frequently, due to time outs when resolving larger objects and evaluating more complex compositions timing out frequently (for example, it times out on a German noun such as Baum). Also, the call to echo shouldn’t be necessary. We can use this function as a benchmark on improving the capabilities and robustness of Wikifunctions. And at the same time, when it works, it demonstrates a really interesting use case.