The on-wiki version of this newsletter can be found here: https://www.wikifunctions.org/wiki/Wikifunctions:Status_updates/2024-10-17
Because of the more complex formatting, the on-wiki version might be easier to read. -- What could abstract content look like?
*This week’s newsletter is guest-written by Mahir Morshed https://www.wikifunctions.org/wiki/User:Mahir256.*
The notion of ‘abstract content’ for Abstract Wikipedia arises by analogy to regular content on regular Wikipedias. This regular content is written in a specific language’s writing system and, on the surface, is not clearly connected to the structured information on Wikidata. By contrast, then, abstract content should not be tied to a specific language’s writing system and should instead be derived from information on Wikidata. It would additionally be useful for the parts of this content to have a simplified syntax, both to reduce the logic needed to process and manipulate this content and to ensure additions to the content don’t inherently require changes to the representation format.
It remains then to speak of how this abstract content should appear such that these desiderata are achieved. Let’s try to arrive at such a representation through some changes to a Constructor for a simple sentence, starting with something similarly structured to Figure 1 in Denny’s CACM paper https://dl.acm.org/doi/10.1145/3425778:
Action( predicate: eating, eater: Robert J. Jones, eaten: ice cream, location: Decatur, Illinois, time: 1 July 2023, 11:30am )
The intended meaning of this sentence is “Robert J. Jones ate ice cream in Decatur, Illinois on July 1st, 2023 at 11:30am.” Right now everything in the Constructor is in English, and none of the arguments refer to Wikidata at all. Let’s (mostly) fix the latter of these problems:
Action( predicate: Q213449 https://www.wikidata.org/wiki/Q213449, eater: Q33103898 https://www.wikidata.org/wiki/Q33103898, eaten: Q13233 https://www.wikidata.org/wiki/Q13233, location: Q506325 https://www.wikidata.org/wiki/Q506325, time: “+2023-07-01T16:30:00Z” )
This is better, but the name of the Constructor and the names of the arguments are still in English. What if we used Wikidata items to represent these as well?
Q4026292 https://www.wikidata.org/wiki/Q4026292( Q179080 https://www.wikidata.org/wiki/Q179080: Q213449 https://www.wikidata.org/wiki/Q213449, Q20984678 https://www.wikidata.org/wiki/Q20984678: Q33103898 https://www.wikidata.org/wiki/Q33103898, Q2095 https://www.wikidata.org/wiki/Q2095: Q13233 https://www.wikidata.org/wiki/Q13233, Q115095765 https://www.wikidata.org/wiki/Q115095765: Q506325 https://www.wikidata.org/wiki/Q506325, Q7805404 https://www.wikidata.org/wiki/Q7805404: +2023-07-01T16:30:00Z )
Now that nearly everything in this Constructor is represented by a Wikidata QID, it can be displayed entirely in a particular language provided that each item referred to has a label in that language, such as Bengali:
কার্য( বিধেয়: খাওয়া, ভোক্তা: রবার্ট জে জোন্স, খাদ্য: আইসক্রিম, অবস্থান: ডেকেটার, ইলিনয়, ঘটনার সময়: +2023-07-01T16:30:00Z )
We’re still not done, though: could we simplify this syntax a bit? (Can we get away from needing named arguments to functions?)
Q4026292 https://www.wikidata.org/wiki/Q4026292( Q179080 https://www.wikidata.org/wiki/Q179080(Q213449 https://www.wikidata.org/wiki/Q213449), Q20984678 https://www.wikidata.org/wiki/Q20984678(Q33103898 https://www.wikidata.org/wiki/Q33103898), Q2095 https://www.wikidata.org/wiki/Q2095(Q13233 https://www.wikidata.org/wiki/Q13233), Q115095765 https://www.wikidata.org/wiki/Q115095765(Q506325 https://www.wikidata.org/wiki/Q506325), Q7805404 https://www.wikidata.org/wiki/Q7805404(+2023-07-01T16:30:00Z) )
This change, from using named function arguments to using single-member functions as unnamed arguments, should hopefully remind one of the composition syntax https://www.wikifunctions.org/wiki/Wikifunctions:How_to_create_implementations#Composition that Wikifunctions functions can be implemented in.
Since different predicates require different participant roles–’drinking’ requires ‘drinker’ and ‘drink’, ‘reading’ requires ‘reader’ and ‘thing being read’, and so on–the number of functions that need to be introduced at this point will likely skyrocket. We can reduce this number by generalizing them to use Q613930 https://www.wikidata.org/wiki/Q613930 to indicate participant roles, keeping the QIDs we introduced for those roles as arguments instead:
Q4026292 https://www.wikidata.org/wiki/Q4026292( Q179080 https://www.wikidata.org/wiki/Q179080(Q213449 https://www.wikidata.org/wiki/Q213449), Q613930 https://www.wikidata.org/wiki/Q613930(Q20984678 https://www.wikidata.org/wiki/Q20984678, Q33103898 https://www.wikidata.org/wiki/Q33103898), Q613930 https://www.wikidata.org/wiki/Q613930(Q2095 https://www.wikidata.org/wiki/Q2095, Q13233 https://www.wikidata.org/wiki/Q13233), Q115095765 https://www.wikidata.org/wiki/Q115095765(Q506325 https://www.wikidata.org/wiki/Q506325), Q7805404 https://www.wikidata.org/wiki/Q7805404(+2023-07-01T16:30:00Z) )
The connection to particular programming languages can be made even more explicit with a little rearrangement:
(“Q4026292” (“Q179080” “Q213449”) (“Q613930” “Q20984678” “Q33103898”) (“Q613930” “Q2095” “Q13233”) (“Q115095765” “Q506325”) (“Q7805404” “+2023-07-01T16:30:00Z”) )
This format, borrowing from the syntax of Lisp https://en.wikipedia.org/wiki/Lisp-like programming languages, is what I believe should be used to store abstract content for Abstract Wikipedia. As a purely optional last measure for completeness, let’s try to turn the timestamp into QIDs, using items for the date, time, and time zone:
(“Q4026292” (“Q179080” “Q213449”) (“Q613930” “Q20984678” “Q33103898”) (“Q613930” “Q2095” “Q13233”) (“Q115095765” “Q506325”) (“Q7805404” (“Q186885” “Q69306847” “Q95056915” “Q15406405”)) )
Since this final result is composed entirely of strings (if the “Q” is removed everywhere, integers?) and lists–both more primitive data structures across lots of environments–it can be read and modified the way other lists of strings are dealt with in those environments. (In fact, lists of strings can be used as the input to Wikifunctions functions https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2024-01-03, even though actual handling of Wikidata items is still to come.) As a reminder, since each string is a Wikidata QID, this final result can be displayed in a given language provided each item has a label in that language.
The Constructor whose written form we have been modifying also represents what I believe to be a very useful building block for abstract content. In many languages this would correspond to a structurally more simple sentence–albeit one whose main verb isn’t something like ‘to be’ or ‘to have’–complete with a predicate (‘eating’), participant roles (such as ‘eater’ and ‘food’), and any number of modifiers (such as ‘location’ and ‘time’). There are already lots of Wikidata items for predicates, with Wikidata verb and verb phrase lexemes linking to them https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Statistics/indirect_translations_(predicates), and there is an emerging effort to introduce items to represent participant roles for predicates https://www.wikidata.org/wiki/Wikidata:WikiProject_Events_and_Role_Frames. In principle, the order of components within such a block would not be significant, so that the following would be functionally identical to what was shown above:
(“Q4026292” (“Q115095765” “Q506325”) (“Q179080” “Q213449”) (“Q7805404” (“Q186885” “Q69306847” “Q95056915” “Q15406405”)) (“Q613930” “Q2095” “Q13233”) (“Q613930” “Q20984678” “Q33103898”) )
Putting these blocks together requires introducing some machinery, but with the representation we arrived at it is possible to make this machinery realizable. The following are but three possible examples:
- Two simple sentences can be coordinated (e.g. using ‘and’, ‘or’, ‘but’, and so on) by adding both as arguments to a new list. The item Q13381767 https://www.wikidata.org/wiki/Q13381767 below, for example, represents a simple ‘and’ relationship:
(“Q13381767” (“Q4026292” (“Q179080” “Q213449”) [...]) (“Q4026292” (“Q179080” “Q199657”) [...]) )
- A simple sentence may be subordinated to another (e.g. using ‘because’, ‘when’, ‘while’, and so on) by introducing a modifier wrapping that simple sentence and using that modifier in the other sentence. The item Q12774849 https://www.wikidata.org/wiki/Q12774849 below, for example, represents a simple ‘because’ relationship:
(“Q4026292” (“Q179080” “Q213449”) [...] (“Q12774849” (“Q4026292” (“Q179080” “Q199657”) [...]) ) )
- Arbitrary modifiers could be applied after a simple sentence has been formed by wrapping them around that sentence. The item Q1478451 https://www.wikidata.org/wiki/Q1478451 below, for example, represents simple negation:
(“Q1478451” (“Q4026292” (“Q179080” “Q199657”) [...]) )
Much, if not all, of what has been described above has been put into practice at elemwala.toolforge.org (powered by Ninai https://gitlab.com/mahir256/ninai//Udiron https://gitlab.com/mahir256/udiron/).
*This week’s newsletter is guest-written by Mahir Morshed https://www.wikifunctions.org/wiki/User:Mahir256. If you want to propose a guest-written newsletter, please contact Luca https://www.wikifunctions.org/wiki/User_talk:Sannita_(WMF) or Denny https://www.wikifunctions.org/wiki/User_talk:DVrandecic_(WMF).* Recent Changes in the software
A very light set of technical changes this week, as our focus was on the longer-term Quarterly work which is still in-flight.
On the front-end side, we made some follow-up fixes to the UX components for using Lexemes (T373589 https://phabricator.wikimedia.org/T373589), allowing you to search for single-glyph Lexemes (like '𒂼', which is L1 https://www.wikidata.org/wiki/Lexeme:L1) and tweaking the visual display.
We also improved the request traceability headers we generate when you run a function, consolidating on the OpenTelemetry standard ones as part of wider Wikimedia observability work (T375922 https://phabricator.wikimedia.org/T375922). Function of the Week: select representation from lexeme
As we wrote last week https://www.wikifunctions.org/wiki/Wikifunctions:Status_updates/2024-10-11, we are introducing Wikidata lexemes and first versions of other Wikidata-based types. The new types are now available, and in order to demonstrate the new types and how they work, we have created a first set of functions:
1. count lexeme forms in lexeme https://www.wikifunctions.org/view/en/Z19232 2. count matching lexeme forms in lexeme https://www.wikifunctions.org/view/en/Z19234 3. select representation from lexeme https://www.wikifunctions.org/view/en/Z19241 4. select matching lexeme forms in lexeme https://www.wikifunctions.org/view/en/Z19243
All of these functions use the new Wikidata lexeme https://www.wikifunctions.org/view/en/Z6005 type for their first argument. When you go to one of these functions, our UI provides a lexeme selector that helps you to pick a lexeme from Wikidata that matches the word that you type. After hitting run, your selected lexeme is retrieved from Wikidata and transformed into our Wikidata lexeme type (by a preparatory call to the new builtin fetch Wikidata lexeme https://www.wikifunctions.org/view/en/Z6825 function) and then passed into the selected function above.
Let’s take a closer look at one of these new functions: select representation from lexeme https://www.wikifunctions.org/view/en/Z19241.
That function also has a second argument, grammatical features, which is a list https://www.wikifunctions.org/view/en/Z881 of Wikidata item references https://www.wikifunctions.org/view/en/Z6091. Currently, we don't have a UI component for selecting Wikidata items yet, but that is part of our upcoming work in this quarter. However, you can copy and paste a QID for grammatical features from Wikidata. When you specify one or more grammatical features, those are used to select the lexeme form(s) from the lexeme which have those grammatical features.
Let’s take a look at a simple example: we want to obtain the (first) plural form of the English noun "goose" https://www.wikidata.org/wiki/Lexeme:L6424. We type "goose" in the Lexeme selector, and click on the "English, noun" choice. In the second argument, we click on the "+" button and type in Q146786, the QID for plural https://www.wikidata.org/wiki/Q146786. Then we click “Run function” and we should get back the plural form.
That is also the first test https://www.wikifunctions.org/view/en/Z19258 for the function. A second test https://www.wikifunctions.org/view/en/Z19259 checks that the plural https://www.wikidata.org/wiki/Q146786 nominative https://www.wikidata.org/wiki/Q131105 of the Malayalam word ആപ്പിൾ https://www.wikidata.org/wiki/Lexeme:L455955 (with one meaning being apple) is ആപ്പിളുകൾ. This test is to check a different script and a more complex lexeme.
In general, it can be difficult to write tests for some of these functions, as they rely on a certain stability of Wikidata, and when writing tests we should make a thoughtful decision about what exactly we are testing with a given test.
The function currently has one implementation https://www.wikifunctions.org/view/en/Z19242 written in JavaScript. The implementation can be inspected and used as a pattern for other implementations. But this function is implemented entirely in the contributor space (unlike the fetch Wikidata lexeme https://www.wikifunctions.org/view/en/Z6825 function, which has a magical builtin implementation https://www.wikifunctions.org/view/en/Z6925 and certainly does things that contributors cannot do).
Here is another example on how to use these new functions: if you want to examine the lexeme forms from a lexeme, use select matching lexeme forms in lexeme https://www.wikifunctions.org/view/en/Z19243. Type some word into the Lexeme selector and choose one of the options it offers. If you now leave the second argument as the empty list, you will get back all of the Lexeme forms from the selected Lexeme. Then you can browse them in WIkifunctions
Note that we currently have a few bugs: If there are two or more choices displayed with the exact same word form, the first of them will always be selected, no matter which one you click on. Also, larger Lexemes cause a gateway timeout on loading. And, just with selecting QIDs, we also don’t have a proper display for QIDs yet. If you encounter further issues, please let us know.