September 2021 - Abstract-Wikipedia

by Houcemeddine A. Turki

Dear all, I thank you for your contributions to the Wikifunctions Project. As an end user of the Wikifunctions Project, I have been invited to speak at WikiArabia about Wikifunctions and Abstract Wikipedia in Arabic. That is why I developed and implemented several linguistic functions for Arabic Languages: * Root and Pattern-Based Generator of Lexemes for Arabic Languages (Z10157) * Pattern-Root Compatibility Verifier for Arabic Languages (Z10160) * IPA Generator for Diacritized Arabic Script Texts in Tunisian Arabic (Z10163) This implies the creation of Python codes for the three functions, the development of test functions and the description of the developed functions. When developing the functions, I have found several matters that can be solved in the next few months: 1. When a word assigns two Arabic Diacritics to a letter, this can cause a deficiency to the system. For example, كَرَّر has two Arabic diacritics (a shaddah and a fatha) on its second letter. The shaddah should be below the Fatha as its effect should come first. The Wikifunctions compilers do not efficiently consider that and this can harm the processing of the languages using the Arabic Script. This should be fixed. 2. The identation of the source code should be done by hand after pasting the code into the field. There is no automatic identation for pasted source codes. This can alter the user experience. 3. The mobile edition of the website does not work. Lucas Werkmeister has raised a ticket about this (T291325). 4. All these linguistic functions are taken from reference grammar books. It will be interesting to have a function that assigns a Wikidata item as a reference of a Wikifunctions function. 5. The runtime of the website is signficantly important. Several efforts should be done to make this project quicker. 6. It will be interesting to align inputs with their corresponding Wikidata items to have better semantics for the functions. 7. System messages are not absolutely user-friendly. This can be fixed. 8. The token for the connection to NotWikiLambda does not allow a long connection. It almost disconnects every fifteen minutes. Yours Sincerely, Houcemeddine Turki

2 years, 6 months

3
7
0 0

Objectivity and Subjectivity in Computational Historical Narration

by Adam Sobieski

Wikidata, Abstract Wikipedia, Hello. I am recently thinking about objectivity and subjectivity with respect to natural language generation, in particular in the contexts of story generation using historical data [1][2]. In the near future, digital humanities scholars – in particular historians – could modify collections of data and finetune generation-related parameters, watching as resultant multimodal historical narratives emerged and varied. In these regards, we can envision both computer-aided and automated historical narrative generation tools and technologies. Could AI be a long-sought objective narrator for historians? Is all narration, or all language use, inherently subjective? What might the nature of “generation-related parameters” and “finetuning” be for style and subjectivity [3][4][5][6][7][8] when generating natural language and multimodal historical narratives from historical data [1][2]? Thank you. Hopefully, these topics are interesting. Best regards, Adam Sobieski [1] Metilli, Daniele, Valentina Bartalesi, and Carlo Meghini. "A Wikidata-based tool for building and visualising narratives." International Journal on Digital Libraries 20, no. 4 (2019): 417-432. [2] Metilli, Daniele, Valentina Bartalesi, Carlo Meghini, and Nicola Aloia. "Populating narratives using Wikidata events: An initial experiment." In Italian Research Conference on Digital Libraries, pp. 159-166. Springer, Cham, 2019. [3] https://en.wikipedia.org/wiki/Subjectivity [4] https://en.wikipedia.org/wiki/Objectivity_(philosophy) [5] https://en.wikipedia.org/wiki/Political_subjectivity [6] https://en.wikipedia.org/wiki/Framing_(social_sciences) [7] https://en.wikipedia.org/wiki/Focalisation [8] https://en.wikipedia.org/wiki/Point_of_view_(philosophy)

2 years, 6 months

2
4
0 0

Newsletter #47: Thank you, Lindsay!

by Denny Vrandečić

The on-wiki version of this newsletter is available here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-30 -- This week is the last week Lindsay Wardell will be working on Wikifunctions as a contractor from ThisDot <https://www.thisdot.co/>. Her contributions to the code, particularly to the front end, can be seen everywhere. The discussions with her and insights we learned from her about Wikifunctions and how composition should work, about errors, about functional programming, and about the details of the functional model will have a lasting impact on the project. She very quickly developed a deep understanding and intuition of what the project could achieve, and was able to channel that understanding into creative solutions. It was a pleasure working with her. The whole team is sad to see Lindsay go. We are very thankful to Lindsay for her contribution, and we congratulate her on her new role. Here are her own words. “When I started working on this project back in March, I was fascinated with the goal and the ambition behind it. Providing a way for Wikipedia articles to be provided in any number of languages is exciting in its own right, but also providing a platform for people to interact with data and create their own functions spoke to me personally. I have enjoyed working so much on the Wikifunctions platform, and building the experience for users to create and utilize their own functions. I have loved working with this team (and the Wikimedia Foundation in general). From day one, I was accepted as a member of the group, despite my official role as a consultant. The feeling of being welcome was so wonderful to feel. I have so much respect for each and every member of the Foundation that I got to work with, and I am very grateful that I got to interact with them on such an exciting project. It was always a dream to get to work with the Wikimedia Foundation, and my experience was truly amazing. Once the dust settles around me, I fully intend on being a part of the community that is forming around Abstract Wikipedia and Wikifunctions. I look forward to participating as a community member and contributor to the project.” You can follow Lindsay on Twitter <https://twitter.com/lindsaykwardell> or listen to the Views on Vue <https://viewsonvue.com/> podcast she is a host on. Again, congratulations to your new role, we know how excited you are about it, and we all wish you the best!

2 years, 6 months

1
0
0 0

Newsletter #46: User research reports and lessons learned

by Denny Vrandečić

The on-wiki version of this newsletter is available here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-24 -- <https://meta.wikimedia.org/wiki/File:Research_-_Wikifunctions_mental_model.…> <https://meta.wikimedia.org/wiki/File:Research_-_Wikifunctions_mental_model.…> Wikifunctions mental model In order to understand the potential contributors and users for Wikifunctions better, we had Jeff Howard <https://commons.wikimedia.org/wiki/User:JDH264> conduct user research to better understand the potential contributor base for Wikifunctions. The results of this work are now available in the form of two reports on Wikimedia Commons: *Wikifunctions mental models <https://commons.wikimedia.org/wiki/File:Research_-_Wikifunctions_mental_mod…>.* 18 participants were interviewed in order to understand the current mental models about Wikifunctions, and to uncover potential problems. Below are a few of the problems found during this work. Please find more details in the full report. - The goals of the project were confusing: what does Wikifunctions aim to achieve? - The mock-ups were confusing, and they didn’t explain how the project would work, or what one would do on the site. - It was unclear whether non-programmers could contribute to or benefit from the project. The latter exposes a particular challenge for the project. I’ll come to it later, but one main goal of the project is to be accessible for people who do not currently see themselves as programmers. In fact, we think that people who are currently non-programmers may benefit from Wikifunctions most! <https://meta.wikimedia.org/wiki/File:Publish_-_Wikifunctions_feedback.pdf> <https://meta.wikimedia.org/wiki/File:Publish_-_Wikifunctions_feedback.pdf> Wikifunctions developer feedback *Wikifunctions feedback <https://commons.wikimedia.org/wiki/File:Publish_-_Wikifunctions_feedback.pdf>.* 10 developers were interviewed in order to see what developers think of the Wikifunctions idea. There are many interesting ideas and discussions in this report: - Discussions of GitHub vs MediaWiki versioning. - How well will the UI support more complex implementations? - How to curate many different implementations for a function? The developers correctly identified that Wikifunctions is not about whole programs, but about individual functions that can then be used like a toolbox for many purposes. The discussions and questions also laid bare the expectations developers might have for Wikifunctions, and where we need to make our communication clearer in order to not disappoint potential contributors. Both reports indicate some of the challenges Wikifunctions will face. We are taking the reports seriously and are using them as input for our UX design. Even given these results, our goal remains to make writing functions and implementations in Wikifunctions accessible to novice coders, and Wikifunctions usable and understandable by people who are not already coders. <https://meta.wikimedia.org/wiki/File:Wikilambda_-_Early_Eta_-_Create_a_new_…> <https://meta.wikimedia.org/wiki/File:Wikilambda_-_Early_Eta_-_Create_a_new_…> Creating a new function in the Early Eta Wikilambda prototype We are currently designing the function editor to be more approachable, intuitive, and mobile-friendly. The video here gives you a first view of how to define and edit a function <https://commons.wikimedia.org/wiki/File:Wikilambda_-_Early_Eta_-_Create_a_n…>. It involves a number of simple steps, while providing guidance throughout the process. We can see an automated diagram of the function on the right. Adding testers and implementations can also be done directly from within the function editor. The implementation of this interface should land in the prototype soon, so you will be able to test it 'live'. We hope that this makes function creation and editing in Wikifunctions considerably easier, more understandable, and more enjoyable than the initial, placeholder experience. Enjoy reading the reports and watching the video!

2 years, 7 months

1
0
0 0

Newsletter #45: Lexemes and paradigms

by Denny Vrandečić

The on-wiki version of this newsletter is available here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-17 -- Last week we discussed how to implement paradigms <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-10> in Wikifunctions. This week, let’s discuss a few ideas on how this could be used. One may ask why this is useful, given that we are collecting all the different forms in the lexicographic data in Wikidata anyway. We don’t need to generate the forms if we have a full set of forms in Wikidata, surely? There are several possible use cases: First, we will probably never achieve a full coverage in Wikidata of all forms in all languages. In some languages, the number of forms may be prohibitively high, and we, like every other dictionary, might need to make a selection of forms to store. Often the forms not stored are highly regular. Second, even if we have really good coverage, occasionally you will need to introduce words that are not in the dictionary: when displaying neologisms, when generating a new lexeme by conversion from another grammatical category (for example: verbing nouns in English, or using place names to make demonyms), or when using loanwords from other languages. Fortunately, such words are often regular, and having smart paradigms as described last time can take us pretty far. Third, the paradigms can be used in Wikidata to connect to the actual lexemes. For example, on a lexeme such as *"cat <https://www.wikidata.org/wiki/Lexeme:L7>"* we could link to the paradigm that we developed last week, either the add s <https://notwikilambda.toolforge.org/wiki/Z10110> function or the English regular plural <https://notwikilambda.toolforge.org/wiki/Z10132> function. Linking the lexeme with the function allows individual forms to be re-generated, which in turn means they can be checked for correctness, thus ensuring data quality. The English regular plural function can tell us that the plural for *"pasty <https://www.wikidata.org/wiki/Lexeme:L24858>"* should be *"pasties"*, but that Wikidata lexeme previously defined it as *"pastiest <https://www.wikidata.org/w/index.php?title=Lexeme:L24858&oldid=1033449948#F2>"*. The plural of *"strawman"* should be *"strawmen"*, not *"strawmans <https://www.wikidata.org/w/index.php?title=Lexeme:L227827&oldid=1069374578>"*; the plural for *"Frenchwoman"* should be *"Frenchwomen"* not *"Frenchwoman <https://www.wikidata.org/w/index.php?title=Lexeme:L34524&oldid=1392427375>"* . One question is: if we have a paradigm that can create the forms, why even create and store the forms in Wikidata in the first place? That’s a great question, and a decision that can indeed be revisited by the community. Personally, I think we need both forms stored explicitly in Wikidata and generative paradigms. Without the former, it's not clear how we would handle irregular forms — would the onus lie on the paradigms? That seems messy. Likewise, paradigms are crucial when, for example, a Lexeme has thousands of possible forms. If these forms are always regular, the community might decide not to materialize them all — especially if many Lexemes cleave to the same regular morphological pattern. This seems also to be the case for English nouns: almost all of the English nouns in Wikidata have two forms, even though one could argue that English nouns have four forms (including the possessive forms); however, the English possessive <https://en.wikipedia.org/wiki/English_possessive> forms seem to be generated so regularly that, so far, Wikidata contributors seem to consider them unnecessary and usually omit them. Fourth, the paradigms can also be used to propose a starting point when entering the data. Imagine the Wikidata Lexeme Forms <https://www.wikidata.org/wiki/Wikidata:Wikidata_Lexeme_Forms> allowing you to select a function on Wikifunctions that, given the lemma, generates all likely forms for an entry. The Lexeme Forms tool has already improved the creation of Lexemes considerably, making the entries much more consistent and expansive. If, in addition, we could also automatically generate most of the forms, this would increase the speed of entering the data by a lot - and at the same time reduce the likelihood of data entry errors. Besides all these immediate improvements, there might be many further advantages. For example, storing an offline dictionary would require much less storage space if we use paradigms. Developing paradigms for currently under-resourced languages might create aids for working with those languages. Having a knowledge base of paradigms across languages may be interesting from the perspective of linguistic research. Once Wikifunctions has launched, we hope that the community will develop a library of morphological paradigms and their connection with the lexicographical data in Wikidata. Besides this being a very helpful step on our path to Abstract Wikipedia, we think that this will considerably expand the content of the lexicographical data in Wikidata. That — together with enabling access to the lexicographic data from within the Wiktionaries <https://phabricator.wikimedia.org/T235901> — will help with significantly empowering the contributors to Wiktionary, particularly to the smaller Wiktionaries and to the languages with fewer contributors in all Wiktionaries. Thanks to User:YULdigitalpreservation <https://www.wikidata.org/wiki/User:YULdigitalpreservation>, who created EntitySchema E327 <https://www.wikidata.org/wiki/EntitySchema:E327> on Wikidata for English Nouns with Genitives, and to User:VIGNERON <https://meta.wikimedia.org/wiki/User:VIGNERON> for creating French plural morphology on NotWikiLambda, and User:Strobilomyces <https://en.wikipedia.org/wiki/User:Strobilomyces> for collaborating on that.

2 years, 7 months

1
0
0 0

Newsletter #42: Phase ζ completed

by Denny Vrandečić

The on-wiki version of this newsletter is here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-08-27 -- When we started the development effort towards the Wikifunctions site, we sub-divided the work leading up to the launch of Wikifunctions into eleven phases <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Phases>, named after the first eleven letters of the Greek alphabet. - With Phase α (alpha) completed, it became possible to create instances of the system-provided Types in the wiki. - With Phase β (beta), it became possible to create new Types on-wiki and to create instances of these Types. - With Phase γ (gamma), all the main Types of the pre-generic function model were available. - With Phase δ (delta), it became possible to evaluate built-in implementations. - With Phase ε (epsilon), it became possible to evaluate contributor-written implementations in any of our supported programming languages. - This week, we completed Phase ζ (zeta). The goal of Phase ζ has been to provide the capability to evaluate implementations composed of other functions. What does this mean? Every Function in Wikifunctions can have several Implementations. There are three different ways to express an Implementation: 1. As a built-in Function, written in the code of Wikilambda: this means that the Implementation is handled by the evaluator natively using code written by the team. 2. As code in a programming language, created by the contributors of Wikifunctions: the Implementation of a Function can be given in any programming language that Wikifunctions supports. Eventually we aim to support a large number of programming languages; for now we support JavaScript and Python. 3. As a composition of other Functions: this means that contributors can use existing Functions as building blocks in order to implement new capabilities. With Phase ζ we close the trilogy of Phases dealing with the different ways to create Implementations. Besides making composition work, we also spent some time on other areas. We worked to reduce technical debt that we accumulated in development during the last two phases which we rushed in order to be ready for the security and performance reviews. We improved how the error system works, re-worked the data model for Testers and Errors, refactored the common library to be more extensible, moved the content of the wiki to the main namespace, and changed Python function definitions to align with the style we use for JavaScript ones. We started with some work to make the current bare-bones user experience better. This included displaying Testers' results and meta-data on their own page as well as related Function and Implementation pages. Functions and Implementations can be easily called right from their page. We made it much easier to create and connect Implementations and Testers with their functions, started on the designs for Function definition and implementation, and implemented aliases that sit alongside labels, much like in Wikidata. Plenty done! We are now moving on to Phase η (eta). The three main goals of phase η is to finish the re-work of the Error system, to revisit user-defined types and integrate them better with validators, and to allow for generic types. What are generic types? We have a type for a list of elements. But instead of saying “this is a list of elements”, we can often be more specific, and for example say “this is a list of strings”. Why is that useful? Because now, if, for example, we have a function to get the first element of a list, we know that this function will return a string when given this kind of list. This allows us to then offer a better user experience by making more specific suggestions, because now the system knows that it can suggest functions that work with strings. We can also check whether an implementation makes sense by ensuring that the types fit. We won’t be able to do that in all cases, but having generics will allow us to increase the number of cases where we can do that by a lot. For more background you can refer to the Wikipedia article on generic programming <https://en.wikipedia.org/wiki/Generic_programming>. In this example case, instead of a special type representing a list of strings, we will have a function that takes a type and returns a typed list. If you then call this function with the string type as the argument, the result of the function will be the concept of a list of strings. And you can easily use that for any other type, including user-defined types. My thanks to the team! My thanks to the volunteers! Some of us are starting to have fun using the prototype, playing with implementations across different programming languages interacting with each other in non-trivial ways, and starting to build a small basic library of functions. This will also be the phase where we move from the pre-generic data model <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Pre-generic_function_mod…> to the full function model <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Function_model>. To give due warning: this probably means that almost everything will need to be re-written by the end of this phase, in order to take advantage of the generic system that we are introducing. Thank you for accompanying us on our journey!

2 years, 7 months

1
0
0 0

Newsletter #44: Morphological paradigms

by Denny Vrandečić

The on-wiki version of this newsletter is here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-10 -- One of the early types of functions we want to start building in Wikifunctions are functions that perform regular morphological transformations on words. That is, functions that, given the base form of a word, can create the regular inflected forms of a word. Or, to give an example: that can tell me that the plural of *“book”* in English is *“books”*. English is a comparably simple example, but that should make it easier to sketch out the proposal in this newsletter. In many other cases, the morphological functions and the grammar are likely to be more complicated. The most regular way to create a plural from an English noun’s base form is to add the letter *“s”* to it. Let’s now see how many of Wikidata’s entries would be covered by this simple rule. Wikidata currently has about 28,100 <https://w.wiki/43N6> English nouns. Whereas Wikidata allows for a lot of flexibility when entering lexicographical entries, Wikifunctions will require the data to have a more predictable shape in order to use it effectively. One way to express these shapes is through lexical masks <https://github.com/google/lexical-masks/>. English nouns have two different lexical masks <https://github.com/google/lexical-masks/blob/master/masks/en.json>: one with only two forms (a singular and a plural, e.g. *“book”* and *“books”*) and one with four forms (including two genitive forms, i.e. *“book’s”* and *“books’”*). Both of these masks have been automatically translated <https://github.com/google/lexical-masks/blob/master/shex/en.shex> into Shex <https://www.wikidata.org/wiki/Wikidata:WikiProject_Schemas>, the language that is used by Wikidata for checking data completeness. But only the two-form version has been turned into an Entity Schema in Wikidata <https://www.wikidata.org/wiki/EntitySchema:E155>. Now we can take the 28,000 English nouns in Wikidata and check how many of them fulfill the requirements described above (let me know if there is interest in the code). It turns out that more than 25,500, that is more than 91% of the nouns, fulfill the requirement. And all of them fulfill the two-form schema. Four nouns (*contract <https://www.wikidata.org/wiki/Lexeme:L5605>*, *player* <https://www.wikidata.org/wiki/Lexeme:L5607>, *swimmer* <https://www.wikidata.org/wiki/Lexeme:L7384>, and *sport <https://www.wikidata.org/wiki/Lexeme:L301>*) almost fulfill the four-form schema, but on each of them the cases on the nominative forms are missing. <https://meta.wikimedia.org/wiki/File:Book_to_books_in_NotWikiLambda.png> <https://meta.wikimedia.org/wiki/File:Book_to_books_in_NotWikiLambda.png> Evaluating "Add s" on "book" in NotWikiLambda So let’s focus on the 25,500 nouns that pass the structural requirements. We created a function that adds the letter *“s”* at the end of the word in NotWikiLambda. When we count how many of the plurals are generated this way, we see that 21,000 English nouns are created correctly by simply adding *"s"*, 82% of all nouns. Adding *“s”* is one paradigm, and, as we can see, the most common one for English nouns. On the right-hand side of the Function's page you can see a heading “Evaluate Function,” and there you can enter a value, say *“book”*. If you click on “Call Function” below, the result *“books”* should come back. (Note that WikiLambda <https://www.mediawiki.org/wiki/Extension:WikiLambda> is in heavy development, and the test site <https://notwikilambda.toolforge.org/> might have hiccups at any time. A screenshot of the evaluation working correctly is shown here.) Another paradigm works for many English nouns that end with the letter *“y”*. There are many cases where we replace the letter *“y”* with the letter *“ies”*, e.g. when turning *“baby”* into *“babies”*, or *“fairy”* into *“fairies”*. We created the function replacing *“y”* at the end with *“ies”* <https://notwikilambda.toolforge.org/wiki/Z10129> in NotWikiLambda. When we run this paradigm against the nouns in Wikidata, more than 2,000 nouns (almost 8%) get covered by this function. <https://meta.wikimedia.org/wiki/File:Baby_to_babies_in_NotWikiLambda.png> <https://meta.wikimedia.org/wiki/File:Baby_to_babies_in_NotWikiLambda.png> Evaluating "Replace y with ies at end" in NotWikiLambda We could create further paradigms (e.g. add *“es”*, which would cover more than 1,800 nouns), and we could even write a single function which tries to discern which of these functions to apply (e.g. if it ends with *“s”* or *“sh”*, add *“es”*; if it ends with a *“y”* preceded by a consonant, replace that *“y”* with an *“ies”*; else simply add an *“s”*, etc.), which would give us a more powerful function that can deal with many more words (a bit of experimentation got me to a function <https://notwikilambda.toolforge.org/wiki/Z10132> that covers 98.3% of all cases). Grammatical Framework has introduced these functions as so-called smart paradigms <https://aclanthology.org/E12-1066.pdf>. Their web-based implementation of smart paradigms <https://cloud.grammaticalframework.org/gfmorpho/> for English nouns covers 96% of the nouns in Wikidata. I would be very curious to see how either of these numbers compares to modern, machine-learning based solutions, and I also want to invite people to create an even smarter paradigm with better coverage without the code becoming too complex. Smart paradigms are useful when data in Wikidata is incomplete. For example for loan words, technical terms, neologisms, names, or when verbing nouns <https://www.gocomics.com/calvinandhobbes/1993/01/25> (so-called conversion <https://en.wikipedia.org/wiki/Conversion_(word_formation)#Verb_conversion_i…>), we might need to create a form automatically that Wikidata doesn’t yet explicitly know about. As this week’s entry is already getting quite long, we will defer to next time the discussion of some of the possibilities of how those paradigms implemented in Wikifunctions might interplay with the lexicographic data in Wikidata. This will also shed more light on the role that the morphological paradigms might play for Abstract Wikipedia in the future. ---- In other news: This week, Abstract Wikipedia was covered within the US NPR radio news programme The World <https://en.wikipedia.org/wiki/The_World_(radio_program)>. Host Marco Werman interviewed Denny <https://www.pri.org/file/2021-09-07/wikipedia-s-efforts-get-its-300-languag…> in a five-minute segment that was broadcasted on numerous public radio stations. The segment is now also available online. The German public TV station 3sat <https://en.wikipedia.org/wiki/3sat> broadcast a documentary about Wikipedia this week: “Wikipedia - Die Schwarmoffensive” <https://www.3sat.de/film/dokumentarfilm/wikipedia--die-schwarmoffensive-100…>. The German-language documentary can be viewed online from Germany, Switzerland, and Austria. It also discusses Abstract Wikipedia for a few minutes at the end of the documentary.

2 years, 7 months

1
0
0 0

Word Sense Disambiguation

by Andy

Is very hard to make large or even medium size corpus of sentences, in which each word would be manually annotated with sense. Abstract Wikipedia not only allows generate text in many languages from one source but can be WSD corpus. Moreover: in many languages. This allows understanding natural text and operations like: 1) translation from any natural language to disambig form 2) translate from this form to other natural language and after step 1 this form will very useful not only for translation I was interested in this Abstract Wikipedia project one year ago.Now I'm not up to date on the topic On Arctic Knot conference will be look on project as database of disambiguated knowledge?

2 years, 7 months

2
1
0 0

Newsletter #43: Generating text with Ninai and Udiron

by Denny Vrandečić

The on-wiki version of this newsletter is here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-03 Due to the embedded table it might be easier to read on-wiki. -- The update this week has been written by Mahir Morshed <https://meta.wikimedia.org/wiki/User:Mahir256>. Mahir is a long time contributor to Wikidata and particularly also to the lexicographical data on Wikidata. He has developed a prototype that generates natural language from an abstract content representation in Bengali and Swedish, a prototype with the goal that this could be implementable within Wikifunctions. In this newsletter, Mahir describes the prototype. ------------------------------ Discussion around Abstract Wikipedia's natural language generation capabilities has revolved around the presence of abstract constructors and concrete renderers per language, while also noting the use of Wikidata items and lexemes as a basis for mapping concepts to language. In the interest of making this connection a bit clearer to imagine, I have started to build a text generation system. This uses items, lexemes, and wrappers for them as building blocks, and these blocks are then assembled into syntactic trees, based in part on the Universal Dependencies <https://universaldependencies.org/> syntactic annotation scheme. (If this seems like a different approach from what was discussed in a newsletter two months prior <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-06-24>, that's because it is. Feel free to drop me a message if you'd like to discuss it.) The system is composed of three parts, where the last is likely to be something we could skip in a port to Wikifunctions: - Ninai <https://bitbucket.org/mmorshe2/ninai/> (from the Classical Tamil for "to think") holds all constructors, logic at a sufficiently high level for renderers, and a resolution system from items (each wrapped in a "Concept" object) to sense IDs for a given language. Decisions and actions in Ninai are meant to be agnostic to the methods for text formation underneath, which are supplied by... - Udiron <https://bitbucket.org/mmorshe2/udiron/> (from the Bengali pronunciation of the Sanskrit for "communicating, saying, speaking"), which holds lower-level text manipulation functions for specific languages. These functions operate on syntactic trees of lexemes (each lexeme wrapped in a "Clause" object). These lexemes are imported via... - tfsl <https://phabricator.wikimedia.org/source/tool-twofivesixlex/> (from "twofivesixlex"), a lexeme manipulation tool, which is intended to be akin to pywikibot but with a specific focus on the handling of Wikibase objects. Both of the above components depend on this one, although if 'native' item and lexeme access and manipulation becomes possible with Wikifunctions built-ins then tfsl could possibly be omitted. Some design choices in this system worth noting are as follows: - Constructors, while being language-agnostic and falling within some portion of a class hierarchy, are purely containers for their arguments, carrying no other logic within. This means, for example, that an instance of a constructor Existence(subject), to indicate that the subject in question exists, only holds that subject within that instance, and does nothing else until a renderer encounters that constructor. - Every constructor allows, in addition to any required inputs, a list of extra modifiers in any order (the 'scope' of the idea represented by that constructor). This means, for example, that a constructor Benefaction(benefactor, beneficiary) might be invoked with extra arguments for the time, place, mode, and other specifiers after the beneficiary. - When one 'renders' a composition of constructors, a Clause object (representing the root of a syntactic tree) is returned; turning it into a string of text is done with Python's str() built-in applied to that object. At the moment, there are just enough constructors to represent Sentence 1.1 from the Jupiter examples <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Examples/Jupiter>, as well as renderers in Bengali and Swedish for those constructors (thanks to Bodhisattwa <https://meta.wikimedia.org/wiki/User:Bodhisattwa>, Jan <https://meta.wikimedia.org/wiki/User:Ainali>, and Dennis <https://meta.wikimedia.org/wiki/User:So9q> for feedback on those). Building up to the Jupiter sentence should demonstrate how these work: Building up to the Jupiter sentence step by step Constructor textBengali outputSwedish outputGloss *(not renderer output!)* Notes Identification( Concept(Q(319)), Concept(Q(634))) বৃহস্পতি গ্রহ। Jupiter är klot. Jupiter is planet. We start by simply *identifying* the two *concepts* of Jupiter (Q319) <https://www.wikidata.org/wiki/Q319> and planet (Q634) <https://www.wikidata.org/wiki/Q634> as being equal. Identification( Concept(Q(319)), Instance( Concept(Q(634)))) বৃহস্পতি একটা গ্রহ। Jupiter är ett klot. Jupiter is a planet. Instead of equating the concepts alone, we might instead equate "Jupiter" with an *instance* of "planet". Identification( Concept(Q(319)), Instance( Concept(Q(634)), Definite())) বৃহস্পতি গ্রহটি। Jupiter är klotet. Jupiter is the planet. We may further refine that by making clear that "Jupiter" is a *definite* instance of "planet". Identification( Concept(Q(319)), Instance( Attribution( Concept(Q(634)), Concept(Q(59863338))), Definite())) বৃহস্পতি বড় গ্রহটা। Jupiter är det stora klotet. Jupiter is the large planet. Now we might ascribe an *attribute* to the definite planet instance in question, this attribute being large (Q59863338) <https://www.wikidata.org/wiki/Q59863338>. Identification( Concept(Q(319)), Instance( Attribution( Concept(Q(634)), Superlative( Concept(Q(59863338)))), Definite())) বৃহস্পতি সবচেয়ে বড় গ্রহটি। Jupiter är det största klotet. Jupiter is the largest planet. This attribute being *superlative* for Jupiter can be marked by modifying the attribute. Identification( Concept(Q(319)), Instance( Attribution( Concept(Q(634)), Superlative( Concept(Q(59863338)), Locative( Concept(Q(544))))), Definite())) বৃহস্পতি সৌরমণ্ডলে সবচেয়ে বড় গ্রহ। Jupiter är den största planeten i solsystemet. Jupiter is the largest planet in the solar system. Once we specify the *location* where Jupiter being the largest applies (that is, in the Solar System (Q544) <https://www.wikidata.org/wiki/Q544>), we're done! Note that the sense resolution system does not have enough information to choose which of '-টা' or '-টি' (for Bengali) or of 'klot' or 'planet' (for Swedish) to use in some of these examples, so currently in the prototype one is chosen at random. This therefore means that re-rendering any examples which pull those in might use something different. Besides this, there is clearly a lot more functionality to be added, and because Bengali and Swedish are both Indo-European languages (however distant), there are likely linguistic phenomena that won't be considered simply by developing renderers for those two languages alone. If there's something particular in your language that isn't present in those two languages, this may then raise the question: what can you do for your language? I can think of at least four things, not in any particular order: - Create lexemes and add senses to them! What matters most to the system is that words have meanings (possibly in some context, and possibly with equivalents in other languages or to Wikidata items) so that those words can be properly retrieved based on those equivalences; that these words might have a second-person plural negative past conditional form is largely secondary! - Think about how you might perform some basic grammatical tasks in your language: how do you inflect adjectives? add objects to verbs? indicate in a sentence where something happened? - Think about how you might perform higher-level tasks involving meaning: what do you do to indicate that something exists? to indicate that something happened in the past but is no longer the case? to change a simple declarative sentence into a question? - If you have some ideas on how to render the Jupiter sentence in your language, and the lexemes you would need to build that sentence exist on Wikidata, and those lexemes have senses for the meanings those lexemes take in that sentence, let me know! We'd love to hear your thoughts on this prototype, and what it might mean for realizing Abstract Wikipedia through Wikidata's lexicographic data and Wikifunctions's platform. ------------------------------ Thank you Mahir for the great update! If you too want to contribute to the weekly, get in touch. This is a project we all build together. In addition, this week Slate published a great explaining article on the goals of Abstract Wikipedia and Wikifunctions: Wikipedia Is Trying to Transcend the Limits of Human Language <https://slate.com/technology/2021/09/wikipedia-human-language-wikifunctions…>

2 years, 7 months

1
0
0 0

2024

2023

2022

2021

2020

Abstract-Wikipedia September 2021