Abstract-Wikipedia February 2021

abstract-wikipedia@lists.wikimedia.org

7 participants
5 discussions

Newsletter #20: Logo concept voting will start on Monday

by Denny Vrandečić

The on-wiki version of this update can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-02-25 -- The logo concept <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Wikifunctions_logo_conce…> submission phase has come to a close, and we received 46 submissions and variants <https://meta.wikimedia.org/wiki/Talk:Abstract_Wikipedia/Wikifunctions_logo_…>. I am deeply impressed with the submissions, and am looking forward to the vote. Some of the submissions have a number of variants, and in order to avoid the decoy effect <https://en.wikipedia.org/wiki/Decoy_effect>, we would like to remove concepts that are too similar to each other. For Wikidata, we made the decision inside the team about which variants to choose, and we also dropped a number of the proposals. Here, I would like us to make the decision together, but in order to ensure that things move on, we’ll take a look at the status of the discussion on Monday and then the team will finalize the exact candidates. We also plan to have short notes on each submission (3-4 sentences), so we are asking the submitters (but also everyone else who wants to join in) to write short notes for each of the candidates. A short explanation can do wonders for making a logo more interesting (my second favorite example is the seemingly simple FedEx logo <https://en.wikipedia.org/wiki/FedEx#Logo>). [image: Screenshot 2021-02-25 at 12.46.10.png] This time can also be an opportunity to remove submissions from the pool, or to flag other concerns and make them explicit in the notes, so that voters are aware. This would include similarities to other, existing logos, and we should decide whether we still want to keep the candidate or drop it. We will open the voting on Monday afternoon Pacific time, and keep it open for two weeks until 15 March. Everyone eligible can vote for as many different candidates as they like, and the logo with the most votes will then be submitted to the legal department in order to scrutinize it, and to the design department to refine it. To give an example of how far the refinement could go: this was the winner of the Wikidata logo voting, and it was refined by using a different ratio on the bar and changing the wordmark considerably. We will take similar freedoms with the logo concept this time around too. In summary, here are the next steps: until Monday, let’s agree on the set of logo concepts to vote on, and decide on the notes to accompany them. On Monday, we will then start sending out messages inviting people to vote. I am very excited to see what will be in the corner of our new Website, on t-shirts, stickers, and badges. And thank you all for joining us in this exciting process!

3 years, 1 month

URL-addressable statements and clusters of statements

by Adam Sobieski

Hello. The following ideas about URL-addressable statements and clusters of statements (e.g. paraphrase sets or clusters) are relevant to a recent Wikifact project proposal<https://meta.wikimedia.org/wiki/Wikifact>, could be relevant to a recent Wikipragmatica project proposal<https://meta.wikimedia.org/wiki/Wikipragmatica>, and, hopefully, are relevant and interesting to Wikidata and Abstract Wikipedia. Each statement, claim, or fact could have a URL. Each cluster of paraphrases could have a URL. Statements, claims, or facts could have URL’s, for instance https://www.wikifact.org/statements/33DCF305-3A4D-4024-9AD7-CCB1A29054E2 . Clusters of paraphrases could have URL’s, for instance https://www.wikifact.org/clusters/D006871E-24A6-428F-BD1F-D20C3C7B7685 . The URL for an individual statement, claim, or fact could, while optionally providing data, redirect to a URL for the paraphrase cluster which contains it. This could convenience processes of semi-automated, collaborative paraphrasing. That is, in the event of an erroneous paraphrasing, editors or software tools could edit a redirect page to re-cluster the individual statement, claim, or fact to an updated cluster of paraphrases. At the URL for a paraphrase cluster could be a human-editable sequence of explained annotations about a statement, claim, or fact. The emergent feature of URL-addressability could convenience Web-based communication about statements, claims, and facts. End-users would be able to share hyperlinks to fact-checking articles about individual statements, claims, or facts. This could facilitate a number of other, related technologies. Also interestingly, statement patterns could be expressed and these patterns could be utilized via URL query strings. Nouns or noun phrases could be provided as arguments. That is, arguments for thematic relations could be provided utilizing Wikidata lexemes and entities. https://www.wikifact.org/patterns/293FCD5D-27A7-498A-81C3-C78EF0F9D9A2?agen… could represent a set of statements expressing that “Douglas Adams ate an apple.” Best regards, Adam Sobieski

3 years, 2 months

Newsletter #19: A test suite for evaluation engines (and other updates)

by Denny Vrandečić

The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-02-18 -- Development has been active. We are deep into Phase γ <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Phases#Phase_%CE%B3_(gam…>, working on supporting the core types for Wikifunctions, including functions, implementations, testers, errors, and so on. We are removing some major blockers for further development. At the same time, we have already begun our work on the larger architecture of the system, in particular our evaluation engine with support for one native programming language. The evaluation engine is the part of Wikifunctions responsible for evaluating function calls. That is, it is the part that gets asked, “Hey, what’s the sum of 3 and 5?” and answers, “8”. Our evaluation engine is principally separated into two main parts: the function orchestrator <https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/function-…>, which receives the calls and collates the functions and any data needed to process and evaluate the calls; and then the function executor <https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/function-…>, which runs the contributor-written code, as instructed by the orchestrator. As the executor can run uncontrolled native code, it lives in a tightly controlled environment and has only minimal permissions beyond the limited use of compute and memory. The orchestrator will also rely heavily on caching: if we have just calculated the sum of 3 and 5, and someone else asks for that too, we’ll just take it from the cache instead of re-running the computation. We'll also cache function definitions and inputs within the orchestrator, so that if someone asks for the sum of 3 and 6 we can answer more swiftly. But this is just our production evaluation engine. We are hoping that several other evaluation engines will be built, like the GraalVM-based one <https://github.com/lucaswerkmeister/graaleneyj> on which Lucas Werkmeister is already working. In order to support the development of evaluation engines, we are working on a test suite that other evaluation engines can use for conformance testing. If you’re interested in joining that effort, drop a note on this task <https://phabricator.wikimedia.org/T275093>. The test suite, as well as the common code used by several parts of our system to handle ZObjects, will live in a new library repository, function-schemata <https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/function-…> . This development has been a bit out of order from the original plan we conceived last August. In fact, we are thinking of changing the order of some of the developments, and we expect to do significant parts of it in parallel. Having the evaluation engine available earlier makes it possible to start the security and performance reviews in a timely manner, and to validate our architectural plans. Originally, we had only planned for an evaluation engine that understands a programming language in Phase θ, and to support only a single programming language until after launch. We have now changed that to be much sooner, and also we plan to support at least two programming languages right at launch. This change will help us avoid the pitfall of possibly getting stuck with a design that only works for one programming language. Having two or more will better commit us to a multi-environment project, in terms of programming languages. In other news The deadline for submissions to the Wikifunctions logo concept <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Wikifunctions_logo_conce…> is coming closer: submissions are accepted until Tuesday, 23 February, followed by a two-day discussion before the voting on which concept to develop starts on Thursday, 25 February. Currently, we have 17 submissions (and some additional variants). There have been a number of talks and external articles which may be of interest. We gave a presentation at the Graph Technologies in the Humanities: 2021 Virtual Symposium <https://graphentechnologien.hypotheses.org/tagungen/graph-technologies-in-t…>. You can watch our pre-recorded presentation <https://thm.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=30dcef00-63a2-44b…> for the symposium. It was followed by ample time to discuss the project; unfortunately, the discussion itself will not be published. We also presented at the NSF Convergence Accelerator Series <http://spatial.ucsb.edu/2021/Denny-Vrandecic>.The talk is very similar to the previous talk, but this recording includes the discussion following the talk. The Tool Box Journal - A Computer Journal For Translation Professionals Issue 322 <https://internationalwriters.com/toolkit/current.html> reports on Abstract Wikipedia, Wikifunctions, and Wikidata. I found it very interesting to see how the projects are perceived by professional translators, and their comparison of Wikidata to a termbase. The German magazine Der Spiegel published an interview with Denny <https://www.spiegel.de/netzwelt/web/wikipedia-wird-20-wenn-google-das-proje…> about Abstract Wikipedia. They also published a more comprehensive article in their 16 January print issue, which is available in their archive for subscribers <https://www.spiegel.de/netzwelt/web/wie-wikipedia-zu-einer-uebersetzungsmas…>. Both the interview and the article are in German.

3 years, 2 months

Newsletter #18: Two prototype tools to visualize lexicographic coverage in Wikidata

by Denny Vrandečić

The on-wiki version of this newsletter is here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-02-10 The goal of Abstract Wikipedia is to generate natural language texts from an abstract representation of the content to be represented. In order to do so, we will use lexicographic data from Wikidata. And although we are quite far from being able to generate texts, one thing that we want to encourage everyone’s help with is the coverage and completeness of the lexicographic data in Wikidata. Today we want to present prototypes of two tools that could help people to visualize, exemplify, and better guide our understanding of the coverage of lexicographic data in Wikidata. Annotation interface The first prototype is an annotation interface that allows users to annotate sentences in any language, associating each word or expression with a Lexeme from Wikidata, including picking its Form and Sense. You can see an example in the screenshot below. Each ‘word’ of the sentence here is annotated with a Lexeme (the Lexeme ID L31818 <https://www.wikidata.org/wiki/Lexeme:L31818> is given just under the word), followed by the lemma, the language, and the part of speech. Then comes, if selected, the specific Form that is being used in context - for example, on ‘dignity’ we see the Form ID L31818#F1, which is the singular Form of the Lexeme. Lastly, comes the Sense, which is assigned Sense ID L31818#S1 and defined by a gloss. At any time, you can remove any of the annotations, or add new annotations. Some of the options will take you directly to Wikidata. For example, if you want to add a Sense to a given Lexeme, because it has no Senses or is missing the one you need, it will take you to Wikidata and let you do that there in the normal fashion. Once added there, you can come back and select the newly added Sense. The user interface of the prototype is a bit slow, so please give it a few seconds when you initiate an action. It should work out of the box in different languages. The Universal Language Selector is available (at the top of the page), which you can use to change the language. Note that glosses of Senses are frequently only available in the language of the Lexeme, and the UI doesn’t yet do language fallback, so if you look at English sentences with a German UI you might often find missing glosses. Technologically, this is a prototype entirely implemented in JavaScript and CSS on top of a vanilla MediaWiki installation. This is likely not the best possible technical solution for such a system, but should help to determine if there is any user-interest in the tool, for a potential reimplementation. Also, it would be a fascinating task to agree on an API which can be implemented by other groups to provide the selection of Lexemes, Senses, and Forms for input sentences. The current baseline here is extremely simple, and would not be good enough for an automated tagging system. Having this available for many sentences in many languages could provide a great corpus for training natural language understanding systems. There is a lot that could be built upon that. The goal of this prototype is to make more tangible the Wikidata community's progress regarding the coverage of the lexicographical data. You can take a sentence in any written language, put it into this system, and find out how complete you can get with your annotations. It's a way to showcase and create anecdotal experience of the lexicographic data in Wikidata. The prototype annotation interface is at: http://annotation.wmcloud.org/ You can discuss it here: https://annotation.wmcloud.org/wiki/Discussion (You will need to create a new account - if you have time to set this up with SUL, drop me a line) Corpus coverage dashboard The second prototype tool is a dashboard that shows the coverage of the data compared to a corpus in each of forty languages. Last year, whilst in my previous position at Google Research, I co-authored a publication where we built and published language models out of the cleaned-up text of about forty Wikipedia language editions [1]. Besides the language models, we also published the raw data: this text has been cleaned up by the pre-processing system that Google uses on Wikipedia text in order to integrate the text in several of its features. So while this dataset consists of relatively clean natural language text; certainly, compared to the raw wiki text — it still contains plenty of artefacts. If you know of better large scale encyclopedic text corpora we can use, maybe better cleaned-up versions of Wikipedia, or ones covering more languages, please let us know <https://phabricator.wikimedia.org/T273221>. We extracted these texts from the TensorFlow models <https://www.tensorflow.org/datasets/catalog/wiki40b>. We provide the extracted texts for download <https://drive.google.com/drive/folders/1HfL138UCqr69w0XfAhlAEUh6VVOnzwBE> (a task <https://phabricator.wikimedia.org/T274208> to move it to Wikimedia servers is underway). We split the text into tokens and count the occurrences of words, and compared how many of these tokens appear in the Forms on Lexemes of the given language in Wikidata’s lexicographic data. If this proves useful, we could move the cleaned-up text to a more permanent home. A screenshot of the current state for English is given here: we see how many Forms for this language are available in Wikidata, and we see how many different Forms are attested in Wikipedia (i.e., how many different words, or word types, are in the Wikipedia of the given language). The number of tokens is the total number of words in the given language corpus. Covered forms says how many of the forms in the corpus are also in Wikidata's Lexeme set, and covered tokens tells us how many of the occurrences that covers (so, if the word ‘time’ appears 100 times in English Wikipedia, it would be counted as one covered form, but 100 covered tokens). The two pie charts visualize the coverage of forms and tokens respectively. Finally, there is a link to the thousand most frequent forms that are not yet in Wikidata. This can help communities prioritise ramping up coverage quickly. Note though, the progress report is manual and does not automatically update. I plan to run an update from time to time for now. The prototype corpus coverage dashboard is at: https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage You can discuss it here: https://www.wikidata.org/wiki/Wikidata_talk:Lexicographical_coverage Help wanted Both prototype tools are exactly that: prototypes, not real products. We have not committed to supporting and developing these prototypes further. At the same time, all of the code and data is of course open sourced. If anyone would like to pick up the development or maintenance of these prototypes, you would be more than welcome – please let us know (on my talk page <https://meta.wikimedia.org/wiki/User_talk:DVrandecic_(WMF)>, or via e-mail, or on the Tool ideas page <https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Ideas_of_tools> ). Also, if someone likes the idea but thinks that a different implementation would be better, please move ahead with that – I am happy to support and talk with you. There is much to improve here, but we hope that these two prototypes will lead to more development of content and tools <https://www.wikidata.org/wiki/Wikidata:Tools/Lexicographical_data> in the space of lexicographic data. [1] Mandy Guo, Zihang Dai, Denny Vrandečić, Rami Al-Rfou: Wiki-40B: Multilingual Language Model Dataset, LREC 2020, https://www.aclweb.org/anthology/2020.lrec-1.297/

3 years, 2 months

Newsletter #17: Phase β completed

by Denny Vrandečić

The on-wiki version of this edition of the newsletter is here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-02-04 When we started the development effort towards the Wikifunctions site, we subdivided the work leading up to the launch of Wikifunctions into eleven phases <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Phases>, named after the letters of the Greek alphabet. This week we completed the second phase, Phase β <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Phases#Phase_%CE%B2_(bet…> . With Phase α <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Phases#Phase_%CE%B1_(alp…> completed, it became possible to create instances of the system-provided Types in the wiki. This meant that one could go to the wiki and create, for example, a string, such as this Hello World! <https://notwikilambda.toolforge.org/wiki/ZObject:Z101> String in Lucas <https://meta.wikimedia.org/wiki/User:Lucas_Werkmeister>’s notwikilambda demonstration <https://notwikilambda.toolforge.org/wiki/Main_Page> system. The goal of Phase β <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Phases#Phase_%CE%B2_(bet…> was to allow the creation of Types on-wiki, and to allow the creation of instances of these Types. The assumption is that we will provide just a very small set of core Types, and that almost all of the Types will be defined by the community on wiki. We have discussed Types before in Newsletters #7 <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2020-11-10> and #15 <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-01-21>. As we discussed there, a good Type system can be very helpful in maintaining and working with the catalogue of functions. It can help with choosing the right function, with navigating and exploring the catalogue, and with finding errors in the function implementations. In order to demonstrate that we indeed completed Phase β, we created a Type for Positive Integers <https://notwikilambda.toolforge.org/wiki/ZObject:Z70> on the notwikilambda test site, and a literal instance of that Type for the number one <https://notwikilambda.toolforge.org/wiki/ZObject:Z701> (To make it clear, we do not expect to have a page for every natural number, in fact it might well be that the community decides to restrict their creation as persistent objects. They will be usually created and passed through as literals that are created on the fly. If you are interested in a catalogue of natural numbers, I can refer you to the Linked Open Numbers <http://km.aifb.kit.edu/projects/numbers/> project). We have also considerably improved the user interface in Phase β, and it is now often using labels next to the bare identifiers. The labels are fully internationalized, and can be localized on-wiki (e.g. the labels for the Type Positive integer can be edited right on the Type page <https://notwikilambda.toolforge.org/wiki/ZObject:Z70>, as the key of the type). Note that in order to have editing privileges on that wiki, you need to be logged in. Many parts of the user interface used to be hard coded, but now are dynamically pulled from the wiki. There are of course bugs, and we are tracking them on the Phabricator task board <https://phabricator.wikimedia.org/project/view/4876/>. If you encounter new bugs, please do raise them. Either let us know, or create a new bug report in the column “Needs triage” so that we know to review them. We are starting now with Phase γ <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Phases#Phase_%CE%B3_(gam…>. The goal is to create all the main Types of the pre-generic function model <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Pre-generic_function_mod…> - Function, Implementation, Tester, Function call, Error, and so on. There are a number of tasks that will allow us to create these Types, particularly Function call will have some magic features to them. A bit out of order, we also started developing the supporting services, the function orchestrator <https://gerrit.wikimedia.org/g/mediawiki/services/function-orchestrator/+/r…> and the function evaluator <https://gerrit.wikimedia.org/g/mediawiki/services/function-evaluator/+/refs…>. This is in order to get input on the architecture <https://www.mediawiki.org/wiki/Extension:WikiLambda> as soon as possible. Once the function data model is in place, Phase δ <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Phases#Phase_%CE%B4_(del…> will allow evaluating the function calls that we are building in the current phase. This, and the composition of functions that will be enabled in Phase ε <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Phases#Phase_%CE%B5_(eps…>, will be the beating heart of the technical features that Wikifunctions will provide. After δ it will be possible for everyone to call functions from their wiki pages, and after ε it will be possible to create any kind of function. Sure, there will still be tons of things that need to be developed and improved, without question — but these will be the main steps towards providing a glimpse into what Wikifunctions will bring to the Wikimedia movement and beyond. I want to end with a big shout out to the whole team, and to the volunteers who were contributing patches - in particular Arthur P. Smith and Gabriel Lee – and give a pointer to the logo concept contest <https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Wikifunctions_logo_conce…>. The submission deadline is on 23 February, and we already have nine great submissions! Take a look, and be invited to add your own concept submissions, ideas, and comments on others', and let others know who might be interested.

3 years, 2 months

2024

2023

2022

2021

2020

Abstract-Wikipedia February 2021