The on-wiki version of this update can be found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-02-25
--
The logo concept
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Wikifunctions_logo_conce…>
submission phase has come to a close, and we received 46 submissions and
variants
<https://meta.wikimedia.org/wiki/Talk:Abstract_Wikipedia/Wikifunctions_logo_…>.
I am deeply impressed with the submissions, and am looking forward to the
vote.
Some of the submissions have a number of variants, and in order to avoid
the decoy effect <https://en.wikipedia.org/wiki/Decoy_effect>, we would
like to remove concepts that are too similar to each other. For Wikidata,
we made the decision inside the team about which variants to choose, and we
also dropped a number of the proposals. Here, I would like us to make the
decision together, but in order to ensure that things move on, we’ll take a
look at the status of the discussion on Monday and then the team will
finalize the exact candidates.
We also plan to have short notes on each submission (3-4 sentences), so we
are asking the submitters (but also everyone else who wants to join in) to
write short notes for each of the candidates. A short explanation can do
wonders for making a logo more interesting (my second favorite example is
the seemingly simple FedEx logo <https://en.wikipedia.org/wiki/FedEx#Logo>).
[image: Screenshot 2021-02-25 at 12.46.10.png]
This time can also be an opportunity to remove submissions from the pool,
or to flag other concerns and make them explicit in the notes, so that
voters are aware. This would include similarities to other, existing logos,
and we should decide whether we still want to keep the candidate or drop
it.
We will open the voting on Monday afternoon Pacific time, and keep it open
for two weeks until 15 March. Everyone eligible can vote for as many
different candidates as they like, and the logo with the most votes will
then be submitted to the legal department in order to scrutinize it, and to
the design department to refine it.
To give an example of how far the refinement could go: this was the winner
of the Wikidata logo voting, and it was refined by using a different ratio
on the bar and changing the wordmark considerably. We will take similar
freedoms with the logo concept this time around too.
In summary, here are the next steps: until Monday, let’s agree on the set
of logo concepts to vote on, and decide on the notes to accompany them. On
Monday, we will then start sending out messages inviting people to vote.
I am very excited to see what will be in the corner of our new Website, on
t-shirts, stickers, and badges. And thank you all for joining us in this
exciting process!
Hello. The following ideas about URL-addressable statements and clusters of statements (e.g. paraphrase sets or clusters) are relevant to a recent Wikifact project proposal<https://meta.wikimedia.org/wiki/Wikifact>, could be relevant to a recent Wikipragmatica project proposal<https://meta.wikimedia.org/wiki/Wikipragmatica>, and, hopefully, are relevant and interesting to Wikidata and Abstract Wikipedia.
Each statement, claim, or fact could have a URL. Each cluster of paraphrases could have a URL.
Statements, claims, or facts could have URL’s, for instance https://www.wikifact.org/statements/33DCF305-3A4D-4024-9AD7-CCB1A29054E2 .
Clusters of paraphrases could have URL’s, for instance https://www.wikifact.org/clusters/D006871E-24A6-428F-BD1F-D20C3C7B7685 .
The URL for an individual statement, claim, or fact could, while optionally providing data, redirect to a URL for the paraphrase cluster which contains it. This could convenience processes of semi-automated, collaborative paraphrasing. That is, in the event of an erroneous paraphrasing, editors or software tools could edit a redirect page to re-cluster the individual statement, claim, or fact to an updated cluster of paraphrases. At the URL for a paraphrase cluster could be a human-editable sequence of explained annotations about a statement, claim, or fact.
The emergent feature of URL-addressability could convenience Web-based communication about statements, claims, and facts. End-users would be able to share hyperlinks to fact-checking articles about individual statements, claims, or facts. This could facilitate a number of other, related technologies.
Also interestingly, statement patterns could be expressed and these patterns could be utilized via URL query strings. Nouns or noun phrases could be provided as arguments. That is, arguments for thematic relations could be provided utilizing Wikidata lexemes and entities.
https://www.wikifact.org/patterns/293FCD5D-27A7-498A-81C3-C78EF0F9D9A2?agen… could represent a set of statements expressing that “Douglas Adams ate an apple.”
Best regards,
Adam Sobieski
The on-wiki version of this newsletter can be found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-02-18
--
Development has been active. We are deep into Phase γ
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Phases#Phase_%CE%B3_(gam…>,
working on supporting the core types for Wikifunctions, including
functions, implementations, testers, errors, and so on. We are removing
some major blockers for further development. At the same time, we have
already begun our work on the larger architecture of the system, in
particular our evaluation engine with support for one native programming
language.
The evaluation engine is the part of Wikifunctions responsible for
evaluating function calls. That is, it is the part that gets asked, “Hey,
what’s the sum of 3 and 5?” and answers, “8”. Our evaluation engine is
principally separated into two main parts: the function orchestrator
<https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/function-…>,
which receives the calls and collates the functions and any data needed to
process and evaluate the calls; and then the function executor
<https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/function-…>,
which runs the contributor-written code, as instructed by the orchestrator.
As the executor can run uncontrolled native code, it lives in a tightly
controlled environment and has only minimal permissions beyond the limited
use of compute and memory.
The orchestrator will also rely heavily on caching: if we have just
calculated the sum of 3 and 5, and someone else asks for that too, we’ll
just take it from the cache instead of re-running the computation. We'll
also cache function definitions and inputs within the orchestrator, so that
if someone asks for the sum of 3 and 6 we can answer more swiftly.
But this is just our production evaluation engine. We are hoping that
several other evaluation engines will be built, like the GraalVM-based one
<https://github.com/lucaswerkmeister/graaleneyj> on which Lucas Werkmeister
is already working. In order to support the development of evaluation
engines, we are working on a test suite that other evaluation engines can
use for conformance testing. If you’re interested in joining that effort,
drop a note on this task <https://phabricator.wikimedia.org/T275093>. The
test suite, as well as the common code used by several parts of our system
to handle ZObjects, will live in a new library repository, function-schemata
<https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/function-…>
.
This development has been a bit out of order from the original plan we
conceived last August. In fact, we are thinking of changing the order of
some of the developments, and we expect to do significant parts of it in
parallel. Having the evaluation engine available earlier makes it possible
to start the security and performance reviews in a timely manner, and to
validate our architectural plans. Originally, we had only planned for an
evaluation engine that understands a programming language in Phase θ, and
to support only a single programming language until after launch. We have
now changed that to be much sooner, and also we plan to support at least
two programming languages right at launch. This change will help us avoid
the pitfall of possibly getting stuck with a design that only works for one
programming language. Having two or more will better commit us to a
multi-environment project, in terms of programming languages.
In other news
The deadline for submissions to the Wikifunctions logo concept
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Wikifunctions_logo_conce…>
is coming closer: submissions are accepted until Tuesday, 23 February,
followed by a two-day discussion before the voting on which concept to
develop starts on Thursday, 25 February. Currently, we have 17 submissions
(and some additional variants).
There have been a number of talks and external articles which may be of
interest.
We gave a presentation at the Graph Technologies in the Humanities: 2021
Virtual Symposium
<https://graphentechnologien.hypotheses.org/tagungen/graph-technologies-in-t…>.
You can watch our pre-recorded presentation
<https://thm.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=30dcef00-63a2-44b…>
for the symposium. It was followed by ample time to discuss the project;
unfortunately, the discussion itself will not be published.
We also presented at the NSF Convergence Accelerator Series
<http://spatial.ucsb.edu/2021/Denny-Vrandecic>.The talk is very similar to
the previous talk, but this recording includes the discussion following the
talk.
The Tool Box Journal - A Computer Journal For Translation Professionals
Issue 322 <https://internationalwriters.com/toolkit/current.html> reports
on Abstract Wikipedia, Wikifunctions, and Wikidata. I found it very
interesting to see how the projects are perceived by professional
translators, and their comparison of Wikidata to a termbase.
The German magazine Der Spiegel published an interview with Denny
<https://www.spiegel.de/netzwelt/web/wikipedia-wird-20-wenn-google-das-proje…>
about Abstract Wikipedia. They also published a more comprehensive article
in their 16 January print issue, which is available in their archive for
subscribers
<https://www.spiegel.de/netzwelt/web/wie-wikipedia-zu-einer-uebersetzungsmas…>.
Both the interview and the article are in German.
The on-wiki version of this newsletter is here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-02-10
The goal of Abstract Wikipedia is to generate natural language texts from
an abstract representation of the content to be represented. In order to do
so, we will use lexicographic data from Wikidata. And although we are quite
far from being able to generate texts, one thing that we want to encourage
everyone’s help with is the coverage and completeness of the lexicographic
data in Wikidata.
Today we want to present prototypes of two tools that could help people to
visualize, exemplify, and better guide our understanding of the coverage of
lexicographic data in Wikidata.
Annotation interface
The first prototype is an annotation interface that allows users to
annotate sentences in any language, associating each word or expression
with a Lexeme from Wikidata, including picking its Form and Sense.
You can see an example in the screenshot below. Each ‘word’ of the sentence
here is annotated with a Lexeme (the Lexeme ID L31818
<https://www.wikidata.org/wiki/Lexeme:L31818> is given just under the
word), followed by the lemma, the language, and the part of speech. Then
comes, if selected, the specific Form that is being used in context - for
example, on ‘dignity’ we see the Form ID L31818#F1, which is the singular
Form of the Lexeme. Lastly, comes the Sense, which is assigned Sense ID
L31818#S1 and defined by a gloss.
At any time, you can remove any of the annotations, or add new annotations.
Some of the options will take you directly to Wikidata. For example, if you
want to add a Sense to a given Lexeme, because it has no Senses or is
missing the one you need, it will take you to Wikidata and let you do that
there in the normal fashion. Once added there, you can come back and select
the newly added Sense.
The user interface of the prototype is a bit slow, so please give it a few
seconds when you initiate an action. It should work out of the box in
different languages. The Universal Language Selector is available (at the
top of the page), which you can use to change the language. Note that
glosses of Senses are frequently only available in the language of the
Lexeme, and the UI doesn’t yet do language fallback, so if you look at
English sentences with a German UI you might often find missing glosses.
Technologically, this is a prototype entirely implemented in JavaScript and
CSS on top of a vanilla MediaWiki installation. This is likely not the best
possible technical solution for such a system, but should help to determine
if there is any user-interest in the tool, for a potential
reimplementation. Also, it would be a fascinating task to agree on an API
which can be implemented by other groups to provide the selection of
Lexemes, Senses, and Forms for input sentences. The current baseline here
is extremely simple, and would not be good enough for an automated tagging
system. Having this available for many sentences in many languages could
provide a great corpus for training natural language understanding systems.
There is a lot that could be built upon that.
The goal of this prototype is to make more tangible the Wikidata
community's progress regarding the coverage of the lexicographical data.
You can take a sentence in any written language, put it into this system,
and find out how complete you can get with your annotations. It's a way to
showcase and create anecdotal experience of the lexicographic data in
Wikidata.
The prototype annotation interface is at:
http://annotation.wmcloud.org/
You can discuss it here:
https://annotation.wmcloud.org/wiki/Discussion
(You will need to create a new account - if you have time to set this up
with SUL, drop me a line)
Corpus coverage dashboard
The second prototype tool is a dashboard that shows the coverage of the
data compared to a corpus in each of forty languages.
Last year, whilst in my previous position at Google Research, I co-authored
a publication where we built and published language models out of the
cleaned-up text of about forty Wikipedia language editions [1]. Besides the
language models, we also published the raw data: this text has been cleaned
up by the pre-processing system that Google uses on Wikipedia text in order
to integrate the text in several of its features. So while this dataset
consists of relatively clean natural language text; certainly, compared to
the raw wiki text — it still contains plenty of artefacts. If you know of
better large scale encyclopedic text corpora we can use, maybe better
cleaned-up versions of Wikipedia, or ones covering more languages, please let
us know <https://phabricator.wikimedia.org/T273221>.
We extracted these texts from the TensorFlow models
<https://www.tensorflow.org/datasets/catalog/wiki40b>. We provide the extracted
texts for download
<https://drive.google.com/drive/folders/1HfL138UCqr69w0XfAhlAEUh6VVOnzwBE>
(a task <https://phabricator.wikimedia.org/T274208> to move it to Wikimedia
servers is underway). We split the text into tokens and count the
occurrences of words, and compared how many of these tokens appear in the
Forms on Lexemes of the given language in Wikidata’s lexicographic data. If
this proves useful, we could move the cleaned-up text to a more permanent
home.
A screenshot of the current state for English is given here: we see how
many Forms for this language are available in Wikidata, and we see how many
different Forms are attested in Wikipedia (i.e., how many different words,
or word types, are in the Wikipedia of the given language). The number of
tokens is the total number of words in the given language corpus. Covered
forms says how many of the forms in the corpus are also in Wikidata's
Lexeme set, and covered tokens tells us how many of the occurrences that
covers (so, if the word ‘time’ appears 100 times in English Wikipedia, it
would be counted as one covered form, but 100 covered tokens). The two pie
charts visualize the coverage of forms and tokens respectively.
Finally, there is a link to the thousand most frequent forms that are not
yet in Wikidata. This can help communities prioritise ramping up coverage
quickly. Note though, the progress report is manual and does not
automatically update. I plan to run an update from time to time for now.
The prototype corpus coverage dashboard is at:
https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage
You can discuss it here:
https://www.wikidata.org/wiki/Wikidata_talk:Lexicographical_coverage
Help wanted
Both prototype tools are exactly that: prototypes, not real products. We
have not committed to supporting and developing these prototypes further.
At the same time, all of the code and data is of course open sourced. If
anyone would like to pick up the development or maintenance of these
prototypes, you would be more than welcome – please let us know (on my talk
page <https://meta.wikimedia.org/wiki/User_talk:DVrandecic_(WMF)>, or via
e-mail, or on the Tool ideas page
<https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Ideas_of_tools>
).
Also, if someone likes the idea but thinks that a different implementation
would be better, please move ahead with that – I am happy to support and
talk with you. There is much to improve here, but we hope that these two
prototypes will lead to more development of content and tools
<https://www.wikidata.org/wiki/Wikidata:Tools/Lexicographical_data> in the
space of lexicographic data.
[1] Mandy Guo, Zihang Dai, Denny Vrandečić, Rami Al-Rfou: Wiki-40B:
Multilingual Language Model Dataset, LREC 2020,
https://www.aclweb.org/anthology/2020.lrec-1.297/