Is very hard to make large or even medium size corpus of sentences, in
which each word would be manually annotated with sense.
Abstract Wikipedia not only allows generate text in many languages from one
source but can be WSD corpus. Moreover: in many languages.
This allows understanding natural text and operations like:
1) translation from any natural language to disambig form
2) translate from this form to other natural language
and after step 1 this form will very useful not only for translation
I was interested in this Abstract Wikipedia project one year ago.Now I'm
not up to date on the topic
On Arctic Knot conference will be look on project as database of
In WikiData is very large amount of Qnnnn entities and much smaller set of
One lexem can have many senses and many different words can be synonyms
(problem: sometimes very close meaning but not the same)
For example in multilingual Wordnet : Polish->English word "kot" has:
kot, kot domowy
domestic cat, house cat, Felis domesticus, Felis catus
any domesticated member of the genus Felis
an unskilled or low-ranking soldier or other worker
an inexperienced and untrained recruit
cat, true cat
feline mammal usually having thick soft fur and no ability to roar:
domestic cats; wildcats
a new military recruit
# of them: 10149241-n 10508379-n and 10641301-n are English synonyms, very
close meanings, differ lexem. In Polish is one lexem "kot", In Polish
shouldn't distinguish this 3 senses, instead of should be one general
Problem: in Abstract Wikipedia source text should be language independent,
sense-centered We need common sense for 3 differ English lexems?
Senses are distinguishable in differ degree in differ language, for example
"snow" in African languages vs Siberian languages.
- how Abstract Wikipedia will do with senses?
- how view senses in Wikidata?
The on-wiki version of this week's newsletter is available here:
When we started the development effort towards the Wikifunctions site, we
subdivided the work leading up to the launch of Wikifunctions into eleven
named after the first eleven letters of the Greek alphabet.
- With Phase α (alpha) completed, it became possible to create instances
of the system-provided Types in the wiki.
- With Phase β (beta), it became possible to create new Types on-wiki
and to create instances of these Types.
- With Phase γ (gamma), all the main Types of the pre-generic function
- With Phase δ (delta). It became possible to evaluate built-in
- This week, we completed Phase ε (epsilon).
The goal of Phase ε has been to provide the capability to evaluate
contributor-written implementations in a programming language.
What does this mean? Every function in Wikifunctions can have several
implementations. There are three different ways to express an
1. As a *built-in* function, written by the development team: this means
that the implementation is handled by the evaluator as a black box.
2. As *code* in a programming language, created by the contributors of
Wikifunctions: the implementation of a function can be given in any
programming language that Wikifunctions supports. Eventually we aim to
support a large number of programming languages, but we will start small.
3. As a *composition* of other functions: this means that contributors
can use existing functions as building blocks in order to implement new
In Phase ε, we extended the infrastructure for evaluating functions to
allow for the running of contributed code in addition to the built-ins from
<https://en.wikipedia.org/wiki/Python_(programming_language)>. We plan to
add more programming languages (most familiar to the Wikimedia community,
Lua <https://en.wikipedia.org/wiki/Lua_(programming_language)>) and we will
document a process for requesting the support of additional programming
languages. We implemented the planned architecture
Wikifunctions top-level architectural model
We have now a system where the orchestrator receives the function call to
be evaluated, gathers all necessary data from Wikifunctions and potential
other resources, and then chooses the corresponding evaluators that can run
the given programming language. Since this is contributor-created code, we
are very careful about where and how to run the code, and which
capabilities to give to the virtual machine that runs the code. For
example, no network access and no persistence layer is allowed, in order to
reduce the potential for security issues.
A security review and a separate performance review of our architecture and
implementation are currently ongoing. Once we have dealt with the most
pressing issues that are uncovered by the reviews, we plan to provide a
demonstration system. This will probably be in the next Phase.
The next screenshot shows us an implementation of the Function
“Concatenate”. A concatenate
takes two strings as its arguments and returns a single string consisting
of the two input strings joined end-to-end. Our implementation uses
The next screenshot shows the function being called using the arguments
“Wiki“ and “functions”, resulting in the string “Wikifunctions”.
Screenshot of a function call
We are now moving on to Phase ζ (zeta). The goal of this Phase will be to
allow for the third type of implementations: the composition of functions
in order to build new capabilities. This will also be the first Phase to
really highlight the advantages of our system for contributing
implementations in non-English languages. We have published a few examples
of composed implementations
The example implementation of common Boolean functions might be
particularly instructive. Phase ζ is the last of the trilogy of Phases
dealing with the different ways to create an implementation.
During this Phase and subsequent ones, we will also spend some time to
reduce technical debt that we accumulated in rushed development during the
last two phases in order to be ready for the security and performance
reviews. We also expect to begin early changes based on our user research
and design development, replacing the current bare-bones user experience.
Structurally, Phase ζ aims to be the turning point of the development
towards Wikifunctions. It will set us up with a system that is powerful
enough to allow for its own refactoring in order to support generic types
and functions in Phase η (eta), and then to implement monitoring, UX,
security, and documentation. All the core technical capabilities should be
in place by then, and then we will need to add the necessary supporting
systems that will allow us to launch Wikifunctions.
Last week saw the Arctic Knot conference
<https://meta.wikimedia.org/wiki/Arctic_Knot_Conference_2021>, which was
about the future of indigenous and underrepresented languages and their
presence and use on the Wikimedia projects. I want to point to a few talks
that are potentially particularly interesting to the Abstract Wikipedia and
- Mahir talked on preparing languages for natural language generation
using Wikidata lexicographical data
- Denny gave an introductory presentation on Abstract Wikipedia
<https://www.youtube.com/watch?v=f13c3lCghtE&t=30581s> and suggestions
on what tasks we want to tackle first
- Sadik and Mohammed talked in Dagbani Wikipedia Bachinima
<https://www.youtube.com/watch?v=ee1TaLK3dJE> about challenges and
successes of growing the project, including the Spell4Wiki app
Thanks everyone for attending, presenting at, and organizing the event!
We close with a reminder that there is an invitation to attend for free
Framework Summer School <http://school.grammaticalframework.org/2021//> from
26 July to 6 August 2021.
The language committee has approved the creation of the Dagbani language
Wikipedia <https://phabricator.wikimedia.org/T284450>. Dagbani is one of
our focus languages, and has been the only one of the five that has been in
Incubator. Congratulations to the Dagbani community!
Next week there will be no newsletter.
The on-wiki version of this newsletter can be found here:
*Summary*: The Grammatical Framework community is inviting Wikimedians to
participate for free in the GF Summer School 2021. Participation for
Wikimedians will be sponsored by Digital Grammars.
“Grammatical Framework” (GF <https://www.grammaticalframework.org/>) is an
Open Source functional programming language and suite of tools which is
aimed at multilingual natural language generation and parsing of natural
language input. GF was first created in 1998 at Xerox Research in order to
support multilingual document authoring. GF is capable of parsing and
generating texts in several languages simultaneously while working from a
language-independent representation of meaning. GF has an active and lively
community, and offers more than 40 languages.
Here is an example of how GF works (note the syntax has been changed from a
Haskell-like syntax to a functional syntax). Given an abstract
representation such as:
mkUtt(mkS(mkCl(mkNP(aPl_Det, horse_N), mkNP(aPl_Det, animal_N))))
In order to make it a bit easier to understand, here's the terminology
make Utterance (make Sentence (make Clause (make Noun Phrase (a Plural
Determiner, horse Noun), make Noun Phrase (a Plural Determiner, animal
Note that this structure in turn could also be abstracted away behind a
function call with a simpler structure:
One can linearize that abstract representation in several languages. Here
are the results as created by the cloud-based implementation of GF
is dated as of 2012 - by now, GF has added support to dozens more
- Bulgarian: коне са животни
- Chinese: 些 马 是 些 动 物
- Dutch: paarden zijn dieren
- English: horses are animals
- Spanish: caballos son animales
- Swedish: hästar är djur
Let’s make two small changes to the abstract representation: add a negative
polarity to the sentence (negativePol) and switch horse_N with tree_N, and
we get the following representation:
mkUtt(mkS(negativePol, mkCl(mkNP(aPl_Det, tree_N), mkNP(aPl_Det,
Just as above, this could be hidden behind a function call:
This leads to the following linearizations:
- Bulgarian: дърва не са животни
- Chinese: 些 树 不 是 些 动 物
- Dutch: bomen zijn niet dieren
- English: trees aren't animals
- Spanish: árboles no son animales
- Swedish: träd är inte djur
While the idea for Abstract Wikipedia was developed, GF served as an
important inspiration. It was part of AceWiki
<http://attempto.ifi.uzh.ch/acewiki/>, an extension of MediaWiki that was
integrating tightly with GF and Attempto Controlled English
<https://en.wikipedia.org/wiki/Attempto_Controlled_English> (ACE) in order
to create text in several languages and also to capture the formal
semantics of the text. Whereas in AceWiki one of the main goals was to
express all sentences also in a formal logical language (in that case OWL
<https://www.w3.org/OWL/>), we are less interested in the formal semantics
of the abstract content (in fact, this is one major difference between
Abstract Wikipedia and the many predecessor projects). Other than that you
can see how GF and AceWiki have influenced the development of Abstract
Since the announcement of Abstract Wikipedia, the GF developers and
communities have reached out to the Abstract Wikipedia developers, and we
have been discussing
<https://groups.google.com/g/gf-dev/c/A6lNwZ813b0/m/dK8G8pyLAQAJ> our plans
and ideas. In order to further the relationship between the communities and
to transfer experiences and ideas between them, we are very happy to extend
an invitation to the Abstract Wikipedia community: this year’s Grammatical
Framework Summer School will be open and free for all Wikimedians.
At this stage, it is too early to commit ourselves to using GF as the only
approach towards natural language generation in Abstract Wikipedia. There
are alternatives, and Wikifunctions will be malleable enough to support
different approaches. One example for such an alternative is HPSG
phrase structure grammar), which will be presented in the second week of
the summer school. But we plan to learn from the decades of work and
research into GF and the hundreds of person-years that went into its
development, and we also plan to explore whether we can reuse some of the
software or parts of the comprehensive grammar libraries that are part of
GF. In order to facilitate such reuse, it will be crucial to have more
knowledge about each other and better mutual understanding.
The GF Summer School 2021 <https://school.grammaticalframework.org/2021/> will
be held from 26 July to 6 August in Singapore, and it will be possible to
attend online. Registration will be required. *In order to register as a
Wikimedian*, please email inari[image: (_AT_)]digitalgrammars.com, state
your Wikimedia account and your name, your country of residence, the
languages you read and write, and whether you would like to participate for
one or two weeks. This step is required in order to have you avoid the
participation fee—if you sign up yourself, you will need to pay the fee. We
are very thankful to Digital Grammars for covering the fee for Wikimedians.
We are very excited about this collaboration and are looking forward to the
two communities working together and to mutually benefit from each other's
goals, experiences, and skills.
This week also saw our first office hour. We answered a lot of questions,
and you can catch up on the logs
We plan the next office hour to be in four to six weeks, and will announce
dates also in this newsletter.
Quick reminder: our first office hour is upcoming *in 20 minutes*. It will
be held on the Telegram channel <https://t.me/Wikifunctions> and on
Libera.Chat IRC Channel #wikipedia-abstractconnect
<https://web.libera.chat/?channel=#wikipedia-abstract> (bridged together)
The development team will discuss what they have been working on recently,
and the community is welcome to ask questions and discuss important related
We hope to see you there! Chat-logs will be available afterwards.
-- Quiddity (WMF)
The on-wiki version of this newsletter can be found here:
Apologies for missing the update last week. Times are even busier than
Before we dive into today’s topic, two quick reminders: our first office
hour is upcoming on *Tuesday, 22 June 2021, at 16:00 UTC*. It will be held
on the *Telegram* channel and on Libera.Chat IRC Channel #wikipedia-abstract
On *Thursday, 24 June 2021*, we will be presenting at the Arctic Knot
Community member Mahir Morshed
<https://meta.wikimedia.org/wiki/User:Mahir256> will present on how to get
the lexicographic data ready to be used in Abstract Wikipedia at *15:00 UTC*,
and Denny <https://meta.wikimedia.org/wiki/User:Denny> will present on
Abstract Wikipedia and Wikifunctions in general at *16:00 UTC*.
Wikifunctions’s core model is centered around functions. Every function can
have several implementations. All implementations of a function should
return the same results given the same inputs.
One may ask: Why? What’s the point of having several implementations that
all do the same?
There are several answers to that. For example, for many functions,
different algorithms exist which could be used by different
implementations. The traditional example in computer science classes is the
sorting function: a sorting function takes two arguments, a list of
elements to be sorted (i.e. to be brought into a specific, linear order),
and a comparator operator that, given two elements, tells us which element
should go first. There are many different sorting algorithms
<https://en.wikipedia.org/wiki/Sorting_algorithm>, any of which could be
used to implement the sorting function. A particularly interesting
visualization of the different sorting algorithms can be found in the form
of traditional folk dances
The person calling the sorting function will often not care much about
which algorithm is being used, as long as it works, is correct, and returns
sufficiently quickly. But having different algorithms implemented allows
the service to run the different algorithms and compare their runtime
behaviors against each other. Different algorithms will often require
different amounts of memory or computation cycles. Keeping track of the
runtime behavior of the different implementations will eventually allow the
function orchestrator to predict and select the most efficient
implementation for a given input and at a given instant. When spare compute
cycles are available, it can also run some implementations with different
inputs, in order to learn more about the differing behavior of these
One benefit of allowing for multiple implementations is that it reduces the
potential for conflicts when editing Wikifunctions. If a contributor wants
to try a different implementation, thinking it might be more efficient,
they are welcome to do so and submit their implementation to the system.
There is no need for the well-known arguments around different programming
languages and their relative merits and qualities to spill over to
Wikifunctions: everyone will be welcome to provide implementations of their
favorite algorithms in their favorite programming languages, and the system
will take care of validating, testing, and selecting the right
Another benefit of having multiple implementations is that we can test them
against each other rigorously. Sure, we will have the user-written suite of
testers for an initial correctness check (and also to start collecting
runtime metadata). But when you have several independent implementations of
a function, you can either synthetically create more inputs, or you can run
actual user-submitted function executions against different implementations
to gather more metadata about the executions. Since we have several
implementations, we can use these to cross-validate the different
implementations, compare the results from the different implementations,
and bubble up inconsistencies that arise to the community.
Besides having implementations of different algorithms, we also expect to
have implementations in different programming languages. Implementations in
different programming languages will be useful for the same reasons that
different algorithms are useful, i.e. being able to cross-validate each
other, and to allow for the selection of the most efficient implementation
for a given function call. But they will have the additional advantage of
being able to run on different configurations of the Wikidata function
evaluator. What do I mean by that?
Whereas we plan to support numerous different programming languages for
adding implementations in Wikifunctions, we do not plan to actually run
evaluators for all of them in production. This is due to several reasons:
the maintenance cost of keeping all these evaluators up and running and up
to date will likely be severe. The more programming languages we support,
the more likely it is that the Foundation or the community will be exposed
to bugs or security concerns in the run-times of these languages. And it is
likely that, beyond five or six programming languages, the return on
investment will greatly diminish. So what’s the point of having
implementations in programming that we do not plan to run in production?
Remember that we are planning for an ecosystem around Wikifunctions where
there are many function evaluators independent of the one run by the
Wikimedia Foundation. We are hoping for evaluators to be available as apps
on your smartphone, to have evaluators available on your own machine at
home, or in your browser, or in the cloud, to have third parties embed
evaluators for certain functions within their systems, or even to have a
peer to peer network of evaluators exchanging resources and results. Within
these contexts, the backends may choose to support a different set of
programming languages from those supported by Wikifunctions, either because
it is favorable to their use cases, or because they are constrained to or
passionate about a specific programming language. Particularly for running
Wikifunctions functions that are embedded within the system of a third
party app, it can easily provide quite a performance boost to run these
functions in the same language as the embedding app.
Another advantage of having implementations in different programming
languages is that in case an evaluator has to be suddenly taken down, e.g.
because a security issue has been reported and not fixed yet, or because
the resource costs of that particular evaluator have developed in a
problematic way, we can take that evaluator down, and change our
configuration to run a different set of programming languages. This gives
us a lot of flexibility in how to support the operation of Wikifunctions
without disrupting people using the service.
An implementation of a function can also be given as a function
composition: instead of contributed code in a programming language, a
composition takes existing functions from Wikifunctions and nests them
together in order to implement a given function. Here’s an example: let’s
say we want to implement a function second, which returns the second letter
of a word. Assume that we already have a function first which returns the
first letter of a word, and a function tail which chops off the first
letter of a word and returns the rest, then second(w) can be implemented as
first(tail(w)), i.e. the first letter of the result after chopping off the
first letter. We will talk about function composition in more detail at
Composition has the advantage that we don’t require implementations of
every function in code or as a built-in, and yet we can evaluate these
functions. The backend will properly chain the function calls and pipe the
results from one function to the next until it arrives at the requested
result. This might be particularly useful for third-party evaluators who
offer a different set of programming languages, or even focus on one
specific programming language; they still might be able to use a large set
of the functions, even without local implementations.
We expect composition to usually offer a less competitive performance
compared to running code. Our meta-data will be able to pinpoint especially
resource intensive function calls. We plan to surface these results on the
wiki, highlighting to the community where more efficient implementations
would have the most impact. I am hoping for a feature that, e.g., will
allow a contributor to see how many CPU cycles have been saved thanks to
their porting a function into WebAssembly.
One interesting approach to function composition could be that, if we have
code in a specific programming language for all functions participating in
a composition, it might sometimes be possible to synthesize and compile the
code for the composed function in that programming language. This might
lead to a situation where, say, two different programming languages offer
the most efficient implementation for some of the participating functions,
but the actual function call will run yet more efficiently in the new
And finally, there’s also caching. Any of the function calls, including
nested calls in composed functions, might be cached and re-used. This cache
would be shared across all our projects, and provide significant speed-up:
after all, it is likely that certain calculations are going to be much more
popular than others, similar to how some articles are much more popular
than others at a given time. And just as Wikipedia saves tremendous amounts
of CPU cycles by keeping pages in the cache instead of re-rendering them
every time someone wants to read them, we can reap similar benefits by
keeping a cache of function calls and their results.
In summary: having multiple implementations for a function gives us not
only more flexibility in how to plan and run a function, and thus to
potentially save resources, but it also gives us a higher confidence in the
correctness of the system as a whole due to the cross-validation of the
different implementations and reduces the potential for conflicts when
We are very curious to see how this will play out in practice. The few
paragraphs above describe ideas that require a lot of smart development
work on the back-end, and for which we don’t really know how well each of
them will perform. We sure expect to come across even more ideas (Maybe
from you? Maybe from a research lab?), and to discover that some of the
ideas sketched out here don’t work. We do not promise to implement
everything described above. The good thing is that many of those ideas are
ultimately optimizations: even simple versions of all of this should
provide correct results. But smarter approaches will likely save us huge
amounts of resources, and enable us to scale to the full potential of the
project. As with our other projects, we plan to publish our data and
metadata, and we invite external organizations, in academia and in industry
as well as hobbyists and independent developers and researchers, to help us
tackle these interesting and difficult challenges.
Again, reminder: our first office hour is upcoming *Tuesday, June 22, 2021,
at 16:00 UTC* on the Telegram channel and IRC Channel #wikipedia-abstract
The on-wiki version of this newsletter is available here:
We are planning for our first office hour! The Wikifunctions and Abstract
Wikipedia office hours will be online events where the development team
presents what they have been working on recently, and the community is
welcome to ask questions and discuss important related issues. They will be
announced on the mailing list and in the newsletter, and are planned to
take place every four to six weeks.
Our first office hour will be at 16:00 UTC on June 22, 2021, and will be in
the Telegram channel and IRC Channel #wikipedia-abstractconnect
<https://web.libera.chat/?channel=#wikipedia-abstract> (bridged together).
Shani Evenstein Sigalov <https://meta.wikimedia.org/wiki/User:Esh77> is
teaching a course “From Web 2.0 to Web 3.0, from Wikipedia to Wikidata” at Tel
Aviv University <https://en.wikipedia.org/wiki/Tel_Aviv_University>. Shani
prepared a video with Denny Vrandečić
<https://meta.wikimedia.org/wiki/User:Denny> where they discuss Abstract
Wikipedia and Wikifunctions. The video is available on YouTube
In the interview Shani and Denny discussed some of the challenges in
Wikipedia and Wikidata, and how they brought about the idea of Abstract
Wikipedia. What the differences are between "Abstract Wikipedia",
"Wikifunctions" and "WikiLambda"; What the current state of the project is;
and how it all ties to the current Internet Ecosystem and things like AI
and Machine Learning.
Next Monday, June 7th, at 15:00 Israel time (12:00 UTC), Shani will be
hosting Denny in her course for a 45 mins Q & A session with her students
via Zoom. This part of the class will be open to anyone interested in this
topic, and you are welcome to either join them live and ask questions to
Denny (after watching the pre-recorded interview), or watch it all later.
If you are interested in joining, please write Shani an email (shani dot
even at gmail dot com) with the title "Joining the Q & A session with
Denny", by Sunday June 6th at 20:00 UTC, and she will send you the Zoom
link. This is to avoid Zoombombing.
If you are unable to participate live, but still want to engage, feel free
to send in questions via email or leave them on Shani’s Facebook post
The session will also be recorded, so if you cannot make it to the live
session, you will be able to watch it later on YouTube.
Lucas Werkmeister <https://meta.wikimedia.org/wiki/User:Lucas_Werkmeister>,
our esteemed colleague at Wikimedia Deutschland, who runs the Notwikilambda
site <https://notwikilambda.toolforge.org/> as a volunteer, has set up the
function evaluator and function orchestrator on Notwikilambda. He also
created instructive videos of him doing so on Twitch:
- Lucas setting up the function orchestrator
- Lucas setting up the function evaluator
These videos get automatically deleted after two weeks, i.e. in another
week after publishing this newsletter.
The function orchestrator is the service that receives a function call,
validates it, brings all necessary information together, calls the function
evaluator as needed, and eventually will resolve function compositions. The
function evaluator, on the other hand, takes code provided by Wikifunctions
contributors and runs it in order to produce results for the orchestrator.
Both of them are now set up on Notwikilambda, thanks to Lucas' work. Thank
Boris Shalumov <https://2020-eu.semantics.cc/users/boris-shalumov>, host of
the podcast “Chaos Orchestra” on the topic of Knowledge Graphs, has also
interviewed Denny (among other guests, such as Sören Auer, a co-founder of
DBpedia, Jans Aasman, CEO of Franz Inc, or Daniel Schwabe, Professor at the
Catholic University of Rio de Janeiro, and a few others), and they discuss
Wikidata, Wikifunctions, and many other topics.
The podcast episode is available on YouTube
and Google Podcasts
We'll also participate in the Arctic Knot conference
year. Community member Mahir Morshed
<https://meta.wikimedia.org/wiki/User:Mahir256> will present on how to get
the lexicographic data ready to be used in Abstract Wikipedia, and Denny
will present on Abstract Wikipedia and Wikifunctions in general.
The conference will be free, fully online, and registration is open
<https://meta.wikimedia.org/wiki/Arctic_Knot_Conference_2021> right now.
On the development side, this week saw the start of the performance and
security reviews, to make sure the architecture and implementation are on
solid ground to take us to the launch later this year. They are both
scheduled to go for two weeks. We are cooperating with the respective teams
at the Wikimedia Foundation and are thankful for their support regarding
these critical aspects of the new project.