Is very hard to make large or even medium size corpus of sentences, in
which each word would be manually annotated with sense.
Abstract Wikipedia not only allows generate text in many languages from one
source but can be WSD corpus. Moreover: in many languages.
This allows understanding natural text and operations like:
1) translation from any natural language to disambig form
2) translate from this form to other natural language
and after step 1 this form will very useful not only for translation
I was interested in this Abstract Wikipedia project one year ago.Now I'm
not up to date on the topic
On Arctic Knot conference will be look on project as database of
disambiguated knowledge?
The on-wiki version of this newsletter is available here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-07-29
--
Our goal with Abstract Wikipedia is to enable everyone to write content in
any language that can be read in any language. Ultimately, the main form of
content we aim for are Wikipedia articles, in order to allow everyone to
equitably have and contribute to unbiased, up-to-date, comprehensive
encyclopedic knowledge.
In the coming months, we will take major milestones towards that goal.
Today, I want to sketch one possible milestone on our way: abstract
descriptions for Wikidata.
Every Item <https://www.wikidata.org/wiki/Help:Items> in Wikidata has a
label <https://www.wikidata.org/wiki/Help:Label>, a short description
<https://www.wikidata.org/wiki/Help:Description>, and aliases
<https://www.wikidata.org/wiki/Help:Aliases> in each language. Let’s say
you take a look at Item Q836805 <https://www.wikidata.org/wiki/Q836805>. In
English, that Item has the label *“Chalmers University of Technology”* and
the description *“university in Gothenburg, Sweden”*. In Swedish it is
*“Chalmers
tekniska högskola”* and *“universitet i Göteborg, Sverige”*. The goal of
the label is to be a common name for the Item, and together with the
description it should uniquely identify the Item in the world. That’s why,
although multiple Items can have the same label, as things in the world can
be called the same but be different, no two Items should have both the same
label and the same description in a given language. The aliases are used to
help with improving the search experience.
The meaning of the descriptions across languages is often the same, and
when it is not, although sometimes intentional, it usually differs by
accident. Given there are more than 94 million Items in Wikidata, and
Wikidata supports more than 430 languages, that would mean that if we had
perfect coverage, we would have more than 40 billion labels and as many
descriptions. And not only would the creation of all these labels and
descriptions be a huge amount of work, they would also need to be
maintained. If there are not enough contributors checking on the quality of
these, it would be unfortunately easy to sneak in vandalism.
The Wikidata community has known about this issue for a long time, and made
great efforts to correct it. Tools such as AutoDesc
<https://autodesc.toolforge.org/> by Magnus Manske
<https://meta.wikimedia.org/wiki/User:Magnus_Manske> and bots such as
Edoderoobot <https://www.wikidata.org/wiki/User:Edoderoobot>, Mr.Ibrahembot
<https://www.wikidata.org/wiki/User:Mr.Ibrahembot>, MatSuBot
<https://www.wikidata.org/wiki/User:MatSuBot> (these were selected by
clicking “Random Item” and looking at the history) and many others have
worked on increasing the coverage. And it shows: these bots often target
descriptions, and so, even though only six languages have *labels* for more
than 10% of Wikidata Items, a whopping 64 languages have a coverage over
10% for *descriptions*! Today, we have well over two billion descriptions
in Wikidata.
These bots create descriptions, usually based on the existing statements of
the Item. And that is great. But there is no easy way to fix an error
across languages, nor is there an easy way to ensure that no vandalism has
snuck in. Also, bots give an oversized responsibility to a comparably small
group of bot operators. Our goal is to democratize that responsibility
again and allow more people to contribute.
Descriptions in Wikidata are usually noun phrases, which are something that
we will need to be able to do for Abstract Wikipedia anyway. We want to
start thinking about how to implement this feature, and then derive from
there what will need to happen in Wikifunctions and in Wikidata. This work
will need to happen in close coöperation with the Wikidata team, and the
communities of both Wikidata and Wikifunctions. It will represent a way to
ramp-up our capabilities towards the wider vision of Abstract Wikipedia.
Timewise, we hope to achieve that in 2022.
We don’t know yet how exactly this will work. Here are a few thoughts, but
really I invite you so that we all work together on the design for abstract
descriptions:
- It must be possible to overwrite a description for a given language
- It must be possible to retract a local overwrite for a given language
- The pair of label and description still must remain unique
- It would be great if implementing this would not be a large effort
- The goal is not to create automatic descriptions
<https://www.wikidata.org/wiki/Wikidata:Automating_descriptions>, but
abstract descriptions
The last point is subtle: an automatic description is a description
generated automatically from the given statements of an Item. That’s a
valuable and very difficult task. The above mentioned AutoDesc for example,
starts the English description for Douglas Adams
<https://autodesc.toolforge.org/?q=Q42&lang=en&mode=short&links=text&redlink…>
as
follows: *“British playwright, screenwriter, novelist, children's writer,
science fiction writer, comedian, and writer (1952–2001) ♂; member of
Footlights and Groucho Club; child of Christopher Douglas Adams and Janet
Adams; spouse of Jane Belson”*. The Item <https://www.wikidata.org/wiki/Q42>'s
current manual English description is the much more succinct *“English
writer and humorist”*. There can be many subtle decisions and editorial
judgements to be made in order to create the description for a given Item,
and I think we should be working on this — but later.
Instead, we want to support abstract descriptions: a description, manually
created, but instead of being written in a specific natural language, it is
encoded in the abstract notation of Wikifunctions and then we use the
renderers to generate the natural languages text. This allows the community
to retain direct control over the content of a description.
Here are a few ideas to kick off the conversation:
- We introduce a new language code, qqz. That code is in the range
reserved for local use, and is similar to the other dummy language codes
<https://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes> in
MediaWiki, qqq and qqx. Wikidata is to support the qqz language code for
descriptions.
- The content of the qqz description is an abstract content. Technically
we could store it in some string notation such as “Z12367(Q3918
<https://www.wikidata.org/wiki/Q3918>, Q25287
<https://www.wikidata.org/wiki/Q25287>, Q34
<https://www.wikidata.org/wiki/Q34>)”. Or we could store the JSON
ZObject.
- The abstract description would be edited using the same Vue components
we develop for Wikifunctions for editing abstract content.
- The abstract description is a fallback for languages without a
description. It can be overwritten by providing a description in that
language.
- Every time the renderer function or the underlying lexicographic data
changes, we also need to retrigger the relevant generations.
- One question is whether we should store the generated description in
the Item, and if so, how to change the data model in order to mark the
description as generated from the abstract description.
- We also need to figure out how to report changes to everyone who is
interested in tracking them. If we store the generated description as
proposed above, we can piggyback on the current system.
All of these are just ideas for discussion. Some of the major questions are
whether to store all the generated descriptions in the Item or not, how to
represent that in the edit history of the Item, how to design the caching
and retriggering of the generated descriptions, etc.
What would that look like?
Let’s take a look at an oversimplified example. The description for
Chalmers is *“university in Gothenburg, Sweden”*. That seems like a
reasonably simple case that could easily be templated into abstract content
say of the form “Z12367(Q3918 <https://www.wikidata.org/wiki/Q3918>, Q25287
<https://www.wikidata.org/wiki/Q25287>, Q34
<https://www.wikidata.org/wiki/Q34>)”, where Z12367 (that ZID is made-up)
represents the abstract content saying in English *“(institution) in
(city), (country)”*, Q3918 <https://www.wikidata.org/wiki/Q3918> the QID
for university, Q25287 <https://www.wikidata.org/wiki/Q25287> the QID for
Gothenburg, and Q34 <https://www.wikidata.org/wiki/Q34> the QID for Sweden.
(In reality, this template is actually nowhere near as simple as it looks
like - we will discuss this more in an upcoming weekly newsletter. For now,
let’s assume this to be so simple.)
Renderers would then take this abstract content and for each language
generate the description, in this case *“university in Gothenburg, Sweden”* for
English, or *“sveučilište u Göteborgu u Švedskoj”* in Croatian. Since there
is already an English description, we wouldn’t store nor actually generate
the text, but in Croatian we would generate it, store it, and mark it as a
generated description.
We think of this as a good milestone on our path to Abstract Wikipedia,
with a directly useful outcome. What are your thoughts? Join us in
discussing this idea on the following talk page:
https://meta.wikimedia.org/wiki/Talk:Abstract_Wikipedia/Updates/2021-07-29
------------------------------
In other news, Lindsay has created a video of a new feature: how Testers
and Implementations work together to show whether the tests pass. The video
is availabe here:
https://commons.wikimedia.org/wiki/File:Wikilambda_Testers_on_Code_based_Im…
The video shows how she is changing the implementation and re-running the
testers several times. Testers will be a main component in ensuring the
quality of Wikifunctions.
The next opportunity to meet us and ask us questions will be at Wikimania.
On 14 August, at 17:00 UTC, we will host a 1.5 hour session on
Wikifunctions and Abstract Wikipedia. This year, Wikimania will be an
entirely virtual event and registration is free. Bring your questions and
discussions to Wikimania 2021.
Next week, we are skipping the weekly update.
The on-wiki version of this update is here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-07-22
----
In the last few weeks, the Wikifunctions prototype has passed a few
critical milestones. We have massively improved the testability of our
codebase and increased the robustness of our tests. There’s still plenty to
do, but, considering the development ahead, it is reassuring to see the
code becoming more robust.
Another step is that the first parts of evaluating function composition are
now working. We can neatly compose any combination of built-ins,
code-based implementations, and other compositions.
I found myself having quite a bit of fun working with the prototype. Last
week, in order to capture some of the possibilities, I made a video where I
set up a new Wikilambda instance and defined a few functions for Boolean
algebra. Booleans are one of the types that come pre-loaded with a
Wikilambda instance. The main reason why they come as a pre-loaded type is
because they are necessary for the builtin If function, and the If function
is extremely useful.
In the demonstration video, I defined the Negate function, which takes one
of the two Boolean values (i.e. either True or False) and returns the
other. Then I implemented the Negate function using the If function: If
true then false else true. I followed this by implementing a few other
Boolean functions with two parameters, such as the And function
(conjunction), the Or function (disjunction), the Nand function, and the
Exclusive or function. Some of the functions are implemented using solely
the built-in If function; others combine previously composed functions
together (such as Nand, implemented as Not And).
The video also shows how to call these newly-created functions and see that
they work. You will notice a number of bugs in the video. Most of them are
already filed and being worked on; some of them have even been solved
already. A number of the workflows that you see have already been improved,
such as creating an implementation directly from a newly defined function,
etc. Also, please remember that the UX is still intentionally rough, and we
will give it a complete overhaul before we launch.
The video runs for 24 minutes and is available on Commons:
https://commons.wikimedia.org/wiki/File:Boolean_Algebra_with_very_early_Wik…
Thanks so much to the team for getting the prototype so far! I am very
proud, and looking forward to what comes next.
----
We are hiring! We are looking for an Engineering Manager:
https://boards.greenhouse.io/wikimedia/jobs/3270135 Our hires can be based
remotely.
The next opportunity to meet us and ask us questions will be at Wikimania.
On 14 August, at 17:00 UTC, we will host a 1.5 hour session on
Wikifunctions and Abstract Wikipedia. This year, Wikimania will be an
entirely virtual event and registration is free. Bring your questions and
discussions to Wikimania 2021:
https://www.eventbrite.com/e/wikimania-2021-tickets-161884957265
And a reminder that all Wikimedians are invited to attend the Grammatical
Framework Summer School from 26 July to 6 August 2021 for free. The link
explains how to register and gives more background:
https://meta.wikimedia.org/wiki/Special:MyLanguage/Abstract_Wikipedia/Updat…
The on-wiki version of this newsletter is available here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-07-16
--
It is our pleasure to announce that Max Binder
<https://meta.wikimedia.org/wiki/User:MBinder_(WMF)> will join the Abstract
Wikipedia team part-time for a while in order to provide us help and
support with our processes and tools. Max is a Senior Team Effectiveness
Coach and joined the Wikimedia Foundation in 2015. I want to let Max
introduce himself with his own words:
Hello! :)
My name is Max (it’s not short for anything, which I have found is
uncommon). I use he/him pronouns. I am excited to join this team from
the Technical
Program Management
<https://www.mediawiki.org/wiki/Wikimedia_Product/Technical_Program_Manageme…>
team,
and support healthy team practices. Here are some links to things I’ve
written previously about who I am and how I approach my work:
Meta page: User:MBinder (WMF)
<https://meta.wikimedia.org/wiki/User:MBinder_(WMF)>
Approach and style: Team Effectiveness Coach Approach and Style
<https://www.mediawiki.org/wiki/Wikimedia_Product/Technical_Program_Manageme…>
I will be with this team as long as it takes to codify needs and norms
thereof, and eventually onboard a Technical Program Manager for ongoing
support thereafter.
Picking a favorite Wikipedia page is like picking a favorite child, but
I’ve always enjoyed: List of helicopter prison escapes
<https://en.wikipedia.org/wiki/List_of_helicopter_prison_escapes> and Toliet
paper orientation <https://en.wikipedia.org/wiki/Toilet_paper_orientation>
One goal for Max is to help us get our processes ready to scale for more
new members.
Speaking of which: we are hiring! We are currently hiring for two Software
Engineers
<https://boards.greenhouse.io/wikimedia/jobs/3298646?gh_src=03df28cb1us>
and an Engineering Manager
<https://boards.greenhouse.io/wikimedia/jobs/3270135> for Abstract
Wikipedia. The positions can be remote and can be outside the United
States. If you are interested, or know someone who might be, or a good
community to spread the word about the positions, please share the link.
We also want to say thanks to Carolyn Li-Madeo and Simone Cuomo, who have
been working with us for the last few months. Carolyn was helping to
kick-off the design work within Abstract Wikipedia with Aishwarya, and is
now stepping away from the day to day work on Abstract Wikipedia in order
to focus more on her primary work within the Foundation. Simone has worked
on a number of tasks on the front end, for example making the front-end
more testable and modular, and is now ramping up to provide support for the
Structured Data team. Our deepest gratitude to both of them for their
contributions to the project and the support they provide to the
organization and user experiences more generally.
We also got word that we were selected for a session at Wikimania 2021
<https://meta.wikimedia.org/wiki/Wikimania_2021>. Wikimania will be held
from August 13 to 17, 2021. On August 14, at 15:00 UTC, we will host a two
hour session on Wikifunctions and Abstract Wikipedia. This year, Wikimania
will be an entirely virtual event and registration is free
<https://www.eventbrite.com/e/wikimania-2021-tickets-161884957265>. Bring
your questions and discussions to Wikimania 2021!
Another reminder that there is an invitation to attend for free
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-06-24> the
Grammatical Framework Summer School from 26 July to 6 August 2021.
In WikiData is very large amount of Qnnnn entities and much smaller set of
Lnnnn lexems.
One lexem can have many senses and many different words can be synonyms
(problem: sometimes very close meaning but not the same)
For example in multilingual Wordnet : Polish->English word "kot" has:
02121808-n
kot, kot domowy
domestic cat, house cat, Felis domesticus, Felis catus
any domesticated member of the genus Felis
10149241-n
kot
grunt
an unskilled or low-ranking soldier or other worker
10508379-n
kot
raw recruit
an inexperienced and untrained recruit
02121620-n (18)
kot
cat, true cat
feline mammal usually having thick soft fur and no ability to roar:
domestic cats; wildcats
10641301-n
kot
sprog
a new military recruit
# of them: 10149241-n 10508379-n and 10641301-n are English synonyms, very
close meanings, differ lexem. In Polish is one lexem "kot", In Polish
shouldn't distinguish this 3 senses, instead of should be one general
definition.
Problem: in Abstract Wikipedia source text should be language independent,
sense-centered We need common sense for 3 differ English lexems?
Senses are distinguishable in differ degree in differ language, for example
"snow" in African languages vs Siberian languages.
- how Abstract Wikipedia will do with senses?
- how view senses in Wikidata?