Hoi,
I fail to understand. You have the data in the prescribed manner for an
article. The original is based on English. How can you generate from the
data a text in Dutch or any other language, when you do have the Senses but
not the meanings of the words.
Thanks,
GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić <dvrandecic(a)wikimedia.org>
wrote:
The on-wiki version of this newsletter can be
found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical
knowledge <https://www.wikidata.org/wiki/Wikidata:Lexicographical_data>.
Several hundred thousand Lexemes have been created since then, and this
year the tools will be further developed by Wikimedia Deutschland to make
the creation and maintenance of the lexicographic knowledge in Wikidata
easier.
The lexicographic extension to Wikidata was developed with the goal that
became Abstract Wikipedia in mind, but a recent discussion within the
community showed me that I have not made the possible connection between
these two parts clear yet. Today, I would like to sketch out a few ideas on
how Abstract Wikipedia and the lexicographic data in Wikidata could work
together.
There are two principal ways to organize a dictionary: either you
organize the entries by ‘lexemes’ or ‘words’ and describe their senses
(this is called the semasiological
<https://en.wikipedia.org/wiki/Semasiology> approach), or you organize
the entries by their ‘senses’ or ‘meanings’ (this is called the
onomasiological <https://en.wikipedia.org/wiki/Onomasiology> approach).
Wikidata has intentionally chosen the semasiological approach: the entries
in Wikidata are called Lexemes, and contributors can add Senses and Forms
to the Lexemes. Senses stand for the different meanings that a Lexeme may
regularly invoke, and the Forms are the different ways the Lexeme may be
expressed in a natural language text, e.g. in order to be in agreement with
the right grammatical number, case, tense, etc. The Lexeme “mouse” (
L1119 <https://www.wikidata.org/wiki/Lexeme:L1119>) thus has two
senses, one for the small rodent, one for the computer input device, and
two forms, “mouse” and “mice”. For an example of a multilingual
onomasiological collaborative dictionary, one can take a look at the
OmegaWiki <http://www.omegawiki.org/> project, which is primarily
organized around (currently 51,000+) Defined Meanings
<http://www.omegawiki.org/Help:DefinedMeaning> and how these are
expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on
the observation that it is much simpler for a crowd-sourced collaborative
project, and has much less potential to be contentious. It is much easier
to gather a list of words used in a corpus than to gather a list of all the
meanings referred to in the same corpus. And whereas it is 'simpler', it is
still not trivial. We still want to collect a list of Senses for each
Lexeme, and we want to describe the connections between these Senses:
whether two Lexemes in a language have the same Sense, how the Senses
relate to the large catalog of items in Wikidata, and how Senses of
different languages relate to each other. These are all very difficult
questions that the Wikidata community is still grappling with (see also the
essay on Making Sense
<https://www.wikidata.org/wiki/Wikidata:Making_sense>).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the
world. He became mayor of Talkeetna, Alaska, at the age of three months and
six days, and retained that position until his death almost four years ago.
Also, Stubbs <https://en.wikipedia.org/wiki/Stubbs_(cat)> was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will
have to be able to express the meaning “cat” (here, we will focus
entirely on the lexical level, and will not discuss grammatical and
idiomatic issues; we will leave those for another day). How do we refer to
the idea for cat in the abstract content? How do we end up, in English,
eventually with the word form “cat” (L7-F4
<https://www.wikidata.org/wiki/Lexeme:L7#F4>)? In French with the word
form “chat” (L511-F4 <https://www.wikidata.org/wiki/Lexeme:L511#F4>)?
And in German with the form “Kater” (L303326-F1
<https://www.wikidata.org/wiki/Lexeme:L303326#F1>)?
Note that these three words commonly do not have the same meaning. The
English word cat refers to both male or female cats equally; and whereas
the French word could refer to a cat generically, for example if we
wouldn’t know Stubbs’ gender, the word is male, but a female cat would
usually be referred to using the word “chatte”. The German word, on the
other hand, may only refer to a male cat. If we wouldn’t know whether
Stubbs is male or female, we would need to use the word “Katze” in
German instead, whereas in French, as said, we still could use “chat”.
And English also has words for male cats, e.g. “tom” or “tomcat”, but
these are much less frequently used. Searching the Web for “Stubbs is a
cat” returns more than 10,000 hits, but not a single one for “Stubbs is
a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette
<https://en.wikipedia.org/wiki/F%C3%A9licette>, the first and so far
only cat in space, the articles indeed use the words “chatte” in French
and “Katze” in German.
Here we are talking about three rather closely related languages, we are
talking about a rather simple noun. This should have been a very simple
case, and yet it is not. When we talk about verbs, adjectives, or nouns
about more complex concepts (for example different kinds of human
settlements or the different ways human body parts are conceptualized in
different languages, e.g. arms and hands <https://wals.info/chapter/129>,
terms for colors), it gets much more complicated very quickly. If we were
to require that all words we want to use in Abstract Wikipedia first must
align their meanings, then that would put a very difficult task in our
critical path. So whereas it would indeed have been helpful to Abstract
Wikipedia to have followed an onomasiological approach (how wonderful would
it be to have a comprehensive catalog of meanings!), that approach was
deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can
avoid that is because Abstract Wikipedia only needs to generate text, and
neither parse nor understand it. This allows us to get by using a
Constructor that, for each language, uses a Renderer to select the correct
word (or other lexical representation). For example, we could have a
Constructor that may take several optional further pieces of information:
the kind of animal, the breed, the color, whether it is an adult, whether
it is neutered, the gender, the number of them, etc. For each of these
pieces of information, we could mark whether that information must be
expressed in the Rendering, or whether this information is optional and can
be ignored, and thus what is available for those Renderers to choose the
most appropriate word. Note, this is not telling the community how to do
it, merely sketching out one possible approach that would avoid to rely on
a catalog of meanings.
Each language Renderer could then use the information it needs to select
the right word. If a language has a preference to express the gender (such
as German) it can do so, whereas a language that prefers not to (such as
English) can do so. If for a language the age of the cat matters for the
selection of the word, it can look it up. If the color of the animal
matters (as it does for horses in German
<https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben>),
the respective Renderer can use the information. If a required information
is missing, we could add this to a maintenance queue so that contributors
can fill it out. If a language should happen not to have a word, a
different noun phrase can be chosen, e.g. a less specific word such as
”animal” or “pet”, or a phrase such as “male kitten”, or “black horse”
for the German word “Rappen”.
But the important design feature here is that we do not need to ensure
and agree on the alignment of meanings of words across different languages.
We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of
meanings. It would be a tremendously valuable resource. And even without
such a catalog, the statements connecting Senses and Items in Wikidata can
be very helpful for the creation and maintenance of Renderers, but these do
not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up
to the community to decide on how to implement the Renderers and what
information to use. In this, I am sketching out an architecture that allows
us to avoid blocking on the availability of a (valuable but very difficult
to create) resource, a comprehensive catalog of meanings aligning words
across many different languages.
_______________________________________________
Abstract-Wikipedia mailing list --
abstract-wikipedia(a)lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikime…
_______________________________________________
Abstract-Wikipedia mailing list -- abstract-wikipedia(a)lists.wikimedia.org
List information: