[Abstract-wikipedia] Re: Newsletter #30: The missing link from Abstract Wikipedia to Lexicographic data in Wikidata

17 May 2021

      Denny,
I really appreciate your time in responding. I understand the magnitude of
the problem and the technical challenges your team is addressing. So, I
thank you for your well reasoned and detailed response.
I would like to clarify the base capability I see in Wikipragmatica, as
well as the user community's work stream in support of it's curation. My
concern is the path the team has chosen is a dead end beyond the limited
use cases of Wikipedia day zero + a few years. An ecosystem of free
knowledge certainly seems to lead outside the confines of today's wiki
markup world for data and information acquisition. At some point, you will
have to semantically disambiguate the remainder of the web. That is not in
the manual tagging solution set.
To curate Wikipragmatica, the community will first create training corpora,
per knowledge domain. This exercise is similar to lexeme tagging, but
addresses knowledge domains with unique lexicons or semantics. From there,
after training, the assignment of vectors for sentences is logic akin to
your constructors. At this point, both approaches involve the community's
understanding of the semantics and pragmatics of their knowledge areas to
enhance the curation. Also, both approaches have software logic that most
of the community will not have purview. That is the nature of software
projects. Just because it is new, does not make it less knowable. Further,
the skills applied in curating Wikipragmatica are state of the art. Network
(graph) analysis is the future of data curation, information extraction and
analytics whether manual or automated. The process of translating a word
sense to a position in a three dimensional space is not rocket science. The
space is separated into semantic locations. So instead of lat/longs, you
have mooses and mouses. The community can understand the magic. Once the
semantic network is created, the community can read the results. I know I
harp on this, but reading the graph is key to keeping the community
involved. The output is still natural language, it's just pivoted to a
graph. You could actually browse a Wiki by reading it via the graph. The
community's next job will be to supervise the accuracy of the vector
assignments and begin the process of metadata tagging. The last step is the
paraphrase detection of nearest neighbors to de-dupe the graph into a
paraphrase graph with retained context. At this point, the community will
once again perform quality control that will be input into the models to
refine accuracy and performance. I would argue that the community is just
as involved in the Wikipragmatic curation as lexeme tagging. It's all about
semantics, and the community will be the referees. They can be trained to
understand and control the entire process including the machine models.
I would also like to stress that semantic vector spaces are not hot off the
presses. Neither is paraphrase detection. Both are well understood, robust
software approaches to data curation. Nothing in Wikipragmatica is cutting
edge R&D. It's a unique curation, but relies on well proven tools. In the
Wikipragmatica approach, you do not need the additional logic of the
constructors, as the curation contains the lookup value. Each node will
ultimately have each language's paraphrase of the main node concept. It's
just a lookup. Further, each node will inherit all of the appropriate
existing wikidata metadata. When finished, Wikipragmatica will be a machine
readable knowledge representation that can perform many functions. It is
the foundation for the knowledge as a service strategic goal. A lexeme
based approach lacks the critical component of context brokering. If you
want to respond to a request for knowledge, you must resolve the context in
order to serve the correct semantics. Lastly, I recommend the use of linked
data architectures to address scaling and privacy concerns. A linked data
architecture can address webscale technical requirements.
I apologize if I come across as overly critical and I do know I am late to
the table. However, when I decided to open source my curation, I felt the
logical place for it to grow was at the Wikimedia Foundation. The
movement's 2030 strategy requires support from the machines. I appreciate
that you may be past the point of no further use cases, but I do think that
the community as a whole should not reject vector based or machine learning
based approaches out of hand. The curation skills to get to and maintain
models are modern and can directly contribute to each community member's
professional livelihood. There are classes that I teach that help people
see the utility of organizing data into networks.
Please know that I wish you and the team nothing but success. I also stand
ready to support as you see fit.
Doug
On Thu, May 13, 2021 at 5:08 PM Denny Vrandečić dvrandecic@wikimedia.org
wrote:
...
Hi Douglas,
Thank you for your message.
Yes, you are right that if we were trying to understand a sentence such as
"The cat is digging", we would need to resolve the ambiguity in that
sentence. But, as I wrote in the newsletter, our trick is that we can avoid
the necessity to parse and understand text. The abstract content will
already be written by the contributors in a representation that
disambiguates to the level that is needed to generate the text in the
languages we support - no automatic disambiguation of natural language is
thus needed.
Thank you for publishing the Wikipragmatica proposal. I have read it when
you published it back in January, and I find it interesting and I certainly
hope that you will try it out. It is a very different approach to what we
are trying to achieve with Abstract Wikipedia, where we don't aim to
annotate existing textual resources, but to create entirely new ones from
scratch. Wikipragmatica is squarely aimed at the difficult and important
task of natural language understanding. Abstract Wikipedia is, very
intentionally, trying to circumvent that task. I have not reached out for a
discussion because of these significant differences - I think we are aiming
for very different goals using very different approaches. The goals of
Wikipragmatica are to understand the content, and use that understanding
for detecting misinformation, ascertaining truth, and discovering
inconsistencies. These are extremely valuable goals, and very difficult,
and I have tried to steer explicitly away from them. The same is true for
machine learning and vector-based approaches. I cannot figure out how to
incorporate these in a way that allows the community to truly own the
system and the outputs, which I think is crucial for a Wikimedia project
where the community owns and maintains the content. I think that is a very
worthwhile question to explore, that still needs a crucial insight or two
to make it work.
Yes, FrameNet and WordNet are much more related to our approach than GPT-3
or Bert. About a decade ago, Chuck Filmore, the creator of FrameNet, and I
were teaching together in Berkeley, and back then I learned a lot about
FrameNet from him, and how much effort is in it. Later, during my time at
Google I had the particular luck that some of my colleagues were a few of
Chuck's former collaborators on FrameNet and have discussed it with a
number of them in detail. This made it clear that one of the biggest risks
in the Abstract Wikipedia project is the absolute number of constructors
that we will need, as this will ultimately decide how much effort it will
be to make the content in Abstract Wikipedia available in a new language.
Regarding WordNet, Christiane Fellbaum was one of the initial members of
the advisory board for Wikidata, and her work and results were very
influential in designing the data model for the lexicographic space in
Wikidata (albeit, indirectly, as we settled on the Lemon model that came
later and has learned from WordNet).
You are exactly right, we are going down a well worn path. I keep saying
that in my talks: this is not a research project, we are applying
well-known results from several fields such as natural language generation,
crowd-sourcing, programming languages, etc. I still consider it a risky
project, as there are a number of unknowns (e.g. the number of
constructors, and how multilingual the constructors are) that will play a
major role in how effective our approach will be, but I also think that we
will certainly achieve something worthwhile - but we don't know yet exactly
what and how far this architecture will carry us.
Thank you for your comment,
Denny
On Fri, May 7, 2021 at 9:28 AM Douglas Clark clarkdd@gmail.com wrote:
...
Gerard and Denny,
The problem with a lexeme approach is that the constructors and renderers
will become so complex and convoluted as to be non-scalable. The use of
lexemes is problematic due to a complete lack of context awareness. Just
because you have a word, and know all of its senses, how do you know which
sense to pick?
Using Denny's example, "cat" could actually refer to the American
construction equipment maker Caterpillar. "That cat is digging!" works for
both the animal and the machine. We humans are somewhat unpredictable in
when we set context. Your constructors will have to walk up and down the
text chain to try and find context for each verb and noun. With a word
based approach, words are your granularity, so everything is a lookup for a
word, even though your application is at the sentence level. GPT-3, the
most powerful NLP tool yet created, has 175 billion parameters for its
lexeme based dataset, yet it too loses context. Humans are great at
rephrasing something to fit their complete communique. WordNet is the most
complete and scientifically accurate lexeme database on the planet, yet
very few NLP approaches use WordNet. The traversals of the WordNet
thesaurus can be compute intensive, and would be sensitive to how your
constructors' logic walks the tree. The rules alone would become massive.
You have to at least move up to phrases, and I recommend sentences
(paraphrases). As for phrases, the FrameNet
https://framenet.icsi.berkeley.edu/fndrupal/ folks can tell you how
hard it is to build a dataset of phrases for NLP.
I've asked several times to discuss this with you and to save you and the
team from going down this dead end path. The Wikipragmatica proposal
directly addresses both context and semantics. If you used Wikipragmatica,
translation logic would entail semantic disambiguation, paraphrase
detection on nearest neighbors, node assignment, and then a lookup of node
members for the appropriate language. If you decide to go down the lexeme
path, I highly recommend you spend some noodle time on context brokering.
I'm confident that in short order you will understand the magnitude of the
context problem using lexemes. You are going down a well worn path.
Respectfully,
Doug
On Fri, May 7, 2021 at 8:53 AM Thad Guidry thadguidry@gmail.com wrote:
...
Denny,
Wait...
Your original posting mentions that *Constructors* would essentially
hold the conditional logic, or "rules"?
But in your followup, I see you mention *Renderers*?
I'm curious where the delineation of rules will occur, and if the answer
is "it depends"?
Have you given much thought to constraints on Constructors or Renderers
themselves (Are there high level design docs available for each of those
yet)?
Or do you think that will be something still being worked through in the
long term with community use cases, and practices that evolve?
Thad
https://www.linkedin.com/in/thadguidry/
https://calendly.com/thadguidry/
On Fri, May 7, 2021 at 10:07 AM Denny Vrandečić <
dvrandecic@wikimedia.org> wrote:
...
Hi Gerard,
If the abstract content states (and I am further simplifying):
type: animal type phrase

type of animal: cat
sex: male

that might be represented e.g.
{
  Z1K1: Z14000,
  Z14000K1: Z14146,
  Z14000K2: Z14097
}
or it could be, if we are using QIDs for the values,
{
  Z1K1: Z14000,
  Z14000K1: Q146,
  Z14000K2: Q44148
}
so it wouldn't be based on English, it would be abstracted from the
natural language.
Now there could be a Renderer in Dutch for 'animal type phrases' that
would include:
if Z14000K1 = Q146/cat:
  if Z1400K2 = unknown or Z1400K2 = Q43445/female organism:
    return L208775/kat (Dutch, noun)
  if Z1400K2 = Q44148/male organism:
    return L.../kater (Dutch, noun)
...
etc.
This is just for selecting the right Lexeme. Further functions would
now select the right form, depending on how the sentence looks like.
But nowhere do we need to refer to the Senses or to explicitly modeled
meanings.
On the other hand, we *could* refer to the Senses and items. (And this
is what I meant with not being prescriptive - I am just sketching out one
possibility that does *not* refer to them). Because we could also write a
multilingual Renderer (e.g. as a fallback Renderer?) that does for example
the following:
Animal = Z1400K1  // which would be Q146/cat in our example
Senses = FollowBacklink(P5137/item for this sense)
Lexemes = GetLexemesFromSenses(Senses)
DutchLexemes = FilterByLanguage(Lexemes, Q7411/Dutch)
return ChooseOne(DutchLexemes)  // that would need to be some
deterministic choice)
This probably would need some refinement to figure out how the sex
would play into this, but it's a just the start of a sketch. You could also
imagine to build something on Defined Meanings at this point.
I hope that makes sense - happy to answer more. And again, it is all
just suggestions!
Also, Happy Birthday, Gerard!
Cheers,
Denny
On Thu, May 6, 2021 at 10:23 PM Gerard Meijssen <
gerard.meijssen@gmail.com> wrote:
...
Hoi,
I fail to understand. You have the data in the prescribed manner for
an article. The original is based on English. How can you generate from the
data a text in Dutch or any other language, when you do have the Senses but
not the meanings of the words.
Thanks,
      GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić dvrandecic@wikimedia.org
wrote:
...
The on-wiki version of this newsletter can be found here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical
knowledge
https://www.wikidata.org/wiki/Wikidata:Lexicographical_data.
Several hundred thousand Lexemes have been created since then, and this
year the tools will be further developed by Wikimedia Deutschland to make
the creation and maintenance of the lexicographic knowledge in Wikidata
easier.
The lexicographic extension to Wikidata was developed with the goal
that became Abstract Wikipedia in mind, but a recent discussion within the
community showed me that I have not made the possible connection between
these two parts clear yet. Today, I would like to sketch out a few ideas on
how Abstract Wikipedia and the lexicographic data in Wikidata could work
together.
There are two principal ways to organize a dictionary: either you
organize the entries by ‘lexemes’ or ‘words’ and describe their senses
(this is called the semasiological
https://en.wikipedia.org/wiki/Semasiology approach), or you
organize the entries by their ‘senses’ or ‘meanings’ (this is called the
onomasiological https://en.wikipedia.org/wiki/Onomasiology
approach). Wikidata has intentionally chosen the semasiological approach:
the entries in Wikidata are called Lexemes, and contributors can add Senses
and Forms to the Lexemes. Senses stand for the different meanings that a
Lexeme may regularly invoke, and the Forms are the different ways the
Lexeme may be expressed in a natural language text, e.g. in order to be in
agreement with the right grammatical number, case, tense, etc. The Lexeme
“mouse” (L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus
has two senses, one for the small rodent, one for the computer input
device, and two forms, “mouse” and “mice”.  For an example of a
multilingual onomasiological collaborative dictionary, one can take a look
at the OmegaWiki http://www.omegawiki.org/ project, which is
primarily organized around (currently 51,000+) Defined Meanings
http://www.omegawiki.org/Help:DefinedMeaning and how these are
expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on
the observation that it is much simpler for a crowd-sourced collaborative
project, and has much less potential to be contentious. It is much easier
to gather a list of words used in a corpus than to gather a list of all the
meanings referred to in the same corpus. And whereas it is 'simpler', it is
still not trivial. We still want to collect a list of Senses for each
Lexeme, and we want to describe the connections between these Senses:
whether two Lexemes in a language have the same Sense, how the Senses
relate to the large catalog of items in Wikidata, and how Senses of
different languages relate to each other. These are all very difficult
questions that the Wikidata community is still grappling with (see also the
essay on Making Sense
https://www.wikidata.org/wiki/Wikidata:Making_sense).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the
world. He became mayor of Talkeetna, Alaska, at the age of three months and
six days, and retained that position until his death almost four years ago.
Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we
will have to be able to express the meaning “cat” (here, we will
focus entirely on the lexical level, and will not discuss grammatical and
idiomatic issues; we will leave those for another day). How do we refer to
the idea for cat in the abstract content? How do we end up, in English,
eventually with the word form “cat” (L7-F4
https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the
word form “chat” (L511-F4
https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with
the form “Kater” (L303326-F1
https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning.
The English word cat refers to both male or female cats equally; and
whereas the French word could refer to a cat generically, for example if we
wouldn’t know Stubbs’ gender, the word is male, but a female cat would
usually be referred to using the word “chatte”. The German word, on
the other hand, may only refer to a male cat. If we wouldn’t know whether
Stubbs is male or female, we would need to use the word “Katze” in
German instead, whereas in French, as said, we still could use “chat”.
And English also has words for male cats, e.g. “tom” or “tomcat”,
but these are much less frequently used. Searching the Web for “Stubbs
is a cat” returns more than 10,000 hits, but not a single one for “Stubbs
is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette
https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far
only cat in space, the articles indeed use the words “chatte” in
French and “Katze” in German.
Here we are talking about three rather closely related languages, we
are talking about a rather simple noun. This should have been a very simple
case, and yet it is not. When we talk about verbs, adjectives, or nouns
about more complex concepts (for example different kinds of human
settlements or the different ways human body parts are conceptualized in
different languages, e.g. arms and hands
https://wals.info/chapter/129, terms for colors), it gets much
more complicated very quickly. If we were to require that all words we want
to use in Abstract Wikipedia first must align their meanings, then that
would put a very difficult task in our critical path. So whereas it would
indeed have been helpful to Abstract Wikipedia to have followed an
onomasiological approach (how wonderful would it be to have a comprehensive
catalog of meanings!), that approach was deemed too difficult and a
semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can
avoid that is because Abstract Wikipedia only needs to generate text, and
neither parse nor understand it. This allows us to get by using a
Constructor that, for each language, uses a Renderer to select the correct
word (or other lexical representation). For example, we could have a
Constructor that may take several optional further pieces of information:
the kind of animal, the breed, the color, whether it is an adult, whether
it is neutered, the gender, the number of them, etc. For each of these
pieces of information, we could mark whether that information must be
expressed in the Rendering, or whether this information is optional and can
be ignored, and thus what is available for those Renderers to choose the
most appropriate word. Note, this is not telling the community how to do
it, merely sketching out one possible approach that would avoid to rely on
a catalog of meanings.
Each language Renderer could then use the information it needs to
select the right word. If a language has a preference to express the gender
(such as German) it can do so, whereas a language that prefers not to (such
as English) can do so. If for a language the age of the cat matters for the
selection of the word, it can look it up. If the color of the animal
matters (as it does for horses in German
https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben),
the respective Renderer can use the information. If a required information
is missing, we could add this to a maintenance queue so that contributors
can fill it out. If a language should happen not to have a word, a
different noun phrase can be chosen, e.g. a less specific word such as
”animal” or “pet”, or a phrase such as “male kitten”, or “black
horse” for the German word “Rappen”.
But the important design feature here is that we do not need to
ensure and agree on the alignment of meanings of words across different
languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of
meanings. It would be a tremendously valuable resource. And even without
such a catalog, the statements connecting Senses and Items in Wikidata can
be very helpful for the creation and maintenance of Renderers, but these do
not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be
up to the community to decide on how to implement the Renderers and what
information to use. In this, I am sketching out an architecture that allows
us to avoid blocking on the availability of a (valuable but very difficult
to create) resource, a comprehensive catalog of meanings aligning words
across many different languages.

Abstract-Wikipedia mailing list --
abstract-wikipedia@lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...

2024

2023

2022

2021

2020

[Abstract-wikipedia] Re: Newsletter #30: The missing link from Abstract Wikipedia to Lexicographic data in Wikidata