The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical knowledge https://www.wikidata.org/wiki/Wikidata:Lexicographical_data. Several hundred thousand Lexemes have been created since then, and this year the tools will be further developed by Wikimedia Deutschland to make the creation and maintenance of the lexicographic knowledge in Wikidata easier.
The lexicographic extension to Wikidata was developed with the goal that became Abstract Wikipedia in mind, but a recent discussion within the community showed me that I have not made the possible connection between these two parts clear yet. Today, I would like to sketch out a few ideas on how Abstract Wikipedia and the lexicographic data in Wikidata could work together.
There are two principal ways to organize a dictionary: either you organize the entries by ‘lexemes’ or ‘words’ and describe their senses (this is called the semasiological https://en.wikipedia.org/wiki/Semasiology approach), or you organize the entries by their ‘senses’ or ‘meanings’ (this is called the onomasiological https://en.wikipedia.org/wiki/Onomasiology approach). Wikidata has intentionally chosen the semasiological approach: the entries in Wikidata are called Lexemes, and contributors can add Senses and Forms to the Lexemes. Senses stand for the different meanings that a Lexeme may regularly invoke, and the Forms are the different ways the Lexeme may be expressed in a natural language text, e.g. in order to be in agreement with the right grammatical number, case, tense, etc. The Lexeme “mouse” (L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus has two senses, one for the small rodent, one for the computer input device, and two forms, “mouse” and “mice”. For an example of a multilingual onomasiological collaborative dictionary, one can take a look at the OmegaWiki http://www.omegawiki.org/ project, which is primarily organized around (currently 51,000+) Defined Meanings http://www.omegawiki.org/Help:DefinedMeaning and how these are expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on the observation that it is much simpler for a crowd-sourced collaborative project, and has much less potential to be contentious. It is much easier to gather a list of words used in a corpus than to gather a list of all the meanings referred to in the same corpus. And whereas it is 'simpler', it is still not trivial. We still want to collect a list of Senses for each Lexeme, and we want to describe the connections between these Senses: whether two Lexemes in a language have the same Sense, how the Senses relate to the large catalog of items in Wikidata, and how Senses of different languages relate to each other. These are all very difficult questions that the Wikidata community is still grappling with (see also the essay on Making Sense https://www.wikidata.org/wiki/Wikidata:Making_sense ).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the world. He became mayor of Talkeetna, Alaska, at the age of three months and six days, and retained that position until his death almost four years ago. Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will have to be able to express the meaning “cat” (here, we will focus entirely on the lexical level, and will not discuss grammatical and idiomatic issues; we will leave those for another day). How do we refer to the idea for cat in the abstract content? How do we end up, in English, eventually with the word form “cat” (L7-F4 https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the word form “chat” (L511-F4 https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with the form “Kater” (L303326-F1 https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning. The English word cat refers to both male or female cats equally; and whereas the French word could refer to a cat generically, for example if we wouldn’t know Stubbs’ gender, the word is male, but a female cat would usually be referred to using the word “chatte”. The German word, on the other hand, may only refer to a male cat. If we wouldn’t know whether Stubbs is male or female, we would need to use the word “Katze” in German instead, whereas in French, as said, we still could use “chat”. And English also has words for male cats, e.g. “tom” or “tomcat”, but these are much less frequently used. Searching the Web for “Stubbs is a cat” returns more than 10,000 hits, but not a single one for “Stubbs is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far only cat in space, the articles indeed use the words “chatte” in French and “Katze” in German.
Here we are talking about three rather closely related languages, we are talking about a rather simple noun. This should have been a very simple case, and yet it is not. When we talk about verbs, adjectives, or nouns about more complex concepts (for example different kinds of human settlements or the different ways human body parts are conceptualized in different languages, e.g. arms and hands https://wals.info/chapter/129, terms for colors), it gets much more complicated very quickly. If we were to require that all words we want to use in Abstract Wikipedia first must align their meanings, then that would put a very difficult task in our critical path. So whereas it would indeed have been helpful to Abstract Wikipedia to have followed an onomasiological approach (how wonderful would it be to have a comprehensive catalog of meanings!), that approach was deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can avoid that is because Abstract Wikipedia only needs to generate text, and neither parse nor understand it. This allows us to get by using a Constructor that, for each language, uses a Renderer to select the correct word (or other lexical representation). For example, we could have a Constructor that may take several optional further pieces of information: the kind of animal, the breed, the color, whether it is an adult, whether it is neutered, the gender, the number of them, etc. For each of these pieces of information, we could mark whether that information must be expressed in the Rendering, or whether this information is optional and can be ignored, and thus what is available for those Renderers to choose the most appropriate word. Note, this is not telling the community how to do it, merely sketching out one possible approach that would avoid to rely on a catalog of meanings.
Each language Renderer could then use the information it needs to select the right word. If a language has a preference to express the gender (such as German) it can do so, whereas a language that prefers not to (such as English) can do so. If for a language the age of the cat matters for the selection of the word, it can look it up. If the color of the animal matters (as it does for horses in German https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben), the respective Renderer can use the information. If a required information is missing, we could add this to a maintenance queue so that contributors can fill it out. If a language should happen not to have a word, a different noun phrase can be chosen, e.g. a less specific word such as ”animal” or “pet”, or a phrase such as “male kitten”, or “black horse” for the German word “Rappen”.
But the important design feature here is that we do not need to ensure and agree on the alignment of meanings of words across different languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of meanings. It would be a tremendously valuable resource. And even without such a catalog, the statements connecting Senses and Items in Wikidata can be very helpful for the creation and maintenance of Renderers, but these do not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up to the community to decide on how to implement the Renderers and what information to use. In this, I am sketching out an architecture that allows us to avoid blocking on the availability of a (valuable but very difficult to create) resource, a comprehensive catalog of meanings aligning words across many different languages.
Hoi, I fail to understand. You have the data in the prescribed manner for an article. The original is based on English. How can you generate from the data a text in Dutch or any other language, when you do have the Senses but not the meanings of the words. Thanks, GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić dvrandecic@wikimedia.org wrote:
The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical knowledge https://www.wikidata.org/wiki/Wikidata:Lexicographical_data. Several hundred thousand Lexemes have been created since then, and this year the tools will be further developed by Wikimedia Deutschland to make the creation and maintenance of the lexicographic knowledge in Wikidata easier.
The lexicographic extension to Wikidata was developed with the goal that became Abstract Wikipedia in mind, but a recent discussion within the community showed me that I have not made the possible connection between these two parts clear yet. Today, I would like to sketch out a few ideas on how Abstract Wikipedia and the lexicographic data in Wikidata could work together.
There are two principal ways to organize a dictionary: either you organize the entries by ‘lexemes’ or ‘words’ and describe their senses (this is called the semasiological https://en.wikipedia.org/wiki/Semasiology approach), or you organize the entries by their ‘senses’ or ‘meanings’ (this is called the onomasiological https://en.wikipedia.org/wiki/Onomasiology approach). Wikidata has intentionally chosen the semasiological approach: the entries in Wikidata are called Lexemes, and contributors can add Senses and Forms to the Lexemes. Senses stand for the different meanings that a Lexeme may regularly invoke, and the Forms are the different ways the Lexeme may be expressed in a natural language text, e.g. in order to be in agreement with the right grammatical number, case, tense, etc. The Lexeme “mouse” (L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus has two senses, one for the small rodent, one for the computer input device, and two forms, “mouse” and “mice”. For an example of a multilingual onomasiological collaborative dictionary, one can take a look at the OmegaWiki http://www.omegawiki.org/ project, which is primarily organized around (currently 51,000+) Defined Meanings http://www.omegawiki.org/Help:DefinedMeaning and how these are expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on the observation that it is much simpler for a crowd-sourced collaborative project, and has much less potential to be contentious. It is much easier to gather a list of words used in a corpus than to gather a list of all the meanings referred to in the same corpus. And whereas it is 'simpler', it is still not trivial. We still want to collect a list of Senses for each Lexeme, and we want to describe the connections between these Senses: whether two Lexemes in a language have the same Sense, how the Senses relate to the large catalog of items in Wikidata, and how Senses of different languages relate to each other. These are all very difficult questions that the Wikidata community is still grappling with (see also the essay on Making Sense https://www.wikidata.org/wiki/Wikidata:Making_sense).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the world. He became mayor of Talkeetna, Alaska, at the age of three months and six days, and retained that position until his death almost four years ago. Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will have to be able to express the meaning “cat” (here, we will focus entirely on the lexical level, and will not discuss grammatical and idiomatic issues; we will leave those for another day). How do we refer to the idea for cat in the abstract content? How do we end up, in English, eventually with the word form “cat” (L7-F4 https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the word form “chat” (L511-F4 https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with the form “Kater” (L303326-F1 https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning. The English word cat refers to both male or female cats equally; and whereas the French word could refer to a cat generically, for example if we wouldn’t know Stubbs’ gender, the word is male, but a female cat would usually be referred to using the word “chatte”. The German word, on the other hand, may only refer to a male cat. If we wouldn’t know whether Stubbs is male or female, we would need to use the word “Katze” in German instead, whereas in French, as said, we still could use “chat”. And English also has words for male cats, e.g. “tom” or “tomcat”, but these are much less frequently used. Searching the Web for “Stubbs is a cat” returns more than 10,000 hits, but not a single one for “Stubbs is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far only cat in space, the articles indeed use the words “chatte” in French and “ Katze” in German.
Here we are talking about three rather closely related languages, we are talking about a rather simple noun. This should have been a very simple case, and yet it is not. When we talk about verbs, adjectives, or nouns about more complex concepts (for example different kinds of human settlements or the different ways human body parts are conceptualized in different languages, e.g. arms and hands https://wals.info/chapter/129, terms for colors), it gets much more complicated very quickly. If we were to require that all words we want to use in Abstract Wikipedia first must align their meanings, then that would put a very difficult task in our critical path. So whereas it would indeed have been helpful to Abstract Wikipedia to have followed an onomasiological approach (how wonderful would it be to have a comprehensive catalog of meanings!), that approach was deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can avoid that is because Abstract Wikipedia only needs to generate text, and neither parse nor understand it. This allows us to get by using a Constructor that, for each language, uses a Renderer to select the correct word (or other lexical representation). For example, we could have a Constructor that may take several optional further pieces of information: the kind of animal, the breed, the color, whether it is an adult, whether it is neutered, the gender, the number of them, etc. For each of these pieces of information, we could mark whether that information must be expressed in the Rendering, or whether this information is optional and can be ignored, and thus what is available for those Renderers to choose the most appropriate word. Note, this is not telling the community how to do it, merely sketching out one possible approach that would avoid to rely on a catalog of meanings.
Each language Renderer could then use the information it needs to select the right word. If a language has a preference to express the gender (such as German) it can do so, whereas a language that prefers not to (such as English) can do so. If for a language the age of the cat matters for the selection of the word, it can look it up. If the color of the animal matters (as it does for horses in German https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben), the respective Renderer can use the information. If a required information is missing, we could add this to a maintenance queue so that contributors can fill it out. If a language should happen not to have a word, a different noun phrase can be chosen, e.g. a less specific word such as ”animal” or “pet”, or a phrase such as “male kitten”, or “black horse” for the German word “Rappen”.
But the important design feature here is that we do not need to ensure and agree on the alignment of meanings of words across different languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of meanings. It would be a tremendously valuable resource. And even without such a catalog, the statements connecting Senses and Items in Wikidata can be very helpful for the creation and maintenance of Renderers, but these do not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up to the community to decide on how to implement the Renderers and what information to use. In this, I am sketching out an architecture that allows us to avoid blocking on the availability of a (valuable but very difficult to create) resource, a comprehensive catalog of meanings aligning words across many different languages.
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Hi Gerard,
If the abstract content states (and I am further simplifying):
type: animal type phrase - type of animal: cat - sex: male
that might be represented e.g.
{ Z1K1: Z14000, Z14000K1: Z14146, Z14000K2: Z14097 }
or it could be, if we are using QIDs for the values,
{ Z1K1: Z14000, Z14000K1: Q146, Z14000K2: Q44148 }
so it wouldn't be based on English, it would be abstracted from the natural language.
Now there could be a Renderer in Dutch for 'animal type phrases' that would include:
if Z14000K1 = Q146/cat: if Z1400K2 = unknown or Z1400K2 = Q43445/female organism: return L208775/kat (Dutch, noun) if Z1400K2 = Q44148/male organism: return L.../kater (Dutch, noun) ...
etc.
This is just for selecting the right Lexeme. Further functions would now select the right form, depending on how the sentence looks like.
But nowhere do we need to refer to the Senses or to explicitly modeled meanings.
On the other hand, we *could* refer to the Senses and items. (And this is what I meant with not being prescriptive - I am just sketching out one possibility that does *not* refer to them). Because we could also write a multilingual Renderer (e.g. as a fallback Renderer?) that does for example the following:
Animal = Z1400K1 // which would be Q146/cat in our example Senses = FollowBacklink(P5137/item for this sense) Lexemes = GetLexemesFromSenses(Senses) DutchLexemes = FilterByLanguage(Lexemes, Q7411/Dutch) return ChooseOne(DutchLexemes) // that would need to be some deterministic choice)
This probably would need some refinement to figure out how the sex would play into this, but it's a just the start of a sketch. You could also imagine to build something on Defined Meanings at this point.
I hope that makes sense - happy to answer more. And again, it is all just suggestions!
Also, Happy Birthday, Gerard!
Cheers, Denny
On Thu, May 6, 2021 at 10:23 PM Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, I fail to understand. You have the data in the prescribed manner for an article. The original is based on English. How can you generate from the data a text in Dutch or any other language, when you do have the Senses but not the meanings of the words. Thanks, GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić dvrandecic@wikimedia.org wrote:
The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical knowledge https://www.wikidata.org/wiki/Wikidata:Lexicographical_data. Several hundred thousand Lexemes have been created since then, and this year the tools will be further developed by Wikimedia Deutschland to make the creation and maintenance of the lexicographic knowledge in Wikidata easier.
The lexicographic extension to Wikidata was developed with the goal that became Abstract Wikipedia in mind, but a recent discussion within the community showed me that I have not made the possible connection between these two parts clear yet. Today, I would like to sketch out a few ideas on how Abstract Wikipedia and the lexicographic data in Wikidata could work together.
There are two principal ways to organize a dictionary: either you organize the entries by ‘lexemes’ or ‘words’ and describe their senses (this is called the semasiological https://en.wikipedia.org/wiki/Semasiology approach), or you organize the entries by their ‘senses’ or ‘meanings’ (this is called the onomasiological https://en.wikipedia.org/wiki/Onomasiology approach). Wikidata has intentionally chosen the semasiological approach: the entries in Wikidata are called Lexemes, and contributors can add Senses and Forms to the Lexemes. Senses stand for the different meanings that a Lexeme may regularly invoke, and the Forms are the different ways the Lexeme may be expressed in a natural language text, e.g. in order to be in agreement with the right grammatical number, case, tense, etc. The Lexeme “mouse” (L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus has two senses, one for the small rodent, one for the computer input device, and two forms, “mouse” and “mice”. For an example of a multilingual onomasiological collaborative dictionary, one can take a look at the OmegaWiki http://www.omegawiki.org/ project, which is primarily organized around (currently 51,000+) Defined Meanings http://www.omegawiki.org/Help:DefinedMeaning and how these are expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on the observation that it is much simpler for a crowd-sourced collaborative project, and has much less potential to be contentious. It is much easier to gather a list of words used in a corpus than to gather a list of all the meanings referred to in the same corpus. And whereas it is 'simpler', it is still not trivial. We still want to collect a list of Senses for each Lexeme, and we want to describe the connections between these Senses: whether two Lexemes in a language have the same Sense, how the Senses relate to the large catalog of items in Wikidata, and how Senses of different languages relate to each other. These are all very difficult questions that the Wikidata community is still grappling with (see also the essay on Making Sense https://www.wikidata.org/wiki/Wikidata:Making_sense).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the world. He became mayor of Talkeetna, Alaska, at the age of three months and six days, and retained that position until his death almost four years ago. Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will have to be able to express the meaning “cat” (here, we will focus entirely on the lexical level, and will not discuss grammatical and idiomatic issues; we will leave those for another day). How do we refer to the idea for cat in the abstract content? How do we end up, in English, eventually with the word form “cat” (L7-F4 https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the word form “chat” (L511-F4 https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with the form “Kater” (L303326-F1 https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning. The English word cat refers to both male or female cats equally; and whereas the French word could refer to a cat generically, for example if we wouldn’t know Stubbs’ gender, the word is male, but a female cat would usually be referred to using the word “chatte”. The German word, on the other hand, may only refer to a male cat. If we wouldn’t know whether Stubbs is male or female, we would need to use the word “Katze” in German instead, whereas in French, as said, we still could use “chat”. And English also has words for male cats, e.g. “tom” or “tomcat”, but these are much less frequently used. Searching the Web for “Stubbs is a cat” returns more than 10,000 hits, but not a single one for “Stubbs is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far only cat in space, the articles indeed use the words “chatte” in French and “Katze” in German.
Here we are talking about three rather closely related languages, we are talking about a rather simple noun. This should have been a very simple case, and yet it is not. When we talk about verbs, adjectives, or nouns about more complex concepts (for example different kinds of human settlements or the different ways human body parts are conceptualized in different languages, e.g. arms and hands https://wals.info/chapter/129, terms for colors), it gets much more complicated very quickly. If we were to require that all words we want to use in Abstract Wikipedia first must align their meanings, then that would put a very difficult task in our critical path. So whereas it would indeed have been helpful to Abstract Wikipedia to have followed an onomasiological approach (how wonderful would it be to have a comprehensive catalog of meanings!), that approach was deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can avoid that is because Abstract Wikipedia only needs to generate text, and neither parse nor understand it. This allows us to get by using a Constructor that, for each language, uses a Renderer to select the correct word (or other lexical representation). For example, we could have a Constructor that may take several optional further pieces of information: the kind of animal, the breed, the color, whether it is an adult, whether it is neutered, the gender, the number of them, etc. For each of these pieces of information, we could mark whether that information must be expressed in the Rendering, or whether this information is optional and can be ignored, and thus what is available for those Renderers to choose the most appropriate word. Note, this is not telling the community how to do it, merely sketching out one possible approach that would avoid to rely on a catalog of meanings.
Each language Renderer could then use the information it needs to select the right word. If a language has a preference to express the gender (such as German) it can do so, whereas a language that prefers not to (such as English) can do so. If for a language the age of the cat matters for the selection of the word, it can look it up. If the color of the animal matters (as it does for horses in German https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben), the respective Renderer can use the information. If a required information is missing, we could add this to a maintenance queue so that contributors can fill it out. If a language should happen not to have a word, a different noun phrase can be chosen, e.g. a less specific word such as ”animal” or “pet”, or a phrase such as “male kitten”, or “black horse” for the German word “Rappen”.
But the important design feature here is that we do not need to ensure and agree on the alignment of meanings of words across different languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of meanings. It would be a tremendously valuable resource. And even without such a catalog, the statements connecting Senses and Items in Wikidata can be very helpful for the creation and maintenance of Renderers, but these do not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up to the community to decide on how to implement the Renderers and what information to use. In this, I am sketching out an architecture that allows us to avoid blocking on the availability of a (valuable but very difficult to create) resource, a comprehensive catalog of meanings aligning words across many different languages.
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Denny,
Wait... Your original posting mentions that *Constructors* would essentially hold the conditional logic, or "rules"? But in your followup, I see you mention *Renderers*?
I'm curious where the delineation of rules will occur, and if the answer is "it depends"?
Have you given much thought to constraints on Constructors or Renderers themselves (Are there high level design docs available for each of those yet)? Or do you think that will be something still being worked through in the long term with community use cases, and practices that evolve?
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Fri, May 7, 2021 at 10:07 AM Denny Vrandečić dvrandecic@wikimedia.org wrote:
Hi Gerard,
If the abstract content states (and I am further simplifying):
type: animal type phrase
- type of animal: cat
- sex: male
that might be represented e.g.
{ Z1K1: Z14000, Z14000K1: Z14146, Z14000K2: Z14097 }
or it could be, if we are using QIDs for the values,
{ Z1K1: Z14000, Z14000K1: Q146, Z14000K2: Q44148 }
so it wouldn't be based on English, it would be abstracted from the natural language.
Now there could be a Renderer in Dutch for 'animal type phrases' that would include:
if Z14000K1 = Q146/cat: if Z1400K2 = unknown or Z1400K2 = Q43445/female organism: return L208775/kat (Dutch, noun) if Z1400K2 = Q44148/male organism: return L.../kater (Dutch, noun) ...
etc.
This is just for selecting the right Lexeme. Further functions would now select the right form, depending on how the sentence looks like.
But nowhere do we need to refer to the Senses or to explicitly modeled meanings.
On the other hand, we *could* refer to the Senses and items. (And this is what I meant with not being prescriptive - I am just sketching out one possibility that does *not* refer to them). Because we could also write a multilingual Renderer (e.g. as a fallback Renderer?) that does for example the following:
Animal = Z1400K1 // which would be Q146/cat in our example Senses = FollowBacklink(P5137/item for this sense) Lexemes = GetLexemesFromSenses(Senses) DutchLexemes = FilterByLanguage(Lexemes, Q7411/Dutch) return ChooseOne(DutchLexemes) // that would need to be some deterministic choice)
This probably would need some refinement to figure out how the sex would play into this, but it's a just the start of a sketch. You could also imagine to build something on Defined Meanings at this point.
I hope that makes sense - happy to answer more. And again, it is all just suggestions!
Also, Happy Birthday, Gerard!
Cheers, Denny
On Thu, May 6, 2021 at 10:23 PM Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, I fail to understand. You have the data in the prescribed manner for an article. The original is based on English. How can you generate from the data a text in Dutch or any other language, when you do have the Senses but not the meanings of the words. Thanks, GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić dvrandecic@wikimedia.org wrote:
The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical knowledge https://www.wikidata.org/wiki/Wikidata:Lexicographical_data. Several hundred thousand Lexemes have been created since then, and this year the tools will be further developed by Wikimedia Deutschland to make the creation and maintenance of the lexicographic knowledge in Wikidata easier.
The lexicographic extension to Wikidata was developed with the goal that became Abstract Wikipedia in mind, but a recent discussion within the community showed me that I have not made the possible connection between these two parts clear yet. Today, I would like to sketch out a few ideas on how Abstract Wikipedia and the lexicographic data in Wikidata could work together.
There are two principal ways to organize a dictionary: either you organize the entries by ‘lexemes’ or ‘words’ and describe their senses (this is called the semasiological https://en.wikipedia.org/wiki/Semasiology approach), or you organize the entries by their ‘senses’ or ‘meanings’ (this is called the onomasiological https://en.wikipedia.org/wiki/Onomasiology approach). Wikidata has intentionally chosen the semasiological approach: the entries in Wikidata are called Lexemes, and contributors can add Senses and Forms to the Lexemes. Senses stand for the different meanings that a Lexeme may regularly invoke, and the Forms are the different ways the Lexeme may be expressed in a natural language text, e.g. in order to be in agreement with the right grammatical number, case, tense, etc. The Lexeme “mouse” ( L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus has two senses, one for the small rodent, one for the computer input device, and two forms, “mouse” and “mice”. For an example of a multilingual onomasiological collaborative dictionary, one can take a look at the OmegaWiki http://www.omegawiki.org/ project, which is primarily organized around (currently 51,000+) Defined Meanings http://www.omegawiki.org/Help:DefinedMeaning and how these are expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on the observation that it is much simpler for a crowd-sourced collaborative project, and has much less potential to be contentious. It is much easier to gather a list of words used in a corpus than to gather a list of all the meanings referred to in the same corpus. And whereas it is 'simpler', it is still not trivial. We still want to collect a list of Senses for each Lexeme, and we want to describe the connections between these Senses: whether two Lexemes in a language have the same Sense, how the Senses relate to the large catalog of items in Wikidata, and how Senses of different languages relate to each other. These are all very difficult questions that the Wikidata community is still grappling with (see also the essay on Making Sense https://www.wikidata.org/wiki/Wikidata:Making_sense).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the world. He became mayor of Talkeetna, Alaska, at the age of three months and six days, and retained that position until his death almost four years ago. Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will have to be able to express the meaning “cat” (here, we will focus entirely on the lexical level, and will not discuss grammatical and idiomatic issues; we will leave those for another day). How do we refer to the idea for cat in the abstract content? How do we end up, in English, eventually with the word form “cat” (L7-F4 https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the word form “chat” (L511-F4 https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with the form “Kater” (L303326-F1 https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning. The English word cat refers to both male or female cats equally; and whereas the French word could refer to a cat generically, for example if we wouldn’t know Stubbs’ gender, the word is male, but a female cat would usually be referred to using the word “chatte”. The German word, on the other hand, may only refer to a male cat. If we wouldn’t know whether Stubbs is male or female, we would need to use the word “Katze” in German instead, whereas in French, as said, we still could use “chat”. And English also has words for male cats, e.g. “tom” or “tomcat”, but these are much less frequently used. Searching the Web for “Stubbs is a cat” returns more than 10,000 hits, but not a single one for “Stubbs is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far only cat in space, the articles indeed use the words “chatte” in French and “Katze” in German.
Here we are talking about three rather closely related languages, we are talking about a rather simple noun. This should have been a very simple case, and yet it is not. When we talk about verbs, adjectives, or nouns about more complex concepts (for example different kinds of human settlements or the different ways human body parts are conceptualized in different languages, e.g. arms and hands https://wals.info/chapter/129, terms for colors), it gets much more complicated very quickly. If we were to require that all words we want to use in Abstract Wikipedia first must align their meanings, then that would put a very difficult task in our critical path. So whereas it would indeed have been helpful to Abstract Wikipedia to have followed an onomasiological approach (how wonderful would it be to have a comprehensive catalog of meanings!), that approach was deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can avoid that is because Abstract Wikipedia only needs to generate text, and neither parse nor understand it. This allows us to get by using a Constructor that, for each language, uses a Renderer to select the correct word (or other lexical representation). For example, we could have a Constructor that may take several optional further pieces of information: the kind of animal, the breed, the color, whether it is an adult, whether it is neutered, the gender, the number of them, etc. For each of these pieces of information, we could mark whether that information must be expressed in the Rendering, or whether this information is optional and can be ignored, and thus what is available for those Renderers to choose the most appropriate word. Note, this is not telling the community how to do it, merely sketching out one possible approach that would avoid to rely on a catalog of meanings.
Each language Renderer could then use the information it needs to select the right word. If a language has a preference to express the gender (such as German) it can do so, whereas a language that prefers not to (such as English) can do so. If for a language the age of the cat matters for the selection of the word, it can look it up. If the color of the animal matters (as it does for horses in German https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben), the respective Renderer can use the information. If a required information is missing, we could add this to a maintenance queue so that contributors can fill it out. If a language should happen not to have a word, a different noun phrase can be chosen, e.g. a less specific word such as ”animal” or “pet”, or a phrase such as “male kitten”, or “black horse” for the German word “Rappen”.
But the important design feature here is that we do not need to ensure and agree on the alignment of meanings of words across different languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of meanings. It would be a tremendously valuable resource. And even without such a catalog, the statements connecting Senses and Items in Wikidata can be very helpful for the creation and maintenance of Renderers, but these do not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up to the community to decide on how to implement the Renderers and what information to use. In this, I am sketching out an architecture that allows us to avoid blocking on the availability of a (valuable but very difficult to create) resource, a comprehensive catalog of meanings aligning words across many different languages.
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Constructors are used for the notation of the abstract content. Constructors are language independent, and shouldn't hold conditional logic.
Renderers should hold the actual conditional logic (which gets applied to the information in the Constructors). Renderers can be per language (but can also be shared across languages).
We will write this out in more detail when we get to the second part of the Abstract Wikipedia project. For now, this separation is analogous to the separation in other NLG systems such as Grammatical Framework.
But I am hoping that I am at least consistent about these terms :) Please let me know if I am being inconsistent and please ask more.
Thank you, Denny
On Fri, May 7, 2021 at 8:53 AM Thad Guidry thadguidry@gmail.com wrote:
Denny,
Wait... Your original posting mentions that *Constructors* would essentially hold the conditional logic, or "rules"? But in your followup, I see you mention *Renderers*?
I'm curious where the delineation of rules will occur, and if the answer is "it depends"?
Have you given much thought to constraints on Constructors or Renderers themselves (Are there high level design docs available for each of those yet)? Or do you think that will be something still being worked through in the long term with community use cases, and practices that evolve?
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Fri, May 7, 2021 at 10:07 AM Denny Vrandečić dvrandecic@wikimedia.org wrote:
Hi Gerard,
If the abstract content states (and I am further simplifying):
type: animal type phrase
- type of animal: cat
- sex: male
that might be represented e.g.
{ Z1K1: Z14000, Z14000K1: Z14146, Z14000K2: Z14097 }
or it could be, if we are using QIDs for the values,
{ Z1K1: Z14000, Z14000K1: Q146, Z14000K2: Q44148 }
so it wouldn't be based on English, it would be abstracted from the natural language.
Now there could be a Renderer in Dutch for 'animal type phrases' that would include:
if Z14000K1 = Q146/cat: if Z1400K2 = unknown or Z1400K2 = Q43445/female organism: return L208775/kat (Dutch, noun) if Z1400K2 = Q44148/male organism: return L.../kater (Dutch, noun) ...
etc.
This is just for selecting the right Lexeme. Further functions would now select the right form, depending on how the sentence looks like.
But nowhere do we need to refer to the Senses or to explicitly modeled meanings.
On the other hand, we *could* refer to the Senses and items. (And this is what I meant with not being prescriptive - I am just sketching out one possibility that does *not* refer to them). Because we could also write a multilingual Renderer (e.g. as a fallback Renderer?) that does for example the following:
Animal = Z1400K1 // which would be Q146/cat in our example Senses = FollowBacklink(P5137/item for this sense) Lexemes = GetLexemesFromSenses(Senses) DutchLexemes = FilterByLanguage(Lexemes, Q7411/Dutch) return ChooseOne(DutchLexemes) // that would need to be some deterministic choice)
This probably would need some refinement to figure out how the sex would play into this, but it's a just the start of a sketch. You could also imagine to build something on Defined Meanings at this point.
I hope that makes sense - happy to answer more. And again, it is all just suggestions!
Also, Happy Birthday, Gerard!
Cheers, Denny
On Thu, May 6, 2021 at 10:23 PM Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, I fail to understand. You have the data in the prescribed manner for an article. The original is based on English. How can you generate from the data a text in Dutch or any other language, when you do have the Senses but not the meanings of the words. Thanks, GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić dvrandecic@wikimedia.org wrote:
The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical knowledge https://www.wikidata.org/wiki/Wikidata:Lexicographical_data. Several hundred thousand Lexemes have been created since then, and this year the tools will be further developed by Wikimedia Deutschland to make the creation and maintenance of the lexicographic knowledge in Wikidata easier.
The lexicographic extension to Wikidata was developed with the goal that became Abstract Wikipedia in mind, but a recent discussion within the community showed me that I have not made the possible connection between these two parts clear yet. Today, I would like to sketch out a few ideas on how Abstract Wikipedia and the lexicographic data in Wikidata could work together.
There are two principal ways to organize a dictionary: either you organize the entries by ‘lexemes’ or ‘words’ and describe their senses (this is called the semasiological https://en.wikipedia.org/wiki/Semasiology approach), or you organize the entries by their ‘senses’ or ‘meanings’ (this is called the onomasiological https://en.wikipedia.org/wiki/Onomasiology approach). Wikidata has intentionally chosen the semasiological approach: the entries in Wikidata are called Lexemes, and contributors can add Senses and Forms to the Lexemes. Senses stand for the different meanings that a Lexeme may regularly invoke, and the Forms are the different ways the Lexeme may be expressed in a natural language text, e.g. in order to be in agreement with the right grammatical number, case, tense, etc. The Lexeme “mouse” (L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus has two senses, one for the small rodent, one for the computer input device, and two forms, “mouse” and “mice”. For an example of a multilingual onomasiological collaborative dictionary, one can take a look at the OmegaWiki http://www.omegawiki.org/ project, which is primarily organized around (currently 51,000+) Defined Meanings http://www.omegawiki.org/Help:DefinedMeaning and how these are expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on the observation that it is much simpler for a crowd-sourced collaborative project, and has much less potential to be contentious. It is much easier to gather a list of words used in a corpus than to gather a list of all the meanings referred to in the same corpus. And whereas it is 'simpler', it is still not trivial. We still want to collect a list of Senses for each Lexeme, and we want to describe the connections between these Senses: whether two Lexemes in a language have the same Sense, how the Senses relate to the large catalog of items in Wikidata, and how Senses of different languages relate to each other. These are all very difficult questions that the Wikidata community is still grappling with (see also the essay on Making Sense https://www.wikidata.org/wiki/Wikidata:Making_sense).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the world. He became mayor of Talkeetna, Alaska, at the age of three months and six days, and retained that position until his death almost four years ago. Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will have to be able to express the meaning “cat” (here, we will focus entirely on the lexical level, and will not discuss grammatical and idiomatic issues; we will leave those for another day). How do we refer to the idea for cat in the abstract content? How do we end up, in English, eventually with the word form “cat” (L7-F4 https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the word form “chat” (L511-F4 https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with the form “Kater” (L303326-F1 https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning. The English word cat refers to both male or female cats equally; and whereas the French word could refer to a cat generically, for example if we wouldn’t know Stubbs’ gender, the word is male, but a female cat would usually be referred to using the word “chatte”. The German word, on the other hand, may only refer to a male cat. If we wouldn’t know whether Stubbs is male or female, we would need to use the word “Katze” in German instead, whereas in French, as said, we still could use “chat”. And English also has words for male cats, e.g. “tom” or “tomcat”, but these are much less frequently used. Searching the Web for “Stubbs is a cat” returns more than 10,000 hits, but not a single one for “Stubbs is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far only cat in space, the articles indeed use the words “chatte” in French and “Katze” in German.
Here we are talking about three rather closely related languages, we are talking about a rather simple noun. This should have been a very simple case, and yet it is not. When we talk about verbs, adjectives, or nouns about more complex concepts (for example different kinds of human settlements or the different ways human body parts are conceptualized in different languages, e.g. arms and hands https://wals.info/chapter/129, terms for colors), it gets much more complicated very quickly. If we were to require that all words we want to use in Abstract Wikipedia first must align their meanings, then that would put a very difficult task in our critical path. So whereas it would indeed have been helpful to Abstract Wikipedia to have followed an onomasiological approach (how wonderful would it be to have a comprehensive catalog of meanings!), that approach was deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can avoid that is because Abstract Wikipedia only needs to generate text, and neither parse nor understand it. This allows us to get by using a Constructor that, for each language, uses a Renderer to select the correct word (or other lexical representation). For example, we could have a Constructor that may take several optional further pieces of information: the kind of animal, the breed, the color, whether it is an adult, whether it is neutered, the gender, the number of them, etc. For each of these pieces of information, we could mark whether that information must be expressed in the Rendering, or whether this information is optional and can be ignored, and thus what is available for those Renderers to choose the most appropriate word. Note, this is not telling the community how to do it, merely sketching out one possible approach that would avoid to rely on a catalog of meanings.
Each language Renderer could then use the information it needs to select the right word. If a language has a preference to express the gender (such as German) it can do so, whereas a language that prefers not to (such as English) can do so. If for a language the age of the cat matters for the selection of the word, it can look it up. If the color of the animal matters (as it does for horses in German https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben), the respective Renderer can use the information. If a required information is missing, we could add this to a maintenance queue so that contributors can fill it out. If a language should happen not to have a word, a different noun phrase can be chosen, e.g. a less specific word such as ”animal” or “pet”, or a phrase such as “male kitten”, or “black horse” for the German word “Rappen”.
But the important design feature here is that we do not need to ensure and agree on the alignment of meanings of words across different languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of meanings. It would be a tremendously valuable resource. And even without such a catalog, the statements connecting Senses and Items in Wikidata can be very helpful for the creation and maintenance of Renderers, but these do not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up to the community to decide on how to implement the Renderers and what information to use. In this, I am sketching out an architecture that allows us to avoid blocking on the availability of a (valuable but very difficult to create) resource, a comprehensive catalog of meanings aligning words across many different languages.
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Thanks!
No, you clarified enough for now. Yes, totally makes sense with other NLG systems, Agree!
I pinged a potential typo here as well that confuses me: https://meta.wikimedia.org/wiki/Talk:Abstract_Wikipedia/Tasks#Confusing_sent...
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Fri, May 7, 2021 at 11:11 AM Denny Vrandečić dvrandecic@wikimedia.org wrote:
Constructors are used for the notation of the abstract content. Constructors are language independent, and shouldn't hold conditional logic.
Renderers should hold the actual conditional logic (which gets applied to the information in the Constructors). Renderers can be per language (but can also be shared across languages).
We will write this out in more detail when we get to the second part of the Abstract Wikipedia project. For now, this separation is analogous to the separation in other NLG systems such as Grammatical Framework.
But I am hoping that I am at least consistent about these terms :) Please let me know if I am being inconsistent and please ask more.
Thank you, Denny
On Fri, May 7, 2021 at 8:53 AM Thad Guidry thadguidry@gmail.com wrote:
Denny,
Wait... Your original posting mentions that *Constructors* would essentially hold the conditional logic, or "rules"? But in your followup, I see you mention *Renderers*?
I'm curious where the delineation of rules will occur, and if the answer is "it depends"?
Have you given much thought to constraints on Constructors or Renderers themselves (Are there high level design docs available for each of those yet)? Or do you think that will be something still being worked through in the long term with community use cases, and practices that evolve?
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Fri, May 7, 2021 at 10:07 AM Denny Vrandečić dvrandecic@wikimedia.org wrote:
Hi Gerard,
If the abstract content states (and I am further simplifying):
type: animal type phrase
- type of animal: cat
- sex: male
that might be represented e.g.
{ Z1K1: Z14000, Z14000K1: Z14146, Z14000K2: Z14097 }
or it could be, if we are using QIDs for the values,
{ Z1K1: Z14000, Z14000K1: Q146, Z14000K2: Q44148 }
so it wouldn't be based on English, it would be abstracted from the natural language.
Now there could be a Renderer in Dutch for 'animal type phrases' that would include:
if Z14000K1 = Q146/cat: if Z1400K2 = unknown or Z1400K2 = Q43445/female organism: return L208775/kat (Dutch, noun) if Z1400K2 = Q44148/male organism: return L.../kater (Dutch, noun) ...
etc.
This is just for selecting the right Lexeme. Further functions would now select the right form, depending on how the sentence looks like.
But nowhere do we need to refer to the Senses or to explicitly modeled meanings.
On the other hand, we *could* refer to the Senses and items. (And this is what I meant with not being prescriptive - I am just sketching out one possibility that does *not* refer to them). Because we could also write a multilingual Renderer (e.g. as a fallback Renderer?) that does for example the following:
Animal = Z1400K1 // which would be Q146/cat in our example Senses = FollowBacklink(P5137/item for this sense) Lexemes = GetLexemesFromSenses(Senses) DutchLexemes = FilterByLanguage(Lexemes, Q7411/Dutch) return ChooseOne(DutchLexemes) // that would need to be some deterministic choice)
This probably would need some refinement to figure out how the sex would play into this, but it's a just the start of a sketch. You could also imagine to build something on Defined Meanings at this point.
I hope that makes sense - happy to answer more. And again, it is all just suggestions!
Also, Happy Birthday, Gerard!
Cheers, Denny
On Thu, May 6, 2021 at 10:23 PM Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, I fail to understand. You have the data in the prescribed manner for an article. The original is based on English. How can you generate from the data a text in Dutch or any other language, when you do have the Senses but not the meanings of the words. Thanks, GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić dvrandecic@wikimedia.org wrote:
The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical knowledge https://www.wikidata.org/wiki/Wikidata:Lexicographical_data. Several hundred thousand Lexemes have been created since then, and this year the tools will be further developed by Wikimedia Deutschland to make the creation and maintenance of the lexicographic knowledge in Wikidata easier.
The lexicographic extension to Wikidata was developed with the goal that became Abstract Wikipedia in mind, but a recent discussion within the community showed me that I have not made the possible connection between these two parts clear yet. Today, I would like to sketch out a few ideas on how Abstract Wikipedia and the lexicographic data in Wikidata could work together.
There are two principal ways to organize a dictionary: either you organize the entries by ‘lexemes’ or ‘words’ and describe their senses (this is called the semasiological https://en.wikipedia.org/wiki/Semasiology approach), or you organize the entries by their ‘senses’ or ‘meanings’ (this is called the onomasiological https://en.wikipedia.org/wiki/Onomasiology approach). Wikidata has intentionally chosen the semasiological approach: the entries in Wikidata are called Lexemes, and contributors can add Senses and Forms to the Lexemes. Senses stand for the different meanings that a Lexeme may regularly invoke, and the Forms are the different ways the Lexeme may be expressed in a natural language text, e.g. in order to be in agreement with the right grammatical number, case, tense, etc. The Lexeme “mouse” (L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus has two senses, one for the small rodent, one for the computer input device, and two forms, “mouse” and “mice”. For an example of a multilingual onomasiological collaborative dictionary, one can take a look at the OmegaWiki http://www.omegawiki.org/ project, which is primarily organized around (currently 51,000+) Defined Meanings http://www.omegawiki.org/Help:DefinedMeaning and how these are expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on the observation that it is much simpler for a crowd-sourced collaborative project, and has much less potential to be contentious. It is much easier to gather a list of words used in a corpus than to gather a list of all the meanings referred to in the same corpus. And whereas it is 'simpler', it is still not trivial. We still want to collect a list of Senses for each Lexeme, and we want to describe the connections between these Senses: whether two Lexemes in a language have the same Sense, how the Senses relate to the large catalog of items in Wikidata, and how Senses of different languages relate to each other. These are all very difficult questions that the Wikidata community is still grappling with (see also the essay on Making Sense https://www.wikidata.org/wiki/Wikidata:Making_sense).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the world. He became mayor of Talkeetna, Alaska, at the age of three months and six days, and retained that position until his death almost four years ago. Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will have to be able to express the meaning “cat” (here, we will focus entirely on the lexical level, and will not discuss grammatical and idiomatic issues; we will leave those for another day). How do we refer to the idea for cat in the abstract content? How do we end up, in English, eventually with the word form “cat” (L7-F4 https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the word form “chat” (L511-F4 https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with the form “Kater” (L303326-F1 https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning. The English word cat refers to both male or female cats equally; and whereas the French word could refer to a cat generically, for example if we wouldn’t know Stubbs’ gender, the word is male, but a female cat would usually be referred to using the word “chatte”. The German word, on the other hand, may only refer to a male cat. If we wouldn’t know whether Stubbs is male or female, we would need to use the word “Katze” in German instead, whereas in French, as said, we still could use “chat”. And English also has words for male cats, e.g. “tom” or “tomcat”, but these are much less frequently used. Searching the Web for “Stubbs is a cat” returns more than 10,000 hits, but not a single one for “Stubbs is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far only cat in space, the articles indeed use the words “chatte” in French and “Katze” in German.
Here we are talking about three rather closely related languages, we are talking about a rather simple noun. This should have been a very simple case, and yet it is not. When we talk about verbs, adjectives, or nouns about more complex concepts (for example different kinds of human settlements or the different ways human body parts are conceptualized in different languages, e.g. arms and hands https://wals.info/chapter/129, terms for colors), it gets much more complicated very quickly. If we were to require that all words we want to use in Abstract Wikipedia first must align their meanings, then that would put a very difficult task in our critical path. So whereas it would indeed have been helpful to Abstract Wikipedia to have followed an onomasiological approach (how wonderful would it be to have a comprehensive catalog of meanings!), that approach was deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can avoid that is because Abstract Wikipedia only needs to generate text, and neither parse nor understand it. This allows us to get by using a Constructor that, for each language, uses a Renderer to select the correct word (or other lexical representation). For example, we could have a Constructor that may take several optional further pieces of information: the kind of animal, the breed, the color, whether it is an adult, whether it is neutered, the gender, the number of them, etc. For each of these pieces of information, we could mark whether that information must be expressed in the Rendering, or whether this information is optional and can be ignored, and thus what is available for those Renderers to choose the most appropriate word. Note, this is not telling the community how to do it, merely sketching out one possible approach that would avoid to rely on a catalog of meanings.
Each language Renderer could then use the information it needs to select the right word. If a language has a preference to express the gender (such as German) it can do so, whereas a language that prefers not to (such as English) can do so. If for a language the age of the cat matters for the selection of the word, it can look it up. If the color of the animal matters (as it does for horses in German https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben), the respective Renderer can use the information. If a required information is missing, we could add this to a maintenance queue so that contributors can fill it out. If a language should happen not to have a word, a different noun phrase can be chosen, e.g. a less specific word such as ”animal” or “pet”, or a phrase such as “male kitten”, or “black horse” for the German word “Rappen”.
But the important design feature here is that we do not need to ensure and agree on the alignment of meanings of words across different languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of meanings. It would be a tremendously valuable resource. And even without such a catalog, the statements connecting Senses and Items in Wikidata can be very helpful for the creation and maintenance of Renderers, but these do not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up to the community to decide on how to implement the Renderers and what information to use. In this, I am sketching out an architecture that allows us to avoid blocking on the availability of a (valuable but very difficult to create) resource, a comprehensive catalog of meanings aligning words across many different languages.
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Gerard and Denny,
The problem with a lexeme approach is that the constructors and renderers will become so complex and convoluted as to be non-scalable. The use of lexemes is problematic due to a complete lack of context awareness. Just because you have a word, and know all of its senses, how do you know which sense to pick?
Using Denny's example, "cat" could actually refer to the American construction equipment maker Caterpillar. "That cat is digging!" works for both the animal and the machine. We humans are somewhat unpredictable in when we set context. Your constructors will have to walk up and down the text chain to try and find context for each verb and noun. With a word based approach, words are your granularity, so everything is a lookup for a word, even though your application is at the sentence level. GPT-3, the most powerful NLP tool yet created, has 175 billion parameters for its lexeme based dataset, yet it too loses context. Humans are great at rephrasing something to fit their complete communique. WordNet is the most complete and scientifically accurate lexeme database on the planet, yet very few NLP approaches use WordNet. The traversals of the WordNet thesaurus can be compute intensive, and would be sensitive to how your constructors' logic walks the tree. The rules alone would become massive. You have to at least move up to phrases, and I recommend sentences (paraphrases). As for phrases, the FrameNet https://framenet.icsi.berkeley.edu/fndrupal/ folks can tell you how hard it is to build a dataset of phrases for NLP.
I've asked several times to discuss this with you and to save you and the team from going down this dead end path. The Wikipragmatica proposal directly addresses both context and semantics. If you used Wikipragmatica, translation logic would entail semantic disambiguation, paraphrase detection on nearest neighbors, node assignment, and then a lookup of node members for the appropriate language. If you decide to go down the lexeme path, I highly recommend you spend some noodle time on context brokering. I'm confident that in short order you will understand the magnitude of the context problem using lexemes. You are going down a well worn path.
Respectfully,
Doug
On Fri, May 7, 2021 at 8:53 AM Thad Guidry thadguidry@gmail.com wrote:
Denny,
Wait... Your original posting mentions that *Constructors* would essentially hold the conditional logic, or "rules"? But in your followup, I see you mention *Renderers*?
I'm curious where the delineation of rules will occur, and if the answer is "it depends"?
Have you given much thought to constraints on Constructors or Renderers themselves (Are there high level design docs available for each of those yet)? Or do you think that will be something still being worked through in the long term with community use cases, and practices that evolve?
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Fri, May 7, 2021 at 10:07 AM Denny Vrandečić dvrandecic@wikimedia.org wrote:
Hi Gerard,
If the abstract content states (and I am further simplifying):
type: animal type phrase
- type of animal: cat
- sex: male
that might be represented e.g.
{ Z1K1: Z14000, Z14000K1: Z14146, Z14000K2: Z14097 }
or it could be, if we are using QIDs for the values,
{ Z1K1: Z14000, Z14000K1: Q146, Z14000K2: Q44148 }
so it wouldn't be based on English, it would be abstracted from the natural language.
Now there could be a Renderer in Dutch for 'animal type phrases' that would include:
if Z14000K1 = Q146/cat: if Z1400K2 = unknown or Z1400K2 = Q43445/female organism: return L208775/kat (Dutch, noun) if Z1400K2 = Q44148/male organism: return L.../kater (Dutch, noun) ...
etc.
This is just for selecting the right Lexeme. Further functions would now select the right form, depending on how the sentence looks like.
But nowhere do we need to refer to the Senses or to explicitly modeled meanings.
On the other hand, we *could* refer to the Senses and items. (And this is what I meant with not being prescriptive - I am just sketching out one possibility that does *not* refer to them). Because we could also write a multilingual Renderer (e.g. as a fallback Renderer?) that does for example the following:
Animal = Z1400K1 // which would be Q146/cat in our example Senses = FollowBacklink(P5137/item for this sense) Lexemes = GetLexemesFromSenses(Senses) DutchLexemes = FilterByLanguage(Lexemes, Q7411/Dutch) return ChooseOne(DutchLexemes) // that would need to be some deterministic choice)
This probably would need some refinement to figure out how the sex would play into this, but it's a just the start of a sketch. You could also imagine to build something on Defined Meanings at this point.
I hope that makes sense - happy to answer more. And again, it is all just suggestions!
Also, Happy Birthday, Gerard!
Cheers, Denny
On Thu, May 6, 2021 at 10:23 PM Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, I fail to understand. You have the data in the prescribed manner for an article. The original is based on English. How can you generate from the data a text in Dutch or any other language, when you do have the Senses but not the meanings of the words. Thanks, GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić dvrandecic@wikimedia.org wrote:
The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical knowledge https://www.wikidata.org/wiki/Wikidata:Lexicographical_data. Several hundred thousand Lexemes have been created since then, and this year the tools will be further developed by Wikimedia Deutschland to make the creation and maintenance of the lexicographic knowledge in Wikidata easier.
The lexicographic extension to Wikidata was developed with the goal that became Abstract Wikipedia in mind, but a recent discussion within the community showed me that I have not made the possible connection between these two parts clear yet. Today, I would like to sketch out a few ideas on how Abstract Wikipedia and the lexicographic data in Wikidata could work together.
There are two principal ways to organize a dictionary: either you organize the entries by ‘lexemes’ or ‘words’ and describe their senses (this is called the semasiological https://en.wikipedia.org/wiki/Semasiology approach), or you organize the entries by their ‘senses’ or ‘meanings’ (this is called the onomasiological https://en.wikipedia.org/wiki/Onomasiology approach). Wikidata has intentionally chosen the semasiological approach: the entries in Wikidata are called Lexemes, and contributors can add Senses and Forms to the Lexemes. Senses stand for the different meanings that a Lexeme may regularly invoke, and the Forms are the different ways the Lexeme may be expressed in a natural language text, e.g. in order to be in agreement with the right grammatical number, case, tense, etc. The Lexeme “mouse” (L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus has two senses, one for the small rodent, one for the computer input device, and two forms, “mouse” and “mice”. For an example of a multilingual onomasiological collaborative dictionary, one can take a look at the OmegaWiki http://www.omegawiki.org/ project, which is primarily organized around (currently 51,000+) Defined Meanings http://www.omegawiki.org/Help:DefinedMeaning and how these are expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on the observation that it is much simpler for a crowd-sourced collaborative project, and has much less potential to be contentious. It is much easier to gather a list of words used in a corpus than to gather a list of all the meanings referred to in the same corpus. And whereas it is 'simpler', it is still not trivial. We still want to collect a list of Senses for each Lexeme, and we want to describe the connections between these Senses: whether two Lexemes in a language have the same Sense, how the Senses relate to the large catalog of items in Wikidata, and how Senses of different languages relate to each other. These are all very difficult questions that the Wikidata community is still grappling with (see also the essay on Making Sense https://www.wikidata.org/wiki/Wikidata:Making_sense).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the world. He became mayor of Talkeetna, Alaska, at the age of three months and six days, and retained that position until his death almost four years ago. Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will have to be able to express the meaning “cat” (here, we will focus entirely on the lexical level, and will not discuss grammatical and idiomatic issues; we will leave those for another day). How do we refer to the idea for cat in the abstract content? How do we end up, in English, eventually with the word form “cat” (L7-F4 https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the word form “chat” (L511-F4 https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with the form “Kater” (L303326-F1 https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning. The English word cat refers to both male or female cats equally; and whereas the French word could refer to a cat generically, for example if we wouldn’t know Stubbs’ gender, the word is male, but a female cat would usually be referred to using the word “chatte”. The German word, on the other hand, may only refer to a male cat. If we wouldn’t know whether Stubbs is male or female, we would need to use the word “Katze” in German instead, whereas in French, as said, we still could use “chat”. And English also has words for male cats, e.g. “tom” or “tomcat”, but these are much less frequently used. Searching the Web for “Stubbs is a cat” returns more than 10,000 hits, but not a single one for “Stubbs is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far only cat in space, the articles indeed use the words “chatte” in French and “Katze” in German.
Here we are talking about three rather closely related languages, we are talking about a rather simple noun. This should have been a very simple case, and yet it is not. When we talk about verbs, adjectives, or nouns about more complex concepts (for example different kinds of human settlements or the different ways human body parts are conceptualized in different languages, e.g. arms and hands https://wals.info/chapter/129, terms for colors), it gets much more complicated very quickly. If we were to require that all words we want to use in Abstract Wikipedia first must align their meanings, then that would put a very difficult task in our critical path. So whereas it would indeed have been helpful to Abstract Wikipedia to have followed an onomasiological approach (how wonderful would it be to have a comprehensive catalog of meanings!), that approach was deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can avoid that is because Abstract Wikipedia only needs to generate text, and neither parse nor understand it. This allows us to get by using a Constructor that, for each language, uses a Renderer to select the correct word (or other lexical representation). For example, we could have a Constructor that may take several optional further pieces of information: the kind of animal, the breed, the color, whether it is an adult, whether it is neutered, the gender, the number of them, etc. For each of these pieces of information, we could mark whether that information must be expressed in the Rendering, or whether this information is optional and can be ignored, and thus what is available for those Renderers to choose the most appropriate word. Note, this is not telling the community how to do it, merely sketching out one possible approach that would avoid to rely on a catalog of meanings.
Each language Renderer could then use the information it needs to select the right word. If a language has a preference to express the gender (such as German) it can do so, whereas a language that prefers not to (such as English) can do so. If for a language the age of the cat matters for the selection of the word, it can look it up. If the color of the animal matters (as it does for horses in German https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben), the respective Renderer can use the information. If a required information is missing, we could add this to a maintenance queue so that contributors can fill it out. If a language should happen not to have a word, a different noun phrase can be chosen, e.g. a less specific word such as ”animal” or “pet”, or a phrase such as “male kitten”, or “black horse” for the German word “Rappen”.
But the important design feature here is that we do not need to ensure and agree on the alignment of meanings of words across different languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of meanings. It would be a tremendously valuable resource. And even without such a catalog, the statements connecting Senses and Items in Wikidata can be very helpful for the creation and maintenance of Renderers, but these do not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up to the community to decide on how to implement the Renderers and what information to use. In this, I am sketching out an architecture that allows us to avoid blocking on the availability of a (valuable but very difficult to create) resource, a comprehensive catalog of meanings aligning words across many different languages.
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Hi Douglas,
Thank you for your message.
Yes, you are right that if we were trying to understand a sentence such as "The cat is digging", we would need to resolve the ambiguity in that sentence. But, as I wrote in the newsletter, our trick is that we can avoid the necessity to parse and understand text. The abstract content will already be written by the contributors in a representation that disambiguates to the level that is needed to generate the text in the languages we support - no automatic disambiguation of natural language is thus needed.
Thank you for publishing the Wikipragmatica proposal. I have read it when you published it back in January, and I find it interesting and I certainly hope that you will try it out. It is a very different approach to what we are trying to achieve with Abstract Wikipedia, where we don't aim to annotate existing textual resources, but to create entirely new ones from scratch. Wikipragmatica is squarely aimed at the difficult and important task of natural language understanding. Abstract Wikipedia is, very intentionally, trying to circumvent that task. I have not reached out for a discussion because of these significant differences - I think we are aiming for very different goals using very different approaches. The goals of Wikipragmatica are to understand the content, and use that understanding for detecting misinformation, ascertaining truth, and discovering inconsistencies. These are extremely valuable goals, and very difficult, and I have tried to steer explicitly away from them. The same is true for machine learning and vector-based approaches. I cannot figure out how to incorporate these in a way that allows the community to truly own the system and the outputs, which I think is crucial for a Wikimedia project where the community owns and maintains the content. I think that is a very worthwhile question to explore, that still needs a crucial insight or two to make it work.
Yes, FrameNet and WordNet are much more related to our approach than GPT-3 or Bert. About a decade ago, Chuck Filmore, the creator of FrameNet, and I were teaching together in Berkeley, and back then I learned a lot about FrameNet from him, and how much effort is in it. Later, during my time at Google I had the particular luck that some of my colleagues were a few of Chuck's former collaborators on FrameNet and have discussed it with a number of them in detail. This made it clear that one of the biggest risks in the Abstract Wikipedia project is the absolute number of constructors that we will need, as this will ultimately decide how much effort it will be to make the content in Abstract Wikipedia available in a new language. Regarding WordNet, Christiane Fellbaum was one of the initial members of the advisory board for Wikidata, and her work and results were very influential in designing the data model for the lexicographic space in Wikidata (albeit, indirectly, as we settled on the Lemon model that came later and has learned from WordNet).
You are exactly right, we are going down a well worn path. I keep saying that in my talks: this is not a research project, we are applying well-known results from several fields such as natural language generation, crowd-sourcing, programming languages, etc. I still consider it a risky project, as there are a number of unknowns (e.g. the number of constructors, and how multilingual the constructors are) that will play a major role in how effective our approach will be, but I also think that we will certainly achieve something worthwhile - but we don't know yet exactly what and how far this architecture will carry us.
Thank you for your comment, Denny
On Fri, May 7, 2021 at 9:28 AM Douglas Clark clarkdd@gmail.com wrote:
Gerard and Denny,
The problem with a lexeme approach is that the constructors and renderers will become so complex and convoluted as to be non-scalable. The use of lexemes is problematic due to a complete lack of context awareness. Just because you have a word, and know all of its senses, how do you know which sense to pick?
Using Denny's example, "cat" could actually refer to the American construction equipment maker Caterpillar. "That cat is digging!" works for both the animal and the machine. We humans are somewhat unpredictable in when we set context. Your constructors will have to walk up and down the text chain to try and find context for each verb and noun. With a word based approach, words are your granularity, so everything is a lookup for a word, even though your application is at the sentence level. GPT-3, the most powerful NLP tool yet created, has 175 billion parameters for its lexeme based dataset, yet it too loses context. Humans are great at rephrasing something to fit their complete communique. WordNet is the most complete and scientifically accurate lexeme database on the planet, yet very few NLP approaches use WordNet. The traversals of the WordNet thesaurus can be compute intensive, and would be sensitive to how your constructors' logic walks the tree. The rules alone would become massive. You have to at least move up to phrases, and I recommend sentences (paraphrases). As for phrases, the FrameNet https://framenet.icsi.berkeley.edu/fndrupal/ folks can tell you how hard it is to build a dataset of phrases for NLP.
I've asked several times to discuss this with you and to save you and the team from going down this dead end path. The Wikipragmatica proposal directly addresses both context and semantics. If you used Wikipragmatica, translation logic would entail semantic disambiguation, paraphrase detection on nearest neighbors, node assignment, and then a lookup of node members for the appropriate language. If you decide to go down the lexeme path, I highly recommend you spend some noodle time on context brokering. I'm confident that in short order you will understand the magnitude of the context problem using lexemes. You are going down a well worn path.
Respectfully,
Doug
On Fri, May 7, 2021 at 8:53 AM Thad Guidry thadguidry@gmail.com wrote:
Denny,
Wait... Your original posting mentions that *Constructors* would essentially hold the conditional logic, or "rules"? But in your followup, I see you mention *Renderers*?
I'm curious where the delineation of rules will occur, and if the answer is "it depends"?
Have you given much thought to constraints on Constructors or Renderers themselves (Are there high level design docs available for each of those yet)? Or do you think that will be something still being worked through in the long term with community use cases, and practices that evolve?
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Fri, May 7, 2021 at 10:07 AM Denny Vrandečić dvrandecic@wikimedia.org wrote:
Hi Gerard,
If the abstract content states (and I am further simplifying):
type: animal type phrase
- type of animal: cat
- sex: male
that might be represented e.g.
{ Z1K1: Z14000, Z14000K1: Z14146, Z14000K2: Z14097 }
or it could be, if we are using QIDs for the values,
{ Z1K1: Z14000, Z14000K1: Q146, Z14000K2: Q44148 }
so it wouldn't be based on English, it would be abstracted from the natural language.
Now there could be a Renderer in Dutch for 'animal type phrases' that would include:
if Z14000K1 = Q146/cat: if Z1400K2 = unknown or Z1400K2 = Q43445/female organism: return L208775/kat (Dutch, noun) if Z1400K2 = Q44148/male organism: return L.../kater (Dutch, noun) ...
etc.
This is just for selecting the right Lexeme. Further functions would now select the right form, depending on how the sentence looks like.
But nowhere do we need to refer to the Senses or to explicitly modeled meanings.
On the other hand, we *could* refer to the Senses and items. (And this is what I meant with not being prescriptive - I am just sketching out one possibility that does *not* refer to them). Because we could also write a multilingual Renderer (e.g. as a fallback Renderer?) that does for example the following:
Animal = Z1400K1 // which would be Q146/cat in our example Senses = FollowBacklink(P5137/item for this sense) Lexemes = GetLexemesFromSenses(Senses) DutchLexemes = FilterByLanguage(Lexemes, Q7411/Dutch) return ChooseOne(DutchLexemes) // that would need to be some deterministic choice)
This probably would need some refinement to figure out how the sex would play into this, but it's a just the start of a sketch. You could also imagine to build something on Defined Meanings at this point.
I hope that makes sense - happy to answer more. And again, it is all just suggestions!
Also, Happy Birthday, Gerard!
Cheers, Denny
On Thu, May 6, 2021 at 10:23 PM Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, I fail to understand. You have the data in the prescribed manner for an article. The original is based on English. How can you generate from the data a text in Dutch or any other language, when you do have the Senses but not the meanings of the words. Thanks, GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić dvrandecic@wikimedia.org wrote:
The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical knowledge https://www.wikidata.org/wiki/Wikidata:Lexicographical_data. Several hundred thousand Lexemes have been created since then, and this year the tools will be further developed by Wikimedia Deutschland to make the creation and maintenance of the lexicographic knowledge in Wikidata easier.
The lexicographic extension to Wikidata was developed with the goal that became Abstract Wikipedia in mind, but a recent discussion within the community showed me that I have not made the possible connection between these two parts clear yet. Today, I would like to sketch out a few ideas on how Abstract Wikipedia and the lexicographic data in Wikidata could work together.
There are two principal ways to organize a dictionary: either you organize the entries by ‘lexemes’ or ‘words’ and describe their senses (this is called the semasiological https://en.wikipedia.org/wiki/Semasiology approach), or you organize the entries by their ‘senses’ or ‘meanings’ (this is called the onomasiological https://en.wikipedia.org/wiki/Onomasiology approach). Wikidata has intentionally chosen the semasiological approach: the entries in Wikidata are called Lexemes, and contributors can add Senses and Forms to the Lexemes. Senses stand for the different meanings that a Lexeme may regularly invoke, and the Forms are the different ways the Lexeme may be expressed in a natural language text, e.g. in order to be in agreement with the right grammatical number, case, tense, etc. The Lexeme “mouse” (L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus has two senses, one for the small rodent, one for the computer input device, and two forms, “mouse” and “mice”. For an example of a multilingual onomasiological collaborative dictionary, one can take a look at the OmegaWiki http://www.omegawiki.org/ project, which is primarily organized around (currently 51,000+) Defined Meanings http://www.omegawiki.org/Help:DefinedMeaning and how these are expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on the observation that it is much simpler for a crowd-sourced collaborative project, and has much less potential to be contentious. It is much easier to gather a list of words used in a corpus than to gather a list of all the meanings referred to in the same corpus. And whereas it is 'simpler', it is still not trivial. We still want to collect a list of Senses for each Lexeme, and we want to describe the connections between these Senses: whether two Lexemes in a language have the same Sense, how the Senses relate to the large catalog of items in Wikidata, and how Senses of different languages relate to each other. These are all very difficult questions that the Wikidata community is still grappling with (see also the essay on Making Sense https://www.wikidata.org/wiki/Wikidata:Making_sense).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the world. He became mayor of Talkeetna, Alaska, at the age of three months and six days, and retained that position until his death almost four years ago. Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will have to be able to express the meaning “cat” (here, we will focus entirely on the lexical level, and will not discuss grammatical and idiomatic issues; we will leave those for another day). How do we refer to the idea for cat in the abstract content? How do we end up, in English, eventually with the word form “cat” (L7-F4 https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the word form “chat” (L511-F4 https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with the form “Kater” (L303326-F1 https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning. The English word cat refers to both male or female cats equally; and whereas the French word could refer to a cat generically, for example if we wouldn’t know Stubbs’ gender, the word is male, but a female cat would usually be referred to using the word “chatte”. The German word, on the other hand, may only refer to a male cat. If we wouldn’t know whether Stubbs is male or female, we would need to use the word “Katze” in German instead, whereas in French, as said, we still could use “chat”. And English also has words for male cats, e.g. “tom” or “tomcat”, but these are much less frequently used. Searching the Web for “Stubbs is a cat” returns more than 10,000 hits, but not a single one for “Stubbs is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far only cat in space, the articles indeed use the words “chatte” in French and “Katze” in German.
Here we are talking about three rather closely related languages, we are talking about a rather simple noun. This should have been a very simple case, and yet it is not. When we talk about verbs, adjectives, or nouns about more complex concepts (for example different kinds of human settlements or the different ways human body parts are conceptualized in different languages, e.g. arms and hands https://wals.info/chapter/129, terms for colors), it gets much more complicated very quickly. If we were to require that all words we want to use in Abstract Wikipedia first must align their meanings, then that would put a very difficult task in our critical path. So whereas it would indeed have been helpful to Abstract Wikipedia to have followed an onomasiological approach (how wonderful would it be to have a comprehensive catalog of meanings!), that approach was deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can avoid that is because Abstract Wikipedia only needs to generate text, and neither parse nor understand it. This allows us to get by using a Constructor that, for each language, uses a Renderer to select the correct word (or other lexical representation). For example, we could have a Constructor that may take several optional further pieces of information: the kind of animal, the breed, the color, whether it is an adult, whether it is neutered, the gender, the number of them, etc. For each of these pieces of information, we could mark whether that information must be expressed in the Rendering, or whether this information is optional and can be ignored, and thus what is available for those Renderers to choose the most appropriate word. Note, this is not telling the community how to do it, merely sketching out one possible approach that would avoid to rely on a catalog of meanings.
Each language Renderer could then use the information it needs to select the right word. If a language has a preference to express the gender (such as German) it can do so, whereas a language that prefers not to (such as English) can do so. If for a language the age of the cat matters for the selection of the word, it can look it up. If the color of the animal matters (as it does for horses in German https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben), the respective Renderer can use the information. If a required information is missing, we could add this to a maintenance queue so that contributors can fill it out. If a language should happen not to have a word, a different noun phrase can be chosen, e.g. a less specific word such as ”animal” or “pet”, or a phrase such as “male kitten”, or “black horse” for the German word “Rappen”.
But the important design feature here is that we do not need to ensure and agree on the alignment of meanings of words across different languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of meanings. It would be a tremendously valuable resource. And even without such a catalog, the statements connecting Senses and Items in Wikidata can be very helpful for the creation and maintenance of Renderers, but these do not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up to the community to decide on how to implement the Renderers and what information to use. In this, I am sketching out an architecture that allows us to avoid blocking on the availability of a (valuable but very difficult to create) resource, a comprehensive catalog of meanings aligning words across many different languages.
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Denny,
I really appreciate your time in responding. I understand the magnitude of the problem and the technical challenges your team is addressing. So, I thank you for your well reasoned and detailed response.
I would like to clarify the base capability I see in Wikipragmatica, as well as the user community's work stream in support of it's curation. My concern is the path the team has chosen is a dead end beyond the limited use cases of Wikipedia day zero + a few years. An ecosystem of free knowledge certainly seems to lead outside the confines of today's wiki markup world for data and information acquisition. At some point, you will have to semantically disambiguate the remainder of the web. That is not in the manual tagging solution set.
To curate Wikipragmatica, the community will first create training corpora, per knowledge domain. This exercise is similar to lexeme tagging, but addresses knowledge domains with unique lexicons or semantics. From there, after training, the assignment of vectors for sentences is logic akin to your constructors. At this point, both approaches involve the community's understanding of the semantics and pragmatics of their knowledge areas to enhance the curation. Also, both approaches have software logic that most of the community will not have purview. That is the nature of software projects. Just because it is new, does not make it less knowable. Further, the skills applied in curating Wikipragmatica are state of the art. Network (graph) analysis is the future of data curation, information extraction and analytics whether manual or automated. The process of translating a word sense to a position in a three dimensional space is not rocket science. The space is separated into semantic locations. So instead of lat/longs, you have mooses and mouses. The community can understand the magic. Once the semantic network is created, the community can read the results. I know I harp on this, but reading the graph is key to keeping the community involved. The output is still natural language, it's just pivoted to a graph. You could actually browse a Wiki by reading it via the graph. The community's next job will be to supervise the accuracy of the vector assignments and begin the process of metadata tagging. The last step is the paraphrase detection of nearest neighbors to de-dupe the graph into a paraphrase graph with retained context. At this point, the community will once again perform quality control that will be input into the models to refine accuracy and performance. I would argue that the community is just as involved in the Wikipragmatic curation as lexeme tagging. It's all about semantics, and the community will be the referees. They can be trained to understand and control the entire process including the machine models.
I would also like to stress that semantic vector spaces are not hot off the presses. Neither is paraphrase detection. Both are well understood, robust software approaches to data curation. Nothing in Wikipragmatica is cutting edge R&D. It's a unique curation, but relies on well proven tools. In the Wikipragmatica approach, you do not need the additional logic of the constructors, as the curation contains the lookup value. Each node will ultimately have each language's paraphrase of the main node concept. It's just a lookup. Further, each node will inherit all of the appropriate existing wikidata metadata. When finished, Wikipragmatica will be a machine readable knowledge representation that can perform many functions. It is the foundation for the knowledge as a service strategic goal. A lexeme based approach lacks the critical component of context brokering. If you want to respond to a request for knowledge, you must resolve the context in order to serve the correct semantics. Lastly, I recommend the use of linked data architectures to address scaling and privacy concerns. A linked data architecture can address webscale technical requirements.
I apologize if I come across as overly critical and I do know I am late to the table. However, when I decided to open source my curation, I felt the logical place for it to grow was at the Wikimedia Foundation. The movement's 2030 strategy requires support from the machines. I appreciate that you may be past the point of no further use cases, but I do think that the community as a whole should not reject vector based or machine learning based approaches out of hand. The curation skills to get to and maintain models are modern and can directly contribute to each community member's professional livelihood. There are classes that I teach that help people see the utility of organizing data into networks.
Please know that I wish you and the team nothing but success. I also stand ready to support as you see fit.
Doug
On Thu, May 13, 2021 at 5:08 PM Denny Vrandečić dvrandecic@wikimedia.org wrote:
Hi Douglas,
Thank you for your message.
Yes, you are right that if we were trying to understand a sentence such as "The cat is digging", we would need to resolve the ambiguity in that sentence. But, as I wrote in the newsletter, our trick is that we can avoid the necessity to parse and understand text. The abstract content will already be written by the contributors in a representation that disambiguates to the level that is needed to generate the text in the languages we support - no automatic disambiguation of natural language is thus needed.
Thank you for publishing the Wikipragmatica proposal. I have read it when you published it back in January, and I find it interesting and I certainly hope that you will try it out. It is a very different approach to what we are trying to achieve with Abstract Wikipedia, where we don't aim to annotate existing textual resources, but to create entirely new ones from scratch. Wikipragmatica is squarely aimed at the difficult and important task of natural language understanding. Abstract Wikipedia is, very intentionally, trying to circumvent that task. I have not reached out for a discussion because of these significant differences - I think we are aiming for very different goals using very different approaches. The goals of Wikipragmatica are to understand the content, and use that understanding for detecting misinformation, ascertaining truth, and discovering inconsistencies. These are extremely valuable goals, and very difficult, and I have tried to steer explicitly away from them. The same is true for machine learning and vector-based approaches. I cannot figure out how to incorporate these in a way that allows the community to truly own the system and the outputs, which I think is crucial for a Wikimedia project where the community owns and maintains the content. I think that is a very worthwhile question to explore, that still needs a crucial insight or two to make it work.
Yes, FrameNet and WordNet are much more related to our approach than GPT-3 or Bert. About a decade ago, Chuck Filmore, the creator of FrameNet, and I were teaching together in Berkeley, and back then I learned a lot about FrameNet from him, and how much effort is in it. Later, during my time at Google I had the particular luck that some of my colleagues were a few of Chuck's former collaborators on FrameNet and have discussed it with a number of them in detail. This made it clear that one of the biggest risks in the Abstract Wikipedia project is the absolute number of constructors that we will need, as this will ultimately decide how much effort it will be to make the content in Abstract Wikipedia available in a new language. Regarding WordNet, Christiane Fellbaum was one of the initial members of the advisory board for Wikidata, and her work and results were very influential in designing the data model for the lexicographic space in Wikidata (albeit, indirectly, as we settled on the Lemon model that came later and has learned from WordNet).
You are exactly right, we are going down a well worn path. I keep saying that in my talks: this is not a research project, we are applying well-known results from several fields such as natural language generation, crowd-sourcing, programming languages, etc. I still consider it a risky project, as there are a number of unknowns (e.g. the number of constructors, and how multilingual the constructors are) that will play a major role in how effective our approach will be, but I also think that we will certainly achieve something worthwhile - but we don't know yet exactly what and how far this architecture will carry us.
Thank you for your comment, Denny
On Fri, May 7, 2021 at 9:28 AM Douglas Clark clarkdd@gmail.com wrote:
Gerard and Denny,
The problem with a lexeme approach is that the constructors and renderers will become so complex and convoluted as to be non-scalable. The use of lexemes is problematic due to a complete lack of context awareness. Just because you have a word, and know all of its senses, how do you know which sense to pick?
Using Denny's example, "cat" could actually refer to the American construction equipment maker Caterpillar. "That cat is digging!" works for both the animal and the machine. We humans are somewhat unpredictable in when we set context. Your constructors will have to walk up and down the text chain to try and find context for each verb and noun. With a word based approach, words are your granularity, so everything is a lookup for a word, even though your application is at the sentence level. GPT-3, the most powerful NLP tool yet created, has 175 billion parameters for its lexeme based dataset, yet it too loses context. Humans are great at rephrasing something to fit their complete communique. WordNet is the most complete and scientifically accurate lexeme database on the planet, yet very few NLP approaches use WordNet. The traversals of the WordNet thesaurus can be compute intensive, and would be sensitive to how your constructors' logic walks the tree. The rules alone would become massive. You have to at least move up to phrases, and I recommend sentences (paraphrases). As for phrases, the FrameNet https://framenet.icsi.berkeley.edu/fndrupal/ folks can tell you how hard it is to build a dataset of phrases for NLP.
I've asked several times to discuss this with you and to save you and the team from going down this dead end path. The Wikipragmatica proposal directly addresses both context and semantics. If you used Wikipragmatica, translation logic would entail semantic disambiguation, paraphrase detection on nearest neighbors, node assignment, and then a lookup of node members for the appropriate language. If you decide to go down the lexeme path, I highly recommend you spend some noodle time on context brokering. I'm confident that in short order you will understand the magnitude of the context problem using lexemes. You are going down a well worn path.
Respectfully,
Doug
On Fri, May 7, 2021 at 8:53 AM Thad Guidry thadguidry@gmail.com wrote:
Denny,
Wait... Your original posting mentions that *Constructors* would essentially hold the conditional logic, or "rules"? But in your followup, I see you mention *Renderers*?
I'm curious where the delineation of rules will occur, and if the answer is "it depends"?
Have you given much thought to constraints on Constructors or Renderers themselves (Are there high level design docs available for each of those yet)? Or do you think that will be something still being worked through in the long term with community use cases, and practices that evolve?
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Fri, May 7, 2021 at 10:07 AM Denny Vrandečić < dvrandecic@wikimedia.org> wrote:
Hi Gerard,
If the abstract content states (and I am further simplifying):
type: animal type phrase
- type of animal: cat
- sex: male
that might be represented e.g.
{ Z1K1: Z14000, Z14000K1: Z14146, Z14000K2: Z14097 }
or it could be, if we are using QIDs for the values,
{ Z1K1: Z14000, Z14000K1: Q146, Z14000K2: Q44148 }
so it wouldn't be based on English, it would be abstracted from the natural language.
Now there could be a Renderer in Dutch for 'animal type phrases' that would include:
if Z14000K1 = Q146/cat: if Z1400K2 = unknown or Z1400K2 = Q43445/female organism: return L208775/kat (Dutch, noun) if Z1400K2 = Q44148/male organism: return L.../kater (Dutch, noun) ...
etc.
This is just for selecting the right Lexeme. Further functions would now select the right form, depending on how the sentence looks like.
But nowhere do we need to refer to the Senses or to explicitly modeled meanings.
On the other hand, we *could* refer to the Senses and items. (And this is what I meant with not being prescriptive - I am just sketching out one possibility that does *not* refer to them). Because we could also write a multilingual Renderer (e.g. as a fallback Renderer?) that does for example the following:
Animal = Z1400K1 // which would be Q146/cat in our example Senses = FollowBacklink(P5137/item for this sense) Lexemes = GetLexemesFromSenses(Senses) DutchLexemes = FilterByLanguage(Lexemes, Q7411/Dutch) return ChooseOne(DutchLexemes) // that would need to be some deterministic choice)
This probably would need some refinement to figure out how the sex would play into this, but it's a just the start of a sketch. You could also imagine to build something on Defined Meanings at this point.
I hope that makes sense - happy to answer more. And again, it is all just suggestions!
Also, Happy Birthday, Gerard!
Cheers, Denny
On Thu, May 6, 2021 at 10:23 PM Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, I fail to understand. You have the data in the prescribed manner for an article. The original is based on English. How can you generate from the data a text in Dutch or any other language, when you do have the Senses but not the meanings of the words. Thanks, GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić dvrandecic@wikimedia.org wrote:
The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical knowledge https://www.wikidata.org/wiki/Wikidata:Lexicographical_data. Several hundred thousand Lexemes have been created since then, and this year the tools will be further developed by Wikimedia Deutschland to make the creation and maintenance of the lexicographic knowledge in Wikidata easier.
The lexicographic extension to Wikidata was developed with the goal that became Abstract Wikipedia in mind, but a recent discussion within the community showed me that I have not made the possible connection between these two parts clear yet. Today, I would like to sketch out a few ideas on how Abstract Wikipedia and the lexicographic data in Wikidata could work together.
There are two principal ways to organize a dictionary: either you organize the entries by ‘lexemes’ or ‘words’ and describe their senses (this is called the semasiological https://en.wikipedia.org/wiki/Semasiology approach), or you organize the entries by their ‘senses’ or ‘meanings’ (this is called the onomasiological https://en.wikipedia.org/wiki/Onomasiology approach). Wikidata has intentionally chosen the semasiological approach: the entries in Wikidata are called Lexemes, and contributors can add Senses and Forms to the Lexemes. Senses stand for the different meanings that a Lexeme may regularly invoke, and the Forms are the different ways the Lexeme may be expressed in a natural language text, e.g. in order to be in agreement with the right grammatical number, case, tense, etc. The Lexeme “mouse” (L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus has two senses, one for the small rodent, one for the computer input device, and two forms, “mouse” and “mice”. For an example of a multilingual onomasiological collaborative dictionary, one can take a look at the OmegaWiki http://www.omegawiki.org/ project, which is primarily organized around (currently 51,000+) Defined Meanings http://www.omegawiki.org/Help:DefinedMeaning and how these are expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on the observation that it is much simpler for a crowd-sourced collaborative project, and has much less potential to be contentious. It is much easier to gather a list of words used in a corpus than to gather a list of all the meanings referred to in the same corpus. And whereas it is 'simpler', it is still not trivial. We still want to collect a list of Senses for each Lexeme, and we want to describe the connections between these Senses: whether two Lexemes in a language have the same Sense, how the Senses relate to the large catalog of items in Wikidata, and how Senses of different languages relate to each other. These are all very difficult questions that the Wikidata community is still grappling with (see also the essay on Making Sense https://www.wikidata.org/wiki/Wikidata:Making_sense).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the world. He became mayor of Talkeetna, Alaska, at the age of three months and six days, and retained that position until his death almost four years ago. Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will have to be able to express the meaning “cat” (here, we will focus entirely on the lexical level, and will not discuss grammatical and idiomatic issues; we will leave those for another day). How do we refer to the idea for cat in the abstract content? How do we end up, in English, eventually with the word form “cat” (L7-F4 https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the word form “chat” (L511-F4 https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with the form “Kater” (L303326-F1 https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning. The English word cat refers to both male or female cats equally; and whereas the French word could refer to a cat generically, for example if we wouldn’t know Stubbs’ gender, the word is male, but a female cat would usually be referred to using the word “chatte”. The German word, on the other hand, may only refer to a male cat. If we wouldn’t know whether Stubbs is male or female, we would need to use the word “Katze” in German instead, whereas in French, as said, we still could use “chat”. And English also has words for male cats, e.g. “tom” or “tomcat”, but these are much less frequently used. Searching the Web for “Stubbs is a cat” returns more than 10,000 hits, but not a single one for “Stubbs is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far only cat in space, the articles indeed use the words “chatte” in French and “Katze” in German.
Here we are talking about three rather closely related languages, we are talking about a rather simple noun. This should have been a very simple case, and yet it is not. When we talk about verbs, adjectives, or nouns about more complex concepts (for example different kinds of human settlements or the different ways human body parts are conceptualized in different languages, e.g. arms and hands https://wals.info/chapter/129, terms for colors), it gets much more complicated very quickly. If we were to require that all words we want to use in Abstract Wikipedia first must align their meanings, then that would put a very difficult task in our critical path. So whereas it would indeed have been helpful to Abstract Wikipedia to have followed an onomasiological approach (how wonderful would it be to have a comprehensive catalog of meanings!), that approach was deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can avoid that is because Abstract Wikipedia only needs to generate text, and neither parse nor understand it. This allows us to get by using a Constructor that, for each language, uses a Renderer to select the correct word (or other lexical representation). For example, we could have a Constructor that may take several optional further pieces of information: the kind of animal, the breed, the color, whether it is an adult, whether it is neutered, the gender, the number of them, etc. For each of these pieces of information, we could mark whether that information must be expressed in the Rendering, or whether this information is optional and can be ignored, and thus what is available for those Renderers to choose the most appropriate word. Note, this is not telling the community how to do it, merely sketching out one possible approach that would avoid to rely on a catalog of meanings.
Each language Renderer could then use the information it needs to select the right word. If a language has a preference to express the gender (such as German) it can do so, whereas a language that prefers not to (such as English) can do so. If for a language the age of the cat matters for the selection of the word, it can look it up. If the color of the animal matters (as it does for horses in German https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben), the respective Renderer can use the information. If a required information is missing, we could add this to a maintenance queue so that contributors can fill it out. If a language should happen not to have a word, a different noun phrase can be chosen, e.g. a less specific word such as ”animal” or “pet”, or a phrase such as “male kitten”, or “black horse” for the German word “Rappen”.
But the important design feature here is that we do not need to ensure and agree on the alignment of meanings of words across different languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of meanings. It would be a tremendously valuable resource. And even without such a catalog, the statements connecting Senses and Items in Wikidata can be very helpful for the creation and maintenance of Renderers, but these do not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up to the community to decide on how to implement the Renderers and what information to use. In this, I am sketching out an architecture that allows us to avoid blocking on the availability of a (valuable but very difficult to create) resource, a comprehensive catalog of meanings aligning words across many different languages.
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
On 17 May 2021 at 19:21 Douglas Clark clarkdd@gmail.com wrote: I would like to clarify the base capability I see in Wikipragmatica, as well as the user community's work stream in support of it's curation. My concern is the path the team has chosen is a dead end beyond the limited use cases of Wikipedia day zero + a few years. An ecosystem of free knowledge certainly seems to lead outside the confines of today's wiki markup world for data and information acquisition. At some point, you will have to semantically disambiguate the remainder of the web. That is not in the manual tagging solution set.
So suppose we look beyond the proof-of-concept and the immediate impacts of the "materials" of the Abstract Wikipedia project: the concrete improvements in the Lexeme space in Wikidata, for example for medical and chemical vocabulary; and the repository including what broadly could be called "conversion scripts".
Various further topics have come up on this list. Some of those might be:
(a) Authoring multiple-choice questions in AW code, as a basis for multilingual educational materials.
(b) Publication of WikiJournals - the Wikimedia contribution to learned journals - in AW code that would then translate to multilingual versions.
(c) Using AW code as the target in generalised text-mining.
I think you are foreseeing something like (c). Certainly it is more like a blue-sky problem.
Charles
Charles,
I foresee all three. Lexemes do no work. They are somewhat quantum-like masses in that all lexemes have several potential states, but you don't know what flavor they will take until observed in the wild after being acted upon by grammar and context. Grammar and context are the energy to lexemes' mass. So, tagging lexemes is the first part of using language programmatically. A critical next step is either understanding a lexeme's *current* sense state, or placing one in a sense state. This is semantic disambiguation. Grammar can give us some sense states, like setting part of speech. However, those pesky little modifiers are more sensitive to context than grammar. "The yellow digging cat," is still somewhat stateless until we look left and right and find the communication is about construction. Now our mind's eye resolves the cat into a large construction vehicle of some nature.
To get to multiple choice questions, we have to set all of the sense states to support an interrogative. To do translation, we have to precisely know the incoming sense states, and then to also translate to the outgoing grammar rules from the incoming ones. The team is going to be writing a ton of rules for the constructors. The problem is that the total number of rule combinations becomes compute-hard when you start to scale. The team at FrameNet tried to shortcut the rules at the phrase level. Their work became rule convoluted, as the exceptions became significant. We don't communicate in words or phrases we communicate in full concepts - sentences or on social media, sentence fragments. They are fragments mostly because they shun sentence grammar. Grammar is hard.
Wikipragmatica is designed to do the same thing at the base level as Abstract. The curation has some bonus benefits due to the embedded context. However, it's basic job is to do semantic disambiguation (your lexeme tagging is one part of disambiguation) while also providing context outside the concept (sentence) for larger communication construction (e.g., sentence fragments, paragraphs, emails, web pages, etc.). A thesaurus is critical for traditional lexeme manipulation since some people say tomato and some people say nightshade. The point being, you can substitute lexemes and still mean the same thing. What Wikipragmatica does is exactly the same thing as a thesaurus except at the sentence level (complete concept). So instead of synonyms, we have paraphrases. This way, we have ignored sentence level grammar and now we can use context for paragraph and larger grammars. Wikipragmatica skips all that messy sentence grammar stuff since we compare sentences and context to conduct semantic disambiguation. There are a ton less rules above the sentence level. Interestingly, we do see some new rules resolving as we get more samples of pathing (context) between concepts. Sometimes the important context is five sentences away, not our next door neighbor. Thus, keeping track of how sentences connect together in all communications give us another super powerful tool.
Denny and the team have selected a traditional path with a twist. The constructors are the first large scale attempt that I know of to codify grammar for active use rather than passive grammar checkers. Google went with N-grams instead of grammars. GPT-3 went with extreme metadata. Grammar replication is gonna be really hard when you do not have context brokering (meaning a service that can help derive context clues for lexeme sense resolution). Just because you know the markup of the incoming sentence, sentence five may affect which translation grammar rule you applied. If the team is successful, it will be quite an achievement. I just think Wikipragmatica is a simpler, more robust solution with quite a few more use cases.
I hope this addresses your comment. Please let me know if you would like further clarification.
Doug
On Mon, May 17, 2021 at 11:55 PM Charles Matthews via Abstract-Wikipedia < abstract-wikipedia@lists.wikimedia.org> wrote:
On 17 May 2021 at 19:21 Douglas Clark clarkdd@gmail.com wrote: I would like to clarify the base capability I see in Wikipragmatica, as well as the user community's work stream in support of it's curation. My concern is the path the team has chosen is a dead end beyond the limited use cases of Wikipedia day zero + a few years. An ecosystem of free knowledge certainly seems to lead outside the confines of today's wiki markup world for data and information acquisition. At some point, you will have to semantically disambiguate the remainder of the web. That is not in the manual tagging solution set.
So suppose we look beyond the proof-of-concept and the immediate impacts of the "materials" of the Abstract Wikipedia project: the concrete improvements in the Lexeme space in Wikidata, for example for medical and chemical vocabulary; and the repository including what broadly could be called "conversion scripts".
Various further topics have come up on this list. Some of those might be:
(a) Authoring multiple-choice questions in AW code, as a basis for multilingual educational materials.
(b) Publication of WikiJournals - the Wikimedia contribution to learned journals - in AW code that would then translate to multilingual versions.
(c) Using AW code as the target in generalised text-mining.
I think you are foreseeing something like (c). Certainly it is more like a blue-sky problem.
Charles _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
"Thus, keeping track of how sentences connect together in all communications give us another super powerful tool."
I agree, but that is orthogonal to Abstract Wikipedia's initial purpose. It might expand a bit in a few years perhaps, even leveraging Wikipragmatica, but that is anyone's guess at this point. I agree that some of the Abstract Wikpedia work *might* detract or pull away some users or researchers from Wikipragmatica's efforts, and for that I guess all of us can be a bit sorry for that. But at the same time, I think all of us can agree that it is not much of a detractor, but instead just another "layer" towards Natural Language Processing and human knowledge expressed in many languages.
*I am thankful for both efforts*, but at the end of the day, each is really just a different layer of understanding with different use cases and intentions (for the time being). I am supportive of the use cases that Wikipragmatica wants to make machines even more capable of understanding humans and their language. I am supportive of the use cases that Abstract Wikipedia wants to make humans even more capable of understanding other humans in their own language.
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Tue, May 18, 2021 at 10:52 AM Douglas Clark clarkdd@gmail.com wrote:
Charles,
I foresee all three. Lexemes do no work. They are somewhat quantum-like masses in that all lexemes have several potential states, but you don't know what flavor they will take until observed in the wild after being acted upon by grammar and context. Grammar and context are the energy to lexemes' mass. So, tagging lexemes is the first part of using language programmatically. A critical next step is either understanding a lexeme's *current* sense state, or placing one in a sense state. This is semantic disambiguation. Grammar can give us some sense states, like setting part of speech. However, those pesky little modifiers are more sensitive to context than grammar. "The yellow digging cat," is still somewhat stateless until we look left and right and find the communication is about construction. Now our mind's eye resolves the cat into a large construction vehicle of some nature.
To get to multiple choice questions, we have to set all of the sense states to support an interrogative. To do translation, we have to precisely know the incoming sense states, and then to also translate to the outgoing grammar rules from the incoming ones. The team is going to be writing a ton of rules for the constructors. The problem is that the total number of rule combinations becomes compute-hard when you start to scale. The team at FrameNet tried to shortcut the rules at the phrase level. Their work became rule convoluted, as the exceptions became significant. We don't communicate in words or phrases we communicate in full concepts - sentences or on social media, sentence fragments. They are fragments mostly because they shun sentence grammar. Grammar is hard.
Wikipragmatica is designed to do the same thing at the base level as Abstract. The curation has some bonus benefits due to the embedded context. However, it's basic job is to do semantic disambiguation (your lexeme tagging is one part of disambiguation) while also providing context outside the concept (sentence) for larger communication construction (e.g., sentence fragments, paragraphs, emails, web pages, etc.). A thesaurus is critical for traditional lexeme manipulation since some people say tomato and some people say nightshade. The point being, you can substitute lexemes and still mean the same thing. What Wikipragmatica does is exactly the same thing as a thesaurus except at the sentence level (complete concept). So instead of synonyms, we have paraphrases. This way, we have ignored sentence level grammar and now we can use context for paragraph and larger grammars. Wikipragmatica skips all that messy sentence grammar stuff since we compare sentences and context to conduct semantic disambiguation. There are a ton less rules above the sentence level. Interestingly, we do see some new rules resolving as we get more samples of pathing (context) between concepts. Sometimes the important context is five sentences away, not our next door neighbor. Thus, keeping track of how sentences connect together in all communications give us another super powerful tool.
Denny and the team have selected a traditional path with a twist. The constructors are the first large scale attempt that I know of to codify grammar for active use rather than passive grammar checkers. Google went with N-grams instead of grammars. GPT-3 went with extreme metadata. Grammar replication is gonna be really hard when you do not have context brokering (meaning a service that can help derive context clues for lexeme sense resolution). Just because you know the markup of the incoming sentence, sentence five may affect which translation grammar rule you applied. If the team is successful, it will be quite an achievement. I just think Wikipragmatica is a simpler, more robust solution with quite a few more use cases.
I hope this addresses your comment. Please let me know if you would like further clarification.
Doug
On Mon, May 17, 2021 at 11:55 PM Charles Matthews via Abstract-Wikipedia < abstract-wikipedia@lists.wikimedia.org> wrote:
On 17 May 2021 at 19:21 Douglas Clark clarkdd@gmail.com wrote: I would like to clarify the base capability I see in Wikipragmatica, as well as the user community's work stream in support of it's curation. My concern is the path the team has chosen is a dead end beyond the limited use cases of Wikipedia day zero + a few years. An ecosystem of free knowledge certainly seems to lead outside the confines of today's wiki markup world for data and information acquisition. At some point, you will have to semantically disambiguate the remainder of the web. That is not in the manual tagging solution set.
So suppose we look beyond the proof-of-concept and the immediate impacts of the "materials" of the Abstract Wikipedia project: the concrete improvements in the Lexeme space in Wikidata, for example for medical and chemical vocabulary; and the repository including what broadly could be called "conversion scripts".
Various further topics have come up on this list. Some of those might be:
(a) Authoring multiple-choice questions in AW code, as a basis for multilingual educational materials.
(b) Publication of WikiJournals - the Wikimedia contribution to learned journals - in AW code that would then translate to multilingual versions.
(c) Using AW code as the target in generalised text-mining.
I think you are foreseeing something like (c). Certainly it is more like a blue-sky problem.
Charles _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Thad,
Noted. But. Let's not forget that the constructors are software code. That code will be just as unknowable to some community members as the code driving semantic vectors. Abstract is not using stone tools to tag lexemes. Both approaches are primarily code driven. Abstract retains the familiar use of markup to assist the software logic. However, both use humans to do the machine conditioning, both via mark-up. On the flip, humans have used networks to curate data since at least Turing's time. I learned the techniques when I was 18. You say markup, I say edges, nodes and features. We can teach the community an extension to markup world. I think coloring Wikipragmatica as machine based, yet Abstract not is somewhat disingenuous. Machine learning is not a black box, nor is it AI. It is a software tool, just like for loops and abstraction. Wikipragmatica will not need less people, it will require the same number and skills, plus more skills as an unabridged thesaurus of concepts will provide new insights into human communication.
Thank you so very much for taking the time to respond, Thad.
Doug
On Tue, May 18, 2021 at 9:24 AM Thad Guidry thadguidry@gmail.com wrote:
"Thus, keeping track of how sentences connect together in all communications give us another super powerful tool."
I agree, but that is orthogonal to Abstract Wikipedia's initial purpose. It might expand a bit in a few years perhaps, even leveraging Wikipragmatica, but that is anyone's guess at this point. I agree that some of the Abstract Wikpedia work *might* detract or pull away some users or researchers from Wikipragmatica's efforts, and for that I guess all of us can be a bit sorry for that. But at the same time, I think all of us can agree that it is not much of a detractor, but instead just another "layer" towards Natural Language Processing and human knowledge expressed in many languages.
*I am thankful for both efforts*, but at the end of the day, each is really just a different layer of understanding with different use cases and intentions (for the time being). I am supportive of the use cases that Wikipragmatica wants to make machines even more capable of understanding humans and their language. I am supportive of the use cases that Abstract Wikipedia wants to make humans even more capable of understanding other humans in their own language.
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Tue, May 18, 2021 at 10:52 AM Douglas Clark clarkdd@gmail.com wrote:
Charles,
I foresee all three. Lexemes do no work. They are somewhat quantum-like masses in that all lexemes have several potential states, but you don't know what flavor they will take until observed in the wild after being acted upon by grammar and context. Grammar and context are the energy to lexemes' mass. So, tagging lexemes is the first part of using language programmatically. A critical next step is either understanding a lexeme's *current* sense state, or placing one in a sense state. This is semantic disambiguation. Grammar can give us some sense states, like setting part of speech. However, those pesky little modifiers are more sensitive to context than grammar. "The yellow digging cat," is still somewhat stateless until we look left and right and find the communication is about construction. Now our mind's eye resolves the cat into a large construction vehicle of some nature.
To get to multiple choice questions, we have to set all of the sense states to support an interrogative. To do translation, we have to precisely know the incoming sense states, and then to also translate to the outgoing grammar rules from the incoming ones. The team is going to be writing a ton of rules for the constructors. The problem is that the total number of rule combinations becomes compute-hard when you start to scale. The team at FrameNet tried to shortcut the rules at the phrase level. Their work became rule convoluted, as the exceptions became significant. We don't communicate in words or phrases we communicate in full concepts - sentences or on social media, sentence fragments. They are fragments mostly because they shun sentence grammar. Grammar is hard.
Wikipragmatica is designed to do the same thing at the base level as Abstract. The curation has some bonus benefits due to the embedded context. However, it's basic job is to do semantic disambiguation (your lexeme tagging is one part of disambiguation) while also providing context outside the concept (sentence) for larger communication construction (e.g., sentence fragments, paragraphs, emails, web pages, etc.). A thesaurus is critical for traditional lexeme manipulation since some people say tomato and some people say nightshade. The point being, you can substitute lexemes and still mean the same thing. What Wikipragmatica does is exactly the same thing as a thesaurus except at the sentence level (complete concept). So instead of synonyms, we have paraphrases. This way, we have ignored sentence level grammar and now we can use context for paragraph and larger grammars. Wikipragmatica skips all that messy sentence grammar stuff since we compare sentences and context to conduct semantic disambiguation. There are a ton less rules above the sentence level. Interestingly, we do see some new rules resolving as we get more samples of pathing (context) between concepts. Sometimes the important context is five sentences away, not our next door neighbor. Thus, keeping track of how sentences connect together in all communications give us another super powerful tool.
Denny and the team have selected a traditional path with a twist. The constructors are the first large scale attempt that I know of to codify grammar for active use rather than passive grammar checkers. Google went with N-grams instead of grammars. GPT-3 went with extreme metadata. Grammar replication is gonna be really hard when you do not have context brokering (meaning a service that can help derive context clues for lexeme sense resolution). Just because you know the markup of the incoming sentence, sentence five may affect which translation grammar rule you applied. If the team is successful, it will be quite an achievement. I just think Wikipragmatica is a simpler, more robust solution with quite a few more use cases.
I hope this addresses your comment. Please let me know if you would like further clarification.
Doug
On Mon, May 17, 2021 at 11:55 PM Charles Matthews via Abstract-Wikipedia < abstract-wikipedia@lists.wikimedia.org> wrote:
On 17 May 2021 at 19:21 Douglas Clark clarkdd@gmail.com wrote: I would like to clarify the base capability I see in Wikipragmatica, as well as the user community's work stream in support of it's curation. My concern is the path the team has chosen is a dead end beyond the limited use cases of Wikipedia day zero + a few years. An ecosystem of free knowledge certainly seems to lead outside the confines of today's wiki markup world for data and information acquisition. At some point, you will have to semantically disambiguate the remainder of the web. That is not in the manual tagging solution set.
So suppose we look beyond the proof-of-concept and the immediate impacts of the "materials" of the Abstract Wikipedia project: the concrete improvements in the Lexeme space in Wikidata, for example for medical and chemical vocabulary; and the repository including what broadly could be called "conversion scripts".
Various further topics have come up on this list. Some of those might be:
(a) Authoring multiple-choice questions in AW code, as a basis for multilingual educational materials.
(b) Publication of WikiJournals - the Wikimedia contribution to learned journals - in AW code that would then translate to multilingual versions.
(c) Using AW code as the target in generalised text-mining.
I think you are foreseeing something like (c). Certainly it is more like a blue-sky problem.
Charles _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
On 18 May 2021 at 17:54 Douglas Clark clarkdd@gmail.com wrote:
<snip>
Wikipragmatica will not need less people, it will require the same number and skills, plus more skills as an unabridged thesaurus of concepts will provide new insights into human communication.
This may be labouring the point. But when I joined the list, I commented it was interesting as a place to discuss both language and computation.
Two further use cases where the AW approach might gain traction would be abstracts of papers, and patents. I'm sure these areas have both had plenty of attention from the point of view of translation.
The problems to do with having adequate technical vocabulary are being addressed by Wikidata. Disambiguation, for example of acronyms used as index terms for papers, is already on the agenda. The issues of sentences not just having assertoric force can be seen in those contexts in a relatively controlled way.
I mentioned multiple choice questions. There is a different kind of point there, namely the lack of a de facto standard, which is essential for reuse. (Use of questions online is hardly the problem - Wikimedia places a high value on reuse of content.)
What gets called "scholarly communications" is behind my raising the Wikijournals. Here again there is an issue of standards, often posed as "scholarly HTML", because of the lack of uniformity in the way learned journals are put online. There is a whole raft of issues that comes up when you say "write in AW code, then produce HTML later" as a publishing model. Actually, Wikipedia experience in evolving a style guide can help understand the philosophical logic side of that. Definitely a hard area.
My take on text mining is certainly influenced by the couple of years I spent working with the ContentMine group. I wouldn't expect those insights to be common ground.
Charles
Denny,
This dataset https://ai.googleblog.com/2021/05/kelm-integrating-knowledge-graphs-with.html from Google AI could massively reduce the complexity of your constructors. By matching the synthetic language and accompanying metadata from Wikidata with the incoming marked-up text, your rules of action and/or translation are simplified. You have 1500 relationships that can really amp up your context awareness. You are out of the grammar business, which I think is preferable. This still does not solve how you will handle the wilds of natural language out in the interwebs, but it makes what you are doing an active knowledge representation.
I know this is data generated by the reviled vector models, but it's human readable, thus much more approachable than even markup. The Google AI article does a great job of explaining the logic of generating the synthetic text. I think this shows us how machine learning can enhance knowledge sharing, while keeping the curations human readable. We are not out of the loop. We have to make the machine communication in our language, not theirs. I'm personally more excited about this than GPT-3.
Doug
On Thu, May 13, 2021 at 5:08 PM Denny Vrandečić dvrandecic@wikimedia.org wrote:
Hi Douglas,
Thank you for your message.
Yes, you are right that if we were trying to understand a sentence such as "The cat is digging", we would need to resolve the ambiguity in that sentence. But, as I wrote in the newsletter, our trick is that we can avoid the necessity to parse and understand text. The abstract content will already be written by the contributors in a representation that disambiguates to the level that is needed to generate the text in the languages we support - no automatic disambiguation of natural language is thus needed.
Thank you for publishing the Wikipragmatica proposal. I have read it when you published it back in January, and I find it interesting and I certainly hope that you will try it out. It is a very different approach to what we are trying to achieve with Abstract Wikipedia, where we don't aim to annotate existing textual resources, but to create entirely new ones from scratch. Wikipragmatica is squarely aimed at the difficult and important task of natural language understanding. Abstract Wikipedia is, very intentionally, trying to circumvent that task. I have not reached out for a discussion because of these significant differences - I think we are aiming for very different goals using very different approaches. The goals of Wikipragmatica are to understand the content, and use that understanding for detecting misinformation, ascertaining truth, and discovering inconsistencies. These are extremely valuable goals, and very difficult, and I have tried to steer explicitly away from them. The same is true for machine learning and vector-based approaches. I cannot figure out how to incorporate these in a way that allows the community to truly own the system and the outputs, which I think is crucial for a Wikimedia project where the community owns and maintains the content. I think that is a very worthwhile question to explore, that still needs a crucial insight or two to make it work.
Yes, FrameNet and WordNet are much more related to our approach than GPT-3 or Bert. About a decade ago, Chuck Filmore, the creator of FrameNet, and I were teaching together in Berkeley, and back then I learned a lot about FrameNet from him, and how much effort is in it. Later, during my time at Google I had the particular luck that some of my colleagues were a few of Chuck's former collaborators on FrameNet and have discussed it with a number of them in detail. This made it clear that one of the biggest risks in the Abstract Wikipedia project is the absolute number of constructors that we will need, as this will ultimately decide how much effort it will be to make the content in Abstract Wikipedia available in a new language. Regarding WordNet, Christiane Fellbaum was one of the initial members of the advisory board for Wikidata, and her work and results were very influential in designing the data model for the lexicographic space in Wikidata (albeit, indirectly, as we settled on the Lemon model that came later and has learned from WordNet).
You are exactly right, we are going down a well worn path. I keep saying that in my talks: this is not a research project, we are applying well-known results from several fields such as natural language generation, crowd-sourcing, programming languages, etc. I still consider it a risky project, as there are a number of unknowns (e.g. the number of constructors, and how multilingual the constructors are) that will play a major role in how effective our approach will be, but I also think that we will certainly achieve something worthwhile - but we don't know yet exactly what and how far this architecture will carry us.
Thank you for your comment, Denny
On Fri, May 7, 2021 at 9:28 AM Douglas Clark clarkdd@gmail.com wrote:
Gerard and Denny,
The problem with a lexeme approach is that the constructors and renderers will become so complex and convoluted as to be non-scalable. The use of lexemes is problematic due to a complete lack of context awareness. Just because you have a word, and know all of its senses, how do you know which sense to pick?
Using Denny's example, "cat" could actually refer to the American construction equipment maker Caterpillar. "That cat is digging!" works for both the animal and the machine. We humans are somewhat unpredictable in when we set context. Your constructors will have to walk up and down the text chain to try and find context for each verb and noun. With a word based approach, words are your granularity, so everything is a lookup for a word, even though your application is at the sentence level. GPT-3, the most powerful NLP tool yet created, has 175 billion parameters for its lexeme based dataset, yet it too loses context. Humans are great at rephrasing something to fit their complete communique. WordNet is the most complete and scientifically accurate lexeme database on the planet, yet very few NLP approaches use WordNet. The traversals of the WordNet thesaurus can be compute intensive, and would be sensitive to how your constructors' logic walks the tree. The rules alone would become massive. You have to at least move up to phrases, and I recommend sentences (paraphrases). As for phrases, the FrameNet https://framenet.icsi.berkeley.edu/fndrupal/ folks can tell you how hard it is to build a dataset of phrases for NLP.
I've asked several times to discuss this with you and to save you and the team from going down this dead end path. The Wikipragmatica proposal directly addresses both context and semantics. If you used Wikipragmatica, translation logic would entail semantic disambiguation, paraphrase detection on nearest neighbors, node assignment, and then a lookup of node members for the appropriate language. If you decide to go down the lexeme path, I highly recommend you spend some noodle time on context brokering. I'm confident that in short order you will understand the magnitude of the context problem using lexemes. You are going down a well worn path.
Respectfully,
Doug
On Fri, May 7, 2021 at 8:53 AM Thad Guidry thadguidry@gmail.com wrote:
Denny,
Wait... Your original posting mentions that *Constructors* would essentially hold the conditional logic, or "rules"? But in your followup, I see you mention *Renderers*?
I'm curious where the delineation of rules will occur, and if the answer is "it depends"?
Have you given much thought to constraints on Constructors or Renderers themselves (Are there high level design docs available for each of those yet)? Or do you think that will be something still being worked through in the long term with community use cases, and practices that evolve?
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Fri, May 7, 2021 at 10:07 AM Denny Vrandečić < dvrandecic@wikimedia.org> wrote:
Hi Gerard,
If the abstract content states (and I am further simplifying):
type: animal type phrase
- type of animal: cat
- sex: male
that might be represented e.g.
{ Z1K1: Z14000, Z14000K1: Z14146, Z14000K2: Z14097 }
or it could be, if we are using QIDs for the values,
{ Z1K1: Z14000, Z14000K1: Q146, Z14000K2: Q44148 }
so it wouldn't be based on English, it would be abstracted from the natural language.
Now there could be a Renderer in Dutch for 'animal type phrases' that would include:
if Z14000K1 = Q146/cat: if Z1400K2 = unknown or Z1400K2 = Q43445/female organism: return L208775/kat (Dutch, noun) if Z1400K2 = Q44148/male organism: return L.../kater (Dutch, noun) ...
etc.
This is just for selecting the right Lexeme. Further functions would now select the right form, depending on how the sentence looks like.
But nowhere do we need to refer to the Senses or to explicitly modeled meanings.
On the other hand, we *could* refer to the Senses and items. (And this is what I meant with not being prescriptive - I am just sketching out one possibility that does *not* refer to them). Because we could also write a multilingual Renderer (e.g. as a fallback Renderer?) that does for example the following:
Animal = Z1400K1 // which would be Q146/cat in our example Senses = FollowBacklink(P5137/item for this sense) Lexemes = GetLexemesFromSenses(Senses) DutchLexemes = FilterByLanguage(Lexemes, Q7411/Dutch) return ChooseOne(DutchLexemes) // that would need to be some deterministic choice)
This probably would need some refinement to figure out how the sex would play into this, but it's a just the start of a sketch. You could also imagine to build something on Defined Meanings at this point.
I hope that makes sense - happy to answer more. And again, it is all just suggestions!
Also, Happy Birthday, Gerard!
Cheers, Denny
On Thu, May 6, 2021 at 10:23 PM Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, I fail to understand. You have the data in the prescribed manner for an article. The original is based on English. How can you generate from the data a text in Dutch or any other language, when you do have the Senses but not the meanings of the words. Thanks, GerardM
On Thu, 6 May 2021 at 23:38, Denny Vrandečić dvrandecic@wikimedia.org wrote:
The on-wiki version of this newsletter can be found here: https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-05-06
In 2018, Wikidata launched a project to collect lexicographical knowledge https://www.wikidata.org/wiki/Wikidata:Lexicographical_data. Several hundred thousand Lexemes have been created since then, and this year the tools will be further developed by Wikimedia Deutschland to make the creation and maintenance of the lexicographic knowledge in Wikidata easier.
The lexicographic extension to Wikidata was developed with the goal that became Abstract Wikipedia in mind, but a recent discussion within the community showed me that I have not made the possible connection between these two parts clear yet. Today, I would like to sketch out a few ideas on how Abstract Wikipedia and the lexicographic data in Wikidata could work together.
There are two principal ways to organize a dictionary: either you organize the entries by ‘lexemes’ or ‘words’ and describe their senses (this is called the semasiological https://en.wikipedia.org/wiki/Semasiology approach), or you organize the entries by their ‘senses’ or ‘meanings’ (this is called the onomasiological https://en.wikipedia.org/wiki/Onomasiology approach). Wikidata has intentionally chosen the semasiological approach: the entries in Wikidata are called Lexemes, and contributors can add Senses and Forms to the Lexemes. Senses stand for the different meanings that a Lexeme may regularly invoke, and the Forms are the different ways the Lexeme may be expressed in a natural language text, e.g. in order to be in agreement with the right grammatical number, case, tense, etc. The Lexeme “mouse” (L1119 https://www.wikidata.org/wiki/Lexeme:L1119) thus has two senses, one for the small rodent, one for the computer input device, and two forms, “mouse” and “mice”. For an example of a multilingual onomasiological collaborative dictionary, one can take a look at the OmegaWiki http://www.omegawiki.org/ project, which is primarily organized around (currently 51,000+) Defined Meanings http://www.omegawiki.org/Help:DefinedMeaning and how these are expressed in different languages.
The reason why Wikidata chose the semasiological approach is based on the observation that it is much simpler for a crowd-sourced collaborative project, and has much less potential to be contentious. It is much easier to gather a list of words used in a corpus than to gather a list of all the meanings referred to in the same corpus. And whereas it is 'simpler', it is still not trivial. We still want to collect a list of Senses for each Lexeme, and we want to describe the connections between these Senses: whether two Lexemes in a language have the same Sense, how the Senses relate to the large catalog of items in Wikidata, and how Senses of different languages relate to each other. These are all very difficult questions that the Wikidata community is still grappling with (see also the essay on Making Sense https://www.wikidata.org/wiki/Wikidata:Making_sense).
Let’s look at an example.
“Stubbs was probably one of the youngest mayors in the history of the world. He became mayor of Talkeetna, Alaska, at the age of three months and six days, and retained that position until his death almost four years ago. Also, Stubbs https://en.wikipedia.org/wiki/Stubbs_(cat) was a cat."
If we want to express that last sentence - “Stubbs was a cat” - we will have to be able to express the meaning “cat” (here, we will focus entirely on the lexical level, and will not discuss grammatical and idiomatic issues; we will leave those for another day). How do we refer to the idea for cat in the abstract content? How do we end up, in English, eventually with the word form “cat” (L7-F4 https://www.wikidata.org/wiki/Lexeme:L7#F4)? In French with the word form “chat” (L511-F4 https://www.wikidata.org/wiki/Lexeme:L511#F4)? And in German with the form “Kater” (L303326-F1 https://www.wikidata.org/wiki/Lexeme:L303326#F1)?
Note that these three words commonly do not have the same meaning. The English word cat refers to both male or female cats equally; and whereas the French word could refer to a cat generically, for example if we wouldn’t know Stubbs’ gender, the word is male, but a female cat would usually be referred to using the word “chatte”. The German word, on the other hand, may only refer to a male cat. If we wouldn’t know whether Stubbs is male or female, we would need to use the word “Katze” in German instead, whereas in French, as said, we still could use “chat”. And English also has words for male cats, e.g. “tom” or “tomcat”, but these are much less frequently used. Searching the Web for “Stubbs is a cat” returns more than 10,000 hits, but not a single one for “Stubbs is a tom” nor “Stubbs is a tomcat”.
In comparison, for Félicette https://en.wikipedia.org/wiki/F%C3%A9licette, the first and so far only cat in space, the articles indeed use the words “chatte” in French and “Katze” in German.
Here we are talking about three rather closely related languages, we are talking about a rather simple noun. This should have been a very simple case, and yet it is not. When we talk about verbs, adjectives, or nouns about more complex concepts (for example different kinds of human settlements or the different ways human body parts are conceptualized in different languages, e.g. arms and hands https://wals.info/chapter/129, terms for colors), it gets much more complicated very quickly. If we were to require that all words we want to use in Abstract Wikipedia first must align their meanings, then that would put a very difficult task in our critical path. So whereas it would indeed have been helpful to Abstract Wikipedia to have followed an onomasiological approach (how wonderful would it be to have a comprehensive catalog of meanings!), that approach was deemed too difficult and a semasiological approach was chosen instead.
Fortunately, a catalog of meanings is not necessary. The way we can avoid that is because Abstract Wikipedia only needs to generate text, and neither parse nor understand it. This allows us to get by using a Constructor that, for each language, uses a Renderer to select the correct word (or other lexical representation). For example, we could have a Constructor that may take several optional further pieces of information: the kind of animal, the breed, the color, whether it is an adult, whether it is neutered, the gender, the number of them, etc. For each of these pieces of information, we could mark whether that information must be expressed in the Rendering, or whether this information is optional and can be ignored, and thus what is available for those Renderers to choose the most appropriate word. Note, this is not telling the community how to do it, merely sketching out one possible approach that would avoid to rely on a catalog of meanings.
Each language Renderer could then use the information it needs to select the right word. If a language has a preference to express the gender (such as German) it can do so, whereas a language that prefers not to (such as English) can do so. If for a language the age of the cat matters for the selection of the word, it can look it up. If the color of the animal matters (as it does for horses in German https://de.wikipedia.org/wiki/Fellfarben_der_Pferde#Die_einzelnen_Fellfarben), the respective Renderer can use the information. If a required information is missing, we could add this to a maintenance queue so that contributors can fill it out. If a language should happen not to have a word, a different noun phrase can be chosen, e.g. a less specific word such as ”animal” or “pet”, or a phrase such as “male kitten”, or “black horse” for the German word “Rappen”.
But the important design feature here is that we do not need to ensure and agree on the alignment of meanings of words across different languages. We do not need a catalog of meanings to achieve what we want.
Now, there are plenty of other use cases for having such a catalog of meanings. It would be a tremendously valuable resource. And even without such a catalog, the statements connecting Senses and Items in Wikidata can be very helpful for the creation and maintenance of Renderers, but these do not need to be used when the natural text for Wikipedia is created.
This suggestion is not meant to be prescriptive, as said. It will be up to the community to decide on how to implement the Renderers and what information to use. In this, I am sketching out an architecture that allows us to avoid blocking on the availability of a (valuable but very difficult to create) resource, a comprehensive catalog of meanings aligning words across many different languages.
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
abstract-wikipedia@lists.wikimedia.org