INTRODUCTION
Machine-utilizable lexicons can enhance a great number of speech and natural language technologies. Scientists, engineers and technologists – linguists, computational linguists and artificial intelligence researchers – eagerly await the advancement of machine lexicons which include rich, structured metadata and machine-utilizable definitions.
Wiktionary, a collaborative project to produce a free-content multilingual dictionary, aims to describe all words of all languages using definitions and descriptions. The Wiktionary project, brought online in 2002, includes 139 spoken languages and American sign language [1].
This letter hopes to inspire exploration into and discussion regarding machine wiktionaries, machine-utilizable crowdsourced lexicons, and services which could exist at https://machine.wiktionary.org/ .
LEXICON EDITIONING
The premise of editioning is that one version of the resource can be more or less frozen, e.g. a 2018 edition, while wiki editors collaboratively work on a next version, e.g. a 2019 edition. Editioning can provide stability for complex software engineering scenarios utilizing an online resource. Some software engineering teams, however, may choose to utilize fresh dumps or data exports of the freshest edition.
SEMANTIC WEB
A machine-utilizable lexicon could include a semantic model of its contents and a SPARQL endpoint.
MACHINE-UTILIZABLE DEFINITIONS
Machine-utilizable definitions, available in a number of knowledge representation formats, can be granular, detailed and nuanced.
There exist a large number of use cases for machine-utilizable definitions. One use case is providing natural language processing components with the capabilities to semantically interpret natural language, to utilize automated reasoning to disambiguate lexemes, phrases and sentences in contexts. Some contend that the best output after a natural language processing component processes a portion of natural language is each possible interpretation, perhaps weighted via statistics. In this way, (1) natural language processing components could process ambiguous language, (2) other components, e.g. automated reasoning components, could narrow sets of hypotheses utilizing dialogue contexts, (3) other components, e.g. automated reasoning components, could narrow sets of hypotheses utilizing knowledgebase content, and (4) mixed-initiative dialogue systems could also ask users questions to narrow sets of hypotheses. Such disambiguation and interpretation would utilize machine-utilizable definitions of senses of lexemes.
CONJUGATION, DECLENSION AND THE URL-BASED SPECIFICATION OF LEXEMES AND LEXICAL PHRASES
A grammatical category [2] is a property of items within the grammar of a language; it has a number of possible values, sometimes called grammemes, which are normally mutually exclusive within a given category. Verb conjugation, for example, may be affected by the grammatical categories of: person, number, gender, tense, aspect, mood, voice, case, possession, definiteness, politeness, causativity, clusivity, interrogativity, transitivity, valency, polarity, telicity, volition, mirativity, evidentiality, animacy, associativity, pluractionality, reciprocity, agreement, polypersonal agreement, incorporation, noun class, noun classifiers, and verb classifiers in some languages [3].
By combining the grammatical categories from each and every language together, we can precisely specify a conjugation or declension. For example, the URL:
https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-...
includes an edition, a language of a lemma, a lemma, a lexical category, and conjugates (with ellipses) the verb in a language-independent manner.
We can further specify, via URL query string, the semantic sense of a grammatical element:
https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-...
Specifying a grammatical item fully in a URL query string, as indicated in the previous examples, could result in a redirection to another URL.
That is, the URL:
https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-...
could redirect to:
https://machine.wiktionary.org/wiki/index.php?edition=2018&id=12345678
or to:
https://machine.wiktionary.org/wiki/2018/12345678/
and the URL with a specified semantic sense:
https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-...
could redirect to:
https://machine.wiktionary.org/wiki/index.php?edition=2018&id=12345678&a...
or to:
https://machine.wiktionary.org/wiki/2018/12345678/4/
The URL https://machine.wiktionary.org/wiki/2018/12345678/ is intended to indicate a conjugation or declension with one or more meanings or senses. The URL https://machine.wiktionary.org/wiki/2018/12345678/4/ is intended to indicate a specific sense or definition of a conjugation or declension. A feature from having URL’s for both conjugations or declensions and for specific meanings or senses is that HTTP request headers can specify languages and content types of the output desired for a particular URL.
The provided examples intended to indicate that each complete, language-independent conjugation or declension can have an ID number as opposed to each headword or lemma. Instead of one ID number for all variations of “fly”, there is one ID number for “flew”, another for “have flown”, another for “flying”, and one for each conjugation or declension. Reasons for indexing the conjugations and declensions instead of traditional headwords or lemmas include that, at least for some knowledge representation formats, the formal semantics of the definitions vary per conjugation or declension.
CONCLUSION
This letter broached machine wiktionaries and some of the services which could exist at https://machine.wiktionary.org/ . It is my hope that this letter indicated a few of the many exciting topics with regard to machine-utilizable crowdsourced lexicons.
REFERENCES
[1] https://en.wiktionary.org/wiki/Index:All_languages#List_of_languages [2] https://en.wikipedia.org/wiki/Grammatical_category [3] https://en.wikipedia.org/wiki/Grammatical_conjugation [4] https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Request_fields
Dear Adam,
Are you aware of our current efforts in Wikidata with the new lexeme support that was announce last week? Search and SPARQL support is very limited, but I suspect it might come in some months time. Translations and senses should also be on its way.
You can search for lemma and forms at Ordia: https://tools.wmflabs.org/ordia/
For instance, "Luftballon": https://tools.wmflabs.org/ordia/L99
The Wikidata lexeme item is here https://www.wikidata.org/wiki/Lexeme:L99
We got around 2035 lexemes.
/Finn
On 05/30/2018 03:01 AM, Adam Sobieski wrote:
INTRODUCTION
Machine-utilizable lexicons can enhance a great number of speech and natural language technologies. Scientists, engineers and technologists – linguists, computational linguists and artificial intelligence researchers – eagerly await the advancement of machine lexicons which include rich, structured metadata and machine-utilizable definitions.
Wiktionary, a collaborative project to produce a free-content multilingual dictionary, aims to describe all words of all languages using definitions and descriptions. The Wiktionary project, brought online in 2002, includes 139 spoken languages and American sign language [1].
This letter hopes to inspire exploration into and discussion regarding machine wiktionaries, machine-utilizable crowdsourced lexicons, and services which could exist at https://machine.wiktionary.org/ .
LEXICON EDITIONING
The premise of editioning is that one version of the resource can be more or less frozen, e.g. a 2018 edition, while wiki editors collaboratively work on a next version, e.g. a 2019 edition. Editioning can provide stability for complex software engineering scenarios utilizing an online resource. Some software engineering teams, however, may choose to utilize fresh dumps or data exports of the freshest edition.
SEMANTIC WEB
A machine-utilizable lexicon could include a semantic model of its contents and a SPARQL endpoint.
MACHINE-UTILIZABLE DEFINITIONS
Machine-utilizable definitions, available in a number of knowledge representation formats, can be granular, detailed and nuanced.
There exist a large number of use cases for machine-utilizable definitions. One use case is providing natural language processing components with the capabilities to semantically interpret natural language, to utilize automated reasoning to disambiguate lexemes, phrases and sentences in contexts. Some contend that the best output after a natural language processing component processes a portion of natural language is each possible interpretation, perhaps weighted via statistics. In this way, (1) natural language processing components could process ambiguous language, (2) other components, e.g. automated reasoning components, could narrow sets of hypotheses utilizing dialogue contexts, (3) other components, e.g. automated reasoning components, could narrow sets of hypotheses utilizing knowledgebase content, and (4) mixed-initiative dialogue systems could also ask users questions to narrow sets of hypotheses. Such disambiguation and interpretation would utilize machine-utilizable definitions of senses of lexemes.
CONJUGATION, DECLENSION AND THE URL-BASED SPECIFICATION OF LEXEMES AND LEXICAL PHRASES
A grammatical category [2] is a property of items within the grammar of a language; it has a number of possible values, sometimes called grammemes, which are normally mutually exclusive within a given category. Verb conjugation, for example, may be affected by the grammatical categories of: person, number, gender, tense, aspect, mood, voice, case, possession, definiteness, politeness, causativity, clusivity, interrogativity, transitivity, valency, polarity, telicity, volition, mirativity, evidentiality, animacy, associativity, pluractionality, reciprocity, agreement, polypersonal agreement, incorporation, noun class, noun classifiers, and verb classifiers in some languages [3].
By combining the grammatical categories from each and every language together, we can precisely specify a conjugation or declension. For example, the URL:
https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-...
includes an edition, a language of a lemma, a lemma, a lexical category, and conjugates (with ellipses) the verb in a language-independent manner.
We can further specify, via URL query string, the semantic sense of a grammatical element:
https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-...
Specifying a grammatical item fully in a URL query string, as indicated in the previous examples, could result in a redirection to another URL.
That is, the URL:
https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-...
could redirect to:
https://machine.wiktionary.org/wiki/index.php?edition=2018&id=12345678
or to:
https://machine.wiktionary.org/wiki/2018/12345678/
and the URL with a specified semantic sense:
https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-...
could redirect to:
https://machine.wiktionary.org/wiki/index.php?edition=2018&id=12345678&a...
or to:
https://machine.wiktionary.org/wiki/2018/12345678/4/
The URL https://machine.wiktionary.org/wiki/2018/12345678/ is intended to indicate a conjugation or declension with one or more meanings or senses. The URL https://machine.wiktionary.org/wiki/2018/12345678/4/ is intended to indicate a specific sense or definition of a conjugation or declension. A feature from having URL’s for both conjugations or declensions and for specific meanings or senses is that HTTP request headers can specify languages and content types of the output desired for a particular URL.
The provided examples intended to indicate that each complete, language-independent conjugation or declension can have an ID number as opposed to each headword or lemma. Instead of one ID number for all variations of “fly”, there is one ID number for “flew”, another for “have flown”, another for “flying”, and one for each conjugation or declension. Reasons for indexing the conjugations and declensions instead of traditional headwords or lemmas include that, at least for some knowledge representation formats, the formal semantics of the definitions vary per conjugation or declension.
CONCLUSION
This letter broached machine wiktionaries and some of the services which could exist at https://machine.wiktionary.org/ . It is my hope that this letter indicated a few of the many exciting topics with regard to machine-utilizable crowdsourced lexicons.
REFERENCES
[1] https://en.wiktionary.org/wiki/Index:All_languages#List_of_languages [2] https://en.wikipedia.org/wiki/Grammatical_category [3] https://en.wikipedia.org/wiki/Grammatical_conjugation [4] https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Request_fields _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Micru, Finn,
Thank you for the hyperlinks to the pertinent projects.
I’m thinking that machine lexicon services could include URL-addressible: (1) headwords and lemmas, (2) conjugations and declensions, and (3) specific senses or definitions. Each conjugation or declension could have its own URL-addressable definitions. Machine-utilizable definitions are envisioned as existing in a number of machine-utilizable knowledge representation formats.
In addition to Web-based user interfaces for content editing, machine lexicons could support bulk API’s including those based on XML-RPC and SPARUL. With regard to the use of SPARQL and SPARUL, there may already exist a suitable ontology. Some lexical ontologies include: Lemon (https://www.w3.org/2016/05/ontolex/), LexInfo (http://www.lexinfo.net/), LIR (http://mayor2.dia.fi.upm.es/oeg-upm/index.php/en/technologies/63-lir/), LMM (http://ontologydesignpatterns.org/wiki/Ontology:LMM), semiotics.owl (http://www.ontologydesignpatterns.org/cp/owl/semiotics.owl), and Senso Comune (http://www.sensocomune.it/). It should be possible to extend existing ontologies to include machine-utilizable definitions in a number of knowledge representation formats.
I’m thinking about topics in knowledge representation with regard to the formal semantics of nouns, verbs, adjectives, adverbs, pronouns, prepositions and conjunctions and about how automated reasoners could make use of machine-utilizable definitions to obtain and compare semantic interpretations as software systems parse natural language.
Best regards, Adam
In addition to Web-based user interfaces for content editing, machine
lexicons could support bulk API’s including those based on XML-RPC and SPARUL.
It is what it is planned for Wikidata lexemes. There is already a REST API. Example: https://www.wikidata.org/wiki/Special:EntityData/L42.json
We are currently working on an RDF output of the lexemes content using Lemon/Ontolex [1]. It is planned to import this RDF representation into https://query.wikidata.org in order to be able to execute SPARQL queries on it.
Cheers,
Thomas
[1] https://mediawiki.org/wiki/Extension:WikibaseLexeme/RDF_mapping
Le jeu. 31 mai 2018 à 05:22, Adam Sobieski adamsobieski@hotmail.com a écrit :
Micru, Finn,
Thank you for the hyperlinks to the pertinent projects.
I’m thinking that machine lexicon services could include URL-addressible: (1) headwords and lemmas, (2) conjugations and declensions, and (3) specific senses or definitions. Each conjugation or declension could have its own URL-addressable definitions. Machine-utilizable definitions are envisioned as existing in a number of machine-utilizable knowledge representation formats.
In addition to Web-based user interfaces for content editing, machine lexicons could support bulk API’s including those based on XML-RPC and SPARUL. With regard to the use of SPARQL and SPARUL, there may already exist a suitable ontology. Some lexical ontologies include: Lemon ( https://www.w3.org/2016/05/ontolex/), LexInfo (http://www.lexinfo.net/), LIR (http://mayor2.dia.fi.upm.es/oeg-upm/index.php/en/technologies/63-lir/), LMM (http://ontologydesignpatterns.org/wiki/Ontology:LMM), semiotics.owl ( http://www.ontologydesignpatterns.org/cp/owl/semiotics.owl), and Senso Comune (http://www.sensocomune.it/). It should be possible to extend existing ontologies to include machine-utilizable definitions in a number of knowledge representation formats.
I’m thinking about topics in knowledge representation with regard to the formal semantics of nouns, verbs, adjectives, adverbs, pronouns, prepositions and conjunctions and about how automated reasoners could make use of machine-utilizable definitions to obtain and compare semantic interpretations as software systems parse natural language.
Best regards, Adam
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thomas,
Thank you for the exciting information with regard to the future of Wikidata lexemes. With bulk upload and update capabilities, we might anticipate alignments and uploads from projects on the scales of FrameNet, PropBank, VerbNet and WordNet.
With regard to crowdsourced lexicons containing machine-utilizable definitions, we can consider a feature where, as software using the API’s for definitions find that there aren’t yet definitions for particular lexemes, counters can be accumulated such that users can observe which lexemes’ definitions are in popular demand. This could be a means of prioritizing which lexemes to rigorously define.
We might envision natural language understanding, including semantic interpretation, of children’s books in upcoming years.
Best regards,
Adam
________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of Thomas Pellissier Tanon thomas@pellissier-tanon.fr Sent: Thursday, May 31, 2018 6:25:56 AM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Machine-utilizable Crowdsourced Lexicons
In addition to Web-based user interfaces for content editing, machine
lexicons could support bulk API’s including those based on XML-RPC and SPARUL.
It is what it is planned for Wikidata lexemes. There is already a REST API. Example: https://www.wikidata.org/wiki/Special:EntityData/L42.json
We are currently working on an RDF output of the lexemes content using Lemon/Ontolex [1]. It is planned to import this RDF representation into https://query.wikidata.org in order to be able to execute SPARQL queries on it.
Cheers,
Thomas
[1] https://mediawiki.org/wiki/Extension:WikibaseLexeme/RDF_mapping
Le jeu. 31 mai 2018 à 05:22, Adam Sobieski adamsobieski@hotmail.com a écrit :
Micru, Finn,
Thank you for the hyperlinks to the pertinent projects.
I’m thinking that machine lexicon services could include URL-addressible: (1) headwords and lemmas, (2) conjugations and declensions, and (3) specific senses or definitions. Each conjugation or declension could have its own URL-addressable definitions. Machine-utilizable definitions are envisioned as existing in a number of machine-utilizable knowledge representation formats.
In addition to Web-based user interfaces for content editing, machine lexicons could support bulk API’s including those based on XML-RPC and SPARUL. With regard to the use of SPARQL and SPARUL, there may already exist a suitable ontology. Some lexical ontologies include: Lemon ( https://www.w3.org/2016/05/ontolex/), LexInfo (http://www.lexinfo.net/), LIR (http://mayor2.dia.fi.upm.es/oeg-upm/index.php/en/technologies/63-lir/), LMM (http://ontologydesignpatterns.org/wiki/Ontology:LMM), semiotics.owl ( http://www.ontologydesignpatterns.org/cp/owl/semiotics.owl), and Senso Comune (http://www.sensocomune.it/). It should be possible to extend existing ontologies to include machine-utilizable definitions in a number of knowledge representation formats.
I’m thinking about topics in knowledge representation with regard to the formal semantics of nouns, verbs, adjectives, adverbs, pronouns, prepositions and conjunctions and about how automated reasoners could make use of machine-utilizable definitions to obtain and compare semantic interpretations as software systems parse natural language.
Best regards, Adam
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thomas,
I also wanted to briefly indicate how non-trivial that some of these technical topics are; for example algorithmically determining which interpretation hypotheses are correct for sentences or whether one or more constituent elements of sentences are best interpreted in ways not yet specified in a growing, dynamic lexicon.
The matter relates to language learning. There is the matter of encountering new lexemes, lexemes with zero senses thus far in the lexicon, and then there is the matter of encountering new senses of lexemes previously encountered.
My earlier comment was that software systems could signal machine-utilizable crowdsourced lexicon services, in the case of certain events, so that users could utilize data to prioritize collaborative work. I also theorize, as others do, that a viable concept of sequencing work with respect to building natural language understanding systems and lexicons is entering data in the order of reading level, from infancy to adult reading level.
Building machine-utilizable crowdsourced lexicon software with rich, structured metadata and with extensible storage slots for definitions in multiple knowledge representation formats is a difficult task; one that makes possible other difficult tasks utilizing such lexicons.
Thank you for the enjoyable brainstorming session and for indicating the state of the art with regard to projects underway. I am interested in any of your thoughts, opinions and ideas with respect to the future of machine-utilizable crowdsourced lexicons.
Best regards,
Adam
________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of Adam Sobieski adamsobieski@hotmail.com Sent: Thursday, May 31, 2018 4:26:46 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Machine-utilizable Crowdsourced Lexicons
Thomas,
Thank you for the exciting information with regard to the future of Wikidata lexemes. With bulk upload and update capabilities, we might anticipate alignments and uploads from projects on the scales of FrameNet, PropBank, VerbNet and WordNet.
With regard to crowdsourced lexicons containing machine-utilizable definitions, we can consider a feature where, as software using the API’s for definitions find that there aren’t yet definitions for particular lexemes, counters can be accumulated such that users can observe which lexemes’ definitions are in popular demand. This could be a means of prioritizing which lexemes to rigorously define.
We might envision natural language understanding, including semantic interpretation, of children’s books in upcoming years.
Best regards,
Adam
________________________________ From: Wiki-research-l wiki-research-l-bounces@lists.wikimedia.org on behalf of Thomas Pellissier Tanon thomas@pellissier-tanon.fr Sent: Thursday, May 31, 2018 6:25:56 AM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Machine-utilizable Crowdsourced Lexicons
In addition to Web-based user interfaces for content editing, machine
lexicons could support bulk API’s including those based on XML-RPC and SPARUL.
It is what it is planned for Wikidata lexemes. There is already a REST API. Example: https://www.wikidata.org/wiki/Special:EntityData/L42.json
We are currently working on an RDF output of the lexemes content using Lemon/Ontolex [1]. It is planned to import this RDF representation into https://query.wikidata.org in order to be able to execute SPARQL queries on it.
Cheers,
Thomas
[1] https://mediawiki.org/wiki/Extension:WikibaseLexeme/RDF_mapping
Le jeu. 31 mai 2018 à 05:22, Adam Sobieski adamsobieski@hotmail.com a écrit :
Micru, Finn,
Thank you for the hyperlinks to the pertinent projects.
I’m thinking that machine lexicon services could include URL-addressible: (1) headwords and lemmas, (2) conjugations and declensions, and (3) specific senses or definitions. Each conjugation or declension could have its own URL-addressable definitions. Machine-utilizable definitions are envisioned as existing in a number of machine-utilizable knowledge representation formats.
In addition to Web-based user interfaces for content editing, machine lexicons could support bulk API’s including those based on XML-RPC and SPARUL. With regard to the use of SPARQL and SPARUL, there may already exist a suitable ontology. Some lexical ontologies include: Lemon ( https://www.w3.org/2016/05/ontolex/), LexInfo (http://www.lexinfo.net/), LIR (http://mayor2.dia.fi.upm.es/oeg-upm/index.php/en/technologies/63-lir/), LMM (http://ontologydesignpatterns.org/wiki/Ontology:LMM), semiotics.owl ( http://www.ontologydesignpatterns.org/cp/owl/semiotics.owl), and Senso Comune (http://www.sensocomune.it/). It should be possible to extend existing ontologies to include machine-utilizable definitions in a number of knowledge representation formats.
I’m thinking about topics in knowledge representation with regard to the formal semantics of nouns, verbs, adjectives, adverbs, pronouns, prepositions and conjunctions and about how automated reasoners could make use of machine-utilizable definitions to obtain and compare semantic interpretations as software systems parse natural language.
Best regards, Adam
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org