Machine-utilizable Crowdsourced Lexicons - Wikitech-l

30 May 2018

INTRODUCTION

Machine-utilizable lexicons can enhance a great number of speech and natural language
technologies. Scientists, engineers and technologists – linguists, computational linguists
and artificial intelligence researchers – eagerly await the advancement of machine
lexicons which include rich, structured metadata and machine-utilizable definitions.

Wiktionary, a collaborative project to produce a free-content multilingual dictionary,
aims to describe all words of all languages using definitions and descriptions. The
Wiktionary project, brought online in 2002, includes 139 spoken languages and American
sign language [1].

This letter hopes to inspire exploration into and discussion regarding machine
wiktionaries, machine-utilizable crowdsourced lexicons, and services which could exist at
https://machine.wiktionary.org/ .

LEXICON EDITIONING

The premise of editioning is that one version of the resource can be more or less frozen,
e.g. a 2018 edition, while wiki editors collaboratively work on a next version, e.g. a
2019 edition. Editioning can provide stability for complex software engineering scenarios
utilizing an online resource. Some software engineering teams, however, may choose to
utilize fresh dumps or data exports of the freshest edition.

SEMANTIC WEB

A machine-utilizable lexicon could include a semantic model of its contents and a SPARQL
endpoint.

MACHINE-UTILIZABLE DEFINITIONS

Machine-utilizable definitions, available in a number of knowledge representation formats,
can be granular, detailed and nuanced.

There exist a large number of use cases for machine-utilizable definitions. One use case
is providing natural language processing components with the capabilities to semantically
interpret natural language, to utilize automated reasoning to disambiguate lexemes,
phrases and sentences in contexts. Some contend that the best output after a natural
language processing component processes a portion of natural language is each possible
interpretation, perhaps weighted via statistics. In this way, (1) natural language
processing components could process ambiguous language, (2) other components, e.g.
automated reasoning components, could narrow sets of hypotheses utilizing dialogue
contexts, (3) other components, e.g. automated reasoning components, could narrow sets of
hypotheses utilizing knowledgebase content, and (4) mixed-initiative dialogue systems
could also ask users questions to narrow sets of hypotheses. Such disambiguation and
interpretation would utilize machine-utilizable definitions of senses of lexemes.

CONJUGATION, DECLENSION AND THE URL-BASED SPECIFICATION OF LEXEMES AND LEXICAL PHRASES

A grammatical category [2] is a property of items within the grammar of a language; it has
a number of possible values, sometimes called grammemes, which are normally mutually
exclusive within a given category. Verb conjugation, for example, may be affected by the
grammatical categories of: person, number, gender, tense, aspect, mood, voice, case,
possession, definiteness, politeness, causativity, clusivity, interrogativity,
transitivity, valency, polarity, telicity, volition, mirativity, evidentiality, animacy,
associativity, pluractionality, reciprocity, agreement, polypersonal agreement,
incorporation, noun class, noun classifiers, and verb classifiers in some languages [3].

By combining the grammatical categories from each and every language together, we can
precisely specify a conjugation or declension. For example, the URL:

https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en…

includes an edition, a language of a lemma, a lemma, a lexical category, and conjugates
(with ellipses) the verb in a language-independent manner.

We can further specify, via URL query string, the semantic sense of a grammatical
element:

https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en…

Specifying a grammatical item fully in a URL query string, as indicated in the previous
examples, could result in a redirection to another URL.

That is, the URL:

https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en…

could redirect to:

https://machine.wiktionary.org/wiki/index.php?edition=2018&id=12345678

or to:

https://machine.wiktionary.org/wiki/2018/12345678/

and the URL with a specified semantic sense:

https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en…

could redirect to:

https://machine.wiktionary.org/wiki/index.php?edition=2018&id=12345678&…

or to:

https://machine.wiktionary.org/wiki/2018/12345678/4/

The URL https://machine.wiktionary.org/wiki/2018/12345678/ is intended to indicate a
conjugation or declension with one or more meanings or senses. The URL
https://machine.wiktionary.org/wiki/2018/12345678/4/ is intended to indicate a specific
sense or definition of a conjugation or declension. A feature from having URL’s for both
conjugations or declensions and for specific meanings or senses is that HTTP request
headers can specify languages and content types of the output desired for a particular
URL.

The provided examples intended to indicate that each complete, language-independent
conjugation or declension can have an ID number as opposed to each headword or lemma.
Instead of one ID number for all variations of “fly”, there is one ID number for “flew”,
another for “have flown”, another for “flying”, and one for each conjugation or
declension. Reasons for indexing the conjugations and declensions instead of traditional
headwords or lemmas include that, at least for some knowledge representation formats, the
formal semantics of the definitions vary per conjugation or declension.

CONCLUSION

This letter broached machine wiktionaries and some of the services which could exist at
https://machine.wiktionary.org/ . It is my hope that this letter indicated a few of the
many exciting topics with regard to machine-utilizable crowdsourced lexicons.

REFERENCES

[1] https://en.wiktionary.org/wiki/Index:All_languages#List_of_languages
[2] https://en.wikipedia.org/wiki/Grammatical_category
[3] https://en.wikipedia.org/wiki/Grammatical_conjugation
[4] https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Request_fields