Hi Adam,
Thanks for your well-intentioned letter. Do you know about Wikidata and the
recent developments to support machine-readable Lexicographical data? I
would like to invite you to take a look at:
The system is still at its early stages, but you can take a look to
examples like:
If you have any questions about this, please do ask.
Regards,
Micru
On Wed, May 30, 2018 at 3:01 AM, Adam Sobieski <adamsobieski(a)hotmail.com>
wrote:
INTRODUCTION
Machine-utilizable lexicons can enhance a great number of speech and
natural language technologies. Scientists, engineers and technologists –
linguists, computational linguists and artificial intelligence researchers
– eagerly await the advancement of machine lexicons which include rich,
structured metadata and machine-utilizable definitions.
Wiktionary, a collaborative project to produce a free-content multilingual
dictionary, aims to describe all words of all languages using definitions
and descriptions. The Wiktionary project, brought online in 2002, includes
139 spoken languages and American sign language [1].
This letter hopes to inspire exploration into and discussion regarding
machine wiktionaries, machine-utilizable crowdsourced lexicons, and
services which could exist at
https://machine.wiktionary.org/ .
LEXICON EDITIONING
The premise of editioning is that one version of the resource can be more
or less frozen, e.g. a 2018 edition, while wiki editors collaboratively
work on a next version, e.g. a 2019 edition. Editioning can provide
stability for complex software engineering scenarios utilizing an online
resource. Some software engineering teams, however, may choose to utilize
fresh dumps or data exports of the freshest edition.
SEMANTIC WEB
A machine-utilizable lexicon could include a semantic model of its
contents and a SPARQL endpoint.
MACHINE-UTILIZABLE DEFINITIONS
Machine-utilizable definitions, available in a number of knowledge
representation formats, can be granular, detailed and nuanced.
There exist a large number of use cases for machine-utilizable
definitions. One use case is providing natural language processing
components with the capabilities to semantically interpret natural
language, to utilize automated reasoning to disambiguate lexemes, phrases
and sentences in contexts. Some contend that the best output after a
natural language processing component processes a portion of natural
language is each possible interpretation, perhaps weighted via statistics.
In this way, (1) natural language processing components could process
ambiguous language, (2) other components, e.g. automated reasoning
components, could narrow sets of hypotheses utilizing dialogue contexts,
(3) other components, e.g. automated reasoning components, could narrow
sets of hypotheses utilizing knowledgebase content, and (4)
mixed-initiative dialogue systems could also ask users questions to narrow
sets of hypotheses. Such disambiguation and interpretation would utilize
machine-utilizable definitions of senses of lexemes.
CONJUGATION, DECLENSION AND THE URL-BASED SPECIFICATION OF LEXEMES AND
LEXICAL PHRASES
A grammatical category [2] is a property of items within the grammar of a
language; it has a number of possible values, sometimes called grammemes,
which are normally mutually exclusive within a given category. Verb
conjugation, for example, may be affected by the grammatical categories of:
person, number, gender, tense, aspect, mood, voice, case, possession,
definiteness, politeness, causativity, clusivity, interrogativity,
transitivity, valency, polarity, telicity, volition, mirativity,
evidentiality, animacy, associativity, pluractionality, reciprocity,
agreement, polypersonal agreement, incorporation, noun class, noun
classifiers, and verb classifiers in some languages [3].
By combining the grammatical categories from each and every language
together, we can precisely specify a conjugation or declension. For
example, the URL:
https://machine.wiktionary.org/wiki/lookup.php?edition=
2018&language=en-US&lemma=fly&category=verb&person=first-
person&number=singular&tense=past&aspect=past_simple&mood=indicative&…
includes an edition, a language of a lemma, a lemma, a lexical category,
and conjugates (with ellipses) the verb in a language-independent manner.
We can further specify, via URL query string, the semantic sense of a
grammatical element:
https://machine.wiktionary.org/wiki/lookup.php?edition=
2018&language=en-US&lemma=fly&category=verb&person=first-
person&number=singular&tense=past&aspect=past_simple&mood=
indicative&...&sense=4
Specifying a grammatical item fully in a URL query string, as indicated in
the previous examples, could result in a redirection to another URL.
That is, the URL:
https://machine.wiktionary.org/wiki/lookup.php?edition=
2018&language=en-US&lemma=fly&category=verb&person=first-
person&number=singular&tense=past&aspect=past_simple&mood=indicative&…
could redirect to:
https://machine.wiktionary.org/wiki/index.php?edition=2018&id=12345678
or to:
https://machine.wiktionary.org/wiki/2018/12345678/
and the URL with a specified semantic sense:
https://machine.wiktionary.org/wiki/lookup.php?edition=
2018&language=en-US&lemma=fly&category=verb&person=first-
person&number=singular&tense=past&aspect=past_simple&mood=
indicative&...&sense=4
could redirect to:
https://machine.wiktionary.org/wiki/index.php?edition=
2018&id=12345678&sense=4
or to:
https://machine.wiktionary.org/wiki/2018/12345678/4/
The URL
https://machine.wiktionary.org/wiki/2018/12345678/ is intended to
indicate a conjugation or declension with one or more meanings or senses.
The URL
https://machine.wiktionary.org/wiki/2018/12345678/4/ is intended
to indicate a specific sense or definition of a conjugation or declension.
A feature from having URL’s for both conjugations or declensions and for
specific meanings or senses is that HTTP request headers can specify
languages and content types of the output desired for a particular URL.
The provided examples intended to indicate that each complete,
language-independent conjugation or declension can have an ID number as
opposed to each headword or lemma. Instead of one ID number for all
variations of “fly”, there is one ID number for “flew”, another for “have
flown”, another for “flying”, and one for each conjugation or declension.
Reasons for indexing the conjugations and declensions instead of
traditional headwords or lemmas include that, at least for some knowledge
representation formats, the formal semantics of the definitions vary per
conjugation or declension.
CONCLUSION
This letter broached machine wiktionaries and some of the services which
could exist at
https://machine.wiktionary.org/ . It is my hope that this
letter indicated a few of the many exciting topics with regard to
machine-utilizable crowdsourced lexicons.
REFERENCES
[1]
https://en.wiktionary.org/wiki/Index:All_languages#List_of_languages
[2]
https://en.wikipedia.org/wiki/Grammatical_category
[3]
https://en.wikipedia.org/wiki/Grammatical_conjugation
[4]
https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#
Request_fields
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l