The on-wiki version of this newsletter is available here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-17
--
Last week we discussed how to implement paradigms
<https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-09-10> in
Wikifunctions. This week, let’s discuss a few ideas on how this could be
used.
One may ask why this is useful, given that we are collecting all the
different forms in the lexicographic data in Wikidata anyway. We don’t need
to generate the forms if we have a full set of forms in Wikidata, surely?
There are several possible use cases:
First, we will probably never achieve a full coverage in Wikidata of all
forms in all languages. In some languages, the number of forms may be
prohibitively high, and we, like every other dictionary, might need to make
a selection of forms to store. Often the forms not stored are highly
regular.
Second, even if we have really good coverage, occasionally you will need to
introduce words that are not in the dictionary: when displaying neologisms,
when generating a new lexeme by conversion from another grammatical
category (for example: verbing nouns in English, or using place names to
make demonyms), or when using loanwords from other languages. Fortunately,
such words are often regular, and having smart paradigms as described last
time can take us pretty far.
Third, the paradigms can be used in Wikidata to connect to the actual
lexemes. For example, on a lexeme such as *"cat
<https://www.wikidata.org/wiki/Lexeme:L7>"* we could link to the paradigm
that we developed last week, either the add s
<https://notwikilambda.toolforge.org/wiki/Z10110> function or the English
regular plural <https://notwikilambda.toolforge.org/wiki/Z10132> function.
Linking the lexeme with the function allows individual forms to be
re-generated, which in turn means they can be checked for correctness, thus
ensuring data quality. The English regular plural function can tell us that
the plural for *"pasty <https://www.wikidata.org/wiki/Lexeme:L24858>"*
should
be *"pasties"*, but that Wikidata lexeme previously defined it as
*"pastiest
<https://www.wikidata.org/w/index.php?title=Lexeme:L24858&oldid=1033449948#F2>"*.
The plural of *"strawman"* should be *"strawmen"*, not
*"strawmans
<https://www.wikidata.org/w/index.php?title=Lexeme:L227827&oldid=1069374578>"*;
the plural for *"Frenchwoman"* should be *"Frenchwomen"* not
*"Frenchwoman
<https://www.wikidata.org/w/index.php?title=Lexeme:L34524&oldid=1392427375>"*
.
One question is: if we have a paradigm that can create the forms, why even
create and store the forms in Wikidata in the first place? That’s a great
question, and a decision that can indeed be revisited by the community.
Personally, I think we need both forms stored explicitly in Wikidata and
generative paradigms. Without the former, it's not clear how we would
handle irregular forms — would the onus lie on the paradigms? That seems
messy. Likewise, paradigms are crucial when, for example, a Lexeme has
thousands of possible forms. If these forms are always regular, the
community might decide not to materialize them all — especially if many
Lexemes cleave to the same regular morphological pattern.
This seems also to be the case for English nouns: almost all of the English
nouns in Wikidata have two forms, even though one could argue that English
nouns have four forms (including the possessive forms); however, the English
possessive <https://en.wikipedia.org/wiki/English_possessive> forms seem to
be generated so regularly that, so far, Wikidata contributors seem to
consider them unnecessary and usually omit them.
Fourth, the paradigms can also be used to propose a starting point when
entering the data. Imagine the Wikidata Lexeme Forms
<https://www.wikidata.org/wiki/Wikidata:Wikidata_Lexeme_Forms> allowing you
to select a function on Wikifunctions that, given the lemma, generates all
likely forms for an entry. The Lexeme Forms tool has already improved the
creation of Lexemes considerably, making the entries much more consistent
and expansive. If, in addition, we could also automatically generate most
of the forms, this would increase the speed of entering the data by a lot -
and at the same time reduce the likelihood of data entry errors.
Besides all these immediate improvements, there might be many further
advantages. For example, storing an offline dictionary would require much
less storage space if we use paradigms. Developing paradigms for currently
under-resourced languages might create aids for working with those
languages. Having a knowledge base of paradigms across languages may be
interesting from the perspective of linguistic research.
Once Wikifunctions has launched, we hope that the community will develop a
library of morphological paradigms and their connection with the
lexicographical data in Wikidata. Besides this being a very helpful step on
our path to Abstract Wikipedia, we think that this will considerably expand
the content of the lexicographical data in Wikidata. That — together
with enabling
access to the lexicographic data from within the Wiktionaries
<https://phabricator.wikimedia.org/T235901> — will help with significantly
empowering the contributors to Wiktionary, particularly to the smaller
Wiktionaries and to the languages with fewer contributors in all
Wiktionaries.
Thanks to User:YULdigitalpreservation
<https://www.wikidata.org/wiki/User:YULdigitalpreservation>, who
created EntitySchema
E327 <https://www.wikidata.org/wiki/EntitySchema:E327> on Wikidata for
English Nouns with Genitives, and to User:VIGNERON
<https://meta.wikimedia.org/wiki/User:VIGNERON> for creating French plural
morphology on NotWikiLambda, and User:Strobilomyces
<https://en.wikipedia.org/wiki/User:Strobilomyces> for collaborating on
that.