New subject: Lexical datas and automated learning – where it is answered to « I don’t believe in Wikidata senses developpment »

20 Sep 2019

Dear all,
I thank you for your efforts. To know more about word embedding and semantic similarity,
please refer to the survey of our research group about the issue available at
https://www.sciencedirect.com/science/article/pii/S0952197619301745. If you would like
that we work on using these techniques to enrich Lexicographical Data on Wikidata, we will
be honoured to do this. However, we will face two main problems. The first one is
absolutely funding and the second one is that we need people to validate the information
returned by these two techniques and adjust it if needed.
Yours Sincerely,
Houcemeddine Turki (he/him)
Medical Student, Faculty of Medicine of Sfax, University of Sfax, Tunisia
Undergraduate Researcher, UR12SP36
GLAM, Research and Education Coordinator, Wikimedia TN User Group
Member, Wiki Project Med
Member, WikiIndaba Steering Committee
Member, Wikimedia and Library User Group Steering Committee
Co-Founder, WikiLingua Maghreb
____________________
+21629499418

-------- Message d'origine --------
De : Thomas Douillard &lt;thomas.douillard(a)gmail.com&gt;
Date : 2019/09/20 12:08 (GMT+01:00)
À : "Discussion list for the Wikidata project."
&lt;wikidata(a)lists.wikimedia.org&gt;
Objet : [Wikidata] Lexical datas and automated learning – where it is answered to « I
don’t believe in Wikidata senses developpment »

I recently read the french sentence « Je ne crois pas au développement des sens. » —
translation : I don’t believes senses with develop much (following links in a Wikidata
Weekly summary, the slides on a french meeting about Wikidata lexicographical datas). I
believe in it, (regardless of the arguments exposed in the slides), and I write this email
to try to explain why.

I’m curious to know if there is already some work on the automated discovering of
lexicographical datas / senses thanks to the help of Wikidata items.

There is tools for automated tagging of terms with the corresponding Wikidata item, that
appeared on this mailing list and/or on the wikidata weekly summaries.
There is also methods that can discover senses into texts using only the terms with no
reference to any external « sense » like
https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a2… and
can discriminate several usages of the same word according to the context.

Wikidata lexicographical datas and Wikibase items could close the loop between the 2
methods and allow us to semi automatically build tools that annotate texts with Wikidata
items it there is something relevant in Wikidata, but if there is nono try to suggest to
add datas on Wikidata, wether it’s a missing item or a missing sense for the term.

It may even be possible to store word embeddings generated by word2vec methods into
Wikidata senses.

In conclusion, I think Wikidata senses will be used because they allow to close a gap. It
does not depends only on a strong involvement in a volunteer traditional lexicographic
community. If reasearchers of the language community dives into this and develop
algorithms and easy to use tools to share there lexicographical datas in Wikidata, there
could be a very positive feedback loop where numerous data ends to be added on Wikidata,
where the store datas helps the algorithm to enrich text annotations, for example, and
missing datas are semi automatically added thanks to user feedback.

This is all just wishful thinking, but I thought this deserved to be shared, hopefully
this will launch at list a thread of ideas/comment in here :)

Thomas

Re: [Wikidata] Lexical datas and automated learning – where it is answered to « I don’t believe in Wikidata senses developpment »