Re: [Wikimedia-l] If we had proper language models…

19 May 2019


      Perhaps I'll explain this a bit better…
Words can be converted into a vector representation by a word2vec
algorithm [1]. After conversion words will be a point in a high
dimensional space. Relations between words will then be vectors
between such points. Similar relations (or related relations) can be
found by operations on such vectors, or sets of vectors. Often this is
visualized as queen is to king as woman is to man, and similar
relations.
Some relations is quite obvious and common, but some relations simply
does not exist. If we can make a probability model over relations (a
regression model) then we can estimate the probability of observing a
specific relation, and thus be able to say "this does not seem to be a
probable word". (Typically one of several sequence models ("Recurrent
neural network" [2]) would be used for the estimation, and triplet
loss [3] for the training phase.)
It would be like having a "spell right"-metric for text fragments.
Note that this isn't quite as easy as described, as words might have
multiple interpretations and that makes it difficult to build a stable
vector representation. An example is "car" which is something you
typically drive on a road, but it can also be part of a train, or a
toy.
[1] https://en.wikipedia.org/wiki/Word2vec
[2] https://en.wikipedia.org/wiki/Recurrent_neural_network
[3] https://en.wikipedia.org/wiki/Triplet_loss
On Sun, May 19, 2019 at 2:55 PM John Erling Blad jeblad@gmail.com wrote:
...
Microsoft has unveiled an idea about a grammar and style tool for
Word. [1] I proposed something similar for detecting problematic
grammatical constructs in the content translation tools.[2] It is a
couple of years ago now, and I closed the task.
[1] https://venturebeat.com/2019/05/06/microsoft-debuts-ideas-in-word-a-grammar-...
[2] https://phabricator.wikimedia.org/T162525

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wikimedia-l] If we had proper language models…