I added a slight bit on Word Embeddings to the talk page (beyond the
word2vec mentioned in the page). Just to extol its virtues, training a
fastText model is extremely easy.
Thanks,
--justin
On Wed, Jul 18, 2018 at 12:05 PM, Trey Jones <tjones(a)wikimedia.org> wrote:
Hi everyone,
I've got an update on the NLP project selection. We've narrowed things
down to a handful of projects we could work on with a consultant, and a
handful we could work on internally.
David, Erik, and I reviewed a selection of the most promising-seeming
and/or most interesting projects and gave them a very rough cost estimate
based on how big of a relative impact they would have, technologically how
hard they would be, and how difficult the UI aspect would be. The scores
are not definitive, but helped guide the discussion. You can see the list
of projects we looked at and more details of the scoring on MediaWiki
<https://www.mediawiki.org/w/index.php?title=User:TJones_(WMF)/Notes/Potential_Applications_of_Natural_Language_Processing_to_On-Wiki_Search#Current_Recommendations>
.
For the possibility of working with an outside consultant, we also
considered how easily separated each project would be from our overall
system (making it easier for someone new to get up to speed), how projects
feed into each other, how easily we could work on projects ourselves (like,
we know pretty much what to do, we just have to do it), etc.
Our current *recommendation for an outside consultant* would be to start
with (1) *spelling correction/did you mean improvements,* with an option
to extend the project to include either (2) *"more like" suggestion
improvements,* or (3) *query reformulation mining,* specifically for typo
corrections.
For spelling correction (#1), we are envisioning an approach that
integrates generic intra-word and inter-word statistical models, optional
language-specific features, and explicit weighted corrections. We believe
we could mine redirects flagged as typo correction for explicit
corrections, and the query reformulation mining (#3) would also provide
frequency-weighted explicit corrections. Our hope is that a system built
initially for English would be readily applicable to other alphabetic
languages, most probably other Indo-European languages, based on statistics
available from Elastic; and that some elements of the system could be
applied to other non-alphabetic languages and languages that are
typologically <https://en.wikipedia.org/wiki/Morphological_typology> dissimilar
to Indo-European languages.
Looking at the rest of the list, (a) *wrong keyboard detection* seems
like something we should work on internally, since we already have a few
good ideas on how to approach it. (b) *Acronym support* is a pet peeve
for several members of the team, and seems to be straightforward to
improve. (c) *Automatic stemmer building* and (d) *automatic stop word* generation
aren't so much projects we should work on as things we should research to
see if there are already tools or lists out there we could use to make the
projects much easier.
Comments and questions here or on the talk page are welcome.
Cheers,
—Trey
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
On Tue, May 15, 2018 at 11:30 AM, Trey Jones <tjones(a)wikimedia.org> wrote:
Hi everyone,
I just finished putting together an annotated list of potential
applications of natural language processing to on-wiki search
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applications_of_Natural_Language_Processing_to_On-Wiki_Search>.
There are dozens and dozens of ideas there—including many that are
interesting but probably not practical. If you have any additional ideas,
questions, suggestions, recommendations, or preferences, please
share!—either on the mailing list or on the talk page.
The goal is to narrow it down to one or two things to pursue over the
next two to four quarters, along with other projects we are working on.
Thanks!
—Trey
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
_______________________________________________
Discovery mailing list
Discovery(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery