New subject: NLP for on-wiki search

18 Jul 2018

Hi everyone,

I've got an update on the NLP project selection. We've narrowed things down
to a handful of projects we could work on with a consultant, and a handful
we could work on internally.

David, Erik, and I reviewed a selection of the most promising-seeming
and/or most interesting projects and gave them a very rough cost estimate
based on how big of a relative impact they would have, technologically how
hard they would be, and how difficult the UI aspect would be. The scores
are not definitive, but helped guide the discussion. You can see the list
of projects we looked at and more details of the scoring on MediaWiki
<https://www.mediawiki.org/w/index.php?title=User:TJones_(WMF)/Notes/Potential_Applications_of_Natural_Language_Processing_to_On-Wiki_Search#Current_Recommendations>
.

For the possibility of working with an outside consultant, we also
considered how easily separated each project would be from our overall
system (making it easier for someone new to get up to speed), how projects
feed into each other, how easily we could work on projects ourselves (like,
we know pretty much what to do, we just have to do it), etc.

Our current *recommendation for an outside consultant* would be to start
with (1) *spelling correction/did you mean improvements,* with an option to
extend the project to include either (2) *"more like" suggestion
improvements,* or (3) *query reformulation mining,* specifically for typo
corrections.

For spelling correction (#1), we are envisioning an approach that
integrates generic intra-word and inter-word statistical models, optional
language-specific features, and explicit weighted corrections. We believe
we could mine redirects flagged as typo correction for explicit
corrections, and the query reformulation mining (#3) would also provide
frequency-weighted explicit corrections. Our hope is that a system built
initially for English would be readily applicable to other alphabetic
languages, most probably other Indo-European languages, based on statistics
available from Elastic; and that some elements of the system could be
applied to other non-alphabetic languages and languages that are
typologically <https://en.wikipedia.org/wiki/Morphological_typology> dissimilar
to Indo-European languages.

Looking at the rest of the list, (a) *wrong keyboard detection* seems like
something we should work on internally, since we already have a few good
ideas on how to approach it. (b) *Acronym support* is a pet peeve for
several members of the team, and seems to be straightforward to improve. (c)
 *Automatic stemmer building* and (d) *automatic stop word* generation
aren't so much projects we should work on as things we should research to
see if there are already tools or lists out there we could use to make the
projects much easier.

Comments and questions here or on the talk page are welcome.

Cheers,
—Trey

Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation

On Tue, May 15, 2018 at 11:30 AM, Trey Jones &lt;tjones(a)wikimedia.org&gt; wrote:

...
  Hi everyone,

 I just finished putting together an annotated list of potential
 applications of natural language processing to on-wiki search

<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applications_of_Natural_Language_Processing_to_On-Wiki_Search>.
 There are dozens and dozens of ideas there—including many that are
 interesting but probably not practical. If you have any additional ideas,
 questions, suggestions, recommendations, or preferences, please
 share!—either on the mailing list or on the talk page.

 The goal is to narrow it down to one or two things to pursue over the next
 two to four quarters, along with other projects we are working on.

 Thanks!
 —Trey

 Trey Jones
 Sr. Software Engineer, Search Platform
 Wikimedia Foundation

Re: [discovery] NLP for on-wiki search