I added a slight bit on Word Embeddings to the talk page (beyond the word2vec mentioned in the page). Just to extol its virtues, training a fastText model is extremely easy.

Thanks,
--justin

On Wed, Jul 18, 2018 at 12:05 PM, Trey Jones <tjones@wikimedia.org> wrote:

Hi everyone,


I've got an update on the NLP project selection. We've narrowed things down to a handful of projects we could work on with a consultant, and a handful we could work on internally.

David, Erik, and I reviewed a selection of the most promising-seeming and/or most interesting projects and gave them a very rough cost estimate based on how big of a relative impact they would have, technologically how hard they would be, and how difficult the UI aspect would be. The scores are not definitive, but helped guide the discussion. You can see the list of projects we looked at and more details of the scoring on MediaWiki.

For the possibility of working with an outside consultant, we also considered how easily separated each project would be from our overall system (making it easier for someone new to get up to speed), how projects feed into each other, how easily we could work on projects ourselves (like, we know pretty much what to do, we just have to do it), etc.

Our current recommendation for an outside consultant would be to start with (1) spelling correction/did you mean improvements, with an option to extend the project to include either (2) "more like" suggestion improvements, or (3) query reformulation mining, specifically for typo corrections.

For spelling correction (#1), we are envisioning an approach that integrates generic intra-word and inter-word statistical models, optional language-specific features, and explicit weighted corrections. We believe we could mine redirects flagged as typo correction for explicit corrections, and the query reformulation mining (#3) would also provide frequency-weighted explicit corrections. Our hope is that a system built initially for English would be readily applicable to other alphabetic languages, most probably other Indo-European languages, based on statistics available from Elastic; and that some elements of the system could be applied to other non-alphabetic languages and languages that are typologically dissimilar to Indo-European languages.

Looking at the rest of the list, (a) wrong keyboard detection seems like something we should work on internally, since we already have a few good ideas on how to approach it. (b) Acronym support is a pet peeve for several members of the team, and seems to be straightforward to improve. (c) Automatic stemmer building and (d) automatic stop word generation aren't so much projects we should work on as things we should research to see if there are already tools or lists out there we could use to make the projects much easier.


Comments and questions here or on the talk page are welcome.

Cheers,
—Trey

Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation


On Tue, May 15, 2018 at 11:30 AM, Trey Jones <tjones@wikimedia.org> wrote:
Hi everyone,

I just finished putting together an annotated list of potential applications of natural language processing to on-wiki search. There are dozens and dozens of ideas there—including many that are interesting but probably not practical. If you have any additional ideas, questions, suggestions, recommendations, or preferences, please share!—either on the mailing list or on the talk page.

The goal is to narrow it down to one or two things to pursue over the next two to four quarters, along with other projects we are working on.

Thanks!
—Trey

Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation




_______________________________________________
Discovery mailing list
Discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery