For the possibility of working with an outside consultant, we also considered how easily separated each project would be from our overall system (making it easier for someone new to get up to speed), how projects feed into each other, how easily we could work on projects ourselves (like, we know pretty much what to do, we just have to do it), etc.
Our current recommendation for an outside consultant would be to start with (1) spelling correction/did you mean improvements, with an option to extend the project to include either (2) "more like" suggestion improvements, or (3) query reformulation mining, specifically for typo corrections.
For spelling correction (#1), we are envisioning an approach that integrates generic intra-word and inter-word statistical models, optional language-specific features, and explicit weighted corrections. We believe we could mine redirects flagged as typo correction for explicit corrections, and the query reformulation mining (#3) would also provide frequency-weighted explicit corrections. Our hope is that a system built initially for English would be readily applicable to other alphabetic languages, most probably other Indo-European languages, based on statistics available from Elastic; and that some elements of the system could be applied to other non-alphabetic languages and languages that are typologically dissimilar to Indo-European languages.
Looking at the rest of the list, (a) wrong keyboard detection seems like something we should work on internally, since we already have a few good ideas on how to approach it. (b) Acronym support is a pet peeve for several members of the team, and seems to be straightforward to improve. (c) Automatic stemmer building and (d) automatic stop word generation aren't so much projects we should work on as things we should research to see if there are already tools or lists out there we could use to make the projects much easier.
Hi everyone,I just finished putting together an annotated list of potential applications of natural language processing to on-wiki search. There are dozens and dozens of ideas there—including many that are interesting but probably not practical. If you have any additional ideas, questions, suggestions, recommendations, or preferences, please share!—either on the mailing list or on the talk page.The goal is to narrow it down to one or two things to pursue over the next two to four quarters, along with other projects we are working on.Thanks!—TreyTrey JonesSr. Software Engineer, Search Platform
Wikimedia Foundation