+1 to the suggestion to connect with the Search team. Also a few more thoughts about vector / natural-language search and its relevance to Wikimedia from my perspective in Research:
- The common critique of lexical / keyword-based search and why folks point to vector / embedding-based search is handling more natural-language queries (e.g., "What are the different objectives of the United Nations Sustainable Development Goals?" vs. "UN SDG"). The former has a lot of words in it that lead to keyword overlap with less-relevant pages so keyword-based search doesn't do as well. The latter is much more direct and even matches an existing redirect on Wikipedia to the article on UN Sustainable Development Goals, so our existing keyword-based search handles it very well.
- Most existing users of Wikimedia's search are probably doing something closer to the latter above -- i.e. using pretty exact keywords to navigate to a specific page (or find it exists). This is backed up by the data: 80% of searches on Wikipedia are auto-completed directly to article pages. In that sense, the system is working quite well! The Search team also has added quite a bit of normalization into the pipeline (see https://diff.wikimedia.org/2023/04/28/language-harmony-and-unpacking-a-year-in-the-life-of-a-search-nerd/ for a fun overview). For the more complicated natural-language queries to find relevant Wikipedia articles, my sense is that folks using natural language searches are probably doing that within external search engines, which have huge teams/infrastructure to support this, and then clicking through to Wikipedia.
- That said, there are probably use-cases where natural-language search would be more valuable. For example, within new interaction domains such as chat-bots or for new editors / developers who don't yet know the exact terminology to search for but want to do generic things like get access to Toolforge or find out how to add a link to a page. I've been putting together an example of this for Wikitech for the upcoming Hackathon (details) and others have proposed e.g., this for Project pages to help editors find answers to questions about editing (details).
- Finally, there's a second, related aspect to this which is the size and diversity of a given document. Within the Wikipedia article namespace, documents are generally about a single, constrained topic. So the fact that lexical search systems like Elasticsearch operate at the document-level is a very good fit -- i.e. index all the keywords for a given article together. When thinking about other namespaces like Project/Help pages or Wikitech documentation, a single page can be much larger and be about far more diverse topics. This presents further challenges to finding good keyword-overlap because often the search would ideally find a very specific paragraph in a much larger document about many other things. Vector search doesn't directly solve this but in practice, folks tend to learn embeddings for smaller passages than an entire doc -- e.g., sections or even paragraphs within the section. For that reason alone, I suspect vector search will do better for namespaces outside of the article namespace on Wikipedia. Whether it's worth the cost is a separate question as it also introduces substantial new challenges in keeping the embeddings up-to-date :)
Hope that helps.
Best,
Isaac