On 2023-05-09 22:09, Isaac Johnson wrote:
+1 to the suggestion to connect with the Search team. Also a few more thoughts about vector / natural-language search and its relevance to Wikimedia from my perspective in Research:
- The common critique of lexical / keyword-based search and why folks point to vector / embedding-based search is handling more natural-language queries (e.g., "What are the different objectives of the United Nations Sustainable Development Goals?" vs. "UN SDG"). The former has a lot of words in it that lead to keyword overlap with less-relevant pages so keyword-based search doesn't do as well. The latter is much more direct and even matches an existing redirect on Wikipedia to the article on UN Sustainable Development Goals, so our existing keyword-based search handles it very well.
- Most existing users of Wikimedia's search are probably doing something closer to the latter above -- i.e. using pretty exact keywords to navigate to a specific page (or find it exists).
I disagree. The benefit we should expect from vector search is not the ability to write questions with fuzzy grammar while still using exact terminology, but instead to use fuzzy terminology. Today most users search with exact terms, because that's the only thing our search function can handle. You can only search for the terms that are used in the articles. That's not any stranger than the observation that owners of a Fortran compiler tend to write programs in Fortran, as those are the only ones that will compile into running code. Most users would not search for "sustainable development goals" because they are not familiar with this exact UN terminology. Instead they might wonder how the UN envisions the future for humanity. And if those exact words are not in the relevant article, the current text-based search will yield nothing.
On Meta there's a list of mailing lists that mentions "wikimedia-search", but that list seems to be dead and the archive is full of spam. Another list exists, called "discovery", but not listed on Meta. https://lists.wikimedia.org/hyperkitty/list/discovery@lists.wikimedia.org/