Forwarding...
---------- Forwarded message ---------- From: Trey Jones tjones@wikimedia.org Date: Thu, Aug 4, 2016 at 1:25 PM Subject: [discovery] Stripping Question Marks From Wiki Searches is Now Live! To: A public mailing list about Wikimedia Search and Discovery projects < discovery@lists.wikimedia.org>
*Stripping Question Marks From Wiki Searches* *Do you ask questions on Wikipedia? Would you like better results?*
*Summary:* Because the large majority of question marks are used to ask questions by users unfamiliar with bash-style wildcards https://en.wikipedia.org/wiki/Glob_(programming), the default behavior for CirrusSearch will now be to ignore question marks (replacing them with a space). Escaping them with a backslash (?) will preserve their wildcard meaning. Regular expressions in *insource:* will not be affected and should not be escaped. This option can be modified on a per-wiki basis if needed (see $wgCirrusSearchStripQuestionMarks https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/CirrusSearch.php ).
When people ask *how old is tom cruise?* on Wikipedia they almost certainly don’t expect the question mark in *cruise?* to match an additional letter. They aren’t looking for the words *cruised, cruiser, * or *cruises—*but that’s what they get, and it keeps them from finding the information they are really after.
Search on Wikipedia (and other Wikimedia projects) includes a lot of features that most users don’t know about. Most require special keywords, and some even require specialized knowledge, such as familiarity with regular expressions. It’s pretty difficult to invoke these special features by accident.
But search also supports two particular bash-style wildcards without any special syntax: *** will match any number of characters, and *?* will match exactly one. Asterisks do come up from time to time, but people use question marks all the time—because they like to ask questions!
A recent review of query-string features https://commons.wikimedia.org/wiki/File:From_Zero_to_Hero_-_Anticipating_Zero_Results_From_Query_Features,_Ignoring_Content.pdf called out quotes and question marks as the two largest-impact predictors of unsuccessful queries on Wikipedia. In a follow-up survey of queries with question marks https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Dropping_Final_Question_Marks_in_the_Top_10_Wikipedias in six of the top ten Wikipedias (by search volume), most question marks are being used to ask questions (the other four of the top 10 were not reviewed). In all ten of the top ten, stripping final question marks dramatically decreased the number of ?-final queries that got either no results, or very few results (i.e., less than 3). The improvement was around 10-45% for ?-final queries, depending on the wiki. The overall impact is much more modest (less than 0.5%) because queries with question marks are not terribly common.
As a result of this analysis, we’ve implemented a change to search which will by default replace question marks with spaces (to preserve the word boundaries they intend in queries like *how?why?*). This setting can be changed on a per-wiki basis, and other options include (i) only stripping question marks at a clear word boundary (such as before a space), (ii) only stripping question marks at the end of the query, and (iii) leaving the question marks alone.
For the rarer users who do use question marks as a one-letter wildcard, when question mark stripping is enabled, question marks can be escaped with a backslash (e.g., *wiki?edia*) to preserve their original wildcard meaning. Power searchers who use *insource:* won’t need to do anything special; queries with*insource:* will not be modified.
Here's a screenshot https://commons.wikimedia.org/wiki/File:Old-are_viruses_living%3F.png of the former question mark behavior, where it is treated as a wildcard. Note that “living?” only matches the name “Livings”, leading to two very unsatisfactory results.
Here's a screenshot https://commons.wikimedia.org/wiki/File:New-are_viruses_living%3F.png of the new question mark behavior, where it is ignored. Now the question and part of the answer can be seen in the snippet for the very first result, and all of the top three results seem relevant.
(Sorry I can't embed the screenshots—the mailing list won't allow messages over 40K.)
Trey Jones Software Engineer, Discovery Wikimedia Foundation
_______________________________________________ discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery