Hi everyone,
Mikhail, Data Analyst Extraordinaire, recently published his report, "From Zero to Hero"[1] on the relationship between various features of queries as strings (rather than the content of the query) and those queries getting no results.
Today for my 10% project I took a quick look at the two most impactful features, quotes and question marks. These two features stood out in Mikhail's report as having both relatively high volume and a relatively higher chance of getting no results.
I'm not planning on doing a more formal report right now, though I will probably copy this email to my Notes page.
Quotes make sense, as we try to get an exact match for strings inside quotes, which limits our options for making a match. Question marks are actually a little-known, little-used, poorly documented, and poorly understood wildcard: they stand for any single character. Most users use them to ask questions.
I took a random sample of 50,000 English Wikipedia queries (using my now-favorite criteria at [2]—basically, full text queries from normal humans (as best as we can tell) with fewer than 3 results). I extracted all the queries with quotes (170) and all the queries that ended in question marks, that is, looked like questions (274). There were 4 queries that were all questions and spaces (e.g., ???? ???????? ????)—they caused problems as they are very expensive queries that repeatedly failed on the test cluster, so I discarded them. I also took a random sub-sample of 1K queries from the larger sample of 50K.
All samples had plenty of gibberish queries (e.g., "fhdsfhsdjkfgdsjklgsdl"?), queries in other languages, and the other usual cruft.
*For the sample with quotes,* I used Relevance Forge to compare the results of running queries as is vs replacing quotes with spaces. The summary stats are below. The zero results rate for queries with quotes went down by almost half, and more than half of queries has changes in their top 5 results. The TotalHits stats are wildly skewed by one query that increased it's results by over 300,000. (There always seems to be an outlier!)
*Metrics:* *Query Count:* 170 Num TotalHits Changed: μ: 3049.99; σ: 26435.14; median: 1.00
*Zero Results:* 38.2% (-37.1%) *Top 5 Sorted Results Differ:* 51.8% *Top 5 Unsorted Results Differ:* 51.2% Num Top 5 Results Changed: μ: 2.14; σ: 2.30; median: 1.00
*For the sample with question marks, *I used Relevance Forge to compare the results of running queries as is vs dropping all trailing question marks and spaces. Some queries ended in multiple question marks (removed), and some queries had other question marks in the middle of the query (kept). The summary stats are below. The summary is similar to those with quotes: almost half of the zero results queries got results, and more than half of all queries had changes to their top 5 results, and the mean number of total hits is blown out by one query that got more than 300K additional results.
*Metrics:* *Query Count:* 274 Num TotalHits Changed: μ: 1875.48; σ: 19885.60; median: 1.00
*Zero Results:* 43.1% (-39.1%) *Top 5 Sorted Results Differ:* 53.3% *Top 5 Unsorted Results Differ:* 53.3% Num Top 5 Results Changed: μ: 2.22; σ: 2.33; median: 1.00
*For the 1K sample query,* I used Relevance Forge to compare the results of running queries as is vs (a) replacing quotes with spaces, (b) dropping all trailing question marks and spaces, and (c) doing both (there are even a very few queries with both quotes and trailing question marks!).
Keep in mind that these are all poorly performing queries (fewer than 3 results). Summary results:
(a) quotes *Metrics:* *Query Count:* 1000 Num TotalHits Changed: μ: 0.31; σ: 9.70; median: 0.00 *Zero Results:* 79.5% (-0.1%) *Top 5 Sorted Results Differ:* 0.1% *Top 5 Unsorted Results Differ:* 0.1% Num Top 5 Results Changed: μ: 0.01; σ: 0.16; median: 0.00
(b) question marks *Metrics:* *Query Count:* 1000 Num TotalHits Changed: μ: 0.16; σ: 3.45; median: 0.00 *Zero Results:* 79.4% (-0.2%) *Top 5 Sorted Results Differ:* 0.4% *Top 5 Unsorted Results Differ:* 0.4% Num Top 5 Results Changed: μ: 0.02; σ: 0.32; median: 0.00
(c) quotes and question marks (pretty much the sum of the previous two!) *Metrics:* *Query Count:* 1000 Num TotalHits Changed: μ: 0.47; σ: 10.30; median: 0.00 *Zero Results:* 79.3% (-0.3%) *Top 5 Sorted Results Differ:* 0.5% *Top 5 Unsorted Results Differ:* 0.5% Num Top 5 Results Changed: μ: 0.03; σ: 0.35; median: 0.00
Overall, it's a pretty small effect, and a lot of the results are not always great when quotes are dropped, but it's a very small effort to make the change.
A quick look at the queries with question marks didn't show any that were obviously intended to be used as wildcards (except maybe all-question-marks, like ????—but who knows what that is supposed to be?).
It has been suggested before and I would also now recommend disabling ? as a wildcard—it causes many more problems than it solves.
Re-running poor-performing queries that have quotes without the quotes is an easy win. We should do that too!
Thoughts, comments, and suggestions welcome!
—Trey
[1] https://github.com/wikimedia-research/Discovery-Search-Adhoc-QueryFeatures/b... [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Hi Trey,
Cool analysis. I'm curious whether the infrastructure let's you look at query sessions---- do these queries with special symbols occur late in a multi-query sequence that included simpler versions earlier in the sequence?
Maybe you can segment users who are confused about the query language versus power users who are iteratively enhancing a query. The latter seems likely to generate low-result-count queries that are more acceptable because the user up twisted the query intentionally.
John
Sent from +1-617-899-2066
On May 27, 2016, at 5:17 PM, Trey Jones tjones@wikimedia.org wrote:
Hi everyone,
Mikhail, Data Analyst Extraordinaire, recently published his report, "From Zero to Hero"[1] on the relationship between various features of queries as strings (rather than the content of the query) and those queries getting no results.
Today for my 10% project I took a quick look at the two most impactful features, quotes and question marks. These two features stood out in Mikhail's report as having both relatively high volume and a relatively higher chance of getting no results.
I'm not planning on doing a more formal report right now, though I will probably copy this email to my Notes page.
Quotes make sense, as we try to get an exact match for strings inside quotes, which limits our options for making a match. Question marks are actually a little-known, little-used, poorly documented, and poorly understood wildcard: they stand for any single character. Most users use them to ask questions.
I took a random sample of 50,000 English Wikipedia queries (using my now-favorite criteria at [2]—basically, full text queries from normal humans (as best as we can tell) with fewer than 3 results). I extracted all the queries with quotes (170) and all the queries that ended in question marks, that is, looked like questions (274). There were 4 queries that were all questions and spaces (e.g., ???? ???????? ????)—they caused problems as they are very expensive queries that repeatedly failed on the test cluster, so I discarded them. I also took a random sub-sample of 1K queries from the larger sample of 50K.
All samples had plenty of gibberish queries (e.g., "fhdsfhsdjkfgdsjklgsdl"?), queries in other languages, and the other usual cruft.
For the sample with quotes, I used Relevance Forge to compare the results of running queries as is vs replacing quotes with spaces. The summary stats are below. The zero results rate for queries with quotes went down by almost half, and more than half of queries has changes in their top 5 results. The TotalHits stats are wildly skewed by one query that increased it's results by over 300,000. (There always seems to be an outlier!)
Metrics: Query Count: 170 Num TotalHits Changed: μ: 3049.99; σ: 26435.14; median: 1.00
Zero Results: 38.2% (-37.1%) Top 5 Sorted Results Differ: 51.8% Top 5 Unsorted Results Differ: 51.2% Num Top 5 Results Changed: μ: 2.14; σ: 2.30; median: 1.00
For the sample with question marks, I used Relevance Forge to compare the results of running queries as is vs dropping all trailing question marks and spaces. Some queries ended in multiple question marks (removed), and some queries had other question marks in the middle of the query (kept). The summary stats are below. The summary is similar to those with quotes: almost half of the zero results queries got results, and more than half of all queries had changes to their top 5 results, and the mean number of total hits is blown out by one query that got more than 300K additional results.
Metrics: Query Count: 274 Num TotalHits Changed: μ: 1875.48; σ: 19885.60; median: 1.00
Zero Results: 43.1% (-39.1%) Top 5 Sorted Results Differ: 53.3% Top 5 Unsorted Results Differ: 53.3% Num Top 5 Results Changed: μ: 2.22; σ: 2.33; median: 1.00
For the 1K sample query, I used Relevance Forge to compare the results of running queries as is vs (a) replacing quotes with spaces, (b) dropping all trailing question marks and spaces, and (c) doing both (there are even a very few queries with both quotes and trailing question marks!).
Keep in mind that these are all poorly performing queries (fewer than 3 results). Summary results:
(a) quotes Metrics: Query Count: 1000 Num TotalHits Changed: μ: 0.31; σ: 9.70; median: 0.00 Zero Results: 79.5% (-0.1%) Top 5 Sorted Results Differ: 0.1% Top 5 Unsorted Results Differ: 0.1% Num Top 5 Results Changed: μ: 0.01; σ: 0.16; median: 0.00
(b) question marks Metrics: Query Count: 1000 Num TotalHits Changed: μ: 0.16; σ: 3.45; median: 0.00 Zero Results: 79.4% (-0.2%) Top 5 Sorted Results Differ: 0.4% Top 5 Unsorted Results Differ: 0.4% Num Top 5 Results Changed: μ: 0.02; σ: 0.32; median: 0.00
(c) quotes and question marks (pretty much the sum of the previous two!) Metrics: Query Count: 1000 Num TotalHits Changed: μ: 0.47; σ: 10.30; median: 0.00 Zero Results: 79.3% (-0.3%) Top 5 Sorted Results Differ: 0.5% Top 5 Unsorted Results Differ: 0.5% Num Top 5 Results Changed: μ: 0.03; σ: 0.35; median: 0.00
Overall, it's a pretty small effect, and a lot of the results are not always great when quotes are dropped, but it's a very small effort to make the change.
A quick look at the queries with question marks didn't show any that were obviously intended to be used as wildcards (except maybe all-question-marks, like ????—but who knows what that is supposed to be?).
It has been suggested before and I would also now recommend disabling ? as a wildcard—it causes many more problems than it solves.
Re-running poor-performing queries that have quotes without the quotes is an easy win. We should do that too!
Thoughts, comments, and suggestions welcome!
—Trey
[1] https://github.com/wikimedia-research/Discovery-Search-Adhoc-QueryFeatures/b... [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_...
Trey Jones Software Engineer, Discovery Wikimedia Foundation _______________________________________________ discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Hi John,
We can look at query sessions with some level of effort, but the tools I was using do not. This was just a quick stab at a question the came up when Mikhail and I were talking: could Relevance Forge be used to see what kind of difference a simple rule makes for poorly performing queries? (Yes, yes it can!)
We don't, that I know of, track user sessions like that at query time. It's technically possible, I'm sure, but we don't want to, say, store query info in local storage of the browser, and it might be too expensive to compare earlier queries in the sequence in real time anyway.
I would definitely not suggest just turning off ? over the weekend or anything like that. I'd want to reach out to the community and also investigate more queries that use ? to try to figure out what they are doing. But it's clear from the queries I've looked at—considering how many start with who, what, when, where, how, will, do, etc.—that a lot of people are just asking questions, and those queries can do poorly as a result.
We've recently brainstormed about an expert mode, where, among other things, ambiguous search syntax (like ?) could be interpreted as actual search syntax, while the default casual user mode treats question marks as just question marks. It's not going to happen anytime real soon, but it's good to think about.
—Trey
On Fri, May 27, 2016 at 5:32 PM, John R. Frank jrf@diffeo.com wrote:
Hi Trey,
Cool analysis. I'm curious whether the infrastructure let's you look at query sessions---- do these queries with special symbols occur late in a multi-query sequence that included simpler versions earlier in the sequence?
Maybe you can segment users who are confused about the query language versus power users who are iteratively enhancing a query. The latter seems likely to generate low-result-count queries that are more acceptable because the user up twisted the query intentionally.
John
Sent from +1-617-899-2066
On May 27, 2016, at 5:17 PM, Trey Jones tjones@wikimedia.org wrote:
Hi everyone,
Mikhail, Data Analyst Extraordinaire, recently published his report, "From Zero to Hero"[1] on the relationship between various features of queries as strings (rather than the content of the query) and those queries getting no results.
Today for my 10% project I took a quick look at the two most impactful features, quotes and question marks. These two features stood out in Mikhail's report as having both relatively high volume and a relatively higher chance of getting no results.
I'm not planning on doing a more formal report right now, though I will probably copy this email to my Notes page.
Quotes make sense, as we try to get an exact match for strings inside quotes, which limits our options for making a match. Question marks are actually a little-known, little-used, poorly documented, and poorly understood wildcard: they stand for any single character. Most users use them to ask questions.
I took a random sample of 50,000 English Wikipedia queries (using my now-favorite criteria at [2]—basically, full text queries from normal humans (as best as we can tell) with fewer than 3 results). I extracted all the queries with quotes (170) and all the queries that ended in question marks, that is, looked like questions (274). There were 4 queries that were all questions and spaces (e.g., ???? ???????? ????)—they caused problems as they are very expensive queries that repeatedly failed on the test cluster, so I discarded them. I also took a random sub-sample of 1K queries from the larger sample of 50K.
All samples had plenty of gibberish queries (e.g., "fhdsfhsdjkfgdsjklgsdl"?), queries in other languages, and the other usual cruft.
*For the sample with quotes,* I used Relevance Forge to compare the results of running queries as is vs replacing quotes with spaces. The summary stats are below. The zero results rate for queries with quotes went down by almost half, and more than half of queries has changes in their top 5 results. The TotalHits stats are wildly skewed by one query that increased it's results by over 300,000. (There always seems to be an outlier!)
*Metrics:* *Query Count:* 170 Num TotalHits Changed: μ: 3049.99; σ: 26435.14; median: 1.00
*Zero Results:* 38.2% (-37.1%) *Top 5 Sorted Results Differ:* 51.8% *Top 5 Unsorted Results Differ:* 51.2% Num Top 5 Results Changed: μ: 2.14; σ: 2.30; median: 1.00
*For the sample with question marks, *I used Relevance Forge to compare the results of running queries as is vs dropping all trailing question marks and spaces. Some queries ended in multiple question marks (removed), and some queries had other question marks in the middle of the query (kept). The summary stats are below. The summary is similar to those with quotes: almost half of the zero results queries got results, and more than half of all queries had changes to their top 5 results, and the mean number of total hits is blown out by one query that got more than 300K additional results.
*Metrics:* *Query Count:* 274 Num TotalHits Changed: μ: 1875.48; σ: 19885.60; median: 1.00
*Zero Results:* 43.1% (-39.1%) *Top 5 Sorted Results Differ:* 53.3% *Top 5 Unsorted Results Differ:* 53.3% Num Top 5 Results Changed: μ: 2.22; σ: 2.33; median: 1.00
*For the 1K sample query,* I used Relevance Forge to compare the results of running queries as is vs (a) replacing quotes with spaces, (b) dropping all trailing question marks and spaces, and (c) doing both (there are even a very few queries with both quotes and trailing question marks!).
Keep in mind that these are all poorly performing queries (fewer than 3 results). Summary results:
(a) quotes *Metrics:* *Query Count:* 1000 Num TotalHits Changed: μ: 0.31; σ: 9.70; median: 0.00 *Zero Results:* 79.5% (-0.1%) *Top 5 Sorted Results Differ:* 0.1% *Top 5 Unsorted Results Differ:* 0.1% Num Top 5 Results Changed: μ: 0.01; σ: 0.16; median: 0.00
(b) question marks *Metrics:* *Query Count:* 1000 Num TotalHits Changed: μ: 0.16; σ: 3.45; median: 0.00 *Zero Results:* 79.4% (-0.2%) *Top 5 Sorted Results Differ:* 0.4% *Top 5 Unsorted Results Differ:* 0.4% Num Top 5 Results Changed: μ: 0.02; σ: 0.32; median: 0.00
(c) quotes and question marks (pretty much the sum of the previous two!) *Metrics:* *Query Count:* 1000 Num TotalHits Changed: μ: 0.47; σ: 10.30; median: 0.00 *Zero Results:* 79.3% (-0.3%) *Top 5 Sorted Results Differ:* 0.5% *Top 5 Unsorted Results Differ:* 0.5% Num Top 5 Results Changed: μ: 0.03; σ: 0.35; median: 0.00
Overall, it's a pretty small effect, and a lot of the results are not always great when quotes are dropped, but it's a very small effort to make the change.
A quick look at the queries with question marks didn't show any that were obviously intended to be used as wildcards (except maybe all-question-marks, like ????—but who knows what that is supposed to be?).
It has been suggested before and I would also now recommend disabling ? as a wildcard—it causes many more problems than it solves.
Re-running poor-performing queries that have quotes without the quotes is an easy win. We should do that too!
Thoughts, comments, and suggestions welcome!
—Trey
[1] https://github.com/wikimedia-research/Discovery-Search-Adhoc-QueryFeatures/b... [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery