Hi everyone,
Mikhail, Data Analyst Extraordinaire, recently published his report, "From
Zero to Hero"[1] on the relationship between various features of queries as
strings (rather than the content of the query) and those queries getting no
results.
Today for my 10% project I took a quick look at the two most impactful
features, quotes and question marks. These two features stood out in
Mikhail's report as having both relatively high volume and a relatively
higher chance of getting no results.
I'm not planning on doing a more formal report right now, though I will
probably copy this email to my Notes page.
Quotes make sense, as we try to get an exact match for strings inside
quotes, which limits our options for making a match. Question marks are
actually a little-known, little-used, poorly documented, and poorly
understood wildcard: they stand for any single character. Most users use
them to ask questions.
I took a random sample of 50,000 English Wikipedia queries (using my
now-favorite criteria at [2]—basically, full text queries from normal
humans (as best as we can tell) with fewer than 3 results). I extracted all
the queries with quotes (170) and all the queries that ended in question
marks, that is, looked like questions (274). There were 4 queries that were
all questions and spaces (e.g., ???? ???????? ????)—they caused problems as
they are very expensive queries that repeatedly failed on the test cluster,
so I discarded them. I also took a random sub-sample of 1K queries from the
larger sample of 50K.
All samples had plenty of gibberish queries (e.g.,
"fhdsfhsdjkfgdsjklgsdl"?), queries in other languages, and the other usual
cruft.
*For the sample with quotes,* I used Relevance Forge to compare the results
of running queries as is vs replacing quotes with spaces. The summary stats
are below. The zero results rate for queries with quotes went down by
almost half, and more than half of queries has changes in their top 5
results. The TotalHits stats are wildly skewed by one query that increased
it's results by over 300,000. (There always seems to be an outlier!)
*Metrics:*
*Query Count:* 170
Num TotalHits Changed: μ: 3049.99; σ: 26435.14; median: 1.00
*Zero Results:* 38.2% (-37.1%)
*Top 5 Sorted Results Differ:* 51.8%
*Top 5 Unsorted Results Differ:* 51.2%
Num Top 5 Results Changed: μ: 2.14; σ: 2.30; median: 1.00
*For the sample with question marks, *I used Relevance Forge to compare the
results of running queries as is vs dropping all trailing question marks
and spaces. Some queries ended in multiple question marks (removed), and
some queries had other question marks in the middle of the query (kept).
The summary stats are below. The summary is similar to those with quotes:
almost half of the zero results queries got results, and more than half of
all queries had changes to their top 5 results, and the mean number of
total hits is blown out by one query that got more than 300K additional
results.
*Metrics:*
*Query Count:* 274
Num TotalHits Changed: μ: 1875.48; σ: 19885.60; median: 1.00
*Zero Results:* 43.1% (-39.1%)
*Top 5 Sorted Results Differ:* 53.3%
*Top 5 Unsorted Results Differ:* 53.3%
Num Top 5 Results Changed: μ: 2.22; σ: 2.33; median: 1.00
*For the 1K sample query,* I used Relevance Forge to compare the results of
running queries as is vs (a) replacing quotes with spaces, (b) dropping all
trailing question marks and spaces, and (c) doing both (there are even a
very few queries with both quotes and trailing question marks!).
Keep in mind that these are all poorly performing queries (fewer than 3
results). Summary results:
(a) quotes
*Metrics:*
*Query Count:* 1000
Num TotalHits Changed: μ: 0.31; σ: 9.70; median: 0.00
*Zero Results:* 79.5% (-0.1%)
*Top 5 Sorted Results Differ:* 0.1%
*Top 5 Unsorted Results Differ:* 0.1%
Num Top 5 Results Changed: μ: 0.01; σ: 0.16; median: 0.00
(b) question marks
*Metrics:*
*Query Count:* 1000
Num TotalHits Changed: μ: 0.16; σ: 3.45; median: 0.00
*Zero Results:* 79.4% (-0.2%)
*Top 5 Sorted Results Differ:* 0.4%
*Top 5 Unsorted Results Differ:* 0.4%
Num Top 5 Results Changed: μ: 0.02; σ: 0.32; median: 0.00
(c) quotes and question marks (pretty much the sum of the previous two!)
*Metrics:*
*Query Count:* 1000
Num TotalHits Changed: μ: 0.47; σ: 10.30; median: 0.00
*Zero Results:* 79.3% (-0.3%)
*Top 5 Sorted Results Differ:* 0.5%
*Top 5 Unsorted Results Differ:* 0.5%
Num Top 5 Results Changed: μ: 0.03; σ: 0.35; median: 0.00
Overall, it's a pretty small effect, and a lot of the results are not
always great when quotes are dropped, but it's a very small effort to make
the change.
A quick look at the queries with question marks didn't show any that were
obviously intended to be used as wildcards (except maybe
all-question-marks, like ????—but who knows what that is supposed to be?).
It has been suggested before and I would also now recommend disabling ? as
a wildcard—it causes many more problems than it solves.
Re-running poor-performing queries that have quotes without the quotes is
an easy win. We should do that too!
Thoughts, comments, and suggestions welcome!
—Trey
[1]
https://github.com/wikimedia-research/Discovery-Search-Adhoc-QueryFeatures/…
[2]
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization…
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation