Hey all,
This Friday, Trey Jones (our awesome Relevance Engineer) and I spent
some time playing detective with the sampled request logs and a list
of the most common queries resulting in zero results. We found a lot
of interesting things. In particular:
1. A common pattern in which queries, for no particular reason, had a
UNIX timestamp preceding them (example: "1436336857594:2019 FIFA
Women's World Cup"). This is responsible, on its own, for 3% of zero
results queries - and it appears to be caused by the Wikimedia Apps.
2. A search for strings in quotes followed by 'film' (example:
"\"Seventh Son\" film"). This is caused by a media player and is
responsible for around 0.5% of zero results queries.
3. A search for "quot" strings (example: " quot James Tree quot").
This is from the National Library of Australia and is again around
0.5% of zero results queries.
4. A search for a page title and the name of a page that appears as a
link within that page (example: "\"2C-T-19\" AND \"JWH-081\""). This
is about 6% of queries and appears to come from a German IP address.
We're unaware of who this person is or what they're trying, so if
anyone knows what on earth this is, we'd appreciate the hint ;).
https://phabricator.wikimedia.org/T107724 is a card representing the
need to reach out to these people, where possible (obviously this will
be easier for the app team than anyone else ;p). If we can get all of
these solved for, we could drop the zero results rate for full text by
about 10% Obviously cutting /all/ of it out is improbable, but we're
hopeful that we can drop this number and get a better understanding of
what third-party users are trying to achieve, to boot.
--
Oliver Keyes
Count Logula
Wikimedia Foundation