In a twist of irony, this issue was actually caused by a patch I wrote https://gerrit.wikimedia.org/r/#/c/207727/ to fix an annoying little bug https://phabricator.wikimedia.org/T96944 in the app where the namespace of some pages was being set to null when they were saved to the user's storage.
You can see in the changes I made to the persistence helper https://gerrit.wikimedia.org/r/#/c/207727/3/wikipedia/src/main/java/org/wikipedia/history/HistoryEntryPersistenceHelper.java that I took the column that was the timestamp and used it for the namespace instead. This was my first change to the database layer of the app, and I didn't quite realise the ramifications of doing what I did. Since Dmitry's fix https://gerrit.wikimedia.org/r/#/c/228766/ noted that it was silly to ever use column indices rather than looking them up by name, I don't feel *too* bad about it.. ;-)
99 little bugs in the code, 99 little bugs, take one down, patch it around, 127 little bugs in the code.
Dan
On 2 August 2015 at 17:14, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all,
This Friday, Trey Jones (our awesome Relevance Engineer) and I spent some time playing detective with the sampled request logs and a list of the most common queries resulting in zero results. We found a lot of interesting things. In particular:
- A common pattern in which queries, for no particular reason, had a
UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps. 2. A search for strings in quotes followed by 'film' (example: ""Seventh Son" film"). This is caused by a media player and is responsible for around 0.5% of zero results queries. 3. A search for "quot" strings (example: " quot James Tree quot"). This is from the National Library of Australia and is again around 0.5% of zero results queries. 4. A search for a page title and the name of a page that appears as a link within that page (example: ""2C-T-19" AND "JWH-081""). This is about 6% of queries and appears to come from a German IP address. We're unaware of who this person is or what they're trying, so if anyone knows what on earth this is, we'd appreciate the hint ;).
https://phabricator.wikimedia.org/T107724 is a card representing the need to reach out to these people, where possible (obviously this will be easier for the app team than anyone else ;p). If we can get all of these solved for, we could drop the zero results rate for full text by about 10% Obviously cutting /all/ of it out is improbable, but we're hopeful that we can drop this number and get a better understanding of what third-party users are trying to achieve, to boot.
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search