Hey all,
This Friday, Trey Jones (our awesome Relevance Engineer) and I spent some time playing detective with the sampled request logs and a list of the most common queries resulting in zero results. We found a lot of interesting things. In particular:
1. A common pattern in which queries, for no particular reason, had a UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps. 2. A search for strings in quotes followed by 'film' (example: ""Seventh Son" film"). This is caused by a media player and is responsible for around 0.5% of zero results queries. 3. A search for "quot" strings (example: " quot James Tree quot"). This is from the National Library of Australia and is again around 0.5% of zero results queries. 4. A search for a page title and the name of a page that appears as a link within that page (example: ""2C-T-19" AND "JWH-081""). This is about 6% of queries and appears to come from a German IP address. We're unaware of who this person is or what they're trying, so if anyone knows what on earth this is, we'd appreciate the hint ;).
https://phabricator.wikimedia.org/T107724 is a card representing the need to reach out to these people, where possible (obviously this will be easier for the app team than anyone else ;p). If we can get all of these solved for, we could drop the zero results rate for full text by about 10% Obviously cutting /all/ of it out is improbable, but we're hopeful that we can drop this number and get a better understanding of what third-party users are trying to achieve, to boot.
On Sun, Aug 2, 2015 at 6:14 PM, Oliver Keyes okeyes@wikimedia.org wrote:
- A common pattern in which queries, for no particular reason, had a
UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps.
Oliver: What User Agent strings do you see for this? Is it the iOS or Android app, the old Phonegap App, or even something else?
-Bernd
Hey Bernd,
It's the new Android app; I've thrown some example requests at Dmitry by way of fluorine and, being Dmitry, he's already worked out what's going on!
On 3 August 2015 at 11:16, Bernd Sitzmann bernd@wikimedia.org wrote:
On Sun, Aug 2, 2015 at 6:14 PM, Oliver Keyes okeyes@wikimedia.org wrote:
- A common pattern in which queries, for no particular reason, had a
UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps.
Oliver: What User Agent strings do you see for this? Is it the iOS or Android app, the old Phonegap App, or even something else?
-Bernd
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
What's the phab task for this?
On Mon, Aug 3, 2015 at 9:05 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Bernd,
It's the new Android app; I've thrown some example requests at Dmitry by way of fluorine and, being Dmitry, he's already worked out what's going on!
On 3 August 2015 at 11:16, Bernd Sitzmann bernd@wikimedia.org wrote:
On Sun, Aug 2, 2015 at 6:14 PM, Oliver Keyes okeyes@wikimedia.org wrote:
- A common pattern in which queries, for no particular reason, had a
UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps.
Oliver: What User Agent strings do you see for this? Is it the iOS or Android app, the old Phonegap App, or even something else?
-Bernd
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
https://phabricator.wikimedia.org/T107727 is the link!
On 3 August 2015 at 13:07, Tomasz Finc tfinc@wikimedia.org wrote:
What's the phab task for this?
On Mon, Aug 3, 2015 at 9:05 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Bernd,
It's the new Android app; I've thrown some example requests at Dmitry by way of fluorine and, being Dmitry, he's already worked out what's going on!
On 3 August 2015 at 11:16, Bernd Sitzmann bernd@wikimedia.org wrote:
On Sun, Aug 2, 2015 at 6:14 PM, Oliver Keyes okeyes@wikimedia.org wrote:
- A common pattern in which queries, for no particular reason, had a
UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps.
Oliver: What User Agent strings do you see for this? Is it the iOS or Android app, the old Phonegap App, or even something else?
-Bernd
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Great, lets keep tabs on when the release is going out and align it with our metrics.
On Mon, Aug 3, 2015 at 10:15 AM, Oliver Keyes okeyes@wikimedia.org wrote:
https://phabricator.wikimedia.org/T107727 is the link!
On 3 August 2015 at 13:07, Tomasz Finc tfinc@wikimedia.org wrote:
What's the phab task for this?
On Mon, Aug 3, 2015 at 9:05 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Bernd,
It's the new Android app; I've thrown some example requests at Dmitry by way of fluorine and, being Dmitry, he's already worked out what's going on!
On 3 August 2015 at 11:16, Bernd Sitzmann bernd@wikimedia.org wrote:
On Sun, Aug 2, 2015 at 6:14 PM, Oliver Keyes okeyes@wikimedia.org wrote:
- A common pattern in which queries, for no particular reason, had a
UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps.
Oliver: What User Agent strings do you see for this? Is it the iOS or Android app, the old Phonegap App, or even something else?
-Bernd
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
In a twist of irony, this issue was actually caused by a patch I wrote https://gerrit.wikimedia.org/r/#/c/207727/ to fix an annoying little bug https://phabricator.wikimedia.org/T96944 in the app where the namespace of some pages was being set to null when they were saved to the user's storage.
You can see in the changes I made to the persistence helper https://gerrit.wikimedia.org/r/#/c/207727/3/wikipedia/src/main/java/org/wikipedia/history/HistoryEntryPersistenceHelper.java that I took the column that was the timestamp and used it for the namespace instead. This was my first change to the database layer of the app, and I didn't quite realise the ramifications of doing what I did. Since Dmitry's fix https://gerrit.wikimedia.org/r/#/c/228766/ noted that it was silly to ever use column indices rather than looking them up by name, I don't feel *too* bad about it.. ;-)
99 little bugs in the code, 99 little bugs, take one down, patch it around, 127 little bugs in the code.
Dan
On 2 August 2015 at 17:14, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all,
This Friday, Trey Jones (our awesome Relevance Engineer) and I spent some time playing detective with the sampled request logs and a list of the most common queries resulting in zero results. We found a lot of interesting things. In particular:
- A common pattern in which queries, for no particular reason, had a
UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps. 2. A search for strings in quotes followed by 'film' (example: ""Seventh Son" film"). This is caused by a media player and is responsible for around 0.5% of zero results queries. 3. A search for "quot" strings (example: " quot James Tree quot"). This is from the National Library of Australia and is again around 0.5% of zero results queries. 4. A search for a page title and the name of a page that appears as a link within that page (example: ""2C-T-19" AND "JWH-081""). This is about 6% of queries and appears to come from a German IP address. We're unaware of who this person is or what they're trying, so if anyone knows what on earth this is, we'd appreciate the hint ;).
https://phabricator.wikimedia.org/T107724 is a card representing the need to reach out to these people, where possible (obviously this will be easier for the app team than anyone else ;p). If we can get all of these solved for, we could drop the zero results rate for full text by about 10% Obviously cutting /all/ of it out is improbable, but we're hopeful that we can drop this number and get a better understanding of what third-party users are trying to achieve, to boot.
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Thanks to Dan for the great writeup; I've been finding this fantastic primarily because half of my bugs are indices-related and this makes me feel better :D.
Actually, thanks to an overflow problem it's now -2147483648 bugs. We've fixed MediaWiki, let's all go home!
On 3 August 2015 at 13:30, Dan Garry dgarry@wikimedia.org wrote:
In a twist of irony, this issue was actually caused by a patch I wrote to fix an annoying little bug in the app where the namespace of some pages was being set to null when they were saved to the user's storage.
You can see in the changes I made to the persistence helper that I took the column that was the timestamp and used it for the namespace instead. This was my first change to the database layer of the app, and I didn't quite realise the ramifications of doing what I did. Since Dmitry's fix noted that it was silly to ever use column indices rather than looking them up by name, I don't feel too bad about it.. ;-)
99 little bugs in the code, 99 little bugs, take one down, patch it around, 127 little bugs in the code.
Dan
On 2 August 2015 at 17:14, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all,
This Friday, Trey Jones (our awesome Relevance Engineer) and I spent some time playing detective with the sampled request logs and a list of the most common queries resulting in zero results. We found a lot of interesting things. In particular:
- A common pattern in which queries, for no particular reason, had a
UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps. 2. A search for strings in quotes followed by 'film' (example: ""Seventh Son" film"). This is caused by a media player and is responsible for around 0.5% of zero results queries. 3. A search for "quot" strings (example: " quot James Tree quot"). This is from the National Library of Australia and is again around 0.5% of zero results queries. 4. A search for a page title and the name of a page that appears as a link within that page (example: ""2C-T-19" AND "JWH-081""). This is about 6% of queries and appears to come from a German IP address. We're unaware of who this person is or what they're trying, so if anyone knows what on earth this is, we'd appreciate the hint ;).
https://phabricator.wikimedia.org/T107724 is a card representing the need to reach out to these people, where possible (obviously this will be easier for the app team than anyone else ;p). If we can get all of these solved for, we could drop the zero results rate for full text by about 10% Obviously cutting /all/ of it out is improbable, but we're hopeful that we can drop this number and get a better understanding of what third-party users are trying to achieve, to boot.
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
I got 99 problems, but a bug ain't one.
On Mon, Aug 3, 2015 at 1:30 PM, Dan Garry dgarry@wikimedia.org wrote:
In a twist of irony, this issue was actually caused by a patch I wrote https://gerrit.wikimedia.org/r/#/c/207727/ to fix an annoying little bug https://phabricator.wikimedia.org/T96944 in the app where the namespace of some pages was being set to null when they were saved to the user's storage.
You can see in the changes I made to the persistence helper https://gerrit.wikimedia.org/r/#/c/207727/3/wikipedia/src/main/java/org/wikipedia/history/HistoryEntryPersistenceHelper.java that I took the column that was the timestamp and used it for the namespace instead. This was my first change to the database layer of the app, and I didn't quite realise the ramifications of doing what I did. Since Dmitry's fix https://gerrit.wikimedia.org/r/#/c/228766/ noted that it was silly to ever use column indices rather than looking them up by name, I don't feel *too* bad about it.. ;-)
99 little bugs in the code, 99 little bugs, take one down, patch it around, 127 little bugs in the code.
Dan
On 2 August 2015 at 17:14, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all,
This Friday, Trey Jones (our awesome Relevance Engineer) and I spent some time playing detective with the sampled request logs and a list of the most common queries resulting in zero results. We found a lot of interesting things. In particular:
- A common pattern in which queries, for no particular reason, had a
UNIX timestamp preceding them (example: "1436336857594:2019 FIFA Women's World Cup"). This is responsible, on its own, for 3% of zero results queries - and it appears to be caused by the Wikimedia Apps. 2. A search for strings in quotes followed by 'film' (example: ""Seventh Son" film"). This is caused by a media player and is responsible for around 0.5% of zero results queries. 3. A search for "quot" strings (example: " quot James Tree quot"). This is from the National Library of Australia and is again around 0.5% of zero results queries. 4. A search for a page title and the name of a page that appears as a link within that page (example: ""2C-T-19" AND "JWH-081""). This is about 6% of queries and appears to come from a German IP address. We're unaware of who this person is or what they're trying, so if anyone knows what on earth this is, we'd appreciate the hint ;).
https://phabricator.wikimedia.org/T107724 is a card representing the need to reach out to these people, where possible (obviously this will be easier for the app team than anyone else ;p). If we can get all of these solved for, we could drop the zero results rate for full text by about 10% Obviously cutting /all/ of it out is improbable, but we're hopeful that we can drop this number and get a better understanding of what third-party users are trying to achieve, to boot.
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search