Hey everyone,
I got access to some logs and I've been slogging through the data. In particular, I've partially analyzed a sample of 100K zero-result full_text searches against enwiki, over the course of about an hour (2015-07-23 07:51:29 to 2015-07-23 08:55:42). My results and opinions are below.
*TL;DR Summary: If these patterns hold for another sample (and across languages), we should be able to get some decent mileage out of these simple approaches:* * - find sources of weird patterns and either ignore them, or contact the source and redirect them to a more appropriate destination* * - use language or character set detection to redirect queries to other wikis* * - filter the term "quot" from queries* * - filter 14###########: from the front of queries* * - replace _ with space in queries*
All of this is somewhat rough, and exact numbers aren't guaranteed. Also the categories may overlap. I also intend to look for these same patterns from another sample from a different day and make sure they are more general and not just temporary idiosyncrasies. I also plan to look through other language wikis (i.e., Spanish and French to start) to see if there are cross-linguistic patterns like these.
I think we have to some how come to terms with the fact that some queries don't deserve results, and maybe figure out the source of such "illegitimate" queries and filter them. (I'd really like to be able to track down the referrer, if there is one, for a lot of the weirder queries.)
Top query: - 248 Dounload feer game - all via web... and Google can't find it. That's just weird.
Some other categories of queries are below. The numbers are "<total queries> / <unique queries>". Since this is a 100K sample of zero-result queries, and zero-results are about 25% of all results, each 1,000 of total queries here represents about 0.25% of all search queries.
253 / 171 string of numbers
3610 / 2505 no Latin letters - I see Korean, Thai, Japanese, Cyrillic, doi #s (see below), Arabic, Hebrew, Greek, Armenian, Georgian, Devanagari, Burmese, Chinese, and some emoji (e.g., 11 searches for 😜💗🎨❤️💋😞☀️💦). - I also saw instances of mixed Latin / non-Latin queries - Includes gibberish, which is hard to grep for, but easy to spot by eye - Lots of the non-gibberish ones are clearly in other languages, and I saw queries in other Latin-alphabet languages go by, too.
2630 / 2627 DOIs, all in quotes
3015 / 1017 have quot in them (which gets auto-corrected to "quote", obviously) - 327 are one word: quot ... quot - I don't know where these are coming from, but they are weird. If we strip "quot" we would get many of these. This must be coming from some source that is adding quotes, then escaping them as """ and then stripping & and ;. Weird.
7155 / 6337 #:Name - almost all are 14###########:Text - e.g., 1436755654740:Sherlock Holmes - These all look like Wikipedia titles! - Two each of 0:... and 6000:...
114 / 85 actual http(s):// URLs
488 / 244 URL-like things starting with www... and ending with .com, .ru, etc.
211 / 132 other searches starting with “www.”
1085 / 1083 article searches in this format: ('"<TITLE>"', '<AUTHOR(S)>')
2457 / 2060 TV episodes (based on the presence of "S#E#"—that's season #, episode #)
8419 / 7523 AND boolean searches 703 / 701 OR boolean searches - Many of these look auto-generated, esp in the aggregate. - For example: there are 498 / 249 "House_of_Gurieli" AND ... queries
6310 / 5742 queries with _ in them - only 934 / 790 if we skip the 14###########:Text and boolean AND queries
Other things I noticed:
- lots of queries for books, articles, movies, tv, mp3s, and porn (in multiple languages)
- lots of "building up" searches (and these are all marked full_text), for example: achevm achevme achevmen achevment achevments achevments o achevments of achevments of achevments of h achevments of he achevments of hell achevments of helle achevments of hellen achevments of hellen k achevments of hellen k achevments of hellen kell achevments of hellen kelle achevments of hellen keller
- reasonable-looking ~ queries don't work: intitle:George~ intitle:Washin~ gives 0 results intitle:Washington intitle:George gives 279 results
Finally, I did see a bunch of typos, but I didn't try to quantify them because I was digging into all of these other interesting patterns.
Have a good weekend. —Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
This is awesome work! Deleting underscores and using language detection sound like great approaches to take :). What's 14####, etc?
One of the worries I have here is the fact that the cirrus logs very deliberately don't (yet) give us enough information to identify readers. Are those buildup queries or the most common queries automata? We simply don't know :/.
(Don't worry, Erik and I have been brainstorming on ways to get more information in)
On 24 July 2015 at 17:16, Trey Jones tjones@wikimedia.org wrote:
Hey everyone,
I got access to some logs and I've been slogging through the data. In particular, I've partially analyzed a sample of 100K zero-result full_text searches against enwiki, over the course of about an hour (2015-07-23 07:51:29 to 2015-07-23 08:55:42). My results and opinions are below.
*TL;DR Summary: If these patterns hold for another sample (and across languages), we should be able to get some decent mileage out of these simple approaches:*
- find sources of weird patterns and either ignore them, or contact the
source and redirect them to a more appropriate destination*
- use language or character set detection to redirect queries to other
wikis*
- filter the term "quot" from queries*
- filter 14###########: from the front of queries*
- replace _ with space in queries*
All of this is somewhat rough, and exact numbers aren't guaranteed. Also the categories may overlap. I also intend to look for these same patterns from another sample from a different day and make sure they are more general and not just temporary idiosyncrasies. I also plan to look through other language wikis (i.e., Spanish and French to start) to see if there are cross-linguistic patterns like these.
I think we have to some how come to terms with the fact that some queries don't deserve results, and maybe figure out the source of such "illegitimate" queries and filter them. (I'd really like to be able to track down the referrer, if there is one, for a lot of the weirder queries.)
Top query:
- 248 Dounload feer game
- all via web... and Google can't find it. That's just weird.
Some other categories of queries are below. The numbers are "<total queries> / <unique queries>". Since this is a 100K sample of zero-result queries, and zero-results are about 25% of all results, each 1,000 of total queries here represents about 0.25% of all search queries.
253 / 171 string of numbers
3610 / 2505 no Latin letters
- I see Korean, Thai, Japanese, Cyrillic, doi #s (see below), Arabic,
Hebrew, Greek, Armenian, Georgian, Devanagari, Burmese, Chinese, and some emoji (e.g., 11 searches for 😜💗🎨❤️💋😞☀️💦).
- I also saw instances of mixed Latin / non-Latin queries
- Includes gibberish, which is hard to grep for, but easy to spot by eye
- Lots of the non-gibberish ones are clearly in other languages, and I saw
queries in other Latin-alphabet languages go by, too.
2630 / 2627 DOIs, all in quotes
3015 / 1017 have quot in them (which gets auto-corrected to "quote", obviously)
- 327 are one word: quot ... quot
- I don't know where these are coming from, but they are weird. If we
strip "quot" we would get many of these. This must be coming from some source that is adding quotes, then escaping them as """ and then stripping & and ;. Weird.
7155 / 6337 #:Name
- almost all are 14###########:Text
- e.g., 1436755654740:Sherlock Holmes
- These all look like Wikipedia titles!
- Two each of 0:... and 6000:...
114 / 85 actual http(s):// URLs
488 / 244 URL-like things starting with www... and ending with .com, .ru, etc.
211 / 132 other searches starting with “www.”
1085 / 1083 article searches in this format: ('"<TITLE>"', '<AUTHOR(S)>')
2457 / 2060 TV episodes (based on the presence of "S#E#"—that's season #, episode #)
8419 / 7523 AND boolean searches 703 / 701 OR boolean searches
- Many of these look auto-generated, esp in the aggregate.
- For example: there are 498 / 249 "House_of_Gurieli" AND ... queries
6310 / 5742 queries with _ in them
- only 934 / 790 if we skip the 14###########:Text and boolean AND queries
Other things I noticed:
- lots of queries for books, articles, movies, tv, mp3s, and porn (in
multiple languages)
- lots of "building up" searches (and these are all marked full_text), for
example: achevm achevme achevmen achevment achevments achevments o achevments of achevments of achevments of h achevments of he achevments of hell achevments of helle achevments of hellen achevments of hellen k achevments of hellen k achevments of hellen kell achevments of hellen kelle achevments of hellen keller
- reasonable-looking ~ queries don't work:
intitle:George~ intitle:Washin~ gives 0 results intitle:Washington intitle:George gives 279 results
Finally, I did see a bunch of typos, but I didn't try to quantify them because I was digging into all of these other interesting patterns.
Have a good weekend. —Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
I don't know what the 14###'s are. I googled them, thinking they were IDs of some sort, but found nothing.
For those and the buildup queries, and others, I'd love to get a source—referrer for web or app for API—and tell them to do something different.
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Fri, Jul 24, 2015 at 5:20 PM, Oliver Keyes okeyes@wikimedia.org wrote:
This is awesome work! Deleting underscores and using language detection sound like great approaches to take :). What's 14####, etc?
One of the worries I have here is the fact that the cirrus logs very deliberately don't (yet) give us enough information to identify readers. Are those buildup queries or the most common queries automata? We simply don't know :/.
(Don't worry, Erik and I have been brainstorming on ways to get more information in)
On 24 July 2015 at 17:16, Trey Jones tjones@wikimedia.org wrote:
Hey everyone,
I got access to some logs and I've been slogging through the data. In particular, I've partially analyzed a sample of 100K zero-result full_text searches against enwiki, over the course of about an hour (2015-07-23 07:51:29 to 2015-07-23 08:55:42). My results and opinions are below.
*TL;DR Summary: If these patterns hold for another sample (and across languages), we should be able to get some decent mileage out of these simple approaches:*
- find sources of weird patterns and either ignore them, or contact the
source and redirect them to a more appropriate destination*
- use language or character set detection to redirect queries to other
wikis*
- filter the term "quot" from queries*
- filter 14###########: from the front of queries*
- replace _ with space in queries*
All of this is somewhat rough, and exact numbers aren't guaranteed. Also the categories may overlap. I also intend to look for these same patterns from another sample from a different day and make sure they are more general and not just temporary idiosyncrasies. I also plan to look through other language wikis (i.e., Spanish and French to start) to see if there are cross-linguistic patterns like these.
I think we have to some how come to terms with the fact that some queries don't deserve results, and maybe figure out the source of such "illegitimate" queries and filter them. (I'd really like to be able to track down the referrer, if there is one, for a lot of the weirder queries.)
Top query:
- 248 Dounload feer game
- all via web... and Google can't find it. That's just weird.
Some other categories of queries are below. The numbers are "<total queries> / <unique queries>". Since this is a 100K sample of zero-result queries, and zero-results are about 25% of all results, each 1,000 of total queries here represents about 0.25% of all search queries.
253 / 171 string of numbers
3610 / 2505 no Latin letters
- I see Korean, Thai, Japanese, Cyrillic, doi #s (see below), Arabic,
Hebrew, Greek, Armenian, Georgian, Devanagari, Burmese, Chinese, and some emoji (e.g., 11 searches for 😜💗🎨❤️💋😞☀️💦).
- I also saw instances of mixed Latin / non-Latin queries
- Includes gibberish, which is hard to grep for, but easy to spot by eye
- Lots of the non-gibberish ones are clearly in other languages, and I
saw queries in other Latin-alphabet languages go by, too.
2630 / 2627 DOIs, all in quotes
3015 / 1017 have quot in them (which gets auto-corrected to "quote", obviously)
- 327 are one word: quot ... quot
- I don't know where these are coming from, but they are weird. If we
strip "quot" we would get many of these. This must be coming from some source that is adding quotes, then escaping them as """ and then stripping & and ;. Weird.
7155 / 6337 #:Name
- almost all are 14###########:Text
- e.g., 1436755654740:Sherlock Holmes
- These all look like Wikipedia titles!
- Two each of 0:... and 6000:...
114 / 85 actual http(s):// URLs
488 / 244 URL-like things starting with www... and ending with .com, .ru, etc.
211 / 132 other searches starting with “www.”
1085 / 1083 article searches in this format: ('"<TITLE>"', '<AUTHOR(S)>')
2457 / 2060 TV episodes (based on the presence of "S#E#"—that's season #, episode #)
8419 / 7523 AND boolean searches 703 / 701 OR boolean searches
- Many of these look auto-generated, esp in the aggregate.
- For example: there are 498 / 249 "House_of_Gurieli" AND ... queries
6310 / 5742 queries with _ in them
- only 934 / 790 if we skip the 14###########:Text and boolean AND queries
Other things I noticed:
- lots of queries for books, articles, movies, tv, mp3s, and porn (in
multiple languages)
- lots of "building up" searches (and these are all marked full_text),
for example: achevm achevme achevmen achevment achevments achevments o achevments of achevments of achevments of h achevments of he achevments of hell achevments of helle achevments of hellen achevments of hellen k achevments of hellen k achevments of hellen kell achevments of hellen kelle achevments of hellen keller
- reasonable-looking ~ queries don't work:
intitle:George~ intitle:Washin~ gives 0 results intitle:Washington intitle:George gives 279 results
Finally, I did see a bunch of typos, but I didn't try to quantify them because I was digging into all of these other interesting patterns.
Have a good weekend. —Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Word! I used to spend a lot of time doing that with the sampled logs. Can take a look Monday if you can throw me the raw queries :)
On 24 July 2015 at 17:21, Trey Jones tjones@wikimedia.org wrote:
I don't know what the 14###'s are. I googled them, thinking they were IDs of some sort, but found nothing.
For those and the buildup queries, and others, I'd love to get a source—referrer for web or app for API—and tell them to do something different.
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Fri, Jul 24, 2015 at 5:20 PM, Oliver Keyes okeyes@wikimedia.org wrote:
This is awesome work! Deleting underscores and using language detection sound like great approaches to take :). What's 14####, etc?
One of the worries I have here is the fact that the cirrus logs very deliberately don't (yet) give us enough information to identify readers. Are those buildup queries or the most common queries automata? We simply don't know :/.
(Don't worry, Erik and I have been brainstorming on ways to get more information in)
On 24 July 2015 at 17:16, Trey Jones tjones@wikimedia.org wrote:
Hey everyone,
I got access to some logs and I've been slogging through the data. In particular, I've partially analyzed a sample of 100K zero-result full_text searches against enwiki, over the course of about an hour (2015-07-23 07:51:29 to 2015-07-23 08:55:42). My results and opinions are below.
*TL;DR Summary: If these patterns hold for another sample (and across languages), we should be able to get some decent mileage out of these simple approaches:*
- find sources of weird patterns and either ignore them, or contact
the source and redirect them to a more appropriate destination*
- use language or character set detection to redirect queries to other
wikis*
- filter the term "quot" from queries*
- filter 14###########: from the front of queries*
- replace _ with space in queries*
All of this is somewhat rough, and exact numbers aren't guaranteed. Also the categories may overlap. I also intend to look for these same patterns from another sample from a different day and make sure they are more general and not just temporary idiosyncrasies. I also plan to look through other language wikis (i.e., Spanish and French to start) to see if there are cross-linguistic patterns like these.
I think we have to some how come to terms with the fact that some queries don't deserve results, and maybe figure out the source of such "illegitimate" queries and filter them. (I'd really like to be able to track down the referrer, if there is one, for a lot of the weirder queries.)
Top query:
- 248 Dounload feer game
- all via web... and Google can't find it. That's just weird.
Some other categories of queries are below. The numbers are "<total queries> / <unique queries>". Since this is a 100K sample of zero-result queries, and zero-results are about 25% of all results, each 1,000 of total queries here represents about 0.25% of all search queries.
253 / 171 string of numbers
3610 / 2505 no Latin letters
- I see Korean, Thai, Japanese, Cyrillic, doi #s (see below), Arabic,
Hebrew, Greek, Armenian, Georgian, Devanagari, Burmese, Chinese, and some emoji (e.g., 11 searches for 😜💗🎨❤️💋😞☀️💦).
- I also saw instances of mixed Latin / non-Latin queries
- Includes gibberish, which is hard to grep for, but easy to spot by eye
- Lots of the non-gibberish ones are clearly in other languages, and I
saw queries in other Latin-alphabet languages go by, too.
2630 / 2627 DOIs, all in quotes
3015 / 1017 have quot in them (which gets auto-corrected to "quote", obviously)
- 327 are one word: quot ... quot
- I don't know where these are coming from, but they are weird. If we
strip "quot" we would get many of these. This must be coming from some source that is adding quotes, then escaping them as """ and then stripping & and ;. Weird.
7155 / 6337 #:Name
- almost all are 14###########:Text
- e.g., 1436755654740:Sherlock Holmes
- These all look like Wikipedia titles!
- Two each of 0:... and 6000:...
114 / 85 actual http(s):// URLs
488 / 244 URL-like things starting with www... and ending with .com, .ru, etc.
211 / 132 other searches starting with “www.”
1085 / 1083 article searches in this format: ('"<TITLE>"', '<AUTHOR(S)>')
2457 / 2060 TV episodes (based on the presence of "S#E#"—that's season #, episode #)
8419 / 7523 AND boolean searches 703 / 701 OR boolean searches
- Many of these look auto-generated, esp in the aggregate.
- For example: there are 498 / 249 "House_of_Gurieli" AND ... queries
6310 / 5742 queries with _ in them
- only 934 / 790 if we skip the 14###########:Text and boolean AND
queries
Other things I noticed:
- lots of queries for books, articles, movies, tv, mp3s, and porn (in
multiple languages)
- lots of "building up" searches (and these are all marked full_text),
for example: achevm achevme achevmen achevment achevments achevments o achevments of achevments of achevments of h achevments of he achevments of hell achevments of helle achevments of hellen achevments of hellen k achevments of hellen k achevments of hellen kell achevments of hellen kelle achevments of hellen keller
- reasonable-looking ~ queries don't work:
intitle:George~ intitle:Washin~ gives 0 results intitle:Washington intitle:George gives 279 results
Finally, I did see a bunch of typos, but I didn't try to quantify them because I was digging into all of these other interesting patterns.
Have a good weekend. —Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Trey Jones, 24/07/2015 23:16:
I've partially analyzed a sample of 100K zero-result full_text searches against enwiki, over the course of about an hour
Great! Can't wait for the same across all languages.
*TL;DR Summary: If these patterns hold for another sample (and across languages), we should be able to get some decent mileage out of these simple approaches:* *- find sources of weird patterns and either ignore them, or contact the source and redirect them to a more appropriate destination* *- use language or character set detection to redirect queries to other wikis*
Yes, that's the most important one. Crosswiki searches are possible with Cirrus: https://phabricator.wikimedia.org/T46420
T26767 may go a long way and T3837 may be fixed by following the same approach as for sister projects (boxes on Special:Search for crosswiki results).
*- filter the term "quot" from queries*
quot or " ? Som
*- filter 14###########: from the front of queries*
UNIX time? 1437794758
*- replace _ with space in queries*
Do you have actual URL of the search? [[Special:Search/this_format]] is supposed to ignore underscores already.
Also, a full text search URL containing "search=" but not "fulltext=" is first of all a title search (the aim is to be redirected to the title if it exists).
- lots of queries for books, articles, movies, tv, mp3s, and porn (in
multiple languages)
Titles and other proper names are perfect for crosswiki search!
For the incremental searches, do we know if all browsers which integrate a Wikipedia search bar use the correct API for suggestions?
Nemo
I forgot to mention that out of 100K queries, there were 80287 unique queries. 29667 via web 70333 via api
I should have distinguished api from web before.
*- filter the term "quot" from queries*
quot or " ?
just quot—it's weird. Those queries had to go through a couple of incompatible filters to end up like that.
*- filter 14###########: from the front of queries*
UNIX time? 1437794758
Very nice! That would explain why they all start with 14 and are that length. But if they are epoch seconds, they don't make sense. Here's the range over the 7000 instances:
1410632926515 -> Sat, 13 Sep 2014 18:28:46 GMT 1440274360664 -> Sat, 22 Aug 2015 20:12:40 GMT
That's from last year through next month. Makes no sense! All of these queries were within an hour on one day.
*- replace _ with space in queries*
Do you have actual URL of the search? [[Special:Search/this_format]] is supposed to ignore underscores already
Also, a full text search URL containing "search=" but not "fulltext=" is first of all a title search (the aim is to be redirected to the title if it exists).
I don't have access to the URL at the moment, just the elastic search logs. I limited my initial investigation to things marked as full text ("full_text search for ...")
And some of the underscores are in the weird boolean queries.
For the incremental searches, do we know if all browsers which integrate a
Wikipedia search bar use the correct API for suggestions?
I don't know, but I'll keep that in mind.
Looking more carefully at the logs, I didn't distinguish web searches and api searches, which I should have. The two most obvious incremental searches were via api.. which makes sense.
More later...
—Trey
My original sample was a 100K sample from zero-results queries to enwiki on 7/24. Today I looked at similar samples from 7/10 and 7/17 (since there is a weekly pattern to traffic) and from 7/22 to compare.
All of the patterns I detected are still present, in approximately the same volume (give or take a factor of 2), except for the ('"<TITLE>"', '<AUTHOR(S)>') pattern.
I've started looking at a 500K sample from 7/24 across all wikis. I'll have more results tomorrow, but right now it's already clear that someone is spamming useless DOI searches across wikis—and it's 9% of the wiki zero-results queries.
—Trey
On Mon, Jul 27, 2015 at 2:04 PM, Trey Jones tjones@wikimedia.org wrote:
and it's 9% of the wiki zero-results queries
That's a huge discovery to better understand our traffic.
What do we know about who this is? proxy, bot, app, other, etc?
I'm eager to have a talk with them :)
--tomasz
On Mon, Jul 27, 2015 at 3:39 PM, Tomasz Finc tfinc@wikimedia.org wrote:
On Mon, Jul 27, 2015 at 2:04 PM, Trey Jones tjones@wikimedia.org wrote:
and it's 9% of the wiki zero-results queries
That's a huge discovery to better understand our traffic.
What do we know about who this is? proxy, bot, app, other, etc?
I'm eager to have a talk with them :)
The current firehose of logs doesn't contain any PII, so we basically have no idea where these come from. I've been thinking with oliver on if/what PII should be stored (the data is under NDA anyways, but we've always err'd on the side of caution).
If the signature is as specific as were seeing here then i'm sure we'll see them again and can easily identify.
--tomasz
On Mon, Jul 27, 2015 at 3:48 PM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
On Mon, Jul 27, 2015 at 3:39 PM, Tomasz Finc tfinc@wikimedia.org wrote:
On Mon, Jul 27, 2015 at 2:04 PM, Trey Jones tjones@wikimedia.org wrote:
and it's 9% of the wiki zero-results queries
That's a huge discovery to better understand our traffic.
What do we know about who this is? proxy, bot, app, other, etc?
I'm eager to have a talk with them :)
The current firehose of logs doesn't contain any PII, so we basically have no idea where these come from. I've been thinking with oliver on if/what PII should be stored (the data is under NDA anyways, but we've always err'd on the side of caution).
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
The signature is very consistent. I only have to search for ^"10. to find them, and they all look more or less like this:
"10.####/<ID>" OR "http://<publisher_website>/.../10.####/<ID>"
If they are consistently cranking out 45K of these searches every 2 hours or so, they should be easy to find once we have a place to look.
I'm trying to make sense of it. Does it make sense as referral spam or something?
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Jul 27, 2015 at 6:49 PM, Tomasz Finc tfinc@wikimedia.org wrote:
If the signature is as specific as were seeing here then i'm sure we'll see them again and can easily identify.
--tomasz
On Mon, Jul 27, 2015 at 3:48 PM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
On Mon, Jul 27, 2015 at 3:39 PM, Tomasz Finc tfinc@wikimedia.org
wrote:
On Mon, Jul 27, 2015 at 2:04 PM, Trey Jones tjones@wikimedia.org
wrote:
and it's 9% of the wiki zero-results queries
That's a huge discovery to better understand our traffic.
What do we know about who this is? proxy, bot, app, other, etc?
I'm eager to have a talk with them :)
The current firehose of logs doesn't contain any PII, so we basically
have
no idea where these come from. I've been thinking with oliver on if/what
PII
should be stored (the data is under NDA anyways, but we've always err'd
on
the side of caution).
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
I don't know if this is the only source, but one likely source: http://sample.lagotto.io/sources/wikipedia This directly says their default query against 25 wikipedia instances is "DOI" or "URL". Being an open source project this code is likely running in many places.
Found this after max mentioned the api logs might have something. Basically checked for logged api requests with `srsearch="10.` and sorted by the number of times a particular ip address showed up in short(~5min) timespans across a few days. Several AWS ip's, a few ip's that don't have a name in reverse lookup, and sample.lagotto.io.
On Mon, Jul 27, 2015 at 5:08 PM, Trey Jones tjones@wikimedia.org wrote:
The signature is very consistent. I only have to search for ^"10. to find them, and they all look more or less like this:
"10.####/<ID>" OR "http://<publisher_website>/.../10.####/<ID>"
If they are consistently cranking out 45K of these searches every 2 hours or so, they should be easy to find once we have a place to look.
I'm trying to make sense of it. Does it make sense as referral spam or something?
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Jul 27, 2015 at 6:49 PM, Tomasz Finc tfinc@wikimedia.org wrote:
If the signature is as specific as were seeing here then i'm sure we'll see them again and can easily identify.
--tomasz
On Mon, Jul 27, 2015 at 3:48 PM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
On Mon, Jul 27, 2015 at 3:39 PM, Tomasz Finc tfinc@wikimedia.org
wrote:
On Mon, Jul 27, 2015 at 2:04 PM, Trey Jones tjones@wikimedia.org
wrote:
and it's 9% of the wiki zero-results queries
That's a huge discovery to better understand our traffic.
What do we know about who this is? proxy, bot, app, other, etc?
I'm eager to have a talk with them :)
The current firehose of logs doesn't contain any PII, so we basically
have
no idea where these come from. I've been thinking with oliver on
if/what PII
should be stored (the data is under NDA anyways, but we've always err'd
on
the side of caution).
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Erik Bernhardson, 28/07/2015 03:17:
I don't know if this is the only source, but one likely source: http://sample.lagotto.io/sources/wikipedia
I filed https://github.com/lagotto/lagotto/issues/405 But then, it's fine to use search for the purpose of Wikimedia projects analysis if they really like it. :)
Nemo
Erik and Nemo—thanks!
My first pass at data gathering at the end of the day yesterday was slightly skewed, but the general trend still hold... getting Nemo's suggested "insource:" queries into Lagotto will definitely cut down on the number of zero-results searches we get (and actually do what they intend), if the update gets pushed to their heavy users.
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Tue, Jul 28, 2015 at 4:09 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Erik Bernhardson, 28/07/2015 03:17:
I don't know if this is the only source, but one likely source: http://sample.lagotto.io/sources/wikipedia
I filed https://github.com/lagotto/lagotto/issues/405 But then, it's fine to use search for the purpose of Wikimedia projects analysis if they really like it. :)
Nemo
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Le 28/07/2015 15:09, Trey Jones a écrit :
My first pass at data gathering at the end of the day yesterday was slightly skewed, but the general trend still hold... getting Nemo's suggested "insource:" queries into Lagotto will definitely cut down on the number of zero-results searches we get (and actually do what they intend), if the update gets pushed to their heavy users.
Beware that insource is the most expensive query and we allow only 20 insource queries to run concurrently (for all wikipedia sites). I'm not sure it's a good idea to expose this tool too widely. There is several features like that, (I mean syntax that's not available in any other search engine with a large audience) : - wildcard queries (*) - insource - fuzzy searches
While these features are very useful to "expert users" I think we should not rely on such syntax to decrease the zero result rate because it won't scale.
Another solution for this specific use case is to build a custom analyzer that will extract this information from the content and expose a scalable search field.
On Tue, Jul 28, 2015 at 7:18 AM, David Causse dcausse@wikimedia.org wrote:
Le 28/07/2015 15:09, Trey Jones a écrit :
My first pass at data gathering at the end of the day yesterday was slightly skewed, but the general trend still hold... getting Nemo's suggested "insource:" queries into Lagotto will definitely cut down on the number of zero-results searches we get (and actually do what they intend), if the update gets pushed to their heavy users.
Beware that insource is the most expensive query and we allow only 20 insource queries to run concurrently (for all wikipedia sites). I'm not sure it's a good idea to expose this tool too widely. There is several features like that, (I mean syntax that's not available in any other search engine with a large audience) :
- wildcard queries (*)
- insource
- fuzzy searches
While these features are very useful to "expert users" I think we should not rely on such syntax to decrease the zero result rate because it won't scale.
All of what David said.
insource: is a hack for two reasons:
1) We took away the previous default behavior when we embarked on Cirrus. insource: was basically the old lsearchd behavior anyway.
2) I really really want us to replicate the indexes to labs (like we do databases) so labs/tool users can free query them and come up with all kinds of cool toys. I think there's a task for it...but I can't find it (need coffee).
-Chad
Nemo recommended insource: to Lagotto because it would actually work and do what they want, but didn't consider the computational cost on our end. However, if we only allow 20 at a time, they would probably monopolize it entirely. In my sample we got about 50,000 of these queries in about an hour.
David/Chad, can you look at Nemo's issue and comment there on what's plausible and what's not? https://github.com/lagotto/lagotto/issues/405
Also, is this the kind of use case that we want to support? I'm not suggesting that it isn't, I really don't know. But they aren't looking for information, they are looking for something akin to impact factor on reputable parts of the web. If that's not something we want to support, how do we let them know? If that doesn't help—e.g., because it's some other installation using their tool that's generating all the queries—do we block it?
At the very least, we should ignore these malformed queries in our own metrics.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Tue, Jul 28, 2015 at 10:18 AM, David Causse dcausse@wikimedia.org wrote:
Le 28/07/2015 15:09, Trey Jones a écrit :
My first pass at data gathering at the end of the day yesterday was slightly skewed, but the general trend still hold... getting Nemo's suggested "insource:" queries into Lagotto will definitely cut down on the number of zero-results searches we get (and actually do what they intend), if the update gets pushed to their heavy users.
Beware that insource is the most expensive query and we allow only 20 insource queries to run concurrently (for all wikipedia sites). I'm not sure it's a good idea to expose this tool too widely. There is several features like that, (I mean syntax that's not available in any other search engine with a large audience) :
- wildcard queries (*)
- insource
- fuzzy searches
While these features are very useful to "expert users" I think we should not rely on such syntax to decrease the zero result rate because it won't scale.
Another solution for this specific use case is to build a custom analyzer that will extract this information from the content and expose a scalable search field.
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Le 28/07/2015 16:32, Trey Jones a écrit :
Nemo recommended insource: to Lagotto because it would actually work and do what they want, but didn't consider the computational cost on our end. However, if we only allow 20 at a time, they would probably monopolize it entirely. In my sample we got about 50,000 of these queries in about an hour.
David/Chad, can you look at Nemo's issue and comment there on what's plausible and what's not? https://github.com/lagotto/lagotto/issues/405
I added a comment there.
Also, is this the kind of use case that we want to support? I'm not suggesting that it isn't, I really don't know. But they aren't looking for information, they are looking for something akin to impact factor on reputable parts of the web. If that's not something we want to support, how do we let them know? If that doesn't help—e.g., because it's some other installation using their tool that's generating all the queries—do we block it?
I don't know what to do with this, they use our search engine as a workaround because I guess they don't want to deal with too much data and it's pretty convenient to send a query on a system that do not blacklist anyone. I they were using google they would have been able to run something like 1 query per minute.
We should block/limit a source if : - It hurts the system and make the search experience bad for others - It pollutes our stats in a way that it's impossible for us to learn anything from search logs
When we'll start to do some statistical machine learning this is something that we will have to address.
Concerning the costly operators, if other tools/sources start to use them in a way that affect the system performance I'm afraid we will have to make these expert features protected by some permissions granted by wiki admins.
Hmm. I did a quick test on searching for some DOIs, and in fact Lagotto's syntax works fine. But most articles in the world are not in fact referenced in Wikipedia. I searched for DOI and found an example DOI: 10.1016/j.fgb.2007.07.013. All of these searches give the same 2 results:
10.1016/j.fgb.2007.07.013 "10.1016/j.fgb.2007.07.013" "10.1016/j.fgb.2007.07.013" OR " http://www.sciencedirect.com/science/article/pii/S1087184507001259"
insource:10.1016/j.fgb.2007.07.013 gives a third result, but it's actually not relevant (it's a partial match on "10.1016/j.fgb"). So maybe Nemo should withdraw the suggestion to Lagotto entirely?
What we have here may actually just be 50,000 searches (per hour) for things that do not exist in Wikipedia, and zero results is the correct answer.
It sounds more and more like "zero results queries from known automata" is a good category for the dashboard.
By the way, while I like machine learning as much as the next math nerd, that's not the only relevant approach. I found these guys by hand very quickly, and we can definitely get low-hanging fruit like this manually. (The quot quries are another example.) I also think some minimal analysis by an expert system could identify other instances of clear categories of non-failure zero-results (like prefix searches; the series ant ... antm ... antma ... antman is clearly going somewhere, even though antma has no results.)
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Tue, Jul 28, 2015 at 11:10 AM, David Causse dcausse@wikimedia.org wrote:
Le 28/07/2015 16:32, Trey Jones a écrit :
Nemo recommended insource: to Lagotto because it would actually work and do what they want, but didn't consider the computational cost on our end. However, if we only allow 20 at a time, they would probably monopolize it entirely. In my sample we got about 50,000 of these queries in about an hour.
David/Chad, can you look at Nemo's issue and comment there on what's plausible and what's not? https://github.com/lagotto/lagotto/issues/405
I added a comment there.
Also, is this the kind of use case that we want to support? I'm not suggesting that it isn't, I really don't know. But they aren't looking for information, they are looking for something akin to impact factor on reputable parts of the web. If that's not something we want to support, how do we let them know? If that doesn't help—e.g., because it's some other installation using their tool that's generating all the queries—do we block it?
I don't know what to do with this, they use our search engine as a workaround because I guess they don't want to deal with too much data and it's pretty convenient to send a query on a system that do not blacklist anyone. I they were using google they would have been able to run something like 1 query per minute.
We should block/limit a source if :
- It hurts the system and make the search experience bad for others
- It pollutes our stats in a way that it's impossible for us to learn
anything from search logs
When we'll start to do some statistical machine learning this is something that we will have to address.
Concerning the costly operators, if other tools/sources start to use them in a way that affect the system performance I'm afraid we will have to make these expert features protected by some permissions granted by wiki admins.
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
On Mon, Jul 27, 2015 at 2:04 PM, Trey Jones tjones@wikimedia.org wrote:
My original sample was a 100K sample from zero-results queries to enwiki on 7/24. Today I looked at similar samples from 7/10 and 7/17 (since there is a weekly pattern to traffic) and from 7/22 to compare.
All of the patterns I detected are still present, in approximately the same volume (give or take a factor of 2), except for the ('"<TITLE>"', '<AUTHOR(S)>') pattern.
I've started looking at a 500K sample from 7/24 across all wikis. I'll have more results tomorrow, but right now it's already clear that someone is spamming useless DOI searches across wikis—and it's 9% of the wiki zero-results queries.
—Trey
Very interesting. i wonder if they ever get results for the doi searches (for example some of the references here have doi's: https://en.wikipedia.org/wiki/DNA). If they are searching specifically for doi's of specific reference materials, i wish we had a better way to let them query that (perhaps wikidata eventually, i wonder how their support is for putting reference material into wikidata).
Hi!
I've started looking at a 500K sample from 7/24 across all wikis. I'll have more results tomorrow, but right now it's already clear that someone is spamming useless DOI searches across wikis—and it's 9% of the wiki zero-results queries.
In case it is not some crazy bot, could we detect that it looks like DOI and output something like:
Hey, this looks like DOI. Would you like to check out https://dx.doi.org/$doi ?
better way to let them query that (perhaps wikidata eventually, i wonder how their support is for putting reference material into wikidata).
Wikidata may not be the best for it, as DOIs as I understand are used for documents, and Wikidata is repository of information about things (entities), most of which aren't specific documents. I.e. unless it's related to Wikisources or Wikibooks, we usually wouldn't expect specific document to have an entry on Wikidata.
Okay, I have a slightly better sample this morning. (I accidentally left out Wikipedias with abbreviations longer than 2 letters).
My new sample: 500K zero-result full_text queries (web and API) across the Wikipedias with 100K+ articles 383,433 unique search strings (that's a long, long tail) The sample covers a little over an hour: 2015-07-23 07:51:29 to 2015-07-23 08:55:42 The top 10 (en, de, pt, ja, ru, es, it, fr, zh, nl), account for >83% of queries
Top 10 counts, for reference: 221618 enwiki 51936 dewiki 25500 ptwiki 24206 jawiki 21891 ruwiki 19913 eswiki 18303 itwiki 14443 frwiki 11730 zhwiki 7685 nlwiki ----- 417225
The DOI searches that appear to come from Lagotto installations hit 25 wikis (as the Lagotto docs said they would), with en getting a lot more, and ru getting fewer in this sample, and the rest *very* evenly distributed. (I missed ceb and war before—apologies). The total is just over 50K queries, or >10% of the full text queries against larger wikis that result in zero results.
===DOI 6050 enwiki 1904 nlwiki 1902 cebwiki 1901 warwiki 1900 viwiki 1900 svwiki 1900 jawiki 1899 frwiki 1899 eswiki 1899 dewiki 1898 zhwiki 1898 ukwiki 1898 plwiki 1898 itwiki 1897 ptwiki 1897 nowiki 1897 fiwiki 1896 huwiki 1896 fawiki 1896 cswiki 1896 cawiki 1895 kowiki 1895 idwiki 1895 arwiki 475 ruwiki ----- 50181
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Jul 27, 2015 at 5:04 PM, Trey Jones tjones@wikimedia.org wrote:
My original sample was a 100K sample from zero-results queries to enwiki on 7/24. Today I looked at similar samples from 7/10 and 7/17 (since there is a weekly pattern to traffic) and from 7/22 to compare.
All of the patterns I detected are still present, in approximately the same volume (give or take a factor of 2), except for the ('"<TITLE>"', '<AUTHOR(S)>') pattern.
I've started looking at a 500K sample from 7/24 across all wikis. I'll have more results tomorrow, but right now it's already clear that someone is spamming useless DOI searches across wikis—and it's 9% of the wiki zero-results queries.
—Trey
wikimedia-search@lists.wikimedia.org