Hi everyone,
I've broadened my analysis from enwiki to the other larger wikis, looking at the same phenomena I found in enwiki.
While the DOI searches are definitely an issue across 25 wikis, with the other earlier-identified issues some are cross-wiki and some are not.
*TL;DR: After DOI searches, "unix timestamp" searches are the biggest cross-wikipedia issue. Weird AND queries and quot queries are big contributors on enwiki, which make them important overall. We could easily fix the unix timestamp queries (either auto correct or make suggestions), and we could fix lots of the quot queries. All of these could be included in the category of "automata" that could potentially be separated from regular queries, and it wouldn't hurt to track down their sources and help people search better.*
The <unix-timestamp-looking number>:<wiki title> format (with a small number with a space after the colon) is spread across 45 wikis, with 28,089 instances out of 500K (~5.6%). More than half of the results are enwiki (15,961), but there are 3133 on ru, 2986 on it, 1889 on ja, and hundreds on tr, fa, nl, ar, he, hi, id, and cs. At a cursory glance, all seem to be largely named entities or queries in the appropriate language. Removing the "14###########:", tracking down the source, or putting this on the automata list would help a lot.
The boolean AND queries are largely in enwiki (17607: ~3.5% overall, ~7.9% in enwiki), and they are a mixed bag, but many (626) appear with quot, and most (16657) are of the form "article_title_with_underscore" AND "article title without underscores" where the first half is repeated over and over and the second half is something linked to in the first article. Find the source and add to the automata list.
In plwiki (263), the AND queries are all of the form *<musical thing>* AND (muzyk* OR Dyskografia) where <musical thing> seems to be an artist, band, album, or something similar. This looks like an automaton, but may not be worth pursuing. Similarly the ones from nl.
Globally, OR queries are much more common. 46,035 (~9.2%), spread much more evenly over all the wikis. These are almost all the DOI queries.
quot is totally an enwiki thing. It's ~1.2% overall and ~2.8% in enwiki in this sample, which is a lot for one small thing. We should either create a secondary search with filtered quot or track down the source and help them figure out how to do better.
TV episodes and films ("<title> S#E#" film) are mostly on enwiki (~1.1% overall, ~2.4% of enwiki queries), with some on ja, fr, and de, and single digits on it and ru. I'd count this as automata, though finding a source would be nice.
Strings of numbers do happen everywhere, but are only common on enwiki, with less on jawiki, and much less on de, fr, ru, vi, and nl.
My last bit of analysis will later this week, and I'll try to look at non-English and/or cross-wiki stuff, write it all up in Phabricator, and move on.
On Tue, Jul 28, 2015 at 9:51 AM, Trey Jones tjones@wikimedia.org wrote:
Okay, I have a slightly better sample this morning. (I accidentally left out Wikipedias with abbreviations longer than 2 letters).
My new sample: 500K zero-result full_text queries (web and API) across the Wikipedias with 100K+ articles 383,433 unique search strings (that's a long, long tail) The sample covers a little over an hour: 2015-07-23 07:51:29 to 2015-07-23 08:55:42 The top 10 (en, de, pt, ja, ru, es, it, fr, zh, nl), account for >83% of queries
Top 10 counts, for reference: 221618 enwiki 51936 dewiki 25500 ptwiki 24206 jawiki 21891 ruwiki 19913 eswiki 18303 itwiki 14443 frwiki 11730 zhwiki 7685 nlwiki
417225
The DOI searches that appear to come from Lagotto installations hit 25 wikis (as the Lagotto docs said they would), with en getting a lot more, and ru getting fewer in this sample, and the rest *very* evenly distributed. (I missed ceb and war before—apologies). The total is just over 50K queries, or >10% of the full text queries against larger wikis that result in zero results.
===DOI 6050 enwiki 1904 nlwiki 1902 cebwiki 1901 warwiki 1900 viwiki 1900 svwiki 1900 jawiki 1899 frwiki 1899 eswiki 1899 dewiki 1898 zhwiki 1898 ukwiki 1898 plwiki 1898 itwiki 1897 ptwiki 1897 nowiki 1897 fiwiki 1896 huwiki 1896 fawiki 1896 cswiki 1896 cawiki 1895 kowiki 1895 idwiki 1895 arwiki 475 ruwiki
50181
On Mon, Jul 27, 2015 at 5:04 PM, Trey Jones tjones@wikimedia.org wrote:
I've started looking at a 500K sample from 7/24 across all wikis. I'll
have more results tomorrow, but right now it's already clear that someone is spamming useless DOI searches across wikis—and it's 9% of the wiki zero-results queries.
On Tue, Jul 28, 2015 at 12:57 PM, Trey Jones tjones@wikimedia.org wrote:
The boolean AND queries are largely in enwiki (17607: ~3.5% overall, ~7.9% in enwiki), and they are a mixed bag, but many (626) appear with quot, and most (16657) are of the form "article_title_with_underscore" AND "article title without underscores" where the first half is repeated over and over and the second half is something linked to in the first article. Find the source and add to the automata list.
We can probably do better on the underscores thing. Nik even said as much back in November[0].
-Chad
It seems we will have a number of different options to try, I wonder if its better to have independent rules or tie them all together into a more generic rule.
For example: Underscore stripping Converting + into space (or just urldecoding) Quote stripping (the bad `quot` ones, but also things that are legitimately quoted but the quoted query has no results) Timestamp stripping?
A highly generic rule that would probably get more (but worse) results: Either remove or convert into a space everything thats not alphadecimal Maybe even join the words with 'OR' instead of 'AND' if there are enough tokens
If we go the route of attempting to rewrite the query into something more plausible, is that something we would be building into elasticsearch, or cirrussearch? I could come up for plausible reasons for it being on either side but am leaning towards some sort of custom suggester implementation that does our own thing (although that may be due to a lack of knowing the internal api limitations there).
On Tue, Jul 28, 2015 at 3:24 PM, Chad Horohoe chorohoe@wikimedia.org wrote:
On Tue, Jul 28, 2015 at 12:57 PM, Trey Jones tjones@wikimedia.org wrote:
The boolean AND queries are largely in enwiki (17607: ~3.5% overall, ~7.9% in enwiki), and they are a mixed bag, but many (626) appear with quot, and most (16657) are of the form "article_title_with_underscore" AND "article title without underscores" where the first half is repeated over and over and the second half is something linked to in the first article. Find the source and add to the automata list.
We can probably do better on the underscores thing. Nik even said as much back in November[0].
-Chad
[0] https://phabricator.wikimedia.org/T64059
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Le 29/07/2015 00:32, Erik Bernhardson a écrit :
It seems we will have a number of different options to try, I wonder if its better to have independent rules or tie them all together into a more generic rule.
For example: Underscore stripping Converting + into space (or just urldecoding)
_ and + are already handled by the lucene analysis chain. If the query "article_title" don't match "article title" won't match also :
- third_term[0] - third+term[1]
Do you have an example where query_with_underscore returned no result and query with underscore returned a result?
Quote stripping (the bad `quot` ones, but also things that are legitimately quoted but the quoted query has no results) Timestamp stripping?
A highly generic rule that would probably get more (but worse) results: Either remove or convert into a space everything thats not alphadecimal Maybe even join the words with 'OR' instead of 'AND' if there are enough tokens
Re-formating the query at character level can be quite dangerous because it can conflict with the analysis chain. Concerning OR and AND I agree, but we have to make sure it won't hurt the scoring. This is the purpose of query expansion[2] Today we have only one query expansion profile which permits to use the full syntax offered by cirrus. IMHO the current profile is optimized for precision. But we could implement different profiles. To illustrate this idea look at the query word1 word2[3], today the expansion is an AND query over the all.plain with boost 1 and all with boost 0.5. - all.plain contains exact words - all contains exact words + stems
Another expansion profile could be : - AND over all.plain boost 1 - AND over all boost 0.5 - OR over all.plain with boost 0.2 - OR over all with boost 0.1
This is over simplified but if we could refactor cirrus in a way that is easy to implement different query expansion profiles it would be great. We could get rid of query_string for some profiles and use more advanced DSL query clauses (dismax, boosting query, common term query...).
[0] https://en.wikipedia.org/w/api.php?action=query&format=json&srsearch... [1] https://en.wikipedia.org/w/api.php?action=query&format=json&srsearch... [2] https://en.wikipedia.org/wiki/Query_expansion [3] https://en.wikipedia.org/w/index.php?search=word1+word2&title=Special%3A...
One issue I’ve had in the back of my mind but haven’t really made explicit is the question of exactly how to deal with these oddball queries.
There’s query normalization—converting + and _ to spaces, converting curly quotes to straight quotes, and the like—which should do at least whatever normalization the indexer does. (Is there documentation on what normalization the indexer does do?)
Then there are more destructive/transformative techniques, like stripping quot and timestamps, that should perhaps only be used for suggestions (which can be rolled over into re-queries when the original gives zero results).
URL decoding is sort of in between the two, to me. It’s more transformative than straightening quotes, but probably preserves the original intent of the searcher.
And then there are the last ditch attempts to get something vaguely relevant, like converting all non-alphanumerics to spaces—which is a very good last ditch effort, but should only be used if the original and maybe other backoffs fail.
In theory I also like David’s idea of refactoring things so multiple query expansion profiles are possible. Of course multiple searches are expensive. But even if we can’t run too many queries per search, the ability to run different expanded queries for different classes of queries is cool. (e.g., one word query => aggressive expansion; 50 word query => minimal expansion; 500 word query (see below!) => no expansion.)
In addition to identifying patterns and categories of searches that we can treat differently for analytics, it would make sense to do the same for actual queries. In our quest for non-zero results we shouldn’t favor recall over precision so much that we lose relevance.
One category I found was even crazier than I had anticipated was length. I don’t have the full details at hand at the moment, but out of 500K zero-result queries, there were thousands that were more than 100 characters long, and many that were over a thousand characters long. The longest were over 5000 characters. We should have a heuristic to not do any query expansion for queries longer than x characters, or z tokens or something. Doing OR expansion on hundreds of words—they often look like excerpts from books or articles—is a waste of our computational resources.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Jul 29, 2015 at 6:45 AM, David Causse dcausse@wikimedia.org wrote:
Le 29/07/2015 00:32, Erik Bernhardson a écrit :
It seems we will have a number of different options to try, I wonder if its better to have independent rules or tie them all together into a more generic rule.
For example: Underscore stripping Converting + into space (or just urldecoding)
_ and + are already handled by the lucene analysis chain. If the query "article_title" don't match "article title" won't match also :
- third_term[0]
- third+term[1]
Do you have an example where query_with_underscore returned no result and query with underscore returned a result?
Quote stripping (the bad `quot` ones, but also things that are
legitimately quoted but the quoted query has no results) Timestamp stripping?
A highly generic rule that would probably get more (but worse) results: Either remove or convert into a space everything thats not alphadecimal Maybe even join the words with 'OR' instead of 'AND' if there are enough tokens
Re-formating the query at character level can be quite dangerous because it can conflict with the analysis chain. Concerning OR and AND I agree, but we have to make sure it won't hurt the scoring. This is the purpose of query expansion[2] Today we have only one query expansion profile which permits to use the full syntax offered by cirrus. IMHO the current profile is optimized for precision. But we could implement different profiles. To illustrate this idea look at the query word1 word2[3], today the expansion is an AND query over the all.plain with boost 1 and all with boost 0.5.
- all.plain contains exact words
- all contains exact words + stems
Another expansion profile could be :
- AND over all.plain boost 1
- AND over all boost 0.5
- OR over all.plain with boost 0.2
- OR over all with boost 0.1
This is over simplified but if we could refactor cirrus in a way that is easy to implement different query expansion profiles it would be great. We could get rid of query_string for some profiles and use more advanced DSL query clauses (dismax, boosting query, common term query...).
[0] https://en.wikipedia.org/w/api.php?action=query&format=json&srsearch... [1] https://en.wikipedia.org/w/api.php?action=query&format=json&srsearch... [2] https://en.wikipedia.org/wiki/Query_expansion [3] https://en.wikipedia.org/w/index.php?search=word1+word2&title=Special%3A...
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Le 29/07/2015 16:53, Trey Jones a écrit :
One issue I’ve had in the back of my mind but haven’t really made explicit is the question of exactly how to deal with these oddball queries.
There’s query normalization—converting + and _ to spaces, converting curly quotes to straight quotes, and the like—which should do at least whatever normalization the indexer does. (Is there documentation on what normalization the indexer does do?)
Unfortunately no and I'm afraid that the analysis chain is too complex to write an exhaustive description. The easiest way to check this is with vagrant and run these elasticsearch requests :
A simple fulltext query will target the title and redirects data with the all_near_match field :
curl -XGET 'localhost:9200/wiki_content/_analyze?field=all_near_match&pretty' -d 'article_title article+title'; echo { "tokens" : [ { "token" : "article title article+title", "start_offset" : 0, "end_offset" : 27, "type" : "word", "position" : 1 } ] }
A fulltext search will also query the all.plain field (all fields with a standard analyzer) : curl -XGET 'localhost:9200/wiki_content/_analyze?field=all.pain&pretty' -d 'article_title article+title'; echo
{ "tokens" : [ { "token" : "article_title", "start_offset" : 0, "end_offset" : 13, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "article", "start_offset" : 14, "end_offset" : 21, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "title", "start_offset" : 22, "end_offset" : 27, "type" : "<ALPHANUM>", "position" : 3 } ] }
And finally the all field which depends on the language (stems, stopwords): curl -XGET 'localhost:9200/wiki_content/_analyze?field=all&pretty' -d 'article_title article+title'; echo { "tokens" : [ { "token" : "article", "start_offset" : 0, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "title", "start_offset" : 8, "end_offset" : 13, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "article", "start_offset" : 14, "end_offset" : 21, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "title", "start_offset" : 22, "end_offset" : 27, "type" : "<ALPHANUM>", "position" : 4 } ] }
So in this case + and _ should not prevent the query from matching the doc. It would be even worse to remove them because it would be impossible to find words that have been indexed with an '_'. IMHO we should let lucene do its job because we would run into many subtle bugs if we try to normalize something before.
But the query "article_title" (with quotes) will target only the all.plain and underscore are kept. I think the proper fallback method here is to drop the quotes when there's no match with quotes[0]
[0] https://www.google.fr/search?q=%22google+you+don%27t+have+this+page"
Okay, this is my final big blob of output on this topic. I'll put my results on a wiki page and link to it in Phabricator.
—Trey
TL;DR We get some ridiculously long queries (up to 5K characters)—lots are junk. We are also getting a Zerg Rush from bots: - DOI and "timestamp" queries are everywhere (automata). - We have lots of searches for energy-related articles (automaton?). - There's a likely automaton searching for term+term+term country - spam-looking queries for <manufacturing terms> ## de tel fax - paint bot: ""<artist>"" paint and <wikimedia commons file name> paint - Chinese product descriptions and part numbers/phone numbers/etc.
• I reviewed two samples similar to my original one, from a week before and from two weeks before, and found the distribution of zero-result queries was generally similar, though one week dewiki got ~30K (~60%) more. DOI searches ranged from ~15K to ~100K (!!) and "unix timestamp" searches ranged from 26K to 42K.
• Boolean AND queries were also within a factor of two in the other samples. quot queries and " film" queries were very similar.
• I mentioned very long queries before. Here's a breakdown by length (the categories overlap):
length count 150+ 2262 200+ 725 300+ 435 400+ 331 500+ 261 1000+ 120 2000+ 59 3000+ 24 4000+ 10 5000+ 1
Some of the DOI queries are over 150 characters, and make the list. The really long ones often look like random bits of text.
• A regular fixture in the top 100 zero-result queries per day is a 173-character string from a particular novel (Google Books found it right away). That's just weird.
• I looked at crosswiki zero-results searches and broke out the individual words in the queries to find recurring patterns. I've noted ones that have ~1000 instances. (Even 0.2% is a fair chunk for a single phenomenon.)
There are lots of DOI-related terms, of course, and our old friend "quot", lots of URL bits. {searchTerms} shows up 1998 times (mostly in ru). search_suggest_query shows up 440 times (en, de, fr, sv, nl and others).
There are lots of words related to searching for articles about energy: wind, power, turbine, energy, etc. Lots of long titles are included in several formats. I think this may also be a bot.
• There's a weird pattern, largely in eswiki, but some in enwiki and frwiki, where there are a bunch of search terms joined by +, followed by space and the name of a country. Australia, Austria, Bangladés, Bélgica, and Argentina are most used, but there are ~90 different countries (sometimes the same countrty with its name in different languages), for
5600 total instances. The alphabetic skew may or may not be related to the
size of my sample.
1719 Australia 1682 Austria 659 Bangladés 537 Bélgica 519 Argentina 119 Bolivia ...
There are 2380 more instances in es wiki of a bunch of words mushed together with +'s
Looking at them in order in the logs, the are largely in alphabetical order.
A one-week earlier sample had fewer, a two-week earlier sample has even more.
Another likely bot.
• There are lots of intitle searches. Many in nlwiki (out of 355) and frwiki (out of 414) are for names who seem not to be in that wiki. Most failed intitle searches in enwiki (out of 504) are in Spanish or Portuguese. The rest are <100 instances, so I didn't investigate.
Similar patterns in the other two samples.
• There's a weird pattern (1293 instances), all in dewiki, like this: <manufacturing terms> ## de tel fax
<manufacturing terms> includes injection molding, stone cutting die casting, etc.
Other samples have a similar pattern.
• Another weird pattern (953): ""<artist>"" paint <-- literally double double quoted. and (~140) <wikimedia commons file name> paint
All on enwiki, and the commons file names don't include the file type (e.g., ".jpg").
The same or more in other samples.
• 989 instances of this on enwiki <descriptions of products in chinese> *###########*QQ########座机###########*<misc>.<misc>.xyz
座机 = "landline" <misc> = letters, numbers, an transliterated Chinese (Pinyin?)
Online searches for parts of these reveal a similar pattern on Chinese-language business/manufacturing sites.
The same in samples from other weeks.
• Finally, I reviewed the larger collections of zero-results (10K+ from a gven wiki). My ability to analyze languages I don't know is limited, but here are some very brief impressions:
- dewiki has a few hundred OR'd together wildcard searches, some of which seem to be trying handle variations in declension. - jawiki has lots of " film" searches. - ruwiki has a few non-cyrillic searches - itwiki has lots of queries that are multi-word phrases with underscores instead of spaces - eswiki and frwiki have a fair number of build up searches and searches in Arabic, and frwiki has a fair number of searches in Chinese - zhwiki has lots of non-Chinese searches in various languages
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Tue, Jul 28, 2015 at 3:57 PM, Trey Jones tjones@wikimedia.org wrote:
Hi everyone,
I've broadened my analysis from enwiki to the other larger wikis, looking at the same phenomena I found in enwiki.
While the DOI searches are definitely an issue across 25 wikis, with the other earlier-identified issues some are cross-wiki and some are not.
*TL;DR: After DOI searches, "unix timestamp" searches are the biggest cross-wikipedia issue. Weird AND queries and quot queries are big contributors on enwiki, which make them important overall. We could easily fix the unix timestamp queries (either auto correct or make suggestions), and we could fix lots of the quot queries. All of these could be included in the category of "automata" that could potentially be separated from regular queries, and it wouldn't hurt to track down their sources and help people search better.*
The <unix-timestamp-looking number>:<wiki title> format (with a small number with a space after the colon) is spread across 45 wikis, with 28,089 instances out of 500K (~5.6%). More than half of the results are enwiki (15,961), but there are 3133 on ru, 2986 on it, 1889 on ja, and hundreds on tr, fa, nl, ar, he, hi, id, and cs. At a cursory glance, all seem to be largely named entities or queries in the appropriate language. Removing the "14###########:", tracking down the source, or putting this on the automata list would help a lot.
The boolean AND queries are largely in enwiki (17607: ~3.5% overall, ~7.9% in enwiki), and they are a mixed bag, but many (626) appear with quot, and most (16657) are of the form "article_title_with_underscore" AND "article title without underscores" where the first half is repeated over and over and the second half is something linked to in the first article. Find the source and add to the automata list.
In plwiki (263), the AND queries are all of the form *<musical thing>* AND (muzyk* OR Dyskografia) where <musical thing> seems to be an artist, band, album, or something similar. This looks like an automaton, but may not be worth pursuing. Similarly the ones from nl.
Globally, OR queries are much more common. 46,035 (~9.2%), spread much more evenly over all the wikis. These are almost all the DOI queries.
quot is totally an enwiki thing. It's ~1.2% overall and ~2.8% in enwiki in this sample, which is a lot for one small thing. We should either create a secondary search with filtered quot or track down the source and help them figure out how to do better.
TV episodes and films ("<title> S#E#" film) are mostly on enwiki (~1.1% overall, ~2.4% of enwiki queries), with some on ja, fr, and de, and single digits on it and ru. I'd count this as automata, though finding a source would be nice.
Strings of numbers do happen everywhere, but are only common on enwiki, with less on jawiki, and much less on de, fr, ru, vi, and nl.
My last bit of analysis will later this week, and I'll try to look at non-English and/or cross-wiki stuff, write it all up in Phabricator, and move on.
On Tue, Jul 28, 2015 at 9:51 AM, Trey Jones tjones@wikimedia.org wrote:
Okay, I have a slightly better sample this morning. (I accidentally left out Wikipedias with abbreviations longer than 2 letters).
My new sample: 500K zero-result full_text queries (web and API) across the Wikipedias with 100K+ articles 383,433 unique search strings (that's a long, long tail) The sample covers a little over an hour: 2015-07-23 07:51:29 to 2015-07-23 08:55:42 The top 10 (en, de, pt, ja, ru, es, it, fr, zh, nl), account for >83% of queries
Top 10 counts, for reference: 221618 enwiki 51936 dewiki 25500 ptwiki 24206 jawiki 21891 ruwiki 19913 eswiki 18303 itwiki 14443 frwiki 11730 zhwiki 7685 nlwiki
417225
The DOI searches that appear to come from Lagotto installations hit 25 wikis (as the Lagotto docs said they would), with en getting a lot more, and ru getting fewer in this sample, and the rest *very* evenly distributed. (I missed ceb and war before—apologies). The total is just over 50K queries, or >10% of the full text queries against larger wikis that result in zero results.
===DOI 6050 enwiki 1904 nlwiki 1902 cebwiki 1901 warwiki 1900 viwiki 1900 svwiki 1900 jawiki 1899 frwiki 1899 eswiki 1899 dewiki 1898 zhwiki 1898 ukwiki 1898 plwiki 1898 itwiki 1897 ptwiki 1897 nowiki 1897 fiwiki 1896 huwiki 1896 fawiki 1896 cswiki 1896 cawiki 1895 kowiki 1895 idwiki 1895 arwiki 475 ruwiki
50181
On Mon, Jul 27, 2015 at 5:04 PM, Trey Jones tjones@wikimedia.org wrote:
I've started looking at a 500K sample from 7/24 across all wikis. I'll
have more results tomorrow, but right now it's already clear that someone is spamming useless DOI searches across wikis—and it's 9% of the wiki zero-results queries.
A summary... for those who haven't been able to keep up with the voluminous emails:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Result...
—Trey
Many thanks for the summary. Now time to twiddle bits and learn if we're right.
On Fri, Jul 31, 2015 at 11:23 AM, Trey Jones tjones@wikimedia.org wrote:
A summary... for those who haven't been able to keep up with the voluminous emails:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Result...
—Trey
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search