TL;DR We get some ridiculously long queries (up to 5K characters)—lots are junk. We are also getting a Zerg Rush from bots:
- DOI and "timestamp" queries are everywhere (automata).
- We have lots of searches for energy-related articles (automaton?).
- Chinese product descriptions and part numbers/phone numbers/etc.
• I reviewed two samples similar to my original one, from a week before and from two weeks before, and found the distribution of zero-result queries was generally similar, though one week dewiki got ~30K (~60%) more. DOI searches ranged from ~15K to ~100K (!!) and "unix timestamp" searches ranged from 26K to 42K.
• Boolean AND queries were also within a factor of two in the other samples. quot queries and " film" queries were very similar.
• I mentioned very long queries before. Here's a breakdown by length (the categories overlap):
length count
150+ 2262
200+ 725
300+ 435
400+ 331
500+ 261
1000+ 120
2000+ 59
3000+ 24
4000+ 10
5000+ 1
Some of the DOI queries are over 150 characters, and make the list. The really long ones often look like random bits of text.
• A regular fixture in the top 100 zero-result queries per day is a 173-character string from a particular novel (Google Books found it right away). That's just weird.
• I looked at crosswiki zero-results searches and broke out the individual words in the queries to find recurring patterns. I've noted ones that have ~1000 instances. (Even 0.2% is a fair chunk for a single phenomenon.)
There are lots of DOI-related terms, of course, and our old friend "quot", lots of URL bits. {searchTerms} shows up 1998 times (mostly in ru). search_suggest_query shows up 440 times (en, de, fr, sv, nl and others).
There are lots of words related to searching for articles about energy: wind, power, turbine, energy, etc. Lots of long titles are included in several formats. I think this may also be a bot.
• There's a weird pattern, largely in eswiki, but some in enwiki and frwiki, where there are a bunch of search terms joined by +, followed by space and the name of a country. Australia, Austria, Bangladés, Bélgica, and Argentina are most used, but there are ~90 different countries (sometimes the same countrty with its name in different languages), for >5600 total instances. The alphabetic skew may or may not be related to the size of my sample.
1719 Australia
1682 Austria
659 Bangladés
537 Bélgica
519 Argentina
119 Bolivia
...
There are 2380 more instances in es wiki of a bunch of words mushed together with +'s
Looking at them in order in the logs, the are largely in alphabetical order.
A one-week earlier sample had fewer, a two-week earlier sample has even more.
Another likely bot.
• There are lots of intitle searches. Many in nlwiki (out of 355) and frwiki (out of 414) are for names who seem not to be in that wiki. Most failed intitle searches in enwiki (out of 504) are in Spanish or Portuguese. The rest are <100 instances, so I didn't investigate.
Similar patterns in the other two samples.
• There's a weird pattern (1293 instances), all in dewiki, like this:
<manufacturing terms> ## de tel fax
<manufacturing terms> includes injection molding, stone cutting die casting, etc.
Other samples have a similar pattern.
• Another weird pattern (953):
""<artist>"" paint <-- literally double double quoted.
and (~140)
<wikimedia commons file name> paint
All on enwiki, and the commons file names don't include the file type (e.g., ".jpg").
The same or more in other samples.
• 989 instances of this on enwiki
<descriptions of products in chinese> *###########*QQ########座机###########*<misc>.<misc>.xyz
座机 = "landline"
<misc> = letters, numbers, an transliterated Chinese (Pinyin?)
Online searches for parts of these reveal a similar pattern on Chinese-language business/manufacturing sites.
The same in samples from other weeks.
• Finally, I reviewed the larger collections of zero-results (10K+ from a gven wiki). My ability to analyze languages I don't know is limited, but here are some very brief impressions:
- dewiki has a few hundred OR'd together wildcard searches, some of which seem to be trying handle variations in declension.
- jawiki has lots of " film" searches.
- ruwiki has a few non-cyrillic searches
- itwiki has lots of queries that are multi-word phrases with underscores instead of spaces
- eswiki and frwiki have a fair number of build up searches and searches in Arabic, and frwiki has a fair number of searches in Chinese
- zhwiki has lots of non-Chinese searches in various languages