Hi everyone,
I've broadened my analysis from enwiki to the other larger wikis, looking
at the same phenomena I found in enwiki.
While the DOI searches are definitely an issue across 25 wikis, with the
other earlier-identified issues some are cross-wiki and some are not.
*TL;DR: After DOI searches, "unix timestamp" searches are the biggest
cross-wikipedia issue. Weird AND queries and quot queries are big
contributors on enwiki, which make them important overall. We could easily
fix the unix timestamp queries (either auto correct or make suggestions),
and we could fix lots of the quot queries. All of these could be included
in the category of "automata" that could potentially be separated from
regular queries, and it wouldn't hurt to track down their sources and help
people search better.*
The <unix-timestamp-looking number>:<wiki title> format (with a small
number with a space after the colon) is spread across 45 wikis, with 28,089
instances out of 500K (~5.6%). More than half of the results are enwiki
(15,961), but there are 3133 on ru, 2986 on it, 1889 on ja, and hundreds on
tr, fa, nl, ar, he, hi, id, and cs. At a cursory glance, all seem to be
largely named entities or queries in the appropriate language. Removing the
"14###########:", tracking down the source, or putting this on the automata
list would help a lot.
The boolean AND queries are largely in enwiki (17607: ~3.5% overall, ~7.9%
in enwiki), and they are a mixed bag, but many (626) appear with quot, and
most (16657) are of the form
"article_title_with_underscore" AND "article title without underscores"
where the first half is repeated over and over and the second half is
something linked to in the first article. Find the source and add to the
automata list.
In plwiki (263), the AND queries are all of the form
*<musical thing>* AND (muzyk* OR Dyskografia)
where <musical thing> seems to be an artist, band, album, or something
similar. This looks like an automaton, but may not be worth pursuing.
Similarly the ones from nl.
Globally, OR queries are much more common. 46,035 (~9.2%), spread much more
evenly over all the wikis. These are almost all the DOI queries.
quot is totally an enwiki thing. It's ~1.2% overall and ~2.8% in enwiki in
this sample, which is a lot for one small thing. We should either create a
secondary search with filtered quot or track down the source and help them
figure out how to do better.
TV episodes and films ("<title> S#E#" film) are mostly on enwiki (~1.1%
overall, ~2.4% of enwiki queries), with some on ja, fr, and de, and single
digits on it and ru. I'd count this as automata, though finding a source
would be nice.
Strings of numbers do happen everywhere, but are only common on enwiki,
with less on jawiki, and much less on de, fr, ru, vi, and nl.
My last bit of analysis will later this week, and I'll try to look at
non-English and/or cross-wiki stuff, write it all up in Phabricator, and
move on.
On Tue, Jul 28, 2015 at 9:51 AM, Trey Jones <tjones(a)wikimedia.org> wrote:
> Okay, I have a slightly better sample this morning. (I accidentally left
> out Wikipedias with abbreviations longer than 2 letters).
>
> My new sample:
> 500K zero-result full_text queries (web and API) across the Wikipedias
> with 100K+ articles
> 383,433 unique search strings (that's a long, long tail)
> The sample covers a little over an hour: 2015-07-23 07:51:29 to 2015-07-23
> 08:55:42
> The top 10 (en, de, pt, ja, ru, es, it, fr, zh, nl), account for >83% of
> queries
>
> Top 10 counts, for reference:
> 221618 enwiki
> 51936 dewiki
> 25500 ptwiki
> 24206 jawiki
> 21891 ruwiki
> 19913 eswiki
> 18303 itwiki
> 14443 frwiki
> 11730 zhwiki
> 7685 nlwiki
> -----
> 417225
>
> The DOI searches that appear to come from Lagotto installations hit 25
> wikis (as the Lagotto docs said they would), with en getting a lot more,
> and ru getting fewer in this sample, and the rest *very* evenly
> distributed. (I missed ceb and war before—apologies). The total is just
> over 50K queries, or >10% of the full text queries against larger wikis
> that result in zero results.
>
> ===DOI
> 6050 enwiki
> 1904 nlwiki
> 1902 cebwiki
> 1901 warwiki
> 1900 viwiki
> 1900 svwiki
> 1900 jawiki
> 1899 frwiki
> 1899 eswiki
> 1899 dewiki
> 1898 zhwiki
> 1898 ukwiki
> 1898 plwiki
> 1898 itwiki
> 1897 ptwiki
> 1897 nowiki
> 1897 fiwiki
> 1896 huwiki
> 1896 fawiki
> 1896 cswiki
> 1896 cawiki
> 1895 kowiki
> 1895 idwiki
> 1895 arwiki
> 475 ruwiki
> -----
> 50181
>
> On Mon, Jul 27, 2015 at 5:04 PM, Trey Jones <tjones(a)wikimedia.org> wrote:
>
>>
> I've started looking at a 500K sample from 7/24 across all wikis. I'll
>> have more results tomorrow, but right now it's already clear that someone
>> is spamming useless DOI searches across wikis—and it's 9% of the wiki
>> zero-results queries.
>>
>>