Okay, I have a slightly better sample this morning. (I accidentally left
out Wikipedias with abbreviations longer than 2 letters).
My new sample:
500K zero-result full_text queries (web and API) across the Wikipedias with
100K+ articles
383,433 unique search strings (that's a long, long tail)
The sample covers a little over an hour: 2015-07-23 07:51:29 to 2015-07-23
08:55:42
The top 10 (en, de, pt, ja, ru, es, it, fr, zh, nl), account for >83% of
queries
Top 10 counts, for reference:
221618 enwiki
51936 dewiki
25500 ptwiki
24206 jawiki
21891 ruwiki
19913 eswiki
18303 itwiki
14443 frwiki
11730 zhwiki
7685 nlwiki
-----
417225
The DOI searches that appear to come from Lagotto installations hit 25
wikis (as the Lagotto docs said they would), with en getting a lot more,
and ru getting fewer in this sample, and the rest *very* evenly
distributed. (I missed ceb and war before—apologies). The total is just
over 50K queries, or >10% of the full text queries against larger wikis
that result in zero results.
===DOI
6050 enwiki
1904 nlwiki
1902 cebwiki
1901 warwiki
1900 viwiki
1900 svwiki
1900 jawiki
1899 frwiki
1899 eswiki
1899 dewiki
1898 zhwiki
1898 ukwiki
1898 plwiki
1898 itwiki
1897 ptwiki
1897 nowiki
1897 fiwiki
1896 huwiki
1896 fawiki
1896 cswiki
1896 cawiki
1895 kowiki
1895 idwiki
1895 arwiki
475 ruwiki
-----
50181
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Mon, Jul 27, 2015 at 5:04 PM, Trey Jones <tjones(a)wikimedia.org> wrote:
My original sample was a 100K sample from zero-results
queries to enwiki
on 7/24. Today I looked at similar samples from 7/10 and 7/17 (since there
is a weekly pattern to traffic) and from 7/22 to compare.
All of the patterns I detected are still present, in approximately the
same volume (give or take a factor of 2), except for the
('"<TITLE>"',
'<AUTHOR(S)>') pattern.
I've started looking at a 500K sample from 7/24 across all wikis. I'll
have more results tomorrow, but right now it's already clear that someone
is spamming useless DOI searches across wikis—and it's 9% of the wiki
zero-results queries.
—Trey