Hey everyone,
I got access to some logs and I've been slogging through the data. In
particular, I've partially analyzed a sample of 100K zero-result full_text
searches against enwiki, over the course of about an hour (2015-07-23
07:51:29 to 2015-07-23 08:55:42). My results and opinions are below.
*TL;DR Summary: If these patterns hold for another sample (and across
languages), we should be able to get some decent mileage out of these
simple approaches:*
* - find sources of weird patterns and either ignore them, or contact the
source and redirect them to a more appropriate destination*
* - use language or character set detection to redirect queries to other
wikis*
* - filter the term "quot" from queries*
* - filter 14###########: from the front of queries*
* - replace _ with space in queries*
All of this is somewhat rough, and exact numbers aren't guaranteed. Also
the categories may overlap. I also intend to look for these same patterns
from another sample from a different day and make sure they are more
general and not just temporary idiosyncrasies. I also plan to look through
other language wikis (i.e., Spanish and French to start) to see if there
are cross-linguistic patterns like these.
I think we have to some how come to terms with the fact that some queries
don't deserve results, and maybe figure out the source of such
"illegitimate" queries and filter them. (I'd really like to be able to
track down the referrer, if there is one, for a lot of the weirder queries.)
Top query:
- 248 Dounload feer game
- all via web... and Google can't find it. That's just weird.
Some other categories of queries are below. The numbers are "<total
queries> / <unique queries>". Since this is a 100K sample of zero-result
queries, and zero-results are about 25% of all results, each 1,000 of total
queries here represents about 0.25% of all search queries.
253 / 171 string of numbers
3610 / 2505 no Latin letters
- I see Korean, Thai, Japanese, Cyrillic, doi #s (see below), Arabic,
Hebrew, Greek, Armenian, Georgian, Devanagari, Burmese, Chinese, and some
emoji (e.g., 11 searches for 😜💗🎨❤️💋😞☀️💦).
- I also saw instances of mixed Latin / non-Latin queries
- Includes gibberish, which is hard to grep for, but easy to spot by eye
- Lots of the non-gibberish ones are clearly in other languages, and I saw
queries in other Latin-alphabet languages go by, too.
2630 / 2627 DOIs, all in quotes
3015 / 1017 have quot in them (which gets auto-corrected to "quote",
obviously)
- 327 are one word: quot ... quot
- I don't know where these are coming from, but they are weird. If we strip
"quot" we would get many of these. This must be coming from some source
that is adding quotes, then escaping them as """ and then stripping &
and ;. Weird.
7155 / 6337 #:Name
- almost all are 14###########:Text
- e.g., 1436755654740:Sherlock Holmes
- These all look like Wikipedia titles!
- Two each of 0:... and 6000:...
114 / 85 actual http(s):// URLs
488 / 244 URL-like things starting with www... and ending with .com, .ru,
etc.
211 / 132 other searches starting with “www.”
1085 / 1083 article searches in this format: ('"<TITLE>"', '<AUTHOR(S)>')
2457 / 2060 TV episodes (based on the presence of "S#E#"—that's season #,
episode #)
8419 / 7523 AND boolean searches
703 / 701 OR boolean searches
- Many of these look auto-generated, esp in the aggregate.
- For example: there are 498 / 249 "House_of_Gurieli" AND ... queries
6310 / 5742 queries with _ in them
- only 934 / 790 if we skip the 14###########:Text and boolean AND queries
Other things I noticed:
- lots of queries for books, articles, movies, tv, mp3s, and porn (in
multiple languages)
- lots of "building up" searches (and these are all marked full_text), for
example:
achevm
achevme
achevmen
achevment
achevments
achevments o
achevments of
achevments of
achevments of h
achevments of he
achevments of hell
achevments of helle
achevments of hellen
achevments of hellen k
achevments of hellen k
achevments of hellen kell
achevments of hellen kelle
achevments of hellen keller
- reasonable-looking ~ queries don't work:
intitle:George~ intitle:Washin~ gives 0 results
intitle:Washington intitle:George gives 279 results
Finally, I did see a bunch of typos, but I didn't try to quantify them
because I was digging into all of these other interesting patterns.
Have a good weekend.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
This thread started between a few of us, but has some good ideas and
thoughts. Forwarding into the search mailing list (where we will endeavour
to have these conversations in the future).
Erik B
---------- Forwarded message ----------
From: Oliver Keyes <okeyes(a)wikimedia.org>
Date: Wed, Jul 22, 2015 at 8:31 AM
Subject: Re: Zero search results—how can I help?
To: David Causse <dcausse(a)wikimedia.org>
Cc: Trey Jones <tjones(a)wikimedia.org>, Erik Bernhardson <
ebernhardson(a)wikimedia.org>
Whoops; I guess point 4 is the second list ;p.
On 22 July 2015 at 11:30, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
> On 22 July 2015 at 10:55, David Causse <dcausse(a)wikimedia.org> wrote:
>> Le 22/07/2015 15:21, Oliver Keyes a écrit :
>>>
>>> Thanks; much appreciated. Point 3 directly relates to my work so it's
>>> good to be CCd :).
>>>
>>> FWIW, this kind of detail on the specific things we're doing is
>>> missing from the main search mailing list and could be used very much
>>> there to inform people.
>>
>>
>> I agree, my intent right now is still to learn from each others and
>> build/use a friendly environment where engineer with NLP background like
>> Trey can work efficiently. When things will be clearer it'd be great to
>> share our plan.
>>
>>>
>>> Oliver is already handling the executor IDs and distinguishing full
>>> and prefix search, so nyah ;p.
>>
>> Great!
>>
>> Just to be sure : does this means that a search count will be reduced to
its
>> executorID :
>> - all request with the same executorID return 0 zero result -> add 1 to
the
>> zero result counter
>> - if one of the request returns a result -> do not increment the zero
result
>> counter
>> If yes I think this will be the killer patch for Q1 :)
>>
>
> Executor IDs are stored and if a match is found in executor IDs <=120
> seconds after that one, the later outcome is considered "the outcome".
> If not, we assume no second round-trip was made and so go with
> whatever happened first.
>
> So if you make a request and it round-trips once and fails, failure.
> Round-trip once and succeeds, success. Round-trip twice and fail both
> times, failure. Round-trip twice and fail the first time and succeed
> the second - one success, zero failures :). Erik wrote it, and I grok
> the logic.
>
>>> On the language detection - actually
>>> Kolkus and Rehurek published a work in 2009 that handles small amounts
>>> of text really really well (n-gram based approaches /suck at this/)
>>> and there's a Java implementation I've been playing with. Want me to
>>> run it across some search strings and we can look at the results? Or
>>> just send the code across.
>>
>> If you ask I'd say both! ;)
>>
>> We evaluated this kind of dictionary-based language detection (but this
not
>> this one specifically), problem for us was mostly due to performance: it
>> takes time to tokenize the input string correctly and the dictionary we
used
>> was rather big. But we worked mainly on large content (webnews, press
>> articles).
>> In our case input strings should be very small so it makes more sense. We
>> should be able to train the dictionary against the "all title in ns0"
dumps
>> though.
>>
>> This is also a great example to explain why I feel stuck sometimes:
>> How will you be able to test it?
>> - I'm not allowed to download search logs locally.
>> - I think I won't be able to install java and play with this kind of
tools
>> on fluorine.
>>
>
> Ahh, but! You're NDAd, your laptop is a work laptop, and you have FDE,
> right? If yes to all three, I don't see a problem with me squirting
> you a sample of logs (and the Java). I figure if we find the
> methodology works we can look at speedups to the code, which is a lot
> easier a task than looking at fast code and trying to improve the
> methodology.
>
>> Another point:
>> concerning the following tasks described below, I think it overlaps
>> analytics tasks (because it's mainly related to learning from search
logs).
>> I don't know how you work today and maybe this is something you've
already
>> done or is obviously wrong.
>> I think you're one of the best person today to help us to sort this out,
so
>> your feedback concerning the following lines will be greatly appreciated
:)
>>
>> Thanks!
>
> Yes! Okay, thoughts on the below:
>
> 1. Build a search log parser - we sort of have that through the
> streaming python script. It depends whether you mean a literal parser
> or something to pick out all the "important" bits. See point 4.
> 2. Big machine: I'd love this. But see point 4.
> 3. Improve search logs for us: when we say improve for us do we mean
> for analytics/improvements purposes? Because if so we've been talking
> about having the logs in HDFS which would make things pretty easy for
> all and sundry and avoid the need for a parser.
>
> One way of neatly handling all of this would be:
>
> 1. Get the logs in a format that has the fields we want and stream it
> into Hadoop. No parser necessary.
> 2. Stick the big-ass machine in the analytics cluster, where it has
> default access to Hadoop and can grab data trivially, but doesn't have
> to break anyone else's stuff.
> 3. Fin.
>
> What am I missing? Other than "setting up a MediaWiki kafka client is
> going to be kind of a bit of work".
>
>>>>
>>>> Le 22/07/2015 14:38, David Causse a écrit :
>>>>>
>>>>> It's still not very clear in my mind but things could look like :
>>>>>
>>>>> * Epic: Build a toolbox to learn from search logs
>>>>> - Create a script to run search queries against the production
>>>>> index
>>>>> - Build search logs parser that provide all the needed details :
>>>>> time,
>>>>> search type, wiki origin, target search index, search query, search
>>>>> query
>>>>> ID, number of results, offset of the results (search page)
>>>>> (side note : Erik will it be possible to pass the queryID
from
>>>>> page to page when user clicks "next page"?)
>>>>> - Have a descent machine (64g RAM would be great) in the
production
>>>>> cluster where we can
>>>>> - download production search logs
>>>>> - install the tools we want
>>>>> - stress it not being afraid to kill it
>>>>> - do all the stuff we want to learn from data and search logs
>>>>>
>>>>> * Epic: Improve search logs for us
>>>>> - Add an "incognito parameter" to cirrus that could be used by
the
>>>>> toolbox script not to pollute our search logs when running our "search
>>>>> script".
>>>>> - Add a log when the user click on a search result to have a
>>>>> mapping
>>>>> between the queryID, the result choosen and the offset of the chosen
>>>>> link in
>>>>> the result list.
>>>>> - This task is certainly complex and highly depends on the
>>>>> client,
>>>>> I don't know if we will be able to track this down on all clients but
>>>>> it'd
>>>>> be great for us.
>>>>> - More things will be added as we learn
>>>>>
>>>>> * Epic: start to measure and control relevance
>>>>> - Create a corpus of search queries for each wiki with their
>>>>> expected
>>>>> results
>>>>> - Run these queries weekly/monthly and compute the F1-Score for
>>>>> each
>>>>> wiki
>>>>> - Continuously enhance the search queries corpus
>>>>> - Provide a weekly/monthly perf score for each wiki
>>>>>
>>>>> As you can see this is mostly about tools, I propose to start with
batch
>>>>> tools and think later of how we could make this more real-time.
>>>>>
>>>>>
>>
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
Hey all,
So, the data for the Search dashboards
(http://searchdata.wmflabs.org/metrics/) comes from a variety of
sources, one of which is the daily logs of all Cirrus search requests
- about 46GB of data a day. We set up a pipeline to this to report the
"zero" rate - how many queries happen with zero results. This was a
pretty shaky pipeline because it was an ultra-urgent,
need-it-for-a-presentation thing.
Good news: my prediction that it needed work was accurate. Bad news:
my prediction that it needed work was accurate ;).
When Erik and I went through all of the scripts and rewrote
them on the 15th we discovered a lot of maintenance tasks that were
being identified as searches. These are now being excluded, but we
have to backfill 1.5 months of data. I've chosen to eliminate the old
data and then backfill, because it means we avoid having data from
multiple, dissonant software versions, and because it just makes the
backfilling task a bit easier.
As a result, the dashboards may look a bit odd over the next couple of
days; they have data from the 15th onwards that we're comfortable
about, but are gradually backfilling from 1 June to 14 July - starting
on 1 June. So at the moment we have 1 June and 15-21 July. Weird. And
then 1, 2nd June, 15th...so on.
So expect to see increasingly less weird graphs, until the point where
they're back
to normal, (but more consistent and sane looking). Until then: yeah,
they're gonna look a bit weird.
Thanks,
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
In our recent (July) team retrospective, we didn't have a chance to review
the action items that came out of our June retrospective. However, I have
posted those previous items, with status updates (as best I know them)[1].
Of the 18 items, 5 are "done", and several others are improved or in
progress.
[1]
https://www.mediawiki.org/wiki/Wikimedia_Search_Team/Retrospective_2015-07-…
That page will also contain our July retrospective notes, after they have
been processed.
Kevin Smith
Agile Coach
Wikimedia Foundation
*Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment. Help us make it a reality.*
I'm having trouble enabling the analytics role on vagrant. Does this mean
anything to anyone?
==> default: Error: Puppet::Parser::AST::Resource failed with error
> ArgumentError: Could not find declared class ::cdh::hadoop at
> /vagrant/puppet/modules/role/manifests/hadoop.pp:45
> on node mediawiki-vagrant.dev
>
I even tried vagrant destroying, and starting from scratch. It seems like
maybe I need to apt-get install something Hadoop related, but my Google-fu
isn't helping.
We had a meeting today with Giuseppe and Andrew from Ops, and clarified our
path toward getting WDQS deployed in production (as a test service). Here
are the takeaways/action items I'm aware of:
1. We need to specify our hardware needs ASAP
---> I think this means we should unstall
https://phabricator.wikimedia.org/T86561 and assign it to Stas.
2. Most likely the service will run on existing hardware (and ops will want
to deploy it in both data centers)
3. Debian packaging is not required--we'll use maven+archiva+git deploy (?)
4. Andrew can help Stas with archiva (which Stas and Nik have already used)
5. Giuseppe can help Stas with puppet, which should be pretty easy
6. The puppet work should include basic health and performance monitoring
7. Stas will consider using jmx for additional logging
Full notes of the meeting are here:
http://etherpad.wikimedia.org/p/DiscoveryOpsWDQS
Kevin Smith
Agile Coach
Wikimedia Foundation
*Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment. Help us make it a reality.*
If the query returned 0 results and didn't have any syntax (no intitle:foo)
in it, should we try _harder_ to get suggestions? I don't know exactly what
that changes that means but we can totally implement the retry if we think
it'll help.
The idea is that it might not be performant enough to run super duper
strong suggester settings all the time and when there are no results it
important to have suggestions.
For reference, only 20% of 0 results queries that I counted this morning
returned a suggestion. I don't know how many asked for it though.
Hi all,
As a reminder, all[1] of your Discovery-related research and coding work
should be tracked in phabricator. During our Tuesday/Thursday standups,
most of what you talk about should be tasks on one of the "sprint"
workboards. If you are working on a task that isn't in the sprint board,
please a) re-check to be sure that is the highest priority thing you should
be working on, and b) if it is, add it to phabricator and/or to the sprint
board as needed.
When you pick what to work on, try to grab something from near the top of
the sprint's Backlog column, and move it to In Progress. Please use the
Needs Review column as needed, and when the task is really done, move it to
Done.
Each sub-team should be focused on its quarterly goal. Please be sure that
Dan is aware of any work you do outside that. If you have any questions,
check with him, me, or a team lead.
[1] If you do a 15-minutes task here or there, it doesn't need to be
tracked in phab. But any substantive work should be. Personally I would set
the threshold at about an hour, but your mileage may vary.
Thanks much!
Kevin Smith
Agile Coach
Wikimedia Foundation
*Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment. Help us make it a reality.*
Let's say, hypothetically, that I wanted to measure information about HTTP
requests coming into the Wikipedia Portal (www.wikipedia.org).
* Do we record this information?
* If so, is it accessible via analytical tools?
* If so, how do I get my mitts on it?
* If not, is it accessible from a database or similar?
Context: https://phabricator.wikimedia.org/T100673
In our neverending march towards progress I've created a phabricator task
<https://phabricator.wikimedia.org/T103598> to upgrade beta to
Elasticsearch 1.6.0. That requires a few things:
* Release our plugins to archiva
* Propose a patch to upgrade to those new versions
* Manually land the patch in beta and sync those versions of the plugins
* On every Elasticsearch node (deployment-elastic0[5678]) download the
elasticsearch 1.6 package, install it, and restart elasticsearch.
Its not a ton of work but in our effort to get non-Nik people used to doing
elasticsearch maintenance I'd love for someone else to grab it. In our
effort to upgrade to 1.6 soon, it'd be cool if someone could grab it in the
next few days. We need at least a week of beta testing 1.6.0 before we
upgrade production, just to be sure.
So anyone want to do it? I don't expect you need special permissions that
are hard to get because its beta. We can add grant you whatever permissions
you lack in just a few minutes.
Nik