New subject: [Wikimedia-search] Fwd: Zero search results—how can I help?

22 Jul 2015

      This thread started between a few of us, but has some good ideas and
thoughts. Forwarding into the search mailing list (where we will endeavour
to have these conversations in the future).
Erik B
---------- Forwarded message ----------
From: Oliver Keyes okeyes@wikimedia.org
Date: Wed, Jul 22, 2015 at 8:31 AM
Subject: Re: Zero search results—how can I help?
To: David Causse dcausse@wikimedia.org
Cc: Trey Jones tjones@wikimedia.org, Erik Bernhardson <
ebernhardson@wikimedia.org>
Whoops; I guess point 4 is the second list ;p.
On 22 July 2015 at 11:30, Oliver Keyes okeyes@wikimedia.org wrote:
...
On 22 July 2015 at 10:55, David Causse dcausse@wikimedia.org wrote:
...
Le 22/07/2015 15:21, Oliver Keyes a écrit :
...
Thanks; much appreciated. Point 3 directly relates to my work so it's
good to be CCd :).
FWIW, this kind of detail on the specific things we're doing is
missing from the main search mailing list and could be used very much
there to inform people.
I agree, my intent right now is still to learn from each others and
build/use a friendly environment where engineer with NLP background like
Trey can work efficiently. When things will be clearer it'd be great to
share our plan.
...
Oliver is already handling the executor IDs and distinguishing full
and prefix search, so nyah ;p.
Great!
Just to be sure : does this means that a search count will be reduced to
its
...
...
executorID  :

all request with the same executorID return 0 zero result -> add 1 to

the
...
...
zero result counter

if one of the request returns a result -> do not increment the zero

result
...
...
counter
If yes I think this will be the killer patch for Q1 :)
Executor IDs are stored and if a match is found in executor IDs <=120
seconds after that one, the later outcome is considered "the outcome".
If not, we assume no second round-trip was made and so go with
whatever happened first.
So if you make a request and it round-trips once and fails, failure.
Round-trip once and succeeds, success. Round-trip twice and fail both
times, failure. Round-trip twice and fail the first time and succeed
the second - one success, zero failures :). Erik wrote it, and I grok
the logic.
...
...
On the language detection - actually
Kolkus and Rehurek published a work in 2009 that handles small amounts
of text really really well (n-gram based approaches /suck at this/)
and there's a Java implementation I've been playing with. Want me to
run it across some search strings and we can look at the results? Or
just send the code across.
If you ask I'd say both! ;)
We evaluated this kind of dictionary-based language detection (but this
not
...
...
this one specifically), problem for us was mostly due to performance: it
takes time to tokenize the input string correctly and the dictionary we
used
...
...
was rather big. But we worked mainly on large content (webnews, press
articles).
In our case input strings should be very small so it makes more sense. We
should be able to train the dictionary against the "all title in ns0"
dumps
...
...
though.
This is also a great example to explain why I feel stuck sometimes:
How will you be able to test it?

I'm not allowed to download search logs locally.
I think I won't be able to install java and play with this kind of

tools
...
...
on fluorine.
Ahh, but! You're NDAd, your laptop is a work laptop, and you have FDE,
right? If yes to all three, I don't see a problem with me squirting
you a sample of logs (and the Java). I figure if we find the
methodology works we can look at speedups to the code, which is a lot
easier a task than looking at fast code and trying to improve the
methodology.
...
Another point:
concerning the following tasks described below, I think it overlaps
analytics tasks (because it's mainly related to learning from search
logs).
...
...
I don't know how you work today and maybe this is something you've
already
...
...
done or is obviously wrong.
I think you're one of the best person today to help us to sort this out,
so
...
...
your feedback concerning the following lines will be greatly appreciated
:)
...
...
Thanks!
Yes! Okay, thoughts on the below:

Build a search log parser - we sort of have that through the

streaming python script. It depends whether you mean a literal parser
or something to pick out all the "important" bits. See point 4.
2. Big machine: I'd love this. But see point 4.
3. Improve search logs for us: when we say improve for us do we mean
for analytics/improvements purposes? Because if so we've been talking
about having the logs in HDFS which would make things pretty easy for
all and sundry and avoid the need for a parser.
One way of neatly handling all of this would be:

Get the logs in a format that has the fields we want and stream it

into Hadoop. No parser necessary.
2. Stick the big-ass machine in the analytics cluster, where it has
default access to Hadoop and can grab data trivially, but doesn't have
to break anyone else's stuff.
3. Fin.
What am I missing? Other than "setting up a MediaWiki kafka client is
going to be kind of a bit of work".
...
...
...
Le 22/07/2015 14:38, David Causse a écrit :
...
It's still not very clear in my mind but things could look like :

Epic: Build a toolbox to learn from search logs
Create a script to run search queries against the production

index
     - Build search logs parser that provide all the needed details :
time,
search type, wiki origin, target search index, search query, search
query
ID, number of results, offset of the results (search page)
         (side note : Erik will it be possible to pass the queryID
from
...
...
...
...
...
page to page when user clicks "next page"?)
     - Have a descent machine (64g RAM would be great) in the
production
...
...
...
...
...
cluster where we can
         - download production search logs
         - install the tools we want
         - stress it not being afraid to kill it
         - do all the stuff we want to learn from data and search logs

Epic: Improve search logs for us
Add an "incognito parameter" to cirrus that could be used by

the
...
...
...
...
...
toolbox script not to pollute our search logs when running our "search
script".
     - Add a log when the user click on a search result to have a
mapping
between the queryID, the result choosen and the offset of the chosen
link in
the result list.
         - This task is certainly complex and highly depends on the
client,
I don't know if we will be able to track this down on all clients but
it'd
be great for us.
     - More things will be added as we learn

Epic: start to measure and control relevance
Create a corpus of search queries for each wiki with their

expected
results
     - Run these queries weekly/monthly and compute the F1-Score for
each
wiki
     - Continuously enhance the search queries corpus
     - Provide a weekly/monthly perf score for each wiki
As you can see this is mostly about tools, I propose to start with
batch
...
...
...
...
...
tools and think later of how we could make this more real-time.
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
--
Oliver Keyes
Research Analyst
Wikimedia Foundation