New subject: [Wikimedia-search] Fwd: Zero search results—how can I help?

22 Jul 2015

This thread started between a few of us, but has some good ideas and
thoughts. Forwarding into the search mailing list (where we will endeavour
to have these conversations in the future).

Erik B
---------- Forwarded message ----------
From: Oliver Keyes &lt;okeyes(a)wikimedia.org&gt;
Date: Wed, Jul 22, 2015 at 8:31 AM
Subject: Re: Zero search results—how can I help?
To: David Causse &lt;dcausse(a)wikimedia.org&gt;
Cc: Trey Jones &lt;tjones(a)wikimedia.org&gt;rg>, Erik Bernhardson <
ebernhardson(a)wikimedia.org&gt;

Whoops; I guess point 4 is the second list ;p.

On 22 July 2015 at 11:30, Oliver Keyes &lt;okeyes(a)wikimedia.org&gt; wrote:
...
  On 22 July 2015 at 10:55, David Causse
&lt;dcausse(a)wikimedia.org&gt; wrote:
> Le 22/07/2015 15:21, Oliver Keyes a écrit :
>>
>> Thanks; much appreciated. Point 3 directly relates to my work so it's
>> good to be CCd :).
>>
>> FWIW, this kind of detail on the specific things we're doing is
>> missing from the main search mailing list and could be used very much
>> there to inform people.
>
>
> I agree, my intent right now is still to learn from each others and
> build/use a friendly environment where engineer with NLP background like
> Trey can work efficiently. When things will be clearer it'd be great to
> share our plan.
>
>>
>> Oliver is already handling the executor IDs and distinguishing full
>> and prefix search, so nyah ;p.
>
> Great!
>
> Just to be sure : does this means that a search count will be reduced to its
...
 > executorID  :
> - all request with the same executorID return 0 zero result -> add 1 to the
...
 > zero result counter
> - if one of the request returns a result -> do not increment the zero
result
...
   counter
 If yes I think this will be the killer patch for Q1 :)

 Executor IDs are stored and if a match is found in executor IDs <=120
 seconds after that one, the later outcome is considered "the outcome".
 If not, we assume no second round-trip was made and so go with
 whatever happened first.

 So if you make a request and it round-trips once and fails, failure.
 Round-trip once and succeeds, success. Round-trip twice and fail both
 times, failure. Round-trip twice and fail the first time and succeed
 the second - one success, zero failures :). Erik wrote it, and I grok
 the logic.

>> On the language detection - actually
>> Kolkus and Rehurek published a work in 2009 that handles small amounts
>> of text really really well (n-gram based approaches /suck at this/)
>> and there's a Java implementation I've been playing with. Want me to
>> run it across some search strings and we can look at the results? Or
>> just send the code across.
>
> If you ask I'd say both! ;)
>
> We evaluated this kind of dictionary-based language detection (but this not
...
 > this one specifically), problem for us was mostly
due to performance: it
> takes time to tokenize the input string correctly and the dictionary we used
...
 > was rather big. But we worked mainly on large
content (webnews, press
> articles).
> In our case input strings should be very small so it makes more sense. We
> should be able to train the dictionary against the "all title in ns0"
dumps
...
 > though.
>
> This is also a great example to explain why I feel stuck sometimes:
> How will you be able to test it?
> - I'm not allowed to download search logs locally.
> - I think I won't be able to install java and play with this kind of tools
...
   on fluorine.

 Ahh, but! You're NDAd, your laptop is a work laptop, and you have FDE,
 right? If yes to all three, I don't see a problem with me squirting
 you a sample of logs (and the Java). I figure if we find the
 methodology works we can look at speedups to the code, which is a lot
 easier a task than looking at fast code and trying to improve the
 methodology.

> Another point:
> concerning the following tasks described below, I think it overlaps
> analytics tasks (because it's mainly related to learning from search
logs).
...
 > I don't know how you work today and maybe this
is something you've already
...
 > done or is obviously wrong.
> I think you're one of the best person today to help us to sort this out,
so
...
 > your feedback concerning the following lines will
be greatly appreciated :)
...

 Thanks! 
 Yes! Okay, thoughts on the below:

 1. Build a search log parser - we sort of have that through the
 streaming python script. It depends whether you mean a literal parser
 or something to pick out all the "important" bits. See point 4.
 2. Big machine: I'd love this. But see point 4.
 3. Improve search logs for us: when we say improve for us do we mean
 for analytics/improvements purposes? Because if so we've been talking
 about having the logs in HDFS which would make things pretty easy for
 all and sundry and avoid the need for a parser.

 One way of neatly handling all of this would be:

 1. Get the logs in a format that has the fields we want and stream it
 into Hadoop. No parser necessary.
 2. Stick the big-ass machine in the analytics cluster, where it has
 default access to Hadoop and can grab data trivially, but doesn't have
 to break anyone else's stuff.
 3. Fin.

 What am I missing? Other than "setting up a MediaWiki kafka client is
 going to be kind of a bit of work".

>>>
>>> Le 22/07/2015 14:38, David Causse a écrit :
>>>>
>>>> It's still not very clear in my mind but things could look like :
>>>>
>>>> * Epic: Build a toolbox to learn from search logs
>>>>      - Create a script to run search queries against the production
>>>> index
>>>>      - Build search logs parser that provide all the needed details :
>>>> time,
>>>> search type, wiki origin, target search index, search query, search
>>>> query
>>>> ID, number of results, offset of the results (search page)
>>>>          (side note : Erik will it be possible to pass the queryID
from
...
 >>>> page to page when user clicks
"next page"?)
>>>>      - Have a descent machine (64g RAM would be great) in the
production
...
 >>>> cluster where we can
>>>>          - download production search logs
>>>>          - install the tools we want
>>>>          - stress it not being afraid to kill it
>>>>          - do all the stuff we want to learn from data and search logs
>>>>
>>>> * Epic: Improve search logs for us
>>>>      - Add an "incognito parameter" to cirrus that could be
used by the
...
 >>>> toolbox script not to pollute our
search logs when running our "search
>>>> script".
>>>>      - Add a log when the user click on a search result to have a
>>>> mapping
>>>> between the queryID, the result choosen and the offset of the chosen
>>>> link in
>>>> the result list.
>>>>          - This task is certainly complex and highly depends on the
>>>> client,
>>>> I don't know if we will be able to track this down on all clients
but
>>>> it'd
>>>> be great for us.
>>>>      - More things will be added as we learn
>>>>
>>>> * Epic: start to measure and control relevance
>>>>      - Create a corpus of search queries for each wiki with their
>>>> expected
>>>> results
>>>>      - Run these queries weekly/monthly and compute the F1-Score for
>>>> each
>>>> wiki
>>>>      - Continuously enhance the search queries corpus
>>>>      - Provide a weekly/monthly perf score for each wiki
>>>>
>>>> As you can see this is mostly about tools, I propose to start with
batch
...
   >> tools and think later of how we could make
this more real-time.
>>
>> 

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation 

--
Oliver Keyes
Research Analyst
Wikimedia Foundation