This thread started between a few of us, but has some good ideas and thoughts. Forwarding into the search mailing list (where we will endeavour to have these conversations in the future).

Erik B
---------- Forwarded message ----------
From: Oliver Keyes <okeyes@wikimedia.org>
Date: Wed, Jul 22, 2015 at 8:31 AM
Subject: Re: Zero search results—how can I help?
To: David Causse <dcausse@wikimedia.org>
Cc: Trey Jones <tjones@wikimedia.org>, Erik Bernhardson <ebernhardson@wikimedia.org>


Whoops; I guess point 4 is the second list ;p.

On 22 July 2015 at 11:30, Oliver Keyes <okeyes@wikimedia.org> wrote:
> On 22 July 2015 at 10:55, David Causse <dcausse@wikimedia.org> wrote:
>> Le 22/07/2015 15:21, Oliver Keyes a écrit :
>>>
>>> Thanks; much appreciated. Point 3 directly relates to my work so it's
>>> good to be CCd :).
>>>
>>> FWIW, this kind of detail on the specific things we're doing is
>>> missing from the main search mailing list and could be used very much
>>> there to inform people.
>>
>>
>> I agree, my intent right now is still to learn from each others and
>> build/use a friendly environment where engineer with NLP background like
>> Trey can work efficiently. When things will be clearer it'd be great to
>> share our plan.
>>
>>>
>>> Oliver is already handling the executor IDs and distinguishing full
>>> and prefix search, so nyah ;p.
>>
>> Great!
>>
>> Just to be sure : does this means that a search count will be reduced to its
>> executorID  :
>> - all request with the same executorID return 0 zero result -> add 1 to the
>> zero result counter
>> - if one of the request returns a result -> do not increment the zero result
>> counter
>> If yes I think this will be the killer patch for Q1 :)
>>
>
> Executor IDs are stored and if a match is found in executor IDs <=120
> seconds after that one, the later outcome is considered "the outcome".
> If not, we assume no second round-trip was made and so go with
> whatever happened first.
>
> So if you make a request and it round-trips once and fails, failure.
> Round-trip once and succeeds, success. Round-trip twice and fail both
> times, failure. Round-trip twice and fail the first time and succeed
> the second - one success, zero failures :). Erik wrote it, and I grok
> the logic.
>
>>> On the language detection - actually
>>> Kolkus and Rehurek published a work in 2009 that handles small amounts
>>> of text really really well (n-gram based approaches /suck at this/)
>>> and there's a Java implementation I've been playing with. Want me to
>>> run it across some search strings and we can look at the results? Or
>>> just send the code across.
>>
>> If you ask I'd say both! ;)
>>
>> We evaluated this kind of dictionary-based language detection (but this not
>> this one specifically), problem for us was mostly due to performance: it
>> takes time to tokenize the input string correctly and the dictionary we used
>> was rather big. But we worked mainly on large content (webnews, press
>> articles).
>> In our case input strings should be very small so it makes more sense. We
>> should be able to train the dictionary against the "all title in ns0" dumps
>> though.
>>
>> This is also a great example to explain why I feel stuck sometimes:
>> How will you be able to test it?
>> - I'm not allowed to download search logs locally.
>> - I think I won't be able to install java and play with this kind of tools
>> on fluorine.
>>
>
> Ahh, but! You're NDAd, your laptop is a work laptop, and you have FDE,
> right? If yes to all three, I don't see a problem with me squirting
> you a sample of logs (and the Java). I figure if we find the
> methodology works we can look at speedups to the code, which is a lot
> easier a task than looking at fast code and trying to improve the
> methodology.
>
>> Another point:
>> concerning the following tasks described below, I think it overlaps
>> analytics tasks (because it's mainly related to learning from search logs).
>> I don't know how you work today and maybe this is something you've already
>> done or is obviously wrong.
>> I think you're one of the best person today to help us to sort this out, so
>> your feedback concerning the following lines will be greatly appreciated :)
>>
>> Thanks!
>
> Yes! Okay, thoughts on the below:
>
> 1. Build a search log parser - we sort of have that through the
> streaming python script. It depends whether you mean a literal parser
> or something to pick out all the "important" bits. See point 4.
> 2. Big machine: I'd love this. But see point 4.
> 3. Improve search logs for us: when we say improve for us do we mean
> for analytics/improvements purposes? Because if so we've been talking
> about having the logs in HDFS which would make things pretty easy for
> all and sundry and avoid the need for a parser.
>
> One way of neatly handling all of this would be:
>
> 1. Get the logs in a format that has the fields we want and stream it
> into Hadoop. No parser necessary.
> 2. Stick the big-ass machine in the analytics cluster, where it has
> default access to Hadoop and can grab data trivially, but doesn't have
> to break anyone else's stuff.
> 3. Fin.
>
> What am I missing? Other than "setting up a MediaWiki kafka client is
> going to be kind of a bit of work".
>
>>>>
>>>> Le 22/07/2015 14:38, David Causse a écrit :
>>>>>
>>>>> It's still not very clear in my mind but things could look like :
>>>>>
>>>>> * Epic: Build a toolbox to learn from search logs
>>>>>      - Create a script to run search queries against the production
>>>>> index
>>>>>      - Build search logs parser that provide all the needed details :
>>>>> time,
>>>>> search type, wiki origin, target search index, search query, search
>>>>> query
>>>>> ID, number of results, offset of the results (search page)
>>>>>          (side note : Erik will it be possible to pass the queryID from
>>>>> page to page when user clicks "next page"?)
>>>>>      - Have a descent machine (64g RAM would be great) in the production
>>>>> cluster where we can
>>>>>          - download production search logs
>>>>>          - install the tools we want
>>>>>          - stress it not being afraid to kill it
>>>>>          - do all the stuff we want to learn from data and search logs
>>>>>
>>>>> * Epic: Improve search logs for us
>>>>>      - Add an "incognito parameter" to cirrus that could be used by the
>>>>> toolbox script not to pollute our search logs when running our "search
>>>>> script".
>>>>>      - Add a log when the user click on a search result to have a
>>>>> mapping
>>>>> between the queryID, the result choosen and the offset of the chosen
>>>>> link in
>>>>> the result list.
>>>>>          - This task is certainly complex and highly depends on the
>>>>> client,
>>>>> I don't know if we will be able to track this down on all clients but
>>>>> it'd
>>>>> be great for us.
>>>>>      - More things will be added as we learn
>>>>>
>>>>> * Epic: start to measure and control relevance
>>>>>      - Create a corpus of search queries for each wiki with their
>>>>> expected
>>>>> results
>>>>>      - Run these queries weekly/monthly and compute the F1-Score for
>>>>> each
>>>>> wiki
>>>>>      - Continuously enhance the search queries corpus
>>>>>      - Provide a weekly/monthly perf score for each wiki
>>>>>
>>>>> As you can see this is mostly about tools, I propose to start with batch
>>>>> tools and think later of how we could make this more real-time.
>>>>>
>>>>>
>>
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation



--
Oliver Keyes
Research Analyst
Wikimedia Foundation