This thread started between a few of us, but has some good ideas and
thoughts. Forwarding into the search mailing list (where we will endeavour
to have these conversations in the future).
Erik B
---------- Forwarded message ----------
From: Oliver Keyes <okeyes(a)wikimedia.org>
Date: Wed, Jul 22, 2015 at 8:31 AM
Subject: Re: Zero search results—how can I help?
To: David Causse <dcausse(a)wikimedia.org>
Cc: Trey Jones <tjones(a)wikimedia.org>rg>, Erik Bernhardson <
ebernhardson(a)wikimedia.org>
Whoops; I guess point 4 is the second list ;p.
On 22 July 2015 at 11:30, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
On 22 July 2015 at 10:55, David Causse
<dcausse(a)wikimedia.org> wrote:
> Le 22/07/2015 15:21, Oliver Keyes a écrit :
>>
>> Thanks; much appreciated. Point 3 directly relates to my work so it's
>> good to be CCd :).
>>
>> FWIW, this kind of detail on the specific things we're doing is
>> missing from the main search mailing list and could be used very much
>> there to inform people.
>
>
> I agree, my intent right now is still to learn from each others and
> build/use a friendly environment where engineer with NLP background like
> Trey can work efficiently. When things will be clearer it'd be great to
> share our plan.
>
>>
>> Oliver is already handling the executor IDs and distinguishing full
>> and prefix search, so nyah ;p.
>
> Great!
>
> Just to be sure : does this means that a search count will be reduced to
its
> executorID :
> - all request with the same executorID return 0 zero result -> add 1 to
the
> zero result counter
> - if one of the request returns a result -> do not increment the zero
result
counter
If yes I think this will be the killer patch for Q1 :)
Executor IDs are stored and if a match is found in executor IDs <=120
seconds after that one, the later outcome is considered "the outcome".
If not, we assume no second round-trip was made and so go with
whatever happened first.
So if you make a request and it round-trips once and fails, failure.
Round-trip once and succeeds, success. Round-trip twice and fail both
times, failure. Round-trip twice and fail the first time and succeed
the second - one success, zero failures :). Erik wrote it, and I grok
the logic.
>> On the language detection - actually
>> Kolkus and Rehurek published a work in 2009 that handles small amounts
>> of text really really well (n-gram based approaches /suck at this/)
>> and there's a Java implementation I've been playing with. Want me to
>> run it across some search strings and we can look at the results? Or
>> just send the code across.
>
> If you ask I'd say both! ;)
>
> We evaluated this kind of dictionary-based language detection (but this
not
> this one specifically), problem for us was mostly
due to performance: it
> takes time to tokenize the input string correctly and the dictionary we
used
> was rather big. But we worked mainly on large
content (webnews, press
> articles).
> In our case input strings should be very small so it makes more sense. We
> should be able to train the dictionary against the "all title in ns0"
dumps
> though.
>
> This is also a great example to explain why I feel stuck sometimes:
> How will you be able to test it?
> - I'm not allowed to download search logs locally.
> - I think I won't be able to install java and play with this kind of
tools
on fluorine.
Ahh, but! You're NDAd, your laptop is a work laptop, and you have FDE,
right? If yes to all three, I don't see a problem with me squirting
you a sample of logs (and the Java). I figure if we find the
methodology works we can look at speedups to the code, which is a lot
easier a task than looking at fast code and trying to improve the
methodology.
> Another point:
> concerning the following tasks described below, I think it overlaps
> analytics tasks (because it's mainly related to learning from search
logs).
> I don't know how you work today and maybe this
is something you've
already
> done or is obviously wrong.
> I think you're one of the best person today to help us to sort this out,
so
> your feedback concerning the following lines will
be greatly appreciated
:)
Thanks!
Yes! Okay, thoughts on the below:
1. Build a search log parser - we sort of have that through the
streaming python script. It depends whether you mean a literal parser
or something to pick out all the "important" bits. See point 4.
2. Big machine: I'd love this. But see point 4.
3. Improve search logs for us: when we say improve for us do we mean
for analytics/improvements purposes? Because if so we've been talking
about having the logs in HDFS which would make things pretty easy for
all and sundry and avoid the need for a parser.
One way of neatly handling all of this would be:
1. Get the logs in a format that has the fields we want and stream it
into Hadoop. No parser necessary.
2. Stick the big-ass machine in the analytics cluster, where it has
default access to Hadoop and can grab data trivially, but doesn't have
to break anyone else's stuff.
3. Fin.
What am I missing? Other than "setting up a MediaWiki kafka client is
going to be kind of a bit of work".
>>>
>>> Le 22/07/2015 14:38, David Causse a écrit :
>>>>
>>>> It's still not very clear in my mind but things could look like :
>>>>
>>>> * Epic: Build a toolbox to learn from search logs
>>>> - Create a script to run search queries against the production
>>>> index
>>>> - Build search logs parser that provide all the needed details :
>>>> time,
>>>> search type, wiki origin, target search index, search query, search
>>>> query
>>>> ID, number of results, offset of the results (search page)
>>>> (side note : Erik will it be possible to pass the queryID
from
>>>> page to page when user clicks
"next page"?)
>>>> - Have a descent machine (64g RAM would be great) in the
production
>>>> cluster where we can
>>>> - download production search logs
>>>> - install the tools we want
>>>> - stress it not being afraid to kill it
>>>> - do all the stuff we want to learn from data and search logs
>>>>
>>>> * Epic: Improve search logs for us
>>>> - Add an "incognito parameter" to cirrus that could be
used by
the
>>>> toolbox script not to pollute our
search logs when running our "search
>>>> script".
>>>> - Add a log when the user click on a search result to have a
>>>> mapping
>>>> between the queryID, the result choosen and the offset of the chosen
>>>> link in
>>>> the result list.
>>>> - This task is certainly complex and highly depends on the
>>>> client,
>>>> I don't know if we will be able to track this down on all clients
but
>>>> it'd
>>>> be great for us.
>>>> - More things will be added as we learn
>>>>
>>>> * Epic: start to measure and control relevance
>>>> - Create a corpus of search queries for each wiki with their
>>>> expected
>>>> results
>>>> - Run these queries weekly/monthly and compute the F1-Score for
>>>> each
>>>> wiki
>>>> - Continuously enhance the search queries corpus
>>>> - Provide a weekly/monthly perf score for each wiki
>>>>
>>>> As you can see this is mostly about tools, I propose to start with
batch
>> tools and think later of how we could make
this more real-time.
>>
>>
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
--
Oliver Keyes
Research Analyst
Wikimedia Foundation