Recently I've been doing some investigation into how we can collect enough
data to plausibly train an ML model for search re-ranking. As with all ML
training, the labeled dataset to train against is an important piece. Many
approaches seem to use human labeled relevance, and we have a platform for
collecting this data which has proven to have decent predictive
capabilities for offline tests of changes to our search. But the amount of
data necessary for training ML models is simply not there.
In my research i've come across a paper "A Dynamic Bayesian Network Click
Model for Web Search Ranking"[1] and related implementation[2] that seems
to have some promise. Machine generation of relevance labels seems
promising, because i can collect a reasonable amount of information about
clickthroughs and the search results that were provided to users.
For one week of enwiki traffic i have ~20k queries that were issued by more
than 10 identities (~distinct search session). This has around 135k
distinct (query, identity) pairs, 140k distinct (query, identity, click
page id) pairs, 414k distinct (query, result page id) pairs, and covers ~3M
results (~20 per page) that were shown to users and could be converted into
relevance judgements. I'm not sure which way to train the final model on
though, the 414k distinct (query, result_page_id) pairs, or the 3M which
has duplicates from the 414k representing the same (query, result_page_id)
pair being shown multiple times.
I was also curious about a part in the appendix of the paper, labeled
Confidence. It states:
Remember that the latent variables au and su will later be used as targets
for learning a ranking function. It is thus important
to know the
confidence associated with these values
Why is it important to know the confidence, and how does that play into
training a model? This is probably basic ML stuff but I'm new to all of
this.
And finally, are there better ways of generating relevance labels from
clickthrough data, ideally with open source implementations? This is just
something I happened to stumble upon in my research and certainly not the
only thing out there.
[1]
http://www2009.eprints.org/1/1/p1.pdf
[2]
https://github.com/varepsilon/clickmodels