We seem to have some consensus that for the upcoming learning to rank work
we will build out a python library to handle the bulk of the backend data
plumbing work. The library will primarily be code integrating with pyspark
to do various pieces such as:
# Sampling from the click logs to generate the set of queries + page's that
will be labeled with click models
# Distributing the work of running click models against those sampled data
sets
# Pushing queries we use for feature generation into kafka, and reading
back the resulting feature vectors (the other end of this will run those
generated queries against either the hot-spare elasticsearch cluster or the
relforge cluster to get feature scores)
# Merging feature vectors with labeled data, splitting into
test/train/validate sets, and writing out files formatted for whichever
training library we decide on (xgboost, lightgbm and ranklib are in the
running currently)
# Whatever plumbing is necessary to run the actual model training and do
hyper parameter optimization
# Converting the resulting models into a format suitable for use with the
elasticsearch learn to rank plugin
# Reporting on the quality of models vs some baseline
The high level goal is that we would have relatively simple python scripts
in our analytics repository that are called from oozie, those scripts would
know the appropriate locations to load/store data and pass into this
library for the bulk of the processing. There will also be some script,
probably within the library, that combines many of these steps for feature
engineering purposes to take some set of features and run the whole thing.
So, what do we call this thing? Horrible first attempts:
* ltr-pipeline
* learn-to-rank-pipeline
* bob
* cirrussearch-ltr
* ???