We seem to have some consensus that for the upcoming learning to rank work we will build out a python library to handle the bulk of the backend data plumbing work. The library will primarily be code integrating with pyspark to do various pieces such as:# Sampling from the click logs to generate the set of queries + page's that will be labeled with click models# Distributing the work of running click models against those sampled data sets# Pushing queries we use for feature generation into kafka, and reading back the resulting feature vectors (the other end of this will run those generated queries against either the hot-spare elasticsearch cluster or the relforge cluster to get feature scores)# Merging feature vectors with labeled data, splitting into test/train/validate sets, and writing out files formatted for whichever training library we decide on (xgboost, lightgbm and ranklib are in the running currently)# Whatever plumbing is necessary to run the actual model training and do hyper parameter optimization# Converting the resulting models into a format suitable for use with the elasticsearch learn to rank plugin# Reporting on the quality of models vs some baselineThe high level goal is that we would have relatively simple python scripts in our analytics repository that are called from oozie, those scripts would know the appropriate locations to load/store data and pass into this library for the bulk of the processing. There will also be some script, probably within the library, that combines many of these steps for feature engineering purposes to take some set of features and run the whole thing.So, what do we call this thing? Horrible first attempts:* ltr-pipeline* learn-to-rank-pipeline* bob* cirrussearch-ltr* ???
discovery mailing list