We seem to have some consensus that for the upcoming learning to rank work we will build out a python library to handle the bulk of the backend data plumbing work. The library will primarily be code integrating with pyspark to do various pieces such as:
# Sampling from the click logs to generate the set of queries + page's that will be labeled with click models # Distributing the work of running click models against those sampled data sets # Pushing queries we use for feature generation into kafka, and reading back the resulting feature vectors (the other end of this will run those generated queries against either the hot-spare elasticsearch cluster or the relforge cluster to get feature scores) # Merging feature vectors with labeled data, splitting into test/train/validate sets, and writing out files formatted for whichever training library we decide on (xgboost, lightgbm and ranklib are in the running currently) # Whatever plumbing is necessary to run the actual model training and do hyper parameter optimization # Converting the resulting models into a format suitable for use with the elasticsearch learn to rank plugin # Reporting on the quality of models vs some baseline
The high level goal is that we would have relatively simple python scripts in our analytics repository that are called from oozie, those scripts would know the appropriate locations to load/store data and pass into this library for the bulk of the processing. There will also be some script, probably within the library, that combines many of these steps for feature engineering purposes to take some set of features and run the whole thing.
So, what do we call this thing? Horrible first attempts:
* ltr-pipeline * learn-to-rank-pipeline * bob * cirrussearch-ltr * ???
Link to code?
"ltr" means "left to right" to me. Maybe you could do something like "ltrank"
On Wed, Apr 5, 2017 at 2:28 PM, Erik Bernhardson <ebernhardson@wikimedia.org
wrote:
We seem to have some consensus that for the upcoming learning to rank work we will build out a python library to handle the bulk of the backend data plumbing work. The library will primarily be code integrating with pyspark to do various pieces such as:
# Sampling from the click logs to generate the set of queries + page's that will be labeled with click models # Distributing the work of running click models against those sampled data sets # Pushing queries we use for feature generation into kafka, and reading back the resulting feature vectors (the other end of this will run those generated queries against either the hot-spare elasticsearch cluster or the relforge cluster to get feature scores) # Merging feature vectors with labeled data, splitting into test/train/validate sets, and writing out files formatted for whichever training library we decide on (xgboost, lightgbm and ranklib are in the running currently) # Whatever plumbing is necessary to run the actual model training and do hyper parameter optimization # Converting the resulting models into a format suitable for use with the elasticsearch learn to rank plugin # Reporting on the quality of models vs some baseline
The high level goal is that we would have relatively simple python scripts in our analytics repository that are called from oozie, those scripts would know the appropriate locations to load/store data and pass into this library for the bulk of the processing. There will also be some script, probably within the library, that combines many of these steps for feature engineering purposes to take some set of features and run the whole thing.
So, what do we call this thing? Horrible first attempts:
- ltr-pipeline
- learn-to-rank-pipeline
- bob
- cirrussearch-ltr
- ???
AI mailing list AI@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ai
On Wed, Apr 5, 2017 at 12:55 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
Link to code?
No code yet, although there is proof of concept code which this will
inform this work at stat1002.eqiad.wmnet:/a/ebernhardson/spark_feature_log/code
"ltr" means "left to right" to me. Maybe you could do something like "ltrank"
Sounds like LTR is out as the term is already used elsewhere and is more
widely known. LTRank isn't a bad compromise with spelling out the whole thing.
On Wed, Apr 5, 2017 at 2:28 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
We seem to have some consensus that for the upcoming learning to rank work we will build out a python library to handle the bulk of the backend data plumbing work. The library will primarily be code integrating with pyspark to do various pieces such as:
# Sampling from the click logs to generate the set of queries + page's that will be labeled with click models # Distributing the work of running click models against those sampled data sets # Pushing queries we use for feature generation into kafka, and reading back the resulting feature vectors (the other end of this will run those generated queries against either the hot-spare elasticsearch cluster or the relforge cluster to get feature scores) # Merging feature vectors with labeled data, splitting into test/train/validate sets, and writing out files formatted for whichever training library we decide on (xgboost, lightgbm and ranklib are in the running currently) # Whatever plumbing is necessary to run the actual model training and do hyper parameter optimization # Converting the resulting models into a format suitable for use with the elasticsearch learn to rank plugin # Reporting on the quality of models vs some baseline
The high level goal is that we would have relatively simple python scripts in our analytics repository that are called from oozie, those scripts would know the appropriate locations to load/store data and pass into this library for the bulk of the processing. There will also be some script, probably within the library, that combines many of these steps for feature engineering purposes to take some set of features and run the whole thing.
So, what do we call this thing? Horrible first attempts:
- ltr-pipeline
- learn-to-rank-pipeline
- bob
- cirrussearch-ltr
- ???
AI mailing list AI@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ai
AI mailing list AI@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ai