On Wed, Apr 5, 2017 at 12:55 PM, Aaron Halfaker <aaron.halfaker(a)gmail.com>
wrote:
Link to code?
No code yet, although there is proof of concept code which this will
inform this
work at
stat1002.eqiad.wmnet:/a/ebernhardson/spark_feature_log/code
"ltr" means "left to right" to me.
Maybe you could do something like
"ltrank"
Sounds like LTR is out as the term is already used elsewhere and is more
widely
known. LTRank isn't a bad compromise with spelling out the whole
thing.
On Wed, Apr 5, 2017 at 2:28 PM, Erik Bernhardson <
ebernhardson(a)wikimedia.org> wrote:
We seem to have some consensus that for the
upcoming learning to rank
work we will build out a python library to handle the bulk of the backend
data plumbing work. The library will primarily be code integrating with
pyspark to do various pieces such as:
# Sampling from the click logs to generate the set of queries + page's
that will be labeled with click models
# Distributing the work of running click models against those sampled
data sets
# Pushing queries we use for feature generation into kafka, and reading
back the resulting feature vectors (the other end of this will run those
generated queries against either the hot-spare elasticsearch cluster or the
relforge cluster to get feature scores)
# Merging feature vectors with labeled data, splitting into
test/train/validate sets, and writing out files formatted for whichever
training library we decide on (xgboost, lightgbm and ranklib are in the
running currently)
# Whatever plumbing is necessary to run the actual model training and do
hyper parameter optimization
# Converting the resulting models into a format suitable for use with the
elasticsearch learn to rank plugin
# Reporting on the quality of models vs some baseline
The high level goal is that we would have relatively simple python
scripts in our analytics repository that are called from oozie, those
scripts would know the appropriate locations to load/store data and pass
into this library for the bulk of the processing. There will also be some
script, probably within the library, that combines many of these steps for
feature engineering purposes to take some set of features and run the whole
thing.
So, what do we call this thing? Horrible first attempts:
* ltr-pipeline
* learn-to-rank-pipeline
* bob
* cirrussearch-ltr
* ???
_______________________________________________
AI mailing list
AI(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai
_______________________________________________
AI mailing list
AI(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai