Re: [discovery] Ideas around a public release of ML training set for search

30 Dec 2016

I think the PII impact in releasing a dataset w/ only numerical feature
vectors is extremely low.

The privacy impact is greater, but having the original query would be
useful for folks wanting to create their own query level features & query
dependent features. You do have a great set of features listed
<https://phabricator.wikimedia.org/P4677> there. As always, I'd bias for
action, and release what's possible currently, letting folks play with the
dataset.

I'd recommend having a groupId which is uniq for each instance of a user
running a query. This is used to group together all of the results in a
viewed SERP, and allows the ranking function to worry only about rank order
instead of absolute scoring; aka, the scoring only matters relative to the
other viewed documents.

I'd try out LightGBM & XGBoost in their ranking modes for creating a model.

--justin

On Thu, Dec 22, 2016 at 4:00 PM, Erik Bernhardson <
ebernhardson(a)wikimedia.org&gt; wrote:

...
  gh it with 100 normalized queries to get a count, and
there are 4852
 features. Lots of them are probably useless, but choosing which ones is
 probably half the battle. These are ~230MB in pickle format, which stores
 the floats in binary. This can then be compressed to ~20MB with gzip, so
 the data size isn't particularly insane. In a released dataset i would
 probably use 10k normalized queries, meaning about 100x this size Could
 plausibly release as csv's instead of pickled numpy arrays. That will
 probably increase the data size further, 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [discovery] Ideas around a public release of ML training set for search