I think the PII impact in releasing a dataset w/ only numerical feature
vectors is extremely low.
The privacy impact is greater, but having the original query would be
useful for folks wanting to create their own query level features & query
dependent features. You do have a great set of features listed
<https://phabricator.wikimedia.org/P4677> there. As always, I'd bias for
action, and release what's possible currently, letting folks play with the
dataset.
I'd recommend having a groupId which is uniq for each instance of a user
running a query. This is used to group together all of the results in a
viewed SERP, and allows the ranking function to worry only about rank order
instead of absolute scoring; aka, the scoring only matters relative to the
other viewed documents.
I'd try out LightGBM & XGBoost in their ranking modes for creating a model.
--justin
On Thu, Dec 22, 2016 at 4:00 PM, Erik Bernhardson <
ebernhardson(a)wikimedia.org> wrote:
gh it with 100 normalized queries to get a count, and
there are 4852
features. Lots of them are probably useless, but choosing which ones is
probably half the battle. These are ~230MB in pickle format, which stores
the floats in binary. This can then be compressed to ~20MB with gzip, so
the data size isn't particularly insane. In a released dataset i would
probably use 10k normalized queries, meaning about 100x this size Could
plausibly release as csv's instead of pickled numpy arrays. That will
probably increase the data size further,