I think the PII impact in releasing a dataset w/ only numerical feature vectors is extremely low.
The privacy impact is greater, but having the original query would be useful for folks wanting to create their own query level features & query dependent features. You do have a great set of features listed https://phabricator.wikimedia.org/P4677 there. As always, I'd bias for action, and release what's possible currently, letting folks play with the dataset.
I'd recommend having a groupId which is uniq for each instance of a user running a query. This is used to group together all of the results in a viewed SERP, and allows the ranking function to worry only about rank order instead of absolute scoring; aka, the scoring only matters relative to the other viewed documents.
I'd try out LightGBM & XGBoost in their ranking modes for creating a model.
--justin
On Thu, Dec 22, 2016 at 4:00 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
gh it with 100 normalized queries to get a count, and there are 4852 features. Lots of them are probably useless, but choosing which ones is probably half the battle. These are ~230MB in pickle format, which stores the floats in binary. This can then be compressed to ~20MB with gzip, so the data size isn't particularly insane. In a released dataset i would probably use 10k normalized queries, meaning about 100x this size Could plausibly release as csv's instead of pickled numpy arrays. That will probably increase the data size further,