tl/dr: Can feature vectors about relevance of (query, page_id) pairs be released to the public if the final dataset only represents query's with numeric id's?
Over the past 2 months i've been spending free time working on investigating machine learning for ranking. One of the earlier things i tried, to get some semblance of proof it had the ability to improve our search results, was port a set of features for text ranking from an open source kaggle competitor to a datset i could create from our own data. For relevance targets I took queries that had clicks from at least 50 unique sessions over a 60 day period and ran them through a click model (DBN). Perhaps not as useful as human judgements but working with what I have available.
This actually showed it has some promise, and I've been moving further along. An idea was provided to me though about releasing the feature vectors from my initial investigation in an open format that might be useful for others. Each feature vector is for a (query, hit_page_id) pair that was displayed to at least 50 users.
I don't have my original data, but I have all the code and just ran through it with 100 normalized queries to get a count, and there are 4852 features. Lots of them are probably useless, but choosing which ones is probably half the battle. These are ~230MB in pickle format, which stores the floats in binary. This can then be compressed to ~20MB with gzip, so the data size isn't particularly insane. In a released dataset i would probably use 10k normalized queries, meaning about 100x this size Could plausibly release as csv's instead of pickled numpy arrays. That will probably increase the data size further, but since we are only talking ~2GB after compression could go either way.
The list of feature names is in https://phabricator.wikimedia.org/P4677 A few example feature names and their meaning, which hopefully is enough to understand the rest of the feature names:
DiceDistance_Bigram_max_norm_query_x_outgoing_link_1D.pkl - dice distance of bigrams in normalized (stemmed) query string versus outgoing links. outgoing links are an array field, so the dice distanece is calculated per item and this feature has the max value. DigitCount_query_1D.pkl
- Number of digits in the raw user query ES_TFIDF_Unigram_Top50_CosineSim_norm_query_category.plain_termvec_x_category.plain_termvec_1D.pkl
- Cosine similarity of the top 50 terms, as reported by elasticsearch termvectors api, of the normalized query vs the category.plain field of matching document. More terms would perhaps have been nice, but doing this all offline in python made that a bit of a time+space tradeoff.
Ident_Log10_score_mean_query_opening_text_termvec_1D.pkl - log base 10 of the score from the elasticsearch termvectors api on the raw user query applied to the opening_text field analysis chain.
LongestMatchSize_mean_query_x_heading_1D.pkl - mean longest match, in number of characters of the query vs the list of headings for the page
The main question here i think revolves around is this still PII? The exact queries would be normalized into id's and not released. We could leave the page_id in or out of the dataset. With it left in people using the dataset could plausibly come up with their own query independent features to add. With a large enough feature vector for (query_id, page_id) the query could theoretically be reverse engineered, but from a more practical side I'm not sure that's really a valid concern.
Thoughts? Concerns? Questions?
I think the PII impact in releasing a dataset w/ only numerical feature vectors is extremely low.
The privacy impact is greater, but having the original query would be useful for folks wanting to create their own query level features & query dependent features. You do have a great set of features listed https://phabricator.wikimedia.org/P4677 there. As always, I'd bias for action, and release what's possible currently, letting folks play with the dataset.
I'd recommend having a groupId which is uniq for each instance of a user running a query. This is used to group together all of the results in a viewed SERP, and allows the ranking function to worry only about rank order instead of absolute scoring; aka, the scoring only matters relative to the other viewed documents.
I'd try out LightGBM & XGBoost in their ranking modes for creating a model.
--justin
On Thu, Dec 22, 2016 at 4:00 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
gh it with 100 normalized queries to get a count, and there are 4852 features. Lots of them are probably useless, but choosing which ones is probably half the battle. These are ~230MB in pickle format, which stores the floats in binary. This can then be compressed to ~20MB with gzip, so the data size isn't particularly insane. In a released dataset i would probably use 10k normalized queries, meaning about 100x this size Could plausibly release as csv's instead of pickled numpy arrays. That will probably increase the data size further,
The privacy impact is greater, but having the original query would be useful for folks wanting to create their own query level features & query dependent features. You do have a great set of features listed https://phabricator.wikimedia.org/P4677 there. As always, I'd bias for action, and release what's possible currently, letting folks play with the dataset.
Right now the standard is that all queries that are released must be reviewed by humans. A query data dump had to be retracted in the past for containing PII, so I don't see us getting around that (nor would I want to, really, having seen the kind of info that can be in there).
We did the manual review for the Discernatron query data, but it's not scalable for the size of dataset needed to do machine learning. However, if anyone has any good ideas for features, please let us know, and maybe we can generate those features and share them, too, time permitting.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Fri, Dec 30, 2016 at 2:28 PM, Justin Ormont justin.ormont@gmail.com wrote:
I think the PII impact in releasing a dataset w/ only numerical feature vectors is extremely low.
The privacy impact is greater, but having the original query would be useful for folks wanting to create their own query level features & query dependent features. You do have a great set of features listed https://phabricator.wikimedia.org/P4677 there. As always, I'd bias for action, and release what's possible currently, letting folks play with the dataset.
I'd recommend having a groupId which is uniq for each instance of a user running a query. This is used to group together all of the results in a viewed SERP, and allows the ranking function to worry only about rank order instead of absolute scoring; aka, the scoring only matters relative to the other viewed documents.
I'd try out LightGBM & XGBoost in their ranking modes for creating a model.
--justin
On Thu, Dec 22, 2016 at 4:00 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
gh it with 100 normalized queries to get a count, and there are 4852 features. Lots of them are probably useless, but choosing which ones is probably half the battle. These are ~230MB in pickle format, which stores the floats in binary. This can then be compressed to ~20MB with gzip, so the data size isn't particularly insane. In a released dataset i would probably use 10k normalized queries, meaning about 100x this size Could plausibly release as csv's instead of pickled numpy arrays. That will probably increase the data size further,
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Quite understandable. It's also possible to augment the dataset w/ some percent (perhaps ~5%) of the data having the, human reviewed & PII safe, query.
On the PII topic, one missing feature is user geolocation. This will help disambiguate user intent for queries that are geolocal. For instance, [civic center https://en.wikipedia.org/w/index.php?search=civic+center] (location search), [john marks https://en.wikipedia.org/w/index.php?search=john+marks] (people query), or [air marshal https://en.wikipedia.org/w/index.php?search=air+marshal] (alternative meanings in US/UK). Reducing the Lat/Lng to the metropolitan area, or even state level may mitigate the PII impact. You can likely see examples of Google/Bing/DDG doing geo based ranking by using a VPN and running [xyz site:wikipedia.org] queries.
Another feature I'd like to try: one hot encoding of the top 1-5k page categories. Aka create N binary columns (one for each of the top categories across enwiki) in the dataset where each column has a 1/0 if the page for that training row exists in that column's category. This would help uprank certain types of page categories, and can usefully intact w/ the word embedding (word2vec) you're using.
--justin
On Thu, Jan 5, 2017 at 12:48 PM, Trey Jones tjones@wikimedia.org wrote:
The privacy impact is greater, but having the original query would be
useful for folks wanting to create their own query level features & query dependent features. You do have a great set of features listed https://phabricator.wikimedia.org/P4677 there. As always, I'd bias for action, and release what's possible currently, letting folks play with the dataset.
Right now the standard is that all queries that are released must be reviewed by humans. A query data dump had to be retracted in the past for containing PII, so I don't see us getting around that (nor would I want to, really, having seen the kind of info that can be in there).
We did the manual review for the Discernatron query data, but it's not scalable for the size of dataset needed to do machine learning. However, if anyone has any good ideas for features, please let us know, and maybe we can generate those features and share them, too, time permitting.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Fri, Dec 30, 2016 at 2:28 PM, Justin Ormont justin.ormont@gmail.com wrote:
I think the PII impact in releasing a dataset w/ only numerical feature vectors is extremely low.
The privacy impact is greater, but having the original query would be useful for folks wanting to create their own query level features & query dependent features. You do have a great set of features listed https://phabricator.wikimedia.org/P4677 there. As always, I'd bias for action, and release what's possible currently, letting folks play with the dataset.
I'd recommend having a groupId which is uniq for each instance of a user running a query. This is used to group together all of the results in a viewed SERP, and allows the ranking function to worry only about rank order instead of absolute scoring; aka, the scoring only matters relative to the other viewed documents.
I'd try out LightGBM & XGBoost in their ranking modes for creating a model.
--justin
On Thu, Dec 22, 2016 at 4:00 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
gh it with 100 normalized queries to get a count, and there are 4852 features. Lots of them are probably useless, but choosing which ones is probably half the battle. These are ~230MB in pickle format, which stores the floats in binary. This can then be compressed to ~20MB with gzip, so the data size isn't particularly insane. In a released dataset i would probably use 10k normalized queries, meaning about 100x this size Could plausibly release as csv's instead of pickled numpy arrays. That will probably increase the data size further,
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery