Re: [AI] Ideas around a public release of ML training set for search - AI

3 Jan 2017

\o/ for a few days!  Thanks for the quick response Jacob!

On Tue, Jan 3, 2017 at 3:05 PM, Jacob Rogers &lt;jrogers(a)wikimedia.org&gt; wrote:

...
  Hi all,

 I'm not sure that I'll be able to get to this in a timely manner, so I'm
 ccing Aeryn as well. I should be able to review in a few days if it's not
 done by then. Apologies for the delay.

 Best,
 Jacob

 On Tue, Jan 3, 2017 at 9:42 AM, Aaron Halfaker &lt;aaron.halfaker(a)gmail.com&gt;
 wrote:

  * fixed Jacob's email address

 On Tue, Jan 3, 2017 at 11:42 AM, Aaron Halfaker &lt;aaron.halfaker(a)gmail.com
  wrote: 
  Hey Erik,

 Sorry to be so late to respond, but I wanted to take the time to read
 your message and the holidays left me a bit scattered.

 Reviewing what you intend to release, I don't see any clear problems
 with the dataset.   I think the primary concern is really finding PII in
 the search string itself.  Generally this is due to people accidentally
 pasting PII into the query box and then having other queries linked to them
 through some sort of identifier.

 Is there any way (even very confounded) that you could positively
 identify any content from a query string?  It seems like the
 LongestMatchSize_mean_query_x_heading could be a strong indicator of
 content from the query string.  However we don't have anything in article
 headers that couldn't be innocently placed in a query string.  E.g. let's
 say we had a header with "Homer Simpson, 742 Evergreen Terrace,
 636-555-1024 <(636)%20555-1024>"[1] and someone then searches for that
 and gets a perfect match, that wouldn't strongly imply that the person
 searching was Homer Simpson as they could very likely be searching for a
 legitimate header (non-suppressed or deleted) in Wikipedia.

 Either way, I've worked with Jacob Rogers (+CC) to review dataset
 publications like this in the past (see [2] for an example).  I think he'll
 have some more specific concerns and advice.

 1. https://en.wikipedia.org/wiki/The_Simpsons_house#Address_
 and_phone_number -- just so you know I'm not outing someone who might
 live at this address and phone number :)
 2. https://figshare.com/articles/Deleted_Wikipedia_articles_
 spam_vandalism_attack_/4245035

 -Aaron

 On Thu, Dec 22, 2016 at 6:00 PM, Erik Bernhardson <
 ebernhardson(a)wikimedia.org&gt; wrote:

  tl/dr: Can feature vectors about relevance of
(query, page_id) pairs be
 released to the public if the final dataset only represents query's with
 numeric id's?

 Over the past 2 months i've been spending free time working on
 investigating machine learning for ranking. One of the earlier things i
 tried, to get some semblance of proof it had the ability to improve our
 search results, was port a set of features for text ranking from an open
 source kaggle competitor to a datset i could create from our own data. For
 relevance targets I took queries that had clicks from at least 50 unique
 sessions over a 60 day period and ran them through a click model (DBN).
 Perhaps not as useful as human judgements but working with what I have
 available.

 This actually showed it has some promise, and I've been moving further
 along. An idea was provided to me though about releasing the feature
 vectors from my initial investigation in an open format that might be
 useful for others. Each feature vector is for a (query, hit_page_id) pair
 that was displayed to at least 50 users.

 I don't have my original data, but I have all the code and just ran
 through it with 100 normalized queries to get a count, and there are 4852
 features. Lots of them are probably useless, but choosing which ones is
 probably half the battle. These are ~230MB in pickle format, which stores
 the floats in binary. This can then be compressed to ~20MB with gzip, so
 the data size isn't particularly insane. In a released dataset i would
 probably use 10k normalized queries, meaning about 100x this size Could
 plausibly release as csv's instead of pickled numpy arrays. That will
 probably increase the data size further, but since we are only talking ~2GB
 after compression could go either way.

 The list of feature names is in https://phabricator.wikimedia.org/P4677
 A few example feature names and their meaning, which hopefully is enough to
 understand the rest of the feature names:

 DiceDistance_Bigram_max_norm_query_x_outgoing_link_1D.pkl
 -  dice distance of bigrams in normalized (stemmed) query string versus
 outgoing links. outgoing links are an array field, so the dice distanece is
 calculated per item and this feature has the max value.
 DigitCount_query_1D.pkl

 - Number of digits in the raw user query
 ES_TFIDF_Unigram_Top50_CosineSim_norm_query_category.plain_t
 ermvec_x_category.plain_termvec_1D.pkl

 - Cosine similarity of the top 50 terms, as reported by elasticsearch
 termvectors api, of the normalized query vs the category.plain field of
 matching document. More terms would perhaps have been nice, but doing this
 all offline in python made that a bit of a time+space tradeoff.

 Ident_Log10_score_mean_query_opening_text_termvec_1D.pkl
 - log base 10 of the score from the elasticsearch termvectors api on
 the raw user query applied to the opening_text field analysis chain.

 LongestMatchSize_mean_query_x_heading_1D.pkl
 - mean longest match, in number of characters of the query vs the list
 of headings for the page

 The main question here i think revolves around is this still PII? The
 exact queries would be normalized into id's and not released. We could
 leave the page_id in or out of the dataset. With it left in people using
 the dataset could plausibly come up with their own query independent
 features to add. With a large enough feature vector for (query_id, page_id)
 the query could theoretically be reverse engineered, but from a more
 practical side I'm not sure that's really a valid concern.

 Thoughts? Concerns? Questions?

 _______________________________________________
 AI mailing list
 AI(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/ai

 --

 Jacob Rogers
 Legal Counsel
 Wikimedia Foundation

 NOTICE: This message might have confidential or legally privileged
 information in it. If you have received this message by accident, please
 delete it and let us know about the mistake. As an attorney for the
 Wikimedia Foundation, for legal/ethical reasons I cannot give legal advice
 to, or serve as a lawyer for, community members, volunteers, or staff
 members in their personal capacity. For more on what this means, please see
 our legal disclaimer
 <https://meta.wikimedia.org/wiki/Wikimedia_Legal_Disclaimer>.