tl/dr: Can feature vectors about relevance of (query, page_id) pairs be
released to the public if the final dataset only represents query's with
Over the past 2 months i've been spending free time working on
investigating machine learning for ranking. One of the earlier things i
tried, to get some semblance of proof it had the ability to improve our
search results, was port a set of features for text ranking from an open
source kaggle competitor to a datset i could create from our own data. For
relevance targets I took queries that had clicks from at least 50 unique
sessions over a 60 day period and ran them through a click model (DBN).
Perhaps not as useful as human judgements but working with what I have
This actually showed it has some promise, and I've been moving further
along. An idea was provided to me though about releasing the feature
vectors from my initial investigation in an open format that might be
useful for others. Each feature vector is for a (query, hit_page_id) pair
that was displayed to at least 50 users.
I don't have my original data, but I have all the code and just ran through
it with 100 normalized queries to get a count, and there are 4852 features.
Lots of them are probably useless, but choosing which ones is probably half
the battle. These are ~230MB in pickle format, which stores the floats in
binary. This can then be compressed to ~20MB with gzip, so the data size
isn't particularly insane. In a released dataset i would probably use 10k
normalized queries, meaning about 100x this size Could plausibly release as
csv's instead of pickled numpy arrays. That will probably increase the data
size further, but since we are only talking ~2GB after compression could go
The list of feature names is in https://phabricator.wikimedia.org/P4677 A
few example feature names and their meaning, which hopefully is enough to
understand the rest of the feature names:
- dice distance of bigrams in normalized (stemmed) query string versus
outgoing links. outgoing links are an array field, so the dice distanece is
calculated per item and this feature has the max value.
- Number of digits in the raw user query
- Cosine similarity of the top 50 terms, as reported by elasticsearch
termvectors api, of the normalized query vs the category.plain field of
matching document. More terms would perhaps have been nice, but doing this
all offline in python made that a bit of a time+space tradeoff.
- log base 10 of the score from the elasticsearch termvectors api on the
raw user query applied to the opening_text field analysis chain.
- mean longest match, in number of characters of the query vs the list of
headings for the page
The main question here i think revolves around is this still PII? The exact
queries would be normalized into id's and not released. We could leave the
page_id in or out of the dataset. With it left in people using the dataset
could plausibly come up with their own query independent features to add.
With a large enough feature vector for (query_id, page_id) the query could
theoretically be reverse engineered, but from a more practical side I'm not
sure that's really a valid concern.
Thoughts? Concerns? Questions?
Due to an API change in Wikibase which happened several weeks ago, one of
ORES dependencies (pywikibase) broke and caused ORES to break in small
proportion of Wikidata edits but it got bigger and bigger until today which
which we had an unscheduled deployment of ORES. Everything is okay now but
we might lose some scores on edits in ORES review tool, I run a maintenance
script to fill them now.
More details can be found in https://phabricator.wikimedia.org/T154168
Amir Sarabadani Tafreshi
Software Engineer (contractor)
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
I was extracting the Wikipedia cirrus dump of articles using
?action=cirrusDump for feature extraction from articles and noticed two
keys "score" and "popularity_score". Can anyone tell what exactly do these
keys denote and how're they calculated?
I'm curious to know the possible use cases of these scores in Machine
Learning as I'm currently processing articles.