AI December 2016

ai@lists.wikimedia.org

4 participants
4 discussions

Fwd: [Commons-l] Programmatically categorizing media in the Commons with Machine Learning
by Pine W 04 Apr '17

04 Apr '17

Forwarding. Pine ---------- Forwarded message ---------- From: "Jordan Adler" <jmadler(a)google.com> Date: Aug 11, 2016 13:06 Subject: [Commons-l] Programmatically categorizing media in the Commons with Machine Learning To: "commons-l(a)wikimedia.org" <commons-l(a)lists.wikimedia.org> Cc: "Ray Sakai" <rsakai(a)reactive.co.jp>, "Ram Ramanathan" < ramramanathan(a)google.com>, "Kazunori Sato" <kazsato(a)google.com> Hey folks! A few months back a colleague of mine was looking for some unstructured images to analyze as part of a demo for the Google Cloud Vision API <https://cloud.google.com/blog/big-data/2016/05/explore-the-galaxy-of-images…>. Luckily, I knew just the place <https://commons.wikimedia.org/wiki/Category:Media_needing_categories>, and the resulting demo <http://vision-explorer.reactive.ai/>, built by Reactive Inc., is pretty awesome. It was shared on-stage by Jeff Dean during the keynote <https://www.youtube.com/watch?v=HgWHeT_OwHc&feature=youtu.be&t=2h1m19s> at GCP NEXT 2016. I wanted to quickly share the data from the programmatically identified images so it could be used to help categorize the media in the Commons. There's about 80,000 images worth of data: - map.txt <https://storage.googleapis.com/gcs-samples2-explorer/reprocess/map.txt> (5.9MB): A single text file mapping id to filename in a "id : filename" format, one per line - results.tar.gz <https://storage.googleapis.com/gcs-samples2-explorer/reprocess/results.tar.…> (29.6MB): a tgz'd directory of json files representing the output of the API <https://cloud.google.com/vision/reference/rest/v1/images/annotate#response-…>, in the format "${id}.jpg.json" We're making this data available under the CC0 license, and these links will likely be live for at least a few weeks. If you're interested in working with the Cloud Vision API to tag other images in the Commons, talk to the WMF Community Tech team. Thanks for your help! _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

2 1

Ideas around a public release of ML training set for search
by Erik Bernhardson 04 Jan '17

04 Jan '17

tl/dr: Can feature vectors about relevance of (query, page_id) pairs be released to the public if the final dataset only represents query's with numeric id's? Over the past 2 months i've been spending free time working on investigating machine learning for ranking. One of the earlier things i tried, to get some semblance of proof it had the ability to improve our search results, was port a set of features for text ranking from an open source kaggle competitor to a datset i could create from our own data. For relevance targets I took queries that had clicks from at least 50 unique sessions over a 60 day period and ran them through a click model (DBN). Perhaps not as useful as human judgements but working with what I have available. This actually showed it has some promise, and I've been moving further along. An idea was provided to me though about releasing the feature vectors from my initial investigation in an open format that might be useful for others. Each feature vector is for a (query, hit_page_id) pair that was displayed to at least 50 users. I don't have my original data, but I have all the code and just ran through it with 100 normalized queries to get a count, and there are 4852 features. Lots of them are probably useless, but choosing which ones is probably half the battle. These are ~230MB in pickle format, which stores the floats in binary. This can then be compressed to ~20MB with gzip, so the data size isn't particularly insane. In a released dataset i would probably use 10k normalized queries, meaning about 100x this size Could plausibly release as csv's instead of pickled numpy arrays. That will probably increase the data size further, but since we are only talking ~2GB after compression could go either way. The list of feature names is in https://phabricator.wikimedia.org/P4677 A few example feature names and their meaning, which hopefully is enough to understand the rest of the feature names: DiceDistance_Bigram_max_norm_query_x_outgoing_link_1D.pkl - dice distance of bigrams in normalized (stemmed) query string versus outgoing links. outgoing links are an array field, so the dice distanece is calculated per item and this feature has the max value. DigitCount_query_1D.pkl - Number of digits in the raw user query ES_TFIDF_Unigram_Top50_CosineSim_norm_query_category.plain_termvec_x_category.plain_termvec_1D.pkl - Cosine similarity of the top 50 terms, as reported by elasticsearch termvectors api, of the normalized query vs the category.plain field of matching document. More terms would perhaps have been nice, but doing this all offline in python made that a bit of a time+space tradeoff. Ident_Log10_score_mean_query_opening_text_termvec_1D.pkl - log base 10 of the score from the elasticsearch termvectors api on the raw user query applied to the opening_text field analysis chain. LongestMatchSize_mean_query_x_heading_1D.pkl - mean longest match, in number of characters of the query vs the list of headings for the page The main question here i think revolves around is this still PII? The exact queries would be normalized into id's and not released. We could leave the page_id in or out of the dataset. With it left in people using the dataset could plausibly come up with their own query independent features to add. With a large enough feature vector for (query_id, page_id) the query could theoretically be reverse engineered, but from a more practical side I'm not sure that's really a valid concern. Thoughts? Concerns? Questions?

2 2

Unscheduled deployment of ORES
by Amir Tafreshi 27 Dec '16

27 Dec '16

Hey, Due to an API change in Wikibase which happened several weeks ago, one of ORES dependencies (pywikibase) broke and caused ORES to break in small proportion of Wikidata edits but it got bigger and bigger until today which which we had an unscheduled deployment of ORES. Everything is okay now but we might lose some scores on edits in ORES review tool, I run a maintenance script to fill them now. More details can be found in https://phabricator.wikimedia.org/T154168 Happy holidays, Prost! -- Amir Sarabadani Tafreshi Software Engineer (contractor) ------------------------------------- Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

1 0

score and popularity_score in Wikipedia cirrusDump
by Sumit Asthana 07 Dec '16

07 Dec '16

Hi, I was extracting the Wikipedia cirrus dump of articles using ?action=cirrusDump for feature extraction from articles and noticed two keys "score" and "popularity_score". Can anyone tell what exactly do these keys denote and how're they calculated? I'm curious to know the possible use cases of these scores in Machine Learning as I'm currently processing articles. -- -Thanks, Sumit <http://mediawiki.org/wiki/User:Sumit.iitp>

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

AI December 2016