AI September 2016

ai@lists.wikimedia.org

7 participants
17 discussions

Fwd: [Commons-l] Programmatically categorizing media in the Commons with Machine Learning

by Pine W

Forwarding. Pine ---------- Forwarded message ---------- From: "Jordan Adler" <jmadler(a)google.com> Date: Aug 11, 2016 13:06 Subject: [Commons-l] Programmatically categorizing media in the Commons with Machine Learning To: "commons-l(a)wikimedia.org" <commons-l(a)lists.wikimedia.org> Cc: "Ray Sakai" <rsakai(a)reactive.co.jp>, "Ram Ramanathan" < ramramanathan(a)google.com>, "Kazunori Sato" <kazsato(a)google.com> Hey folks! A few months back a colleague of mine was looking for some unstructured images to analyze as part of a demo for the Google Cloud Vision API <https://cloud.google.com/blog/big-data/2016/05/explore-the-galaxy-of-images…>. Luckily, I knew just the place <https://commons.wikimedia.org/wiki/Category:Media_needing_categories>, and the resulting demo <http://vision-explorer.reactive.ai/>, built by Reactive Inc., is pretty awesome. It was shared on-stage by Jeff Dean during the keynote <https://www.youtube.com/watch?v=HgWHeT_OwHc&feature=youtu.be&t=2h1m19s> at GCP NEXT 2016. I wanted to quickly share the data from the programmatically identified images so it could be used to help categorize the media in the Commons. There's about 80,000 images worth of data: - map.txt <https://storage.googleapis.com/gcs-samples2-explorer/reprocess/map.txt> (5.9MB): A single text file mapping id to filename in a "id : filename" format, one per line - results.tar.gz <https://storage.googleapis.com/gcs-samples2-explorer/reprocess/results.tar.…> (29.6MB): a tgz'd directory of json files representing the output of the API <https://cloud.google.com/vision/reference/rest/v1/images/annotate#response-…>, in the format "${id}.jpg.json" We're making this data available under the CC0 license, and these links will likely be live for at least a few weeks. If you're interested in working with the Cloud Vision API to tag other images in the Commons, talk to the WMF Community Tech team. Thanks for your help! _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

7 years

WikiDev17

by Gergo Tisza

I made a suggestion [1] in the ongoing discussion about the Wikimedia Developer Summit [2] in January that AI should be a major topic. I am not exactly an expert on it but my impression is that the Wikimedia movement is largely missing to notice the beginnings of a huge shift in user expectations towards smarter tools and interfaces. While there is some attention to it (as the existence of this list proves), I don't think it is proportional to the importance of the topic and the summit might be a good chance to raise attention. Input from people who, unlike me, actually know what they are talking about would be very welcome on the wiki page :) [1] https://www.mediawiki.org/wiki/Topic:Tcfsas6exo2gd3ug [2] https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit

7 years, 7 months

The Revision Scoring weekly update

by Aaron Halfaker

Hey, This is the 23rd weekly update from revision scoring team that we have sent to this mailing list. New development - We implemented and demonstrated a linguistic/stylometric processing strategy that should give us more signal for finding vandalism and spam[1]. See the discussion on the AI list[2]. - As part of our support for the Collaboration Team, we've been producing tables of model statistics that correspond to set of thresholds[3]. This helps their designers work on strategies for reporting prediction confidence in an intuitive way. Maintenance and robustness - We had a major downtime event that was caused by our logs being too verbose. We've recovered and turned down the log level[4]. - We made sure that halfak got pings when ores.wikimedia.org goes down[5] Datasets - We created a database on Wikimedia Labs that provides access to a dataset containing a complete set of article quality predictions for English Wikipedia[6]. See our announcements[7,8,9]. 1. https://phabricator.wikimedia.org/T146335 -- Implement a basic scoring strategy for PCFGs 2. https://lists.wikimedia.org/pipermail/ai/2016-September/000098.html 3. https://phabricator.wikimedia.org/T146280 -- Produce tables of stats for damaging and goodfaith models 4. https://phabricator.wikimedia.org/T146581 -- celery log level is INFO causing disruption on ORES service 5. https://phabricator.wikimedia.org/T146720 -- Ensure that halfak gets emails when ores.wikimedia.org goes down 6. https://phabricator.wikimedia.org/T106278 -- Setup a db on labsdb for article quality that is publicly accessible 7. https://phabricator.wikimedia.org/T146156 -- Announce article quality database in labsdb 8. https://lists.wikimedia.org/pipermail/ai/2016-September/000091.html 9. https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_14… Sincerely, Aaron from the Revision Scoring team

7 years, 7 months

Re: [AI] [Wiki-research-l] Google open source research on automatic image captioning

by Pine W

Also: 1. If someone is paid to do captioning and/or categorization work, such as by a GLAM institution or a Wikimedia affiliate with a budget that supports this kind of work, then integrating this research into Wikimedia workflows could significantly increase that person's cost-effectiveness. 2. If volunteers are uploading large quantities of photos, this may make captioning and categorization much less time consuming and therefore volunteers may be more likely to do substantial captioning and categorization work instead of doing the minimum amount of work necessary. Pine On Wed, Sep 28, 2016 at 12:19 AM, Jan Dittrich <jan.dittrich(a)wikimedia.de> wrote: > I find it interesting which impact this could have on the sense of > achievement for volunteers, if captions are autogenerated or suggested and > them possibly affirmed or corrected. > On one hand one could assume a decreased sense of ownership, > on the other hand, it might be more easier to comment/correct then to > write from scratch and feel much more efficient. > > Jan > > > 2016-09-27 23:08 GMT+02:00 Dario Taraborelli <dtaraborelli(a)wikimedia.org>: > >> I forwarded this separately to internally at WMF a few days ago. Clearly >> – before thinking of building workflows for human contributors to generate >> captions or rich descriptors of media files in Commons – we should look at >> what's available in terms of off-the-shelf machine learning services and >> libraries. >> >> #1 rule of sane citizen science/crowdsourcing projects: don't ask humans >> to perform tedious tasks machines are pretty good at, get humans to curate >> inputs and outputs of machines instead. >> >> D >> >> On Mon, Sep 26, 2016 at 5:55 PM, Pine W <wiki.pine(a)gmail.com> wrote: >> >>> Perhaps of interest: "...We’re making the latest version of our image >>> captioning system available as an open source model in TensorFlow." >>> https://research.googleblog.com/2016/09/show-and-tell-image- >>> captioning-open.html >>> >>> Pine >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >> >> >> -- >> >> *Dario Taraborelli *Head of Research, Wikimedia Foundation >> wikimediafoundation.org • nitens.org • @readermeter >> <http://twitter.com/readermeter> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > > -- > Jan Dittrich > UX Design/ User Research > > Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin > Phone: +49 (0)30 219 158 26-0 > http://wikimedia.de > > Imagine a world, in which every single human being can freely share in the > sum of all knowledge. That‘s our commitment. > > Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. > Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter > der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für > Körperschaften I Berlin, Steuernummer 27/029/42207. > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >

7 years, 7 months

NLP (PCFG) work

by Aaron Halfaker

I've been looking at some recent work that used Probabilistic Context-free Grammars[1,2] to detect vandalism in Wikipedia. I wanted to send a quick message to share some progress. I've built a python library that implements a really simple PCFG training and scoring strategy and written a quick demo of how it can work. In the following demo, I show how we can build a probabilistic grammar using the I'm a Little Teapot song[4]. Note how sentences that are not characteristic of the song score lower. Note that scores are log-scaled. >>> sentences = [ ... "I am a little teapot", ... "Here is my handle", ... "Here is my spout", ... "When I get all steamed up I just shout tip me over and pour me out", ... "I am a very special pot", ... "It is true", ... "Here is an example of what I can do", ... "I can turn my handle into a spout", ... "Tip me over and pour me out"] >>> >>> >>> teapot_grammar = TreeScorer.from_tree_bank(bllip_parse(s) for s in sentences) >>> >>> teapot_grammar.score(bllip_parse("Here is a little teapot")) -9.392661928770137 >>> teapot_grammar.score(bllip_parse("It is my handle")) -10.296301543090733 >>> teapot_grammar.score(bllip_parse("I am a spout")) -10.40166205874856 >>> teapot_grammar.score(bllip_parse("Your teapot is gay")) -12.96352974967269 >>> teapot_grammar.score(bllip_parse("Your mom's teapot is asldasnldansldal")) -19.424997926026403 This work is inspired by work that Arthur Tilley did on our team a last year[5]. The 'kasami' library represents a narrow slice of Arthur's work. Next, I'm working on building out revscoring to implement some features that use the scoring strategy on sentenced modified in an edit. I'm hoping that this type of feature engineering will allow us to catch edits that make articles more/less notable. I'm also targeting spammy language and insults. 1. https://en.wikipedia.org/wiki/Stochastic_context-free_grammar 2. http://pub.cs.sunysb.edu/~rob/papers/acl11_vandal.pdf 3. https://github.com/halfak/kasami 4. https://en.wikipedia.org/wiki/I%27m_a_Little_Teapot 5. https://github.com/aetilley/pcfg -Aaron

7 years, 7 months

Google open source research on automatic image captioning

by Pine W

Perhaps of interest: "...We’re making the latest version of our image captioning system available as an open source model in TensorFlow." https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open… Pine

7 years, 7 months

Unexpected down time for ORES and unscheduled deployment right now

by Amir Ladsgroup

Today ORES in production was sending out unreasonable amount of timeout errors. Causing icinga to scream and 14% failure rate on average for ORES review tool jobs. It turned out that ores workers are logging too much causing the nodes to run out of disk space. [1] I suspect we had similar issue in our labs nodes. I made changes for prod and labs and deployed it today. You can find more details in the phab card [1]: https://phabricator.wikimedia.org/T146581 Cheers Best

7 years, 7 months

The Revision Scoring weekly update

by Aaron Halfaker

Hey, This is the 22nd weekly update from revision scoring team that we have sent to this mailing list. UI work: - We configured the default threshold for the ORES review tool on Wikidata to be more strict (higher recall, lower precision)[1] - We fixed a display issue on Special:Contributions where the filters would not wrap[2] Increasing model fitness: - We finished demonstrating model fitness gains using hash-vector features[3]. Next, we'll be working to get the hash-vector features implemented in revscoring/ORES[4]. - We implemented a new strategy for training and testing on all data using cross-validation[5]. This will both increase the fitness of the models and make the statistics reported more robust. Maintenance and robustness - We fixed an indexing issues in ores_model that prevented the deployment of updated models[6]. - We did a minor investigation to a short period of degraded service quality on WMF Labs[7] 1. https://phabricator.wikimedia.org/T144784 -- Change default threshold for Wikidata to high 2. https://phabricator.wikimedia.org/T143518 -- Filter on user contribs has nowrap, causing issues 3. https://phabricator.wikimedia.org/T128087 -- [Spike] Investigate HashingVectorizer 4. https://phabricator.wikimedia.org/T145812 -- Implement ~100 most important hash vector features in editquality models 5. https://phabricator.wikimedia.org/T142953 -- Train on all data, Report test statistics on cross-validation 6. https://phabricator.wikimedia.org/T144432 -- oresm_model index should not be unique 7. https://phabricator.wikimedia.org/T145353 -- Investigate short period of ores-web-03 insanity Sincerely, Aaron from the Revision Scoring team

7 years, 7 months

ORES article quality data as a database table

by Amir Ladsgroup

One of ORES [1] applications is determining article quality. For example, What would be the best assessment of an article in the given revision. Users in wikiprojects use ORES data to check if articles need re-assessment. e.g. if an article is in "Start" level and now good it's enough to be a "B" article. As part of Q4 goals, we made a dataset of article quality scores of all articles in English Wikipedia [2] (Here's the link to download the dataset [3]) and we are publishing it in figshare as something you can cite [4] also we are working on publishing monthly data for researchers to track article quality data change over time. [5] As a pet project of mine, I always wanted to put these data in a database. So we can query the database and get much more useful data. For example quality of articles in category 'History_of_Essex' [6] [7]. The weighed sum is a measure of quality which is a decimal number between 0 (really stub) to 5 (a definitely featured article). We have also prediction column which is a number in this map [8] for example if prediction is 5, it means ORES thinks it should be a featured article. I leave more use cases to your imagination :) I'm looking for a more permanent place to put these data, please tell me if it's useful for you. [1] ORES is not a anti-vandalism tool, it's an infrastructure to use AI in Wikipedia. [2] https://phabricator.wikimedia.org/T135684 [3] (117 MBs) https://datasets.wikimedia.org/public-datasets/enwiki/article_quality/wp10-… [4] https://phabricator.wikimedia.org/T145332 [5] https://phabricator.wikimedia.org/T145655 [6] https://quarry.wmflabs.org/query/12647 [7] https://quarry.wmflabs.org/query/12662 [8] https://github.com/wiki-ai/wikiclass/blob/3ff2f6c44c52905c7202515c5c8b525fb… Have fun! Amir

7 years, 7 months

Re: [AI] all article ORES scores

by Peter Ekman

Amir, Thanks for this. I mean WOW for lack of better words. I'm especially impressed with the inclusion of the weighted scores which allows the observation of small changes in quality. I was going to suggest that you could do the same thing for the end of every year, say for the last 5 years, so that we can see the improvement in articles - but that would be too much to ask. But then I noticed you are planning on doing this monthly. Double WOW. Minor quibble - their are lots of disambiguation pages included. I'd delete those if possible. Thanks again, Pete PS - WOW

7 years, 7 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

AI September 2016