AI April 2017

ai@lists.wikimedia.org

4 participants
7 discussions

Re: [AI] [Commons-l] Programmatically categorizing media in the Commons with Machine Learning
by Jordan Adler 19 Apr '17

19 Apr '17

GCP has a number of models-as-a-service <https://cloud.google.com/products/machine-learning/> that might be useful. On Mon, Apr 3, 2017 at 6:46 PM Daniel Mietchen < daniel.mietchen(a)googlemail.com> wrote: > Hi Jordan, > can your pipeline help with video or perhaps even audio as well? > There are lots of such files as well that need categorization. > Thanks, > Daniel > > On Tue, Apr 4, 2017 at 12:05 AM, Jordan Adler <jmadler(a)google.com> wrote: > > Looks like some of these images still need categorization. I think > there's > > still an unrealized opportunity here to use the results I shared to work > the > > backlog of the category on the Commons. > > > > On Thu, Aug 11, 2016 at 1:47 PM Pine W <wiki.pine(a)gmail.com> wrote: > >> > >> Forwarding. > >> > >> Pine > >> > >> ---------- Forwarded message ---------- > >> From: "Jordan Adler" <jmadler(a)google.com> > >> Date: Aug 11, 2016 13:06 > >> Subject: [Commons-l] Programmatically categorizing media in the Commons > >> with Machine Learning > >> To: "commons-l(a)wikimedia.org" <commons-l(a)lists.wikimedia.org> > >> Cc: "Ray Sakai" <rsakai(a)reactive.co.jp>, "Ram Ramanathan" > >> <ramramanathan(a)google.com>, "Kazunori Sato" <kazsato(a)google.com> > >> > >> Hey folks! > >> > >> > >> A few months back a colleague of mine was looking for some unstructured > >> images to analyze as part of a demo for the Google Cloud Vision API. > >> Luckily, I knew just the place, and the resulting demo, built by > Reactive > >> Inc., is pretty awesome. It was shared on-stage by Jeff Dean during the > >> keynote at GCP NEXT 2016. > >> > >> > >> I wanted to quickly share the data from the programmatically identified > >> images so it could be used to help categorize the media in the Commons. > >> There's about 80,000 images worth of data: > >> > >> > >> map.txt (5.9MB): A single text file mapping id to filename in a "id : > >> filename" format, one per line > >> > >> results.tar.gz (29.6MB): a tgz'd directory of json files representing > the > >> output of the API, in the format "${id}.jpg.json" > >> > >> > >> We're making this data available under the CC0 license, and these links > >> will likely be live for at least a few weeks. > >> > >> > >> If you're interested in working with the Cloud Vision API to tag other > >> images in the Commons, talk to the WMF Community Tech team. > >> > >> > >> Thanks for your help! > >> > >> > >> _______________________________________________ > >> Commons-l mailing list > >> Commons-l(a)lists.wikimedia.org > >> https://lists.wikimedia.org/mailman/listinfo/commons-l > >> > > > > _______________________________________________ > > Commons-l mailing list > > Commons-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/commons-l > > > > _______________________________________________ > Commons-l mailing list > Commons-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/commons-l >

1 0

Scoring Platform Team update
by Aaron Halfaker 14 Apr '17

14 Apr '17

Hey folks, In this update, I'm going to change some things up to try and make this update easier for you to consume. The biggest change you'll notice is that I've broken up the [#] references in each section. I hope that saves you some scrolling and confusion. You'll also notice that I have changed the subject line from "Revision scoring" to "Scoring Platform" because it's now clear that, come July, I'll be leading a new team with that name at the Wikimedia Foundation. There'll be an announcement about that coming once our budget is finalized. I'll try to keep this subject consistent for the foreseeable future so that your email clients will continue to group the updates into one big thread. *Deployments & maintenance:* In this cycle, we've gotten better at tracking our deployments and noting what changes do out with each deployment. You can click on the phab task for a deployment and observe the sub-tasks to find out what was deployed. We had 3 deployments for ORES since mid-march[1,2,3]. We've had two deployments to Wikilabels[4,5] and we've added a maintenance notices for a short period of downtime that's coming up on April 21st[6,7]. 1. https://phabricator.wikimedia.org/T160279 -- Deploy ores in prod (Mid-March) 2. https://phabricator.wikimedia.org/T160638 -- Deploy ORES late march 3. https://phabricator.wikimedia.org/T161748 -- Deploy ORES early April 4. https://phabricator.wikimedia.org/T161002 -- Late march wikilabels deployment 5. https://phabricator.wikimedia.org/T163016 -- Deploy Wikilabels mid-April 6. https://phabricator.wikimedia.org/T162888 -- Add header to Wikilabels that warns of upcoming maintenance. 7. https://phabricator.wikimedia.org/T162265 -- Manage wikilabels for labsdb1004 maintenance *Making ORES better:* We've been working to make ORES easier to extend and more useful. ORES now reports it's relevant versions at https://ores.wikimedia.org/versions[8]. We've also reduced the complexity of our "precaching" system that scores edits before you ask for them[9,10]. We're taking advantage of logstash to store and query our logs[11]. We've also implemented some nice abstractions for requests and responses in ORES[12] that allowed us to improve our metrics tracking substantially[13]. 8. https://phabricator.wikimedia.org/T155814 -- Expose version of the service and its dependencies 9. https://phabricator.wikimedia.org/T148714 -- Create generalized "precache" endpoint for ORES 10. https://phabricator.wikimedia.org/T162627 -- Switch `/precache` to be a POST end point 11. https://phabricator.wikimedia.org/T149010 -- Send ORES logs to logstash 12. https://phabricator.wikimedia.org/T159502 -- Exclude precaching requests from cache_miss/cache_hit metrics 13. https://phabricator.wikimedia.org/T161526 -- Implement ScoreRequest/ScoreResponse pattern in ORES *New functionality:* In the last month and a half, we've added basic support to Korean Wikipedia[14,15]. Props to Revi for helping us work through a bunch of issues with our Korean language support[16,17,18]. We've also gotten the ORES Review tool deployed to Hebrew Wikipedia[19,20,21,22] and Estonian Wikipedia[23,24,25]. We're also working with the Collaboration team to implement the threshold test statistics that they need to tune their new Edit Review interface[26] and we're working towards making this kind of work self-serve so that that product team and other tool developers won't have to wait on us to implement these threshold stats in the future[27]. 14. https://phabricator.wikimedia.org/T161617 -- Deploy reverted model for kowiki 15. https://phabricator.wikimedia.org/T161616 -- Train/test reverted model for kowiki 16. https://phabricator.wikimedia.org/T160752 -- Korean generated word lists are in chinese 17. https://phabricator.wikimedia.org/T160757 -- Add language support for Korean 18. https://phabricator.wikimedia.org/T160755 -- Fix tokenization for Korean 19. https://phabricator.wikimedia.org/T161621 -- Deploy ORES Review Tool for hewiki 20. https://phabricator.wikimedia.org/T130284 -- Deploy edit quality models for hewiki 21. https://phabricator.wikimedia.org/T160930 -- Train damaging and goodfaith models for hewiki 22. https://phabricator.wikimedia.org/T130263 -- Complete hewiki edit quality campaign 23. https://phabricator.wikimedia.org/T159609 -- Deploy ORES review tool to etwiki 24. https://phabricator.wikimedia.org/T130280 -- Deploy edit quality models for etwiki 25. https://phabricator.wikimedia.org/T129702 -- Complete etwiki edit quality campaign 26. https://phabricator.wikimedia.org/T162377 -- Implement additional test_stats in editquality 27. https://phabricator.wikimedia.org/T162217 -- Implement "thresholds", deprecate "pile of tests_stats" *ORES training / labeling campaigns:* Thanks to a lot of networking at Wikimedia Conference and some help from Ijon (Asaf Batrov), we've found a bunch of new collaborators to help us deploy ORES to new wikis. As is critcial in this process, we need to deploy labeling campaigns so that Wikipedians can help us train ORES. We've got new editquality labeling campaigns deployed to Albanian[28], Finnish[29], Latvian[30], Korean[31], and Turkish[21] Wikipedias. We've also been working on a new type of model: "Item quality" in Wikidata. We've deployed, labeled, and analyzed a pilot[33], fixed some critical bugs that came up[34,35], and we've finally launched a 5k item campaign which is already 17% done[36]! See https://www.wikidata.org/wiki/Wikidata:Item_quality_campaign if you'd like to help us out. 28. https://phabricator.wikimedia.org/T161981 -- Edit quality campaign for Albanian Wikipedia 29. https://phabricator.wikimedia.org/T161905 -- Edit quality campaign for Finnish Wikipedia 30. https://phabricator.wikimedia.org/T162032 -- Edit quality campaign for Latvian Wikipedia 31. https://phabricator.wikimedia.org/T161622 -- Deploy editquality campaign in Korean Wikipedia 32. https://phabricator.wikimedia.org/T161977 -- Start v2 editquality campaign for trwiki 33. https://phabricator.wikimedia.org/T159570 -- Deploy the pilot of Wikidata item quality campaign 34. https://phabricator.wikimedia.org/T160256 -- Wikidata items render badly in Wikilabels 35. https://phabricator.wikimedia.org/T162530 -- Implement "unwanted pages" filtering strategy for Wikidata 36. https://phabricator.wikimedia.org/T157493 -- Deploy Wikidata item quality campaign *Bug fixing:* As usual, we have a few weird bug that got in our way. We needed to move to a bigger virtual machine in "Beta Labs" because our models take up a bunch of hard drive space[37]. We found that Wikilabels wasn't removing expired tasks correctly and that this was making it difficult to finish labeling campaigns[38]. We also had a lot of right-to-left issues when we did an upgrade of OOjs UI[38]. There was an old bug we had with https://translatewiki.net in one of our message keys[39]. 37. https://phabricator.wikimedia.org/T160762 -- deployment-ores-redis /srv/ redis is too small (500MBytes) 38. https://phabricator.wikimedia.org/T161521 -- Wikilabels is not cleaning up expired tasks for Wikidata item quality campaign 39. https://phabricator.wikimedia.org/T161533 -- Fix RTL issues in Wikilabels after OOjs UI upgrade 40. https://phabricator.wikimedia.org/T132197 -- qqq for a wiki-ai message cannot be loaded -Aaron Principal Research Scientist Head of the Scoring Platform Team

1 0

Re: [AI] [discovery] Another round of name that thing
by Erik Bernhardson 10 Apr '17

10 Apr '17

Something about PLURAL just doesn't strike me. MjoLniR on the other hand doesn't seem too bad, if a little esoteric. And sorry but i think using non-ascii in the name of a git repository is just asking for trouble somewhere :P. I'm also not opposed to being very boring and calling it cirrusearch-mlr or cirrussearch-ltrank On Thu, Apr 6, 2017 at 9:46 AM, Mikhail Popov <mpopov(a)wikimedia.org> wrote: > OH I JUST GOT WHY WE CAN CAPITALIZE THE FINAL R. > > Okay, so MjöLniR => the hammer used for _M_achine _L_earning & _R_anking, > with the added benefit of the pronunciation being "myol-near" => ML-near => > learning to rank articles _near_ the query. > > BOOM! *mic drop* > > On Thu, Apr 6, 2017 at 9:40 AM, Trey Jones <tjones(a)wikimedia.org> wrote: > >> Got to capitalize the final R or don't capitalize the L! >> >> Plus, whatever are the two main components that go into building MjöLniR >> would be, somewhat opaquely, Sindri and Brokkr >> <https://en.wikipedia.org/wiki/Mj%C3%B6lnir>. >> >> Trey Jones >> Software Engineer, Discovery >> Wikimedia Foundation >> >> On Thu, Apr 6, 2017 at 12:32 PM, Mikhail Popov <mpopov(a)wikimedia.org> >> wrote: >> >>> MjöLnir? >>> >>> P.S. I like PLURAL. >>> >>> >>> On Thu, Apr 6, 2017 at 7:30 AM, David Causse <dcausse(a)wikimedia.org> >>> wrote: >>> >>>> I don't have good suggestions, I like PLURAL. >>>> >>>> On Thu, Apr 6, 2017 at 6:01 AM, Pine W <wiki.pine(a)gmail.com> wrote: >>>> >>>>> +1 for PLURAL. >>>>> >>>>> Pine >>>>> >>>>> >>>>> _______________________________________________ >>>>> discovery mailing list >>>>> discovery(a)lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> discovery mailing list >>>> discovery(a)lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>> >>>> >>> >>> _______________________________________________ >>> discovery mailing list >>> discovery(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/discovery >>> >>> >> >> _______________________________________________ >> discovery mailing list >> discovery(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/discovery >> >> > > _______________________________________________ > discovery mailing list > discovery(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/discovery > >

2 1

Wikimania Submissions (Deadline today!)
by Aaron Halfaker 10 Apr '17

10 Apr '17

Hey folks, I've been working on a couple of submissions for Wikimania. I wanted to take the opportunity to call your attention to them and possibly to entice anyone out there to file their own. https://wikimania2017.wikimedia.org/wiki/Submissions/The_story_of_building_… https://wikimania2017.wikimedia.org/wiki/Submissions/The_Keilana_Effect:_Vi… -Aaron

1 0

WikiProject assessments vs. external reviewers vs. ORES
by Aaron Halfaker 06 Apr '17

06 Apr '17

Cross-posting from: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(miscellaneous)#WikiPr… Hey folks, I have been collaborating with some researchers who are publishing a dataset of externally reviewed Wikipedia articles (the sample was taken back in 2006). I'd like to take the opportunity to compare the prediction quality of ORES' article quality model with these external reviewers, but in order get a good picture of the situation, it would also be very helpful to get a set of Wikipedian assessments[1] for the same dataset. So, I have gathered all of the versions of externally reviewed articles in User:EpochFail/ORES_audit[2] and I'm asking for your help to gather assessments. There's 90 old revisions of articles that I need your help assessing. I don't think this will take long, but I need to borrow your judgement here to make sure I'm not biasing things. To help out, see User:EpochFail/ORES_audit[2]. The more we know about how ORES performs against important baselines, the better use of it we can make it to measure Wikipedia and direct wiki work. 1. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment 2. https://en.wikipedia.org/wiki/User:EpochFail/ORES_audit Thanks! -Aaron

1 0

Another round of name that thing
by Erik Bernhardson 05 Apr '17

05 Apr '17

We seem to have some consensus that for the upcoming learning to rank work we will build out a python library to handle the bulk of the backend data plumbing work. The library will primarily be code integrating with pyspark to do various pieces such as: # Sampling from the click logs to generate the set of queries + page's that will be labeled with click models # Distributing the work of running click models against those sampled data sets # Pushing queries we use for feature generation into kafka, and reading back the resulting feature vectors (the other end of this will run those generated queries against either the hot-spare elasticsearch cluster or the relforge cluster to get feature scores) # Merging feature vectors with labeled data, splitting into test/train/validate sets, and writing out files formatted for whichever training library we decide on (xgboost, lightgbm and ranklib are in the running currently) # Whatever plumbing is necessary to run the actual model training and do hyper parameter optimization # Converting the resulting models into a format suitable for use with the elasticsearch learn to rank plugin # Reporting on the quality of models vs some baseline The high level goal is that we would have relatively simple python scripts in our analytics repository that are called from oozie, those scripts would know the appropriate locations to load/store data and pass into this library for the bulk of the processing. There will also be some script, probably within the library, that combines many of these steps for feature engineering purposes to take some set of features and run the whole thing. So, what do we call this thing? Horrible first attempts: * ltr-pipeline * learn-to-rank-pipeline * bob * cirrussearch-ltr * ???

2 2

Fwd: [Commons-l] Programmatically categorizing media in the Commons with Machine Learning
by Pine W 03 Apr '17

03 Apr '17

Forwarding. Pine ---------- Forwarded message ---------- From: "Jordan Adler" <jmadler(a)google.com> Date: Aug 11, 2016 13:06 Subject: [Commons-l] Programmatically categorizing media in the Commons with Machine Learning To: "commons-l(a)wikimedia.org" <commons-l(a)lists.wikimedia.org> Cc: "Ray Sakai" <rsakai(a)reactive.co.jp>, "Ram Ramanathan" < ramramanathan(a)google.com>, "Kazunori Sato" <kazsato(a)google.com> Hey folks! A few months back a colleague of mine was looking for some unstructured images to analyze as part of a demo for the Google Cloud Vision API <https://cloud.google.com/blog/big-data/2016/05/explore-the-galaxy-of-images…>. Luckily, I knew just the place <https://commons.wikimedia.org/wiki/Category:Media_needing_categories>, and the resulting demo <http://vision-explorer.reactive.ai/>, built by Reactive Inc., is pretty awesome. It was shared on-stage by Jeff Dean during the keynote <https://www.youtube.com/watch?v=HgWHeT_OwHc&feature=youtu.be&t=2h1m19s> at GCP NEXT 2016. I wanted to quickly share the data from the programmatically identified images so it could be used to help categorize the media in the Commons. There's about 80,000 images worth of data: - map.txt <https://storage.googleapis.com/gcs-samples2-explorer/reprocess/map.txt> (5.9MB): A single text file mapping id to filename in a "id : filename" format, one per line - results.tar.gz <https://storage.googleapis.com/gcs-samples2-explorer/reprocess/results.tar.…> (29.6MB): a tgz'd directory of json files representing the output of the API <https://cloud.google.com/vision/reference/rest/v1/images/annotate#response-…>, in the format "${id}.jpg.json" We're making this data available under the CC0 license, and these links will likely be live for at least a few weeks. If you're interested in working with the Cloud Vision API to tag other images in the Commons, talk to the WMF Community Tech team. Thanks for your help! _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

AI April 2017