GCP has a number of models-as-a-service
<https://cloud.google.com/products/machine-learning/> that might be useful.
On Mon, Apr 3, 2017 at 6:46 PM Daniel Mietchen <
daniel.mietchen(a)googlemail.com> wrote:
> Hi Jordan,
> can your pipeline help with video or perhaps even audio as well?
> There are lots of such files as well that need categorization.
> Thanks,
> Daniel
>
> On Tue, Apr 4, 2017 at 12:05 AM, Jordan Adler <jmadler(a)google.com> wrote:
> > Looks like some of these images still need categorization. I think
> there's
> > still an unrealized opportunity here to use the results I shared to work
> the
> > backlog of the category on the Commons.
> >
> > On Thu, Aug 11, 2016 at 1:47 PM Pine W <wiki.pine(a)gmail.com> wrote:
> >>
> >> Forwarding.
> >>
> >> Pine
> >>
> >> ---------- Forwarded message ----------
> >> From: "Jordan Adler" <jmadler(a)google.com>
> >> Date: Aug 11, 2016 13:06
> >> Subject: [Commons-l] Programmatically categorizing media in the Commons
> >> with Machine Learning
> >> To: "commons-l(a)wikimedia.org" <commons-l(a)lists.wikimedia.org>
> >> Cc: "Ray Sakai" <rsakai(a)reactive.co.jp>, "Ram Ramanathan"
> >> <ramramanathan(a)google.com>, "Kazunori Sato" <kazsato(a)google.com>
> >>
> >> Hey folks!
> >>
> >>
> >> A few months back a colleague of mine was looking for some unstructured
> >> images to analyze as part of a demo for the Google Cloud Vision API.
> >> Luckily, I knew just the place, and the resulting demo, built by
> Reactive
> >> Inc., is pretty awesome. It was shared on-stage by Jeff Dean during the
> >> keynote at GCP NEXT 2016.
> >>
> >>
> >> I wanted to quickly share the data from the programmatically identified
> >> images so it could be used to help categorize the media in the Commons.
> >> There's about 80,000 images worth of data:
> >>
> >>
> >> map.txt (5.9MB): A single text file mapping id to filename in a "id :
> >> filename" format, one per line
> >>
> >> results.tar.gz (29.6MB): a tgz'd directory of json files representing
> the
> >> output of the API, in the format "${id}.jpg.json"
> >>
> >>
> >> We're making this data available under the CC0 license, and these links
> >> will likely be live for at least a few weeks.
> >>
> >>
> >> If you're interested in working with the Cloud Vision API to tag other
> >> images in the Commons, talk to the WMF Community Tech team.
> >>
> >>
> >> Thanks for your help!
> >>
> >>
> >> _______________________________________________
> >> Commons-l mailing list
> >> Commons-l(a)lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/commons-l
> >>
> >
> > _______________________________________________
> > Commons-l mailing list
> > Commons-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/commons-l
> >
>
> _______________________________________________
> Commons-l mailing list
> Commons-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/commons-l
>
Hey folks,
In this update, I'm going to change some things up to try and make this
update easier for you to consume. The biggest change you'll notice is that
I've broken up the [#] references in each section. I hope that saves you
some scrolling and confusion. You'll also notice that I have changed the
subject line from "Revision scoring" to "Scoring Platform" because it's now
clear that, come July, I'll be leading a new team with that name at the
Wikimedia Foundation. There'll be an announcement about that coming once
our budget is finalized. I'll try to keep this subject consistent for the
foreseeable future so that your email clients will continue to group the
updates into one big thread.
*Deployments & maintenance:*
In this cycle, we've gotten better at tracking our deployments and noting
what changes do out with each deployment. You can click on the phab task
for a deployment and observe the sub-tasks to find out what was deployed.
We had 3 deployments for ORES since mid-march[1,2,3]. We've had two
deployments to Wikilabels[4,5] and we've added a maintenance notices for a
short period of downtime that's coming up on April 21st[6,7].
1. https://phabricator.wikimedia.org/T160279 -- Deploy ores in prod
(Mid-March)
2. https://phabricator.wikimedia.org/T160638 -- Deploy ORES late march
3. https://phabricator.wikimedia.org/T161748 -- Deploy ORES early April
4. https://phabricator.wikimedia.org/T161002 -- Late march wikilabels
deployment
5. https://phabricator.wikimedia.org/T163016 -- Deploy Wikilabels mid-April
6. https://phabricator.wikimedia.org/T162888 -- Add header to Wikilabels
that warns of upcoming maintenance.
7. https://phabricator.wikimedia.org/T162265 -- Manage wikilabels for
labsdb1004 maintenance
*Making ORES better:*
We've been working to make ORES easier to extend and more useful. ORES now
reports it's relevant versions at https://ores.wikimedia.org/versions[8].
We've also reduced the complexity of our "precaching" system that scores
edits before you ask for them[9,10]. We're taking advantage of logstash to
store and query our logs[11]. We've also implemented some nice
abstractions for requests and responses in ORES[12] that allowed us to
improve our metrics tracking substantially[13].
8. https://phabricator.wikimedia.org/T155814 -- Expose version of the
service and its dependencies
9. https://phabricator.wikimedia.org/T148714 -- Create generalized
"precache" endpoint for ORES
10. https://phabricator.wikimedia.org/T162627 -- Switch `/precache` to be a
POST end point
11. https://phabricator.wikimedia.org/T149010 -- Send ORES logs to logstash
12. https://phabricator.wikimedia.org/T159502 -- Exclude precaching
requests from cache_miss/cache_hit metrics
13. https://phabricator.wikimedia.org/T161526 -- Implement
ScoreRequest/ScoreResponse pattern in ORES
*New functionality:*
In the last month and a half, we've added basic support to Korean
Wikipedia[14,15]. Props to Revi for helping us work through a bunch of
issues with our Korean language support[16,17,18].
We've also gotten the ORES Review tool deployed to Hebrew
Wikipedia[19,20,21,22] and Estonian Wikipedia[23,24,25]. We're also
working with the Collaboration team to implement the threshold test
statistics that they need to tune their new Edit Review interface[26] and
we're working towards making this kind of work self-serve so that that
product team and other tool developers won't have to wait on us to
implement these threshold stats in the future[27].
14. https://phabricator.wikimedia.org/T161617 -- Deploy reverted model for
kowiki
15. https://phabricator.wikimedia.org/T161616 -- Train/test reverted model
for kowiki
16. https://phabricator.wikimedia.org/T160752 -- Korean generated word
lists are in chinese
17. https://phabricator.wikimedia.org/T160757 -- Add language support for
Korean
18. https://phabricator.wikimedia.org/T160755 -- Fix tokenization for Korean
19. https://phabricator.wikimedia.org/T161621 -- Deploy ORES Review Tool
for hewiki
20. https://phabricator.wikimedia.org/T130284 -- Deploy edit quality models
for hewiki
21. https://phabricator.wikimedia.org/T160930 -- Train damaging and
goodfaith models for hewiki
22. https://phabricator.wikimedia.org/T130263 -- Complete hewiki edit
quality campaign
23. https://phabricator.wikimedia.org/T159609 -- Deploy ORES review tool to
etwiki
24. https://phabricator.wikimedia.org/T130280 -- Deploy edit quality models
for etwiki
25. https://phabricator.wikimedia.org/T129702 -- Complete etwiki edit
quality campaign
26. https://phabricator.wikimedia.org/T162377 -- Implement additional
test_stats in editquality
27. https://phabricator.wikimedia.org/T162217 -- Implement "thresholds",
deprecate "pile of tests_stats"
*ORES training / labeling campaigns:*
Thanks to a lot of networking at Wikimedia Conference and some help from
Ijon (Asaf Batrov), we've found a bunch of new collaborators to help us
deploy ORES to new wikis. As is critcial in this process, we need to
deploy labeling campaigns so that Wikipedians can help us train ORES.
We've got new editquality labeling campaigns deployed to Albanian[28],
Finnish[29], Latvian[30], Korean[31], and Turkish[21] Wikipedias.
We've also been working on a new type of model: "Item quality" in
Wikidata. We've deployed, labeled, and analyzed a pilot[33], fixed some
critical bugs that came up[34,35], and we've finally launched a 5k item
campaign which is already 17% done[36]! See
https://www.wikidata.org/wiki/Wikidata:Item_quality_campaign if you'd like
to help us out.
28. https://phabricator.wikimedia.org/T161981 -- Edit quality campaign for
Albanian Wikipedia
29. https://phabricator.wikimedia.org/T161905 -- Edit quality campaign for
Finnish Wikipedia
30. https://phabricator.wikimedia.org/T162032 -- Edit quality campaign for
Latvian Wikipedia
31. https://phabricator.wikimedia.org/T161622 -- Deploy editquality
campaign in Korean Wikipedia
32. https://phabricator.wikimedia.org/T161977 -- Start v2 editquality
campaign for trwiki
33. https://phabricator.wikimedia.org/T159570 -- Deploy the pilot of
Wikidata item quality campaign
34. https://phabricator.wikimedia.org/T160256 -- Wikidata items render
badly in Wikilabels
35. https://phabricator.wikimedia.org/T162530 -- Implement "unwanted pages"
filtering strategy for Wikidata
36. https://phabricator.wikimedia.org/T157493 -- Deploy Wikidata item
quality campaign
*Bug fixing:*
As usual, we have a few weird bug that got in our way. We needed to move
to a bigger virtual machine in "Beta Labs" because our models take up a
bunch of hard drive space[37]. We found that Wikilabels wasn't removing
expired tasks correctly and that this was making it difficult to finish
labeling campaigns[38]. We also had a lot of right-to-left issues when we
did an upgrade of OOjs UI[38]. There was an old bug we had with
https://translatewiki.net in one of our message keys[39].
37. https://phabricator.wikimedia.org/T160762 -- deployment-ores-redis
/srv/ redis is too small (500MBytes)
38. https://phabricator.wikimedia.org/T161521 -- Wikilabels is not cleaning
up expired tasks for Wikidata item quality campaign
39. https://phabricator.wikimedia.org/T161533 -- Fix RTL issues in
Wikilabels after OOjs UI upgrade
40. https://phabricator.wikimedia.org/T132197 -- qqq for a wiki-ai message
cannot be loaded
-Aaron
Principal Research Scientist
Head of the Scoring Platform Team
Something about PLURAL just doesn't strike me. MjoLniR on the other hand
doesn't seem too bad, if a little esoteric. And sorry but i think using
non-ascii in the name of a git repository is just asking for trouble
somewhere :P. I'm also not opposed to being very boring and calling it
cirrusearch-mlr or cirrussearch-ltrank
On Thu, Apr 6, 2017 at 9:46 AM, Mikhail Popov <mpopov(a)wikimedia.org> wrote:
> OH I JUST GOT WHY WE CAN CAPITALIZE THE FINAL R.
>
> Okay, so MjöLniR => the hammer used for _M_achine _L_earning & _R_anking,
> with the added benefit of the pronunciation being "myol-near" => ML-near =>
> learning to rank articles _near_ the query.
>
> BOOM! *mic drop*
>
> On Thu, Apr 6, 2017 at 9:40 AM, Trey Jones <tjones(a)wikimedia.org> wrote:
>
>> Got to capitalize the final R or don't capitalize the L!
>>
>> Plus, whatever are the two main components that go into building MjöLniR
>> would be, somewhat opaquely, Sindri and Brokkr
>> <https://en.wikipedia.org/wiki/Mj%C3%B6lnir>.
>>
>> Trey Jones
>> Software Engineer, Discovery
>> Wikimedia Foundation
>>
>> On Thu, Apr 6, 2017 at 12:32 PM, Mikhail Popov <mpopov(a)wikimedia.org>
>> wrote:
>>
>>> MjöLnir?
>>>
>>> P.S. I like PLURAL.
>>>
>>>
>>> On Thu, Apr 6, 2017 at 7:30 AM, David Causse <dcausse(a)wikimedia.org>
>>> wrote:
>>>
>>>> I don't have good suggestions, I like PLURAL.
>>>>
>>>> On Thu, Apr 6, 2017 at 6:01 AM, Pine W <wiki.pine(a)gmail.com> wrote:
>>>>
>>>>> +1 for PLURAL.
>>>>>
>>>>> Pine
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discovery mailing list
>>>>> discovery(a)lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> discovery mailing list
>>>> discovery(a)lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>
>>>>
>>>
>>> _______________________________________________
>>> discovery mailing list
>>> discovery(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>
>>>
>>
>> _______________________________________________
>> discovery mailing list
>> discovery(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
> _______________________________________________
> discovery mailing list
> discovery(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>
Cross-posting from:
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(miscellaneous)#WikiPr…
Hey folks, I have been collaborating with some researchers who are
publishing a dataset of externally reviewed Wikipedia articles (the sample
was taken back in 2006). I'd like to take the opportunity to compare the
prediction quality of ORES' article quality model with these external
reviewers, but in order get a good picture of the situation, it would also
be very helpful to get a set of Wikipedian assessments[1] for the same
dataset. So, I have gathered all of the versions of externally reviewed
articles in User:EpochFail/ORES_audit[2] and I'm asking for your help to
gather assessments. There's 90 old revisions of articles that I need your
help assessing. I don't think this will take long, but I need to borrow
your judgement here to make sure I'm not biasing things.
To help out, see User:EpochFail/ORES_audit[2].
The more we know about how ORES performs against important baselines, the
better use of it we can make it to measure Wikipedia and direct wiki work.
1. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment
2. https://en.wikipedia.org/wiki/User:EpochFail/ORES_audit
Thanks!
-Aaron
We seem to have some consensus that for the upcoming learning to rank work
we will build out a python library to handle the bulk of the backend data
plumbing work. The library will primarily be code integrating with pyspark
to do various pieces such as:
# Sampling from the click logs to generate the set of queries + page's that
will be labeled with click models
# Distributing the work of running click models against those sampled data
sets
# Pushing queries we use for feature generation into kafka, and reading
back the resulting feature vectors (the other end of this will run those
generated queries against either the hot-spare elasticsearch cluster or the
relforge cluster to get feature scores)
# Merging feature vectors with labeled data, splitting into
test/train/validate sets, and writing out files formatted for whichever
training library we decide on (xgboost, lightgbm and ranklib are in the
running currently)
# Whatever plumbing is necessary to run the actual model training and do
hyper parameter optimization
# Converting the resulting models into a format suitable for use with the
elasticsearch learn to rank plugin
# Reporting on the quality of models vs some baseline
The high level goal is that we would have relatively simple python scripts
in our analytics repository that are called from oozie, those scripts would
know the appropriate locations to load/store data and pass into this
library for the bulk of the processing. There will also be some script,
probably within the library, that combines many of these steps for feature
engineering purposes to take some set of features and run the whole thing.
So, what do we call this thing? Horrible first attempts:
* ltr-pipeline
* learn-to-rank-pipeline
* bob
* cirrussearch-ltr
* ???