Just came across https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tens...
In it, the author discusses some of what he calls the 'impedance mismatch' between data engineers and production engineers. The links to Ubers Michelangelo https://eng.uber.com/michelangelo/ (which as far as I can tell has not been open sourced) and the Hidden Technical Debt in Machine Learning Systems paper https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf are also very interesting!
At All hands I've been hearing more and more about using ML in production, so these things seem very relevant to us. I'd love it if we had a working group (or whatever) that focused on how to standardize how we train and deploy ML for production use.
:)
Hi Andrew,
I have recently started a six month AI/Machine Learning Engineering course which focuses exactly on the topics that you've shown interest in.
So,
I'd love it if we had a working group (or whatever) that focused on
how to standardize how we train and deploy ML for production use.
Count me in.
Regards, Goran
Goran S. Milovanović, PhD Data Scientist, Software Department Wikimedia Deutschland
------------------------------------------------ "It's not the size of the dog in the fight, it's the size of the fight in the dog." - Mark Twain ------------------------------------------------
On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto otto@wikimedia.org wrote:
Just came across
https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tens...
In it, the author discusses some of what he calls the 'impedance mismatch' between data engineers and production engineers. The links to Ubers Michelangelo https://eng.uber.com/michelangelo/ (which as far as I can tell has not been open sourced) and the Hidden Technical Debt in Machine Learning Systems paper https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf are also very interesting!
At All hands I've been hearing more and more about using ML in production, so these things seem very relevant to us. I'd love it if we had a working group (or whatever) that focused on how to standardize how we train and deploy ML for production use.
:) _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Just gave the article a quick read. I think this article pushes on some key issues for sure. I definitely agree with the focus on python/jupyter as essential for a productive workflow that leverages the best from research scientists. We've been thinking about what ORES 2.0 would look like and event streams are the dominant proposal for improving on the limitations of our queue-based worker pool.
One of the nice things about ORES/revscoring is that it provides a nice framework for operating using the *exact same code* no matter the environment. E.g. it doesn't matter if we're calling out to an API to get data for feature extraction or providing it via a stream. By investing in a dependency injection strategy, we get that flexibility. So to me, the hardest problem -- the one I don't quite know how to solve -- is how we'll mix and merge streams to get all of the data we want available for feature extraction. If I understand correctly, that's where Kafka shines. :)
I'm definitely interested in fleshing out this proposal. We should probably be exploring the processes for training new types of models (e.g. image processing) using different strategies than ORES. In ORES, we're almost entirely focused on using sklearn but we have some basic abstractions for other estimator libraries. We also make some strong assumptions about running on a single CPU that could probably be broken for some performance gains using real concurrency.
-Aaron
On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic < goran.milovanovic_ext@wikimedia.de> wrote:
Hi Andrew,
I have recently started a six month AI/Machine Learning Engineering course which focuses exactly on the topics that you've shown interest in.
So,
I'd love it if we had a working group (or whatever) that focused on
how to standardize how we train and deploy ML for production use.
Count me in.
Regards, Goran
Goran S. Milovanović, PhD Data Scientist, Software Department Wikimedia Deutschland
"It's not the size of the dog in the fight, it's the size of the fight in the dog."
- Mark Twain
On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto otto@wikimedia.org wrote:
Just came across
https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tens...
In it, the author discusses some of what he calls the 'impedance mismatch' between data engineers and production engineers. The links to Ubers Michelangelo https://eng.uber.com/michelangelo/ (which as far as I can tell has not been open sourced) and the Hidden Technical Debt in Machine Learning Systems paper https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf are also very interesting!
At All hands I've been hearing more and more about using ML in production, so these things seem very relevant to us. I'd love it if we had a working group (or whatever) that focused on how to standardize how we train and deploy ML for production use.
:) _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hey Andrew!
Thank you so much for sharing this and start this conversation. We had a meeting at All Hands with all people interested in "Image Classification" https://phabricator.wikimedia.org/T215413 , and one of the open questions was exactly how to find a "common repository" for ML models that different groups and products within the organization can use. So, please, count me in!
Thanks,
M
On Thu, Feb 7, 2019 at 4:38 PM Aaron Halfaker ahalfaker@wikimedia.org wrote:
Just gave the article a quick read. I think this article pushes on some key issues for sure. I definitely agree with the focus on python/jupyter as essential for a productive workflow that leverages the best from research scientists. We've been thinking about what ORES 2.0 would look like and event streams are the dominant proposal for improving on the limitations of our queue-based worker pool.
One of the nice things about ORES/revscoring is that it provides a nice framework for operating using the *exact same code* no matter the environment. E.g. it doesn't matter if we're calling out to an API to get data for feature extraction or providing it via a stream. By investing in a dependency injection strategy, we get that flexibility. So to me, the hardest problem -- the one I don't quite know how to solve -- is how we'll mix and merge streams to get all of the data we want available for feature extraction. If I understand correctly, that's where Kafka shines. :)
I'm definitely interested in fleshing out this proposal. We should probably be exploring the processes for training new types of models (e.g. image processing) using different strategies than ORES. In ORES, we're almost entirely focused on using sklearn but we have some basic abstractions for other estimator libraries. We also make some strong assumptions about running on a single CPU that could probably be broken for some performance gains using real concurrency.
-Aaron
On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic < goran.milovanovic_ext@wikimedia.de> wrote:
Hi Andrew,
I have recently started a six month AI/Machine Learning Engineering course which focuses exactly on the topics that you've shown interest in.
So,
I'd love it if we had a working group (or whatever) that focused on
how to standardize how we train and deploy ML for production use.
Count me in.
Regards, Goran
Goran S. Milovanović, PhD Data Scientist, Software Department Wikimedia Deutschland
"It's not the size of the dog in the fight, it's the size of the fight in the dog."
- Mark Twain
On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto otto@wikimedia.org wrote:
Just came across
https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tens...
In it, the author discusses some of what he calls the 'impedance mismatch' between data engineers and production engineers. The links to Ubers Michelangelo https://eng.uber.com/michelangelo/ (which as far as I can tell has not been open sourced) and the Hidden Technical Debt in Machine Learning Systems paper https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf are also very interesting!
At All hands I've been hearing more and more about using ML in production, so these things seem very relevant to us. I'd love it if we had a working group (or whatever) that focused on how to standardize how we train and deploy ML for production use.
:) _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Aaron Halfaker
Principal Research Scientist
Head of the Scoring Platform team Wikimedia Foundation _______________________________________________ Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Team,
Since everyone is here, we will be working on a machine learning infrastructure program this year. I will set up meetings with everyone on this thread and some others in SRE and Audiences to get a "bag of requests" of things that are missing, first round of talks that I hope to finish next week is to hear what everyone requests/ideas are. Will be sending meeting invites today and tomorrow. I think from those some themes will emerge. Thus far, it is pretty clear we need a better way to deploy models to production (right now we deploy those to elastic search in very crafty manners, for example) , we need to have an answer to GPU issues to train models, we need to have a "recommended way" in which we train and compute, some unified system for tracking models+data+tests and finally, there are probably many learnings the work been done in Ores thus far.
Thanks,
Nuria
On Thu, Feb 7, 2019 at 8:40 AM Miriam Redi mredi@wikimedia.org wrote:
Hey Andrew!
Thank you so much for sharing this and start this conversation. We had a meeting at All Hands with all people interested in "Image Classification" https://phabricator.wikimedia.org/T215413 , and one of the open questions was exactly how to find a "common repository" for ML models that different groups and products within the organization can use. So, please, count me in!
Thanks,
M
On Thu, Feb 7, 2019 at 4:38 PM Aaron Halfaker ahalfaker@wikimedia.org wrote:
Just gave the article a quick read. I think this article pushes on some key issues for sure. I definitely agree with the focus on python/jupyter as essential for a productive workflow that leverages the best from research scientists. We've been thinking about what ORES 2.0 would look like and event streams are the dominant proposal for improving on the limitations of our queue-based worker pool.
One of the nice things about ORES/revscoring is that it provides a nice framework for operating using the *exact same code* no matter the environment. E.g. it doesn't matter if we're calling out to an API to get data for feature extraction or providing it via a stream. By investing in a dependency injection strategy, we get that flexibility. So to me, the hardest problem -- the one I don't quite know how to solve -- is how we'll mix and merge streams to get all of the data we want available for feature extraction. If I understand correctly, that's where Kafka shines. :)
I'm definitely interested in fleshing out this proposal. We should probably be exploring the processes for training new types of models (e.g. image processing) using different strategies than ORES. In ORES, we're almost entirely focused on using sklearn but we have some basic abstractions for other estimator libraries. We also make some strong assumptions about running on a single CPU that could probably be broken for some performance gains using real concurrency.
-Aaron
On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic < goran.milovanovic_ext@wikimedia.de> wrote:
Hi Andrew,
I have recently started a six month AI/Machine Learning Engineering course which focuses exactly on the topics that you've shown interest in.
So,
I'd love it if we had a working group (or whatever) that focused
on how to standardize how we train and deploy ML for production use.
Count me in.
Regards, Goran
Goran S. Milovanović, PhD Data Scientist, Software Department Wikimedia Deutschland
"It's not the size of the dog in the fight, it's the size of the fight in the dog."
- Mark Twain
On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto otto@wikimedia.org wrote:
Just came across
https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tens...
In it, the author discusses some of what he calls the 'impedance mismatch' between data engineers and production engineers. The links to Ubers Michelangelo https://eng.uber.com/michelangelo/ (which as far as I can tell has not been open sourced) and the Hidden Technical Debt in Machine Learning Systems paper https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf are also very interesting!
At All hands I've been hearing more and more about using ML in production, so these things seem very relevant to us. I'd love it if we had a working group (or whatever) that focused on how to standardize how we train and deploy ML for production use.
:) _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Aaron Halfaker
Principal Research Scientist
Head of the Scoring Platform team Wikimedia Foundation _______________________________________________ Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
FYI, folks might be interested in what we've been doing with The Met Museum in NYC and machine learning. Writeup in the latest GLAM newsletter
https://outreach.wikimedia.org/wiki/GLAM/Newsletter/January_2019/Contents/US...
TL;DR - Andrew worked with Jennie Choi, The Met's General Manager of Collection Information and Nina Diamond, Managing Editor and Producer along with Microsoft Researchers Patrick Buehler, J.S. Tan and Sam Kazemi Nafchi to train a machine learning model on Microsoft Azure that could predict labels for artworks. Using the Met's roughly 1,000 word art vocabulary, and representative images to help train the model a proof of concept app was developed at the hackathon. The results were impressive enough that Andrew finished the creation of a Wikdata Distributed Game - Depicts to connect the subject keyword recommendations to Wikidata.
-Andrew
On Thu, Feb 7, 2019 at 2:06 PM Nuria Ruiz nuria@wikimedia.org wrote:
Team,
Since everyone is here, we will be working on a machine learning infrastructure program this year. I will set up meetings with everyone on this thread and some others in SRE and Audiences to get a "bag of requests" of things that are missing, first round of talks that I hope to finish next week is to hear what everyone requests/ideas are. Will be sending meeting invites today and tomorrow. I think from those some themes will emerge. Thus far, it is pretty clear we need a better way to deploy models to production (right now we deploy those to elastic search in very crafty manners, for example) , we need to have an answer to GPU issues to train models, we need to have a "recommended way" in which we train and compute, some unified system for tracking models+data+tests and finally, there are probably many learnings the work been done in Ores thus far.
Thanks,
Nuria
On Thu, Feb 7, 2019 at 8:40 AM Miriam Redi mredi@wikimedia.org wrote:
Hey Andrew!
Thank you so much for sharing this and start this conversation. We had a meeting at All Hands with all people interested in "Image Classification" https://phabricator.wikimedia.org/T215413 , and one of the open questions was exactly how to find a "common repository" for ML models that different groups and products within the organization can use. So, please, count me in!
Thanks,
M
On Thu, Feb 7, 2019 at 4:38 PM Aaron Halfaker ahalfaker@wikimedia.org wrote:
Just gave the article a quick read. I think this article pushes on some key issues for sure. I definitely agree with the focus on python/jupyter as essential for a productive workflow that leverages the best from research scientists. We've been thinking about what ORES 2.0 would look like and event streams are the dominant proposal for improving on the limitations of our queue-based worker pool.
One of the nice things about ORES/revscoring is that it provides a nice framework for operating using the *exact same code* no matter the environment. E.g. it doesn't matter if we're calling out to an API to get data for feature extraction or providing it via a stream. By investing in a dependency injection strategy, we get that flexibility. So to me, the hardest problem -- the one I don't quite know how to solve -- is how we'll mix and merge streams to get all of the data we want available for feature extraction. If I understand correctly, that's where Kafka shines. :)
I'm definitely interested in fleshing out this proposal. We should probably be exploring the processes for training new types of models (e.g. image processing) using different strategies than ORES. In ORES, we're almost entirely focused on using sklearn but we have some basic abstractions for other estimator libraries. We also make some strong assumptions about running on a single CPU that could probably be broken for some performance gains using real concurrency.
-Aaron
On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic < goran.milovanovic_ext@wikimedia.de> wrote:
Hi Andrew,
I have recently started a six month AI/Machine Learning Engineering course which focuses exactly on the topics that you've shown interest in.
So,
> I'd love it if we had a working group (or whatever) that focused
on how to standardize how we train and deploy ML for production use.
Count me in.
Regards, Goran
Goran S. Milovanović, PhD Data Scientist, Software Department Wikimedia Deutschland
"It's not the size of the dog in the fight, it's the size of the fight in the dog."
- Mark Twain
On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto otto@wikimedia.org wrote:
Just came across
https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tens...
In it, the author discusses some of what he calls the 'impedance mismatch' between data engineers and production engineers. The links to Ubers Michelangelo https://eng.uber.com/michelangelo/ (which as far as I can tell has not been open sourced) and the Hidden Technical Debt in Machine Learning Systems paper https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf are also very interesting!
At All hands I've been hearing more and more about using ML in production, so these things seem very relevant to us. I'd love it if we had a working group (or whatever) that focused on how to standardize how we train and deploy ML for production use.
:) _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Aaron Halfaker
Principal Research Scientist
Head of the Scoring Platform team Wikimedia Foundation _______________________________________________ Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
FYI, since it seems most of the movement in image classification has been in GLAM/arts, I thought folks might find this interesting: in the GLAM world, the Barnes Foundation has been a leader as their collections management has been unconventional and based more on image similarity than by traditional taxonomic classification.
You may find their observations on experimenting with machine learning interesting:
https://medium.com/barnes-foundation/using-computer-vision-to-tag-the-collec... https://github.com/BarnesFoundation/barnes-tms-extract/blob/master/DATASCIEN...
https://collection.barnesfoundation.org/ http://www.attractionsmanagement.com/index.cfm?pagetype=news&codeID=3383...
On Tue, Feb 12, 2019 at 6:25 AM Andrew Lih andrew.lih@gmail.com wrote:
FYI, folks might be interested in what we've been doing with The Met Museum in NYC and machine learning. Writeup in the latest GLAM newsletter
https://outreach.wikimedia.org/wiki/GLAM/Newsletter/January_2019/Contents/US...
TL;DR - Andrew worked with Jennie Choi, The Met's General Manager of Collection Information and Nina Diamond, Managing Editor and Producer along with Microsoft Researchers Patrick Buehler, J.S. Tan and Sam Kazemi Nafchi to train a machine learning model on Microsoft Azure that could predict labels for artworks. Using the Met's roughly 1,000 word art vocabulary, and representative images to help train the model a proof of concept app was developed at the hackathon. The results were impressive enough that Andrew finished the creation of a Wikdata Distributed Game - Depicts to connect the subject keyword recommendations to Wikidata.
-Andrew
On Thu, Feb 7, 2019 at 2:06 PM Nuria Ruiz nuria@wikimedia.org wrote:
Team,
Since everyone is here, we will be working on a machine learning infrastructure program this year. I will set up meetings with everyone on this thread and some others in SRE and Audiences to get a "bag of requests" of things that are missing, first round of talks that I hope to finish next week is to hear what everyone requests/ideas are. Will be sending meeting invites today and tomorrow. I think from those some themes will emerge. Thus far, it is pretty clear we need a better way to deploy models to production (right now we deploy those to elastic search in very crafty manners, for example) , we need to have an answer to GPU issues to train models, we need to have a "recommended way" in which we train and compute, some unified system for tracking models+data+tests and finally, there are probably many learnings the work been done in Ores thus far.
Thanks,
Nuria
On Thu, Feb 7, 2019 at 8:40 AM Miriam Redi mredi@wikimedia.org wrote:
Hey Andrew!
Thank you so much for sharing this and start this conversation. We had a meeting at All Hands with all people interested in "Image Classification" https://phabricator.wikimedia.org/T215413 , and one of the open questions was exactly how to find a "common repository" for ML models that different groups and products within the organization can use. So, please, count me in!
Thanks,
M
On Thu, Feb 7, 2019 at 4:38 PM Aaron Halfaker ahalfaker@wikimedia.org wrote:
Just gave the article a quick read. I think this article pushes on some key issues for sure. I definitely agree with the focus on python/jupyter as essential for a productive workflow that leverages the best from research scientists. We've been thinking about what ORES 2.0 would look like and event streams are the dominant proposal for improving on the limitations of our queue-based worker pool.
One of the nice things about ORES/revscoring is that it provides a nice framework for operating using the *exact same code* no matter the environment. E.g. it doesn't matter if we're calling out to an API to get data for feature extraction or providing it via a stream. By investing in a dependency injection strategy, we get that flexibility. So to me, the hardest problem -- the one I don't quite know how to solve -- is how we'll mix and merge streams to get all of the data we want available for feature extraction. If I understand correctly, that's where Kafka shines. :)
I'm definitely interested in fleshing out this proposal. We should probably be exploring the processes for training new types of models (e.g. image processing) using different strategies than ORES. In ORES, we're almost entirely focused on using sklearn but we have some basic abstractions for other estimator libraries. We also make some strong assumptions about running on a single CPU that could probably be broken for some performance gains using real concurrency.
-Aaron
On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic < goran.milovanovic_ext@wikimedia.de> wrote:
Hi Andrew,
I have recently started a six month AI/Machine Learning Engineering course which focuses exactly on the topics that you've shown interest in.
So,
>> I'd love it if we had a working group (or whatever) that focused
on how to standardize how we train and deploy ML for production use.
Count me in.
Regards, Goran
Goran S. Milovanović, PhD Data Scientist, Software Department Wikimedia Deutschland
"It's not the size of the dog in the fight, it's the size of the fight in the dog."
- Mark Twain
On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto otto@wikimedia.org wrote:
Just came across
https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tens...
In it, the author discusses some of what he calls the 'impedance mismatch' between data engineers and production engineers. The links to Ubers Michelangelo https://eng.uber.com/michelangelo/ (which as far as I can tell has not been open sourced) and the Hidden Technical Debt in Machine Learning Systems paper https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf are also very interesting!
At All hands I've been hearing more and more about using ML in production, so these things seem very relevant to us. I'd love it if we had a working group (or whatever) that focused on how to standardize how we train and deploy ML for production use.
:) _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Aaron Halfaker
Principal Research Scientist
Head of the Scoring Platform team Wikimedia Foundation _______________________________________________ Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- -Andrew Lih Author of The Wikipedia Revolution US National Archives Citizen Archivist of the Year (2016) Knight Foundation grant recipient - Wikipedia Space (2015) Wikimedia DC - Outreach and GLAM Previously: professor of journalism and communications, American University, Columbia University, USC
Email: andrew@andrewlih.com WEB: https://muckrack.com/fuzheado PROJECT: Wikipedia Space: http://en.wikipedia.org/wiki/WP:WPSPACE