I've been working for a while now on splitting the code that does
searching - and more specifically, searching using
ElasticSearch/CirrusSearch - out from Wikibase extension code and into a
separate extension (see https://phabricator.wikimedia.org/T190022). If
you don't know what I'm talking about here (or not interested in this
topic), you can safely skip the rest of this message.
The extension WikibaseCirrusSearch is meant to have all the code related
to ElasticSearch and CirrusSearch extension integration to Wikibase, so
main Wikibase repo does not have any Elastic-specific code. This means
that if you have your own Wikibase install, you'll need (after migration
is done) to install WikibaseCirrusSearch to get search functionality
like we have on Wikidata now. There will also be change in
configurations - I'll make a migration document and announce it
separately. We're now working on deploying and testing it on
Beta/testwiki, after which we'll start migrating production to running
the code in this extension for search, after which the search code in
the Wikibase repo itself will be removed. You can track the progress in
the Phabricator task mentioned above.
Since code migration is in pretty advanced stage now, I'd like to ask if
you make any changes to any code under repo/includes/Search or
repo/config in Wikibase repo, or any tests or configs related to those,
please inform me (by adding me to patch reviewers/CC or by email or by
any other reasonable means) so that these changes won't be lost in the
migration. I'll be looking into the latest patches for anything related
periodically, but I might miss things.
WikibaseLexeme code that relates to search will be also migrated to a
separate extension (WikibaseLexemeCirrusSearch), that work will be
starting soon. So the request above applies to the search parts of the
WikibaseLexeme code also.
If you have any questions/comments, please feel free to ask me, on the
lists or on the IRC.
Since everyone is here, we will be working on a machine learning
infrastructure program this year. I will set up meetings with everyone on
this thread and some others in SRE and Audiences to get a "bag of requests"
of things that are missing, first round of talks that I hope to finish next
week is to hear what everyone requests/ideas are. Will be sending meeting
invites today and tomorrow. I think from those some themes will emerge.
Thus far, it is pretty clear we need a better way to deploy models to
production (right now we deploy those to elastic search in very crafty
manners, for example) , we need to have an answer to GPU issues to train
models, we need to have a "recommended way" in which we train and compute,
some unified system for tracking models+data+tests and finally, there are
probably many learnings the work been done in Ores thus far.
On Thu, Feb 7, 2019 at 8:40 AM Miriam Redi <mredi(a)wikimedia.org> wrote:
> Hey Andrew!
> Thank you so much for sharing this and start this conversation. We had a
> meeting at All Hands with all people interested in "Image Classification"
> https://phabricator.wikimedia.org/T215413 , and one of the open questions
> was exactly how to find a "common repository" for ML models that different
> groups and products within the organization can use. So, please, count me
> On Thu, Feb 7, 2019 at 4:38 PM Aaron Halfaker <ahalfaker(a)wikimedia.org>
>> Just gave the article a quick read. I think this article pushes on some
>> key issues for sure. I definitely agree with the focus on python/jupyter
>> as essential for a productive workflow that leverages the best from
>> research scientists. We've been thinking about what ORES 2.0 would look
>> like and event streams are the dominant proposal for improving on the
>> limitations of our queue-based worker pool.
>> One of the nice things about ORES/revscoring is that it provides a nice
>> framework for operating using the *exact same code* no matter the
>> environment. E.g. it doesn't matter if we're calling out to an API to get
>> data for feature extraction or providing it via a stream. By investing in
>> a dependency injection strategy, we get that flexibility. So to me, the
>> hardest problem -- the one I don't quite know how to solve -- is how we'll
>> mix and merge streams to get all of the data we want available for feature
>> extraction. If I understand correctly, that's where Kafka shines. :)
>> I'm definitely interested in fleshing out this proposal. We should
>> probably be exploring the processes for training new types of models (e.g.
>> image processing) using different strategies than ORES. In ORES, we're
>> almost entirely focused on using sklearn but we have some basic
>> abstractions for other estimator libraries. We also make some strong
>> assumptions about running on a single CPU that could probably be broken for
>> some performance gains using real concurrency.
>> On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic <
>> goran.milovanovic_ext(a)wikimedia.de> wrote:
>>> Hi Andrew,
>>> I have recently started a six month AI/Machine Learning Engineering
>>> course which focuses exactly on the topics that you've shown interest in.
>>> >>> I'd love it if we had a working group (or whatever) that focused
>>> on how to standardize how we train and deploy ML for production use.
>>> Count me in.
>>> Goran S. Milovanović, PhD
>>> Data Scientist, Software Department
>>> Wikimedia Deutschland
>>> "It's not the size of the dog in the fight,
>>> it's the size of the fight in the dog."
>>> - Mark Twain
>>> On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto <otto(a)wikimedia.org> wrote:
>>>> Just came across
>>>> In it, the author discusses some of what he calls the 'impedance
>>>> mismatch' between data engineers and production engineers. The links to
>>>> Ubers Michelangelo <https://eng.uber.com/michelangelo/> (which as far
>>>> as I can tell has not been open sourced) and the Hidden Technical Debt
>>>> in Machine Learning Systems paper
>>>> <https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning…> are
>>>> also very interesting!
>>>> At All hands I've been hearing more and more about using ML in
>>>> production, so these things seem very relevant to us. I'd love it if we
>>>> had a working group (or whatever) that focused on how to standardize how we
>>>> train and deploy ML for production use.
>>>> Analytics mailing list
>> Aaron Halfaker
>> Principal Research Scientist
>> Head of the Scoring Platform team
>> Wikimedia Foundation
>> Research-Internal mailing list
> Research-Internal mailing list