Discovery February 2019

discovery@lists.wikimedia.org

4 participants
8 discussions

Upcoming Search Platform Office Hours—March 6th

by Trey Jones

The Search Platform Team <https://www.mediawiki.org/wiki/Wikimedia_Search_Platform> usually holds office hours the first Wednesday of each month. Come ask us anything about Wikimedia search! We’re particularly interested in: * Opportunities for collaboration—internally or externally to the Wikimedia Foundation * Challenges you have with on-wiki search, in any of the languages we support But we're happy to talk about anything search-related. Feel free to add your items to the Etherpad Agenda for the next meeting. Details for our next meeting: Date: Wednesday, March 6th, 2018 Time: 16:00-17:00 GMT / 08:00-9:00 PST / 11:00-12:00 EST / 17:00-18:00 CET Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours Google Meet link: https://meet.google.com/vyc-jvgq-dww *N.B.:* Google Meet System Requirements <https://support.google.com/meet/answer/7317473> Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

5 years, 1 month

Discovery Weekly Update for the week starting 2019-02-18

by Chris Koerner

Hello, This is the weekly update from the Search Platform team for the week starting 2019-02-18. As always, feedback and questions are welcome. == Discussions == === Search === * A new Korean language analyzer has been configured for Korean-language wikis,[0] however it won't be activated until after we finish the upgrade to Elasticsearch 6, which is ongoing. * SDC [Structured Data on Commons] wanted to know if we could add in a 'inlabel search keyword' and after lots of discussion, it was merged into the new WikibaseCirrusSearch extension that has yet to be merged into the beta cluster [1] * Erik and the team worked on how to measure mutation latency across the newly split elasticsearch clusters and decided that default timeout was good at 30 seconds [2] * Mathew and Gehel worked on testing the spicerack elasticsearch module with quite a few patches that are linked in the ticket [3] * Gehel worked on getting CI set up for search/glent (maven project) to be set up with same options that we use for search/extra [4] * A bug was found where a link-breaking typo is in automatic API documentation for action=query&prop=cirrusbuilddoc, and Erik fixed it by correcting the api docs for cirrusbuilddoc [5] * As we now have different APT components to differentiate the elasticsearch versions, we need to create a new component for the new version and Gehel fixed it all up [6] * David worked on preparing a debian package with search plugins compatible with elastic 5.6.14 in which Gehel merged [7] * Davis also did quite a bit of work to fix and add integration tests for several language analyzers [8] * Erik worked on updating the ttmserver for elasticsearch 6 and removed elastic 2.x compatibility [9] == Did you know? == Grammatical gender [10] often confuses speakers of English and other languages without a similar system. “Why is a bridge feminine in German (Brücke [11]) and masculine in Spanish and French (puente [12] & pont [13])?” they ask—though usually without links to Wiktionary. Grammatical gender is really just a system of noun classes [14] where there are two or three classes, and most things classified as male or female end up in different classes. Other languages have noun classes based on whether or not the nouns are animate, whether they are human or animal, by shape, and sometimes just arbitrarily groupings; languages can have nearly two dozen noun classes, like some of the Niger–Congo languages![15] Now hold on while we veer off on a brief tangent: diminutives are words that convey a smaller, lesser, or more intimate sense of their root form.[16] They are common in American nicknames, often showing up as a -y or -ie ending (Billy vs. Bill, Peggy vs Peg, Bobbie vs Roberta). Sometimes diminutives, especially when applied to small cute things, can become the main or only form of a word. For example, English baby [17] from babe, or kitty from kit. Diminutives and grammatical gender collide in German Mädchen [18] (“girl”) which is historically from Magd (cognate with English “maid”) plus the diminutive suffix -chen; all diminutives formed with -chen have neuter gender in German. Over time, Mädchen became the predominate term for a girl, despite the fact that the word is grammatically “neuter”. [0] https://phabricator.wikimedia.org/T206874 [1] https://phabricator.wikimedia.org/T215967 [2] https://phabricator.wikimedia.org/T215969 [3] https://phabricator.wikimedia.org/T207920 [4] https://phabricator.wikimedia.org/T216599 [5] https://phabricator.wikimedia.org/T216256 [6] https://phabricator.wikimedia.org/T216047 [7] https://phabricator.wikimedia.org/T215932 [8] https://phabricator.wikimedia.org/T215594 [9] https://phabricator.wikimedia.org/T192680 [10] https://en.wikipedia.org/wiki/Grammatical_gender [11] https://en.wiktionary.org/wiki/Br%C3%BCcke#German [12] https://en.wiktionary.org/wiki/puente#Spanish [13] https://en.wiktionary.org/wiki/pont#French [14] https://en.wikipedia.org/wiki/Noun_class [15] https://en.wikipedia.org/wiki/Noun_class#Niger%E2%80%93Congo_languages [16] https://en.wikipedia.org/wiki/Diminutive [17] https://en.wiktionary.org/wiki/baby#Etymology [18] https://en.wiktionary.org/wiki/M%C3%A4dchen#Etymology ---- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner (he/him) Community Relations Specialist Wikimedia Foundation

5 years, 2 months

Discovery Weekly Update for the week starting 2019-02-11

by Chris Koerner

Hello again, This is the weekly update from the Search Platform team for the week starting 2019-02-11. As always, feedback and questions welcome. == Discussions == === Search === * Stas and Trey worked on creating a textcat package to deploy [0] * Mathew and Gehel collaborated on creating an Icinga check for failed shard allocation [1] * Search, SRE, and WMCS created a cloudelastic-root group that refines certain access to the search clusters [2] * Erik ran Wikidata entity autocomplete AB test on de, fr, es wikis. The testing proved to be good, and the new wbsearchentities profiles have been deployed [3] * Erik worked to create a metastore if it is missing from indexNamespaces.php (installs were failing while running updateSearchIndexConfig.php) [4] * David reworked how source_regex timeout is done in Cirrus (to limit the source_regex query from consuming all the cluster resources) [5] * David also confirmed that the ApiFeatureUsage still works with ElasticSearch 6.5.4 [6] * Erik noted that as production search indicies are now split across three clusters per datacenter, mwgrep needs to be able to query multiple ElasticSearch clusters [7] * Erik ensured that the mjolnir daemons will work seamlessly with ElasticSearch 5 or 6 [8] * Trey and David ensured that the Elastic language analysis components, our internal components, and third-party components are all working as expected in ElasticSearch 6 [9] [0] https://phabricator.wikimedia.org/T213936 [1] https://phabricator.wikimedia.org/T212850 [2] https://phabricator.wikimedia.org/T214922 [3] https://phabricator.wikimedia.org/T214515 [4] https://phabricator.wikimedia.org/T215369 [5] https://phabricator.wikimedia.org/T198734 [6] https://phabricator.wikimedia.org/T215621 [7] https://phabricator.wikimedia.org/T215199 [8] https://phabricator.wikimedia.org/T215475 [9] https://phabricator.wikimedia.org/T194849 ---- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner (he/him) Community Relations Specialist Wikimedia Foundation

5 years, 2 months

WikibaseCirrusSearch extension

by Stas Malyshev

Hi! I've been working for a while now on splitting the code that does searching - and more specifically, searching using ElasticSearch/CirrusSearch - out from Wikibase extension code and into a separate extension (see https://phabricator.wikimedia.org/T190022). If you don't know what I'm talking about here (or not interested in this topic), you can safely skip the rest of this message. The extension WikibaseCirrusSearch is meant to have all the code related to ElasticSearch and CirrusSearch extension integration to Wikibase, so main Wikibase repo does not have any Elastic-specific code. This means that if you have your own Wikibase install, you'll need (after migration is done) to install WikibaseCirrusSearch to get search functionality like we have on Wikidata now. There will also be change in configurations - I'll make a migration document and announce it separately. We're now working on deploying and testing it on Beta/testwiki, after which we'll start migrating production to running the code in this extension for search, after which the search code in the Wikibase repo itself will be removed. You can track the progress in the Phabricator task mentioned above. Since code migration is in pretty advanced stage now, I'd like to ask if you make any changes to any code under repo/includes/Search or repo/config in Wikibase repo, or any tests or configs related to those, please inform me (by adding me to patch reviewers/CC or by email or by any other reasonable means) so that these changes won't be lost in the migration. I'll be looking into the latest patches for anything related periodically, but I might miss things. WikibaseLexeme code that relates to search will be also migrated to a separate extension (WikibaseLexemeCirrusSearch), that work will be starting soon. So the request above applies to the search parts of the WikibaseLexeme code also. If you have any questions/comments, please feel free to ask me, on the lists or on the IRC. Thanks, -- Stas Malyshev smalyshev(a)wikimedia.org

5 years, 2 months

Discovery Weekly Update for the week starting 2019-02-04

by Chris Koerner

Hello, This is the weekly update from the Search Platform team for the week starting 2019-02-04. As always, feedback and questions welcome. == Discussions == === Search === * Based on feedback received during a conversation at All Hands, Erik wrote up a page documenting how to utilize dependencies inside pyspark with a custom setup at the start of a notebook. [0] * Julia (NLP contractor) is working on various things and will use glent in part of that research. :) [1] * Nuria is reading / watching all sorts of documentation regarding search, here's a video that is interesting. [2] * Erik finished up more documenting debt with adding the search sort options that are available in the API [3] * David worked on using subphrase matching for autocomplete by default on specific sites [4] * Erik worked on starting a new browser bot instance running elastic 6 to help get ready for the upcoming es6 upgrade [5] * Erik noticed that once the elasticsearch cluster split is complete we will need to be ready to start deploying the archive index split, and went ahead and dropped the "archive" from generic index on testwiki [6] * Trey reviewed David's port of all of our internal and external language analyzers to Elasticsearch 6. There are some unexpected upgrades, and there were a few problems, which David immediately fixed (and added new tests to cover). [7] [0] https://wikitech.wikimedia.org/wiki/User:EBernhardson/pyspark_on_SWAP [1] https://en.wiktionary.org/wiki/glent [2] https://commons.wikimedia.org/wiki/File:BareBonesSearch.webm [3] https://phabricator.wikimedia.org/T215198 [4] https://phabricator.wikimedia.org/T212788 [5] https://phabricator.wikimedia.org/T214422 [6] https://phabricator.wikimedia.org/T213851 [7] https://phabricator.wikimedia.org/T194849 ---- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner (he/him) Community Relations Specialist Wikimedia Foundation

5 years, 2 months

Re: [discovery] [Research-Internal] [Analytics] Article about ML in production woes

by Nuria Ruiz

Team, Since everyone is here, we will be working on a machine learning infrastructure program this year. I will set up meetings with everyone on this thread and some others in SRE and Audiences to get a "bag of requests" of things that are missing, first round of talks that I hope to finish next week is to hear what everyone requests/ideas are. Will be sending meeting invites today and tomorrow. I think from those some themes will emerge. Thus far, it is pretty clear we need a better way to deploy models to production (right now we deploy those to elastic search in very crafty manners, for example) , we need to have an answer to GPU issues to train models, we need to have a "recommended way" in which we train and compute, some unified system for tracking models+data+tests and finally, there are probably many learnings the work been done in Ores thus far. Thanks, Nuria On Thu, Feb 7, 2019 at 8:40 AM Miriam Redi <mredi(a)wikimedia.org> wrote: > Hey Andrew! > > Thank you so much for sharing this and start this conversation. We had a > meeting at All Hands with all people interested in "Image Classification" > https://phabricator.wikimedia.org/T215413 , and one of the open questions > was exactly how to find a "common repository" for ML models that different > groups and products within the organization can use. So, please, count me > in! > > Thanks, > > M > > > On Thu, Feb 7, 2019 at 4:38 PM Aaron Halfaker <ahalfaker(a)wikimedia.org> > wrote: > >> Just gave the article a quick read. I think this article pushes on some >> key issues for sure. I definitely agree with the focus on python/jupyter >> as essential for a productive workflow that leverages the best from >> research scientists. We've been thinking about what ORES 2.0 would look >> like and event streams are the dominant proposal for improving on the >> limitations of our queue-based worker pool. >> >> One of the nice things about ORES/revscoring is that it provides a nice >> framework for operating using the *exact same code* no matter the >> environment. E.g. it doesn't matter if we're calling out to an API to get >> data for feature extraction or providing it via a stream. By investing in >> a dependency injection strategy, we get that flexibility. So to me, the >> hardest problem -- the one I don't quite know how to solve -- is how we'll >> mix and merge streams to get all of the data we want available for feature >> extraction. If I understand correctly, that's where Kafka shines. :) >> >> I'm definitely interested in fleshing out this proposal. We should >> probably be exploring the processes for training new types of models (e.g. >> image processing) using different strategies than ORES. In ORES, we're >> almost entirely focused on using sklearn but we have some basic >> abstractions for other estimator libraries. We also make some strong >> assumptions about running on a single CPU that could probably be broken for >> some performance gains using real concurrency. >> >> -Aaron >> >> On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic < >> goran.milovanovic_ext(a)wikimedia.de> wrote: >> >>> Hi Andrew, >>> >>> I have recently started a six month AI/Machine Learning Engineering >>> course which focuses exactly on the topics that you've shown interest in. >>> >>> So, >>> >>> >>> I'd love it if we had a working group (or whatever) that focused >>> on how to standardize how we train and deploy ML for production use. >>> >>> Count me in. >>> >>> Regards, >>> Goran >>> >>> >>> Goran S. Milovanović, PhD >>> Data Scientist, Software Department >>> Wikimedia Deutschland >>> >>> ------------------------------------------------ >>> "It's not the size of the dog in the fight, >>> it's the size of the fight in the dog." >>> - Mark Twain >>> ------------------------------------------------ >>> >>> >>> On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto <otto(a)wikimedia.org> wrote: >>> >>>> Just came across >>>> >>>> https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-ten… >>>> >>>> In it, the author discusses some of what he calls the 'impedance >>>> mismatch' between data engineers and production engineers. The links to >>>> Ubers Michelangelo <https://eng.uber.com/michelangelo/> (which as far >>>> as I can tell has not been open sourced) and the Hidden Technical Debt >>>> in Machine Learning Systems paper >>>> <https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning…> are >>>> also very interesting! >>>> >>>> At All hands I've been hearing more and more about using ML in >>>> production, so these things seem very relevant to us. I'd love it if we >>>> had a working group (or whatever) that focused on how to standardize how we >>>> train and deploy ML for production use. >>>> >>>> :) >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics(a)lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>> >> >> -- >> >> Aaron Halfaker >> >> Principal Research Scientist >> >> Head of the Scoring Platform team >> Wikimedia Foundation >> _______________________________________________ >> Research-Internal mailing list >> Research-Internal(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/research-internal >> > _______________________________________________ > Research-Internal mailing list > Research-Internal(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/research-internal >

5 years, 2 months

Upcoming Search Platform Office Hours—February 6th

by Trey Jones

The Search Platform Team <https://www.mediawiki.org/wiki/Wikimedia_Search_Platform> usually holds office hours the first Wednesday of each month—that's tomorrow! Come ask us anything about Wikimedia search! We’re particularly interested in: * Opportunities for collaboration—internally or externally to the Wikimedia Foundation * Challenges you have with on-wiki search, in any of the languages we support But we're happy to talk about anything search-related. Feel free to add your items to the Etherpad Agenda for the next meeting. Details for our next meeting: Date: Wednesday, February 6th, 2018 Time: 16:00-17:00 GMT / 08:00-9:00 PST / 11:00-12:00 EST / 17:00-18:00 CET Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours Google Meet link: https://meet.google.com/vyc-jvgq-dww *N.B.:* Google Meet System Requirements <https://support.google.com/meet/answer/7317473> Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

5 years, 2 months

Discovery Weekly Update for the week starting 2019-01-21

by Chris Koerner

Hello, This is the weekly update from the Search Platform team for the week starting 2019-01-21 (acknowledging a little delay). As always, feedback and questions welcome. == Discussions == === Search === * Trey did a write up about his ideas for the wrong keyboard implementation and UI, and the new language models and parameter optimization for TextCat for the wrong-keyboard and wrong-encoding detection. [0] * Trey also reviewed, updated and manually re-built the Hebmorph plugin for ElasticSearch 6 with help from the rest of the Search team [1] * David double checked that chi to psi/omega indices are no longer used and delete them. [2] * David also worked on dropping the "archive" type in the general index [3] * David wrapped up on moving the phrase suggest related code to its FallbackMethod [4] * Erik worked on modifying Wikidata entity completion search for per-language tuning parameters [5] * Erik also worked on a bug where we didn't retain cross-cluster identifier in OtherIndexes [6] * Erik fulfilled a feature request to add chronological sorting by-page-creation-timestamp for search results [7] * David also upgraded the Elasticsearch plugins extra, extra-analysis and highlighter to elasticsearch 6.5.4 [8] * Stas is working on separating ElasticSearch parts of Wikibase into a separate extension [9] [0] [https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Implementation_Desig… [1] https://phabricator.wikimedia.org/T214439 [2] https://phabricator.wikimedia.org/T214052 [3] https://phabricator.wikimedia.org/T200198 [4] https://phabricator.wikimedia.org/T213098 [5] https://phabricator.wikimedia.org/T213106 [6] https://phabricator.wikimedia.org/T214050 [7] https://phabricator.wikimedia.org/T195071 [8] https://phabricator.wikimedia.org/T214312 [9] https://phabricator.wikimedia.org/T190022 ---- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner (he/him) Community Relations Specialist Wikimedia Foundation

5 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery February 2019