Discovery

discovery@lists.wikimedia.org

1 participants
755 discussions

WikibaseCirrusSearch extension
by Stas Malyshev 14 Feb '19

14 Feb '19

Hi! I've been working for a while now on splitting the code that does searching - and more specifically, searching using ElasticSearch/CirrusSearch - out from Wikibase extension code and into a separate extension (see https://phabricator.wikimedia.org/T190022). If you don't know what I'm talking about here (or not interested in this topic), you can safely skip the rest of this message. The extension WikibaseCirrusSearch is meant to have all the code related to ElasticSearch and CirrusSearch extension integration to Wikibase, so main Wikibase repo does not have any Elastic-specific code. This means that if you have your own Wikibase install, you'll need (after migration is done) to install WikibaseCirrusSearch to get search functionality like we have on Wikidata now. There will also be change in configurations - I'll make a migration document and announce it separately. We're now working on deploying and testing it on Beta/testwiki, after which we'll start migrating production to running the code in this extension for search, after which the search code in the Wikibase repo itself will be removed. You can track the progress in the Phabricator task mentioned above. Since code migration is in pretty advanced stage now, I'd like to ask if you make any changes to any code under repo/includes/Search or repo/config in Wikibase repo, or any tests or configs related to those, please inform me (by adding me to patch reviewers/CC or by email or by any other reasonable means) so that these changes won't be lost in the migration. I'll be looking into the latest patches for anything related periodically, but I might miss things. WikibaseLexeme code that relates to search will be also migrated to a separate extension (WikibaseLexemeCirrusSearch), that work will be starting soon. So the request above applies to the search parts of the WikibaseLexeme code also. If you have any questions/comments, please feel free to ask me, on the lists or on the IRC. Thanks, -- Stas Malyshev smalyshev(a)wikimedia.org

1 0

Discovery Weekly Update for the week starting 2019-02-04
by Chris Koerner 13 Feb '19

13 Feb '19

Hello, This is the weekly update from the Search Platform team for the week starting 2019-02-04. As always, feedback and questions welcome. == Discussions == === Search === * Based on feedback received during a conversation at All Hands, Erik wrote up a page documenting how to utilize dependencies inside pyspark with a custom setup at the start of a notebook. [0] * Julia (NLP contractor) is working on various things and will use glent in part of that research. :) [1] * Nuria is reading / watching all sorts of documentation regarding search, here's a video that is interesting. [2] * Erik finished up more documenting debt with adding the search sort options that are available in the API [3] * David worked on using subphrase matching for autocomplete by default on specific sites [4] * Erik worked on starting a new browser bot instance running elastic 6 to help get ready for the upcoming es6 upgrade [5] * Erik noticed that once the elasticsearch cluster split is complete we will need to be ready to start deploying the archive index split, and went ahead and dropped the "archive" from generic index on testwiki [6] * Trey reviewed David's port of all of our internal and external language analyzers to Elasticsearch 6. There are some unexpected upgrades, and there were a few problems, which David immediately fixed (and added new tests to cover). [7] [0] https://wikitech.wikimedia.org/wiki/User:EBernhardson/pyspark_on_SWAP [1] https://en.wiktionary.org/wiki/glent [2] https://commons.wikimedia.org/wiki/File:BareBonesSearch.webm [3] https://phabricator.wikimedia.org/T215198 [4] https://phabricator.wikimedia.org/T212788 [5] https://phabricator.wikimedia.org/T214422 [6] https://phabricator.wikimedia.org/T213851 [7] https://phabricator.wikimedia.org/T194849 ---- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner (he/him) Community Relations Specialist Wikimedia Foundation

1 0

Re: [discovery] [Research-Internal] [Analytics] Article about ML in production woes
by Nuria Ruiz 07 Feb '19

07 Feb '19

Team, Since everyone is here, we will be working on a machine learning infrastructure program this year. I will set up meetings with everyone on this thread and some others in SRE and Audiences to get a "bag of requests" of things that are missing, first round of talks that I hope to finish next week is to hear what everyone requests/ideas are. Will be sending meeting invites today and tomorrow. I think from those some themes will emerge. Thus far, it is pretty clear we need a better way to deploy models to production (right now we deploy those to elastic search in very crafty manners, for example) , we need to have an answer to GPU issues to train models, we need to have a "recommended way" in which we train and compute, some unified system for tracking models+data+tests and finally, there are probably many learnings the work been done in Ores thus far. Thanks, Nuria On Thu, Feb 7, 2019 at 8:40 AM Miriam Redi <mredi(a)wikimedia.org> wrote: > Hey Andrew! > > Thank you so much for sharing this and start this conversation. We had a > meeting at All Hands with all people interested in "Image Classification" > https://phabricator.wikimedia.org/T215413 , and one of the open questions > was exactly how to find a "common repository" for ML models that different > groups and products within the organization can use. So, please, count me > in! > > Thanks, > > M > > > On Thu, Feb 7, 2019 at 4:38 PM Aaron Halfaker <ahalfaker(a)wikimedia.org> > wrote: > >> Just gave the article a quick read. I think this article pushes on some >> key issues for sure. I definitely agree with the focus on python/jupyter >> as essential for a productive workflow that leverages the best from >> research scientists. We've been thinking about what ORES 2.0 would look >> like and event streams are the dominant proposal for improving on the >> limitations of our queue-based worker pool. >> >> One of the nice things about ORES/revscoring is that it provides a nice >> framework for operating using the *exact same code* no matter the >> environment. E.g. it doesn't matter if we're calling out to an API to get >> data for feature extraction or providing it via a stream. By investing in >> a dependency injection strategy, we get that flexibility. So to me, the >> hardest problem -- the one I don't quite know how to solve -- is how we'll >> mix and merge streams to get all of the data we want available for feature >> extraction. If I understand correctly, that's where Kafka shines. :) >> >> I'm definitely interested in fleshing out this proposal. We should >> probably be exploring the processes for training new types of models (e.g. >> image processing) using different strategies than ORES. In ORES, we're >> almost entirely focused on using sklearn but we have some basic >> abstractions for other estimator libraries. We also make some strong >> assumptions about running on a single CPU that could probably be broken for >> some performance gains using real concurrency. >> >> -Aaron >> >> On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic < >> goran.milovanovic_ext(a)wikimedia.de> wrote: >> >>> Hi Andrew, >>> >>> I have recently started a six month AI/Machine Learning Engineering >>> course which focuses exactly on the topics that you've shown interest in. >>> >>> So, >>> >>> >>> I'd love it if we had a working group (or whatever) that focused >>> on how to standardize how we train and deploy ML for production use. >>> >>> Count me in. >>> >>> Regards, >>> Goran >>> >>> >>> Goran S. Milovanović, PhD >>> Data Scientist, Software Department >>> Wikimedia Deutschland >>> >>> ------------------------------------------------ >>> "It's not the size of the dog in the fight, >>> it's the size of the fight in the dog." >>> - Mark Twain >>> ------------------------------------------------ >>> >>> >>> On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto <otto(a)wikimedia.org> wrote: >>> >>>> Just came across >>>> >>>> https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-ten… >>>> >>>> In it, the author discusses some of what he calls the 'impedance >>>> mismatch' between data engineers and production engineers. The links to >>>> Ubers Michelangelo <https://eng.uber.com/michelangelo/> (which as far >>>> as I can tell has not been open sourced) and the Hidden Technical Debt >>>> in Machine Learning Systems paper >>>> <https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning…> are >>>> also very interesting! >>>> >>>> At All hands I've been hearing more and more about using ML in >>>> production, so these things seem very relevant to us. I'd love it if we >>>> had a working group (or whatever) that focused on how to standardize how we >>>> train and deploy ML for production use. >>>> >>>> :) >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics(a)lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>> >> >> -- >> >> Aaron Halfaker >> >> Principal Research Scientist >> >> Head of the Scoring Platform team >> Wikimedia Foundation >> _______________________________________________ >> Research-Internal mailing list >> Research-Internal(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/research-internal >> > _______________________________________________ > Research-Internal mailing list > Research-Internal(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/research-internal >

1 0

Upcoming Search Platform Office Hours—February 6th
by Trey Jones 05 Feb '19

05 Feb '19

The Search Platform Team <https://www.mediawiki.org/wiki/Wikimedia_Search_Platform> usually holds office hours the first Wednesday of each month—that's tomorrow! Come ask us anything about Wikimedia search! We’re particularly interested in: * Opportunities for collaboration—internally or externally to the Wikimedia Foundation * Challenges you have with on-wiki search, in any of the languages we support But we're happy to talk about anything search-related. Feel free to add your items to the Etherpad Agenda for the next meeting. Details for our next meeting: Date: Wednesday, February 6th, 2018 Time: 16:00-17:00 GMT / 08:00-9:00 PST / 11:00-12:00 EST / 17:00-18:00 CET Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours Google Meet link: https://meet.google.com/vyc-jvgq-dww *N.B.:* Google Meet System Requirements <https://support.google.com/meet/answer/7317473> Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2019-01-21
by Chris Koerner 05 Feb '19

05 Feb '19

Hello, This is the weekly update from the Search Platform team for the week starting 2019-01-21 (acknowledging a little delay). As always, feedback and questions welcome. == Discussions == === Search === * Trey did a write up about his ideas for the wrong keyboard implementation and UI, and the new language models and parameter optimization for TextCat for the wrong-keyboard and wrong-encoding detection. [0] * Trey also reviewed, updated and manually re-built the Hebmorph plugin for ElasticSearch 6 with help from the rest of the Search team [1] * David double checked that chi to psi/omega indices are no longer used and delete them. [2] * David also worked on dropping the "archive" type in the general index [3] * David wrapped up on moving the phrase suggest related code to its FallbackMethod [4] * Erik worked on modifying Wikidata entity completion search for per-language tuning parameters [5] * Erik also worked on a bug where we didn't retain cross-cluster identifier in OtherIndexes [6] * Erik fulfilled a feature request to add chronological sorting by-page-creation-timestamp for search results [7] * David also upgraded the Elasticsearch plugins extra, extra-analysis and highlighter to elasticsearch 6.5.4 [8] * Stas is working on separating ElasticSearch parts of Wikibase into a separate extension [9] [0] [https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Implementation_Desig… [1] https://phabricator.wikimedia.org/T214439 [2] https://phabricator.wikimedia.org/T214052 [3] https://phabricator.wikimedia.org/T200198 [4] https://phabricator.wikimedia.org/T213098 [5] https://phabricator.wikimedia.org/T213106 [6] https://phabricator.wikimedia.org/T214050 [7] https://phabricator.wikimedia.org/T195071 [8] https://phabricator.wikimedia.org/T214312 [9] https://phabricator.wikimedia.org/T190022 ---- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner (he/him) Community Relations Specialist Wikimedia Foundation

2 1

Discovery Weekly Update for the week starting 2019-01-14
by Chris Koerner 22 Jan '19

22 Jan '19

Hello, This is the weekly update from the Search Platform team for the week starting 2019-01-14. As always, feedback and questions welcome. == Discussions == === Search === * Trey updated TextCat with models for detecting Russian typed on an English keyboard and vice-versa, and UTF-8 Russian text improperly encoded as Windows-1251, [0] as a precursor to providing wrong-keyboard/encoding detection and suggestion. [1] * Erik and the team did a lot of work on an epic ticket (with several sub tasks) to explore and figure out next steps in using user click data to tune Wikidata search parameters [2] and [3]. The team will ship the newly tuned wbsearchentities profile for en soon with de, fr, es afterward. * The team also had lots of discussions and exploration on how to transform Wikidata autocomplete click logs into a useful dataset. They are now transformed: Relevance Forge now has a utility for taking in the Wikidata completion search logs and tuning the parameters of search based on those logs. [4] * David fixed a minor regression where search request failures when offset+limit is out of bounds (cirrussearch-backend-error) [5] * Mathew discovered that the required metrics have been exposed by the prometheus exporter but they are displaying and fixed the issue with help from David and Gehel [6] * David reconfigured the ElasticSearch crosscluster on production search servers to have persistent configs [7] === WDQS === * Stas & Guillaume finished moving categories namespace into a separate Blazegraph instance [8] == Did you know? == English text, like many others, is written left-to-right (LTR), but some languages—most notably Arabic, Hebrew, Persian, and Urdu, but also many others [9]—are written right-to-left (RTL). To handle different writing directions—especially in mixed LTR and RTL texts—Unicode classifies characters as having "strong", "weak", or "neutral" directionality. Strong characters definitely go in a particular direction, like ABC or אבג. Weak characters have a "vague" directionality, but can be changed in context, mostly numbers. Neutral characters pick up their directionality from context, like punctuation and whitespace characters used across scripts. Mirrored characters change the way they display based on context. For example "A>B>C" and "א>ב>ג" both only have the greater than character (>) in them, but, if you are reading this somewhere that follows the Unicode bidirectional algorithm, the ones between Latin letters point to the right and those between Hebrew letters point to the left. The algorithms are complicated [10], and when they don't work, there are explicit characters that indicate things like "text should flow left to right from here". The explicit formatting characters have the most potential to cause trouble for search because they are usually invisible, and you can pick one up without realizing it. For example, when copying an Arabic word from a page in English, or a French word from a page in Hebrew, the word that is "the other way around" from the main text might have one of these marks at the beginning or end of it. Fortunately, we can usually identify them and strip them out. Finally, there are some scripts that have been written in other interesting directions. Vertical text includes Chinese, Japanese, and Korean, [11] and Mongolian. [12]. Hanunó'o [13] and Ogham [14] were written bottom-to-top! My [Trey's] favorite "direction" is "boustrophedon," [15] which means "like an ox ploughs" and alternates left-to-right and right-to-left, and was used particularly in old manuscripts and inscriptions in may writing systems. Why jump from one side of the page to the other when you can just curve around where you are or flip to mirrored letters and keep going?! [0] https://phabricator.wikimedia.org/T213931 [1] https://phabricator.wikimedia.org/T138958 [2] https://phabricator.wikimedia.org/T193701 [3] https://phabricator.wikimedia.org/T213105 [4] https://phabricator.wikimedia.org/T205111 [5] https://phabricator.wikimedia.org/T213745 [6] https://phabricator.wikimedia.org/T210592 [7] https://phabricator.wikimedia.org/T213150 [8] https://phabricator.wikimedia.org/T213212 [9] https://en.wikipedia.org/wiki/Right-to-left#List_of_RTL_scripts [10] https://www.w3.org/International/articles/inline-bidi-markup/uba-basics [11] https://en.wikipedia.org/wiki/Horizontal_and_vertical_writing_in_East_Asian… [12] https://en.wikipedia.org/wiki/Mongolian_script [13] https://en.wikipedia.org/wiki/Hanun%C3%B3%27o_alphabet [14] https://en.wikipedia.org/wiki/Ogham [15] https://en.wikipedia.org/wiki/Boustrophedon ---- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner (he/him) Community Relations Specialist Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2019-01-07
by Chris Koerner 15 Jan '19

15 Jan '19

This is the weekly update from the Search Platform team for the week starting 2019-01-07. As always, feedback and questions welcome. == Discussions == === Search === * David discovered an issue with the click-through rate on one of the Search dashboards for mobile apps [0] and enlisted Chelsy's help in fixing it quickly (done!) [1] * Mathew worked on increasing the number of shards for enwiki_general [2] * David helped to augmenting the list of known clusters using cluster conf for Mjolnir [3] * David updated the completion suggester: TP50 [Top percentile 50%] was increased from 9ms to 24ms [4] * The Search team worked on supporting searching multiple filetypes at once, based on input from the Multimedia team [5] * David and Mathew worked on allowing ElasticSearch machines to be able to communicate with each other on port 9500 and 9700 [6] * We found that most of the dashboards in grafana are designed to have a cluster per DC, and we needed to refactor them so that we can select a specific cluster (by adding chi, psi and omega selectors) [7] * The multi-instance support code added for ExternalIndex was designed without the group+replica concepts in mind, so we fixed ExternalIndex to support groups & replica topology [8] * There was a recent spike of fatal timeouts from API search suggestions (prefixsearch) that caused a number of user queries to become stalled for 60 seconds and then receive a generic error page without any results. We fixed this by merging a patch for language detection to not be run when rewriting is not enabled [9] === WDQS === * We have added a new keyboard shortcuts to WDQS UI, for those systems where Ctrl-Space is already taken - Ctrl-Alt-Space and Alt-Enter [10] * Stas found an issue where the WDQS puppet/hiera configs were too distributed, Mathew and Gehel worked on it with assistance from SRE (thanks!) [11] * Our database in WDQS seems to hit Blazegraph internal limits, which requires some careful work of rearranging the data to stay away from the limit. This work now has started [12] * Stas have fixed an issue where a large update could crash Updater [13] * Stas have fixed an issue where due to database replication delay, Updater could read an old version of the data from Wikidata [14] * Stas fixed an issue where SERVICE SILENT construct was producing errors despite standards saying it should not do that [15] [0] http://discovery.wmflabs.org/metrics/#app_events [1] https://phabricator.wikimedia.org/T211306 [2] https://phabricator.wikimedia.org/T212224 [3] https://phabricator.wikimedia.org/T211752 [4] https://phabricator.wikimedia.org/T212768 [5] https://phabricator.wikimedia.org/T212776 [6] https://phabricator.wikimedia.org/T212434 [7] https://phabricator.wikimedia.org/T211956 [8] https://phabricator.wikimedia.org/T212120 [9] https://phabricator.wikimedia.org/T212455 [10] https://phabricator.wikimedia.org/T203320 [11] https://phabricator.wikimedia.org/T210431 [12] https://phabricator.wikimedia.org/T213210 [13] https://phabricator.wikimedia.org/T210235 [14] https://phabricator.wikimedia.org/T210901 [15] https://phabricator.wikimedia.org/T196859 ---- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner (he/him) Community Relations Specialist Wikimedia Foundation

1 0

Data corruption on 2 Wikidata Query Service servers
by Guillaume Lederrey 10 Jan '19

10 Jan '19

Hello all! We are having some issues with 2 of the Wikidata Query Service servers. So far, the issue looks like data corruption, probably related to an issue in Blazegraph itself (the database engine behind Wikidata Query Service). The issue prevents updates to the data, but reads are unaffected as far as we can tell. The 2 affected servers are part of the internal WDQS cluster, so the public wdqs endpoint [1] is not affected. Data is lagging on the internal eqiad endpoint, so Mediawiki functionalities that use WDQS are at the moment not seeing the latest updates to Wikidata. We are reaching out to the Blazegraph team via Github [2] and via private contacts that we have. We hope to identify the root cause of the issue so that we can fix it for good, but this looks like a hard problem. Failing that, we will reimport the full data set. You can follow the upstream issue on Github [2] and on Phabricator on our side [3]. Sorry for the inconvenience and thank you for your patience! Have fun, Guillaume [1] https://query.wikidata.org/ [2] https://github.com/blazegraph/database/issues/114 [3] https://phabricator.wikimedia.org/T213134 -- Guillaume Lederrey Operations Engineer, Search Platform Wikimedia Foundation UTC+1 / CET

2 1

Subject: Upcoming Search Platform Office Hours—January 9th
by Trey Jones 09 Jan '19

09 Jan '19

The Search Platform Team <https://www.mediawiki.org/wiki/Wikimedia_Search_Platform> usually holds office hours the first Wednesday of each month—but since this month that would have been Jan 2nd, we’ve delayed for a week. Come ask us anything about Wikimedia search! We’re particularly interested in: * Opportunities for collaboration—internally or externally to the Wikimedia Foundation * Challenges you have with on-wiki search, in any of the languages we support But we're happy to talk about anything search-related. Feel free to add your items to the Etherpad Agenda for the next meeting. Details for our next meeting: Date: Wednesday, December 9th, 2018 Time: 16:00-17:00 GMT / 08:00-9:00 PST / 11:00-12:00 EST / 17:00-18:00 CET Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours Google Meet link: https://meet.google.com/vyc-jvgq-dww *N.B.:* Google Meet System Requirements <https://support.google.com/meet/answer/7317473> —Trey Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

1 1

Search Platform Team office hour for January 2019
by Pine W 02 Jan '19

02 Jan '19

Hi, The previously announced schedule for Search Platform Team office hours was that these office hours would happen on the first Wednesday of each month. My guess is that January 1st is the last day of WMF's end of year holidays, but maybe WMF's holiday break extends further than the 1st. There has been no announcement of an office hour January 2nd. Am I correct in guessing that the office hour will occur on January 9th? Pine ( https://meta.wikimedia.org/wiki/User:Pine )

3 2

← Newer
1
...
9
10
11
12
13
14
15
...
76
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery