Discovery

discovery@lists.wikimedia.org

1 participants
755 discussions

Survey of Regular Expression Searches
by Trey Jones 05 Jun '18

05 Jun '18

Hey everyone, As part of T195491 <https://phabricator.wikimedia.org/T195491>, Erik has been looking into the details of our regex processing and ways to handle ridiculously long-running regex queries. He pulled all the regex queries over the last 90 days to get a sense of what features people are using and what impact certain changes he was considering would have on users. Turns out there are a lot more users than I would have thought—which is good news! And a lot of them look like bots. He also made the mistake of pointing me to the data and highlighting a common pattern—searches for interwiki links. I couldn't help myself—I started digging around found that the majority of the searches are looking for those interwiki links, and the vast majority of regex searches fall into three types—interwiki links, URLs, and Library of Congress collection IDs. Overall, there are 5,613,506 regexes total across all projects and all languages, over a 90-day period. That comes out to ~62K/day—which is a lot more than I'd expected, though I hadn't thought about bots using regexes. Read more on MediaWiki <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Regular_Ex…> . —Trey Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

3 3

Discovery Weekly Update for the week starting 2018-05-28
by Chris Koerner 04 Jun '18

04 Jun '18

Hello, We're back! Here is our post Hackathon update this week from the Search Platform team. As always, feedback and questions welcome. == Discussions == === Search === * Erik worked on evaluating and building out features provided by `query_explorer` functionality of learning-to-rank plugin, there is lots of good info in the ticket [0] * David worked on allowing searches with "all:" keyword to also work on non-English projects, and not only with its translations ("searchall") [1] * David increased the cirrus indices to have more shards for enwiki_general, viwiki_general and wikidatawiki_content [2] * David reverted an earlier patch that had depreciated the global namespace handling of the prefix keyword; the new patch was deployed — prefix and associated InputBox forms should work as before. [3] * Trey and Gehel deployed the updated search/extra plugin and search/extra-analysis-slovak plugin with Slovak Stemmer [4] and will be available after a re-indexing [5] * Stas enabled deep category support on all wikis (except private) [6] * Stas reindexed wikidata to enable support for searching for string & external ID property values [7], [8] * Stas implemented basic text indexing for Lexemes [9] [0] https://phabricator.wikimedia.org/T187148#4086754 [1] https://phabricator.wikimedia.org/T165110 [2] https://phabricator.wikimedia.org/T192064 [3] https://phabricator.wikimedia.org/T193392 [4] https://phabricator.wikimedia.org/T191543 [5] https://phabricator.wikimedia.org/T191545 [6] https://phabricator.wikimedia.org/T194260 [7] https://phabricator.wikimedia.org/T163642 [8] https://phabricator.wikimedia.org/T99899 [9] https://phabricator.wikimedia.org/T195912 --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Let me tell you why search sucks! (Hint: It doesn’t, really.)
by Trey Jones 24 May '18

24 May '18

At the Barcelona Hackathon, one of my projects was to carry around a sign that said, “Tell Me Why Your Search Sucks!” in about 20 languages. A number of people shared their thoughts, which I've summarized on Phab ticket T189791 <https://phabricator.wikimedia.org/T189791#4226596>. —Trey Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

2 1

[nominations needed] Wikimedia Technical Conference
by Deborah Tankersley 21 May '18

21 May '18

*Hello,We recently announced the new Wikimedia Technical Conference (TechConf) during the closing session of the Barcelona Hackathon on May 20, 2018. We are sending this email to give an update on the planning and organization, and to also let everyone know how the nomination process will work for those interested in attending. The Wikimedia Technical Conference will take place in Portland, OR, USA on October 22-25, 2018. And, as mentioned in previous emails [1][2] and on the wiki page [3], this conference will be focused on the cross-departmental program called Platform Evolution. We will be providing more information and context as we go along in the process.For this conference, we are looking for diverse stakeholders, perspectives, and experiences that will help us to make informed decisions for the future evolution of the platform. We need people who can create and architect solutions, as well as those who actually make decisions on funding and prioritization for the projects.Later this week, we will send out a form to provide more detailed information on the nomination process and how to nominate people (or it can be yourself) to attend this conference, along with the skills, experiences, and/or backgrounds that we are looking for. Due to the time needed for visa application and other constraints, the deadline for nominations will be June 8th. Please make sure that you don’t miss the deadline!If you have any questions, please post them on the talk page [4][1] https://lists.wikimedia.org/pipermail/mediawiki-l/2018-April/047367.html <https://lists.wikimedia.org/pipermail/mediawiki-l/2018-April/047367.html> [2] https://lists.wikimedia.org/pipermail/wikitech-l/2018-April/089738.html <https://lists.wikimedia.org/pipermail/wikitech-l/2018-April/089738.html> [3] https://mediawiki.org/wiki/Wikimedia_Technical_Conference/2018 <https://mediawiki.org/wiki/Wikimedia_Technical_Conference/2018> [4] https://www.mediawiki.org/wiki/Talk:Wikimedia_Technical_Conference/2018 <https://www.mediawiki.org/wiki/Talk:Wikimedia_Technical_Conference/2018> * *Cheers from the Program Committee:* *Kate, Corey, Joaquin, Greg, Birgit and TheDJ* -- deb tankersley Program Manager, Engineering Wikimedia Foundation

1 0

NLP for on-wiki search
by Trey Jones 15 May '18

15 May '18

Hi everyone, I just finished putting together an annotated list of potential applications of natural language processing to on-wiki search <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applicatio…>. There are dozens and dozens of ideas there—including many that are interesting but probably not practical. If you have any additional ideas, questions, suggestions, recommendations, or preferences, please share!—either on the mailing list or on the talk page. The goal is to narrow it down to one or two things to pursue over the next two to four quarters, along with other projects we are working on. Thanks! —Trey Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2018-05-07
by Chris Koerner 11 May '18

11 May '18

Привет! Another update from the Search Platform team for the week starting 2018-05-07 **Programming note:** Due to the upcoming Wikimedia Hackathon and some (personal) holiday time, the next update will be the week of 2018-05-28. Until then, and as always, feedback and questions are welcome. == Highlights== * Map internationalization launched everywhere, and embedded maps (mapframe) are now live on 276 Wikipedias [0] * ''"Hello, my name is _____"'' is an in-depth blog post by Trey that was published earlier this week where he details the irony that searching for names is not always as straightforward as you might think. [1] == Discussions == === Search === * Erik updated a script that was populating lots of 500 errors in the logs [2] * Erik also did a lot of research to evaluate impact of adding ~2700 new shards to production cluster (there is a pdf attached to the last comment in the ticket that contains more information) [3] There is a follow-up ticket as well for the next steps [4] * Trey worked on the analysis config for the new Slovak stemmer that was deployed this week—but the plugin still needs to be deployed and the wikis re-indexed. [5] * Stas and others worked on looking up entities by external identifiers - the work is done for now, but it needs a re-index to be fully ready [6] * David worked on externalizing the parsing logic from SimpleKeywordFeature and FullTextQueryStringQueryBuilder and it was pushed into production in April 2018 [7] == Other Noteworthy Stuff == * Trey's most recent updates to transliteration on the Crimean Tatar Wikipedia are live; after a year of part-time 10% project work, the transliteration infrastructure for Crimean Tatar is done and the accuracy is in the high 90% range. [8] == Did you know? == * The English word “dove”, as the past tense of “dive”, is one of the rare cases where a conjugation has become more irregular over time. The verb “dive” picked up the strong conjugation [9] by analogy with other strong verbs, particularly “drive/drove”. [10] Going in the more typical direction of regularization, Swedish strong verbs slowly lost some of their distinctive plural forms. [11] The change started in the 16th century, and was still in progress as late as the 1940s. From the search perspective, regular forms are easier to deal with—so, way to go Swedish! [0] https://lists.wikimedia.org/pipermail/wikitech-l/2018-May/089964.html [1] https://blog.wikimedia.org/2018/05/08/searching-for-names-is-not-always-str… [2] https://phabricator.wikimedia.org/T179266 [3] https://phabricator.wikimedia.org/T192972 [4] https://phabricator.wikimedia.org/T193654 [5] https://phabricator.wikimedia.org/T191544 [6] https://phabricator.wikimedia.org/T99899 [7] https://phabricator.wikimedia.org/T188530 [8] https://phabricator.wikimedia.org/T188321 [9] https://en.wikipedia.org/wiki/Germanic_strong_verb [10] https://en.wiktionary.org/wiki/dove#Etymology_2 [11] https://en.wikipedia.org/wiki/Swedish_grammar#Historical_plural_forms --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Re: [discovery] AdvancedSearch beta feature available on all wikis
by Lea Voget 09 May '18

09 May '18

Hi all, sorry for the mixup, it somehow fell through our cracks that deepcategory is not yet working on all wikis. For wikis that are not indexed yet, for now deepcategory seems to be returning all results of the category whose name was entered. >Now, since we are now indexing categories only for select wikis - the >list is here: > https://noc.wikimedia.org/conf/dblists/categories-rdf.dblist <https://noc.wikimedia.org/conf/dblists/categories-rdf.dblist> - we may >consider adding more wikis to it. E.g. see: >https://phabricator.wikimedia.org/T194139 <https://phabricator.wikimedia.org/T194139> >So which wikis need to be added? Deep category should work on all wikis, so can we allow deep category on all wikis? Stas, I saw that you already added all wikis with > 1000 categories for indexing. Do you know when this will start working? And do you know when we can have this on all wikis? For the time being, we will add a note to the information of the "Pages in this category" info text, so people don't get confused. Best, Lea -- Lea Voget Product Manager Technical Wishlist Produktmanager Technische Wunschliste Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 http://wikimedia.de Stellen Sie sich eine Welt vor, in der jeder Mensch an der Menge allen Wissens frei teilhaben kann. Helfen Sie uns dabei! http://spenden.wikimedia.de/ Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

1 0

AdvancedSearch beta feature available on all wikis
by Birgit Müller 08 May '18

08 May '18

(cross-post) Dear all, we are really happy to announce that the new AdvancedSearch interface got deployed as a beta feature to all wikis just now. [1] The search has great options to perform advanced queries e.g. by using keywords like "hastemplate" or "intitle", but often even experienced editors don't know about it - this is what we found out in a workshop series on advanced searches in 2016 and this is why we have built the AdvancedSearch extension. [2] AdvancedSearch enhances Special:Search through an advanced parameters form. It serves as an interface for some of the search options that the Wikimedia Foundation's search team has been implemented over the past years. The way the interface works, users don't have to know the syntax behind each search field, but they can learn about it if they want to. *From small beta to full beta* The feature already has been a beta feature on deWP, arWP, huWP, faWP and mediawiki.org for more than 5 months. During this "small beta phase" (= base version with a set of features, deployed to a few wikis, both ltr and rtl wikis) support for more search options got added (searches in categories and sub categories, searches for content in a specific language in wikis that have the translate extension enabled, searches for subpages of a page), the way how to select and configure namespaces got improved and several bugs were fixed. Everyone is invited to test the now full beta feature! If you want to give us feedback or if you find a bug, please use the main feedback page (or file a ticket in phabricator): https://www.mediawiki.org/wiki/Help_talk:Extension:AdvancedSearch If you want to learn more about the project, the functional scope of the AdvancedSearch extension and the usage, please see * the help page: https://www.mediawiki.org/wiki/Help:AdvancedSearch *the main project page: https://meta.wikimedia.org/ wiki/WMDE_Technical_Wishes/AdvancedSearch * the list of supported search options: https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes/AdvancedSearch/Functi… *Thanks, thanks, thanks :-)*A huge thanks to everyone who has tested the feature and gave feedback over the last 5 months and to everyone who has translated software messages and announcements - this is much appreciated! And a huge thanks to the WMF's search team who has done all the backend work and has built great options for advanced search queries that now can be accessed through the AdvancedSearch interface. It was and is great to work with you :-) Looking forward to more testing and feedback to further improve the feature, Thanks a lot, Birgit (for WMDE's Technical Wishes team) [1] https://phabricator.wikimedia.org/T193182 <https://phabricator.wikimedia.org/T180147>(Deployment ticket) [2] https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes/AdvancedSearch /Workshop <https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes/AdvancedSearch> -- Birgit Müller Community Communications Manager Software Development and Engineering Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 http://wikimedia.de Stellen Sie sich eine Welt vor, in der jeder Mensch an der Menge allen Wissens frei teilhaben kann. Helfen Sie uns dabei! http://spenden.wikimedia.de/ Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

2 1

Discovery Weekly Update for the week starting 2018-04-30
by Chris Koerner 07 May '18

07 May '18

Hello friends, A short and simple update this week from the Search Platform team. As always, feedback and questions welcome. == Highlights== * Maps update: mapframe was installed on English Wikipedia [0] === Search === * Trey has created a Slovak Elasticsearch Plugin/Analysis Chain using a Slovak Stemming Algorithm and is moving ahead with adding an Elasticsearch plugin for it. [1] [0] https://www.mediawiki.org/wiki/Topic:Uca6loi6wxwy4rvg [1] https://phabricator.wikimedia.org/T190815] --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Hackathon ideas
by Erik Bernhardson 03 May '18

03 May '18

With the hackathon coming up I thought we could ponder what could be done while there. I've been constructing a list of horrible ideas over the last couple weeks: Web UI for cirrus debug/devel features: - Settings dump - Mappings dump - Copy version of settings+mappings suitable to create index with curl - cirrusDumpQuery - cirrusDumpResult - cirrusExplain - cirrusUserTesting Top level idea is to make it easy to access all of these things. Could be a userscript run on-page in the wiki. Could be an SPA run from tool labs (or even people.wikimedia.org). ============ docker setup to initialize elasticsearch, import latest cirrus dump, and attach a kibana instance for UI. Probably with a modified mapping more amicable to kibana inspection. ============ Some script to manage elasticsearch allocation manually via api? Pointless, but perhaps fun. =========== phabricator formatted export for jupyter - problem: images? -- Seems would need to upload separately and then reference them in final output -- There is an api for this, but then we can't just emit something to paste into a field the whole export needs to happen over api then. - better, but worse: data-uri's would be great. But i dunno if phab is built for megabyte sized posts. They also don't support data-uri's. Browsers also hate when you copy/paste excessive amounts of data. ========== Custom implementation to find similar images in commons: - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.5151&rep=rep1&… - http://www.deepideas.net/building-content-based-search-engine-quantifying-s… - Convert image into a feature vector - Use clustering to generate an image signature - Find k-nearest-neighbors via Earth Mover Distance (EMD), can utilize pyemd library. - It's very not-obvious how the signature + weight gets plugged into pyemd - EMD is expensive, no clue how this would scale to millions of images - This would probably perform poorly, more interesting to get to understand some of the history of similar image retrieval ========= https://github.com/beniz/deepdetect.git ? - Use pre-trained ML to detect objects in images and then label those objects. - Can compare similarity of objects detected for similar images. Can probably extend with color information - Do we actually have a use case for images similar to other images? Perhaps on upload? ========== Elasticsearch cluster balance simulator - Allow to Simulate valuate how the cluster balancing performs under various simulated conditions - no way this could be done in a weekend hackathon. It would probably be completely wrong as well and simulate some idealized cluster that doesn't act like ours. ========== Prototype Lire plugin for elasticsearch - Lire = Lucene Image REtrieval - I know nothing about it, other than it exists - Plugin already exists plugging it into solr, so how hard could it be? - Maybe try it out standalone with some small test set to see what it does

5 6

← Newer
1
...
14
15
16
17
18
19
20
...
76
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery