Discovery June 2018

discovery@lists.wikimedia.org

6 participants
8 discussions

Discovery Weekly Update for the week starting 2018-06-18

by Chris Koerner

Here's the update from the Search Platform team for the week of 2018-06-18. As always, feedback and questions welcome. == Discussions == === Search === * We merged a basic analysis chain for Mirandese, [0] and we should be reindexing the Mirandese Wikipedia in a week or so. See T197890. [1] *Bosnian, Croatian, and Serbo-Croatian wikis have been reindexed and are now using the new stemmer, which also supports cross-script searching. See T196658. [2] *Trey did an analysis [3] of an Esperanto stemmer that could be converted for use with Elasticsearch. If you know some Esperanto, join the discussion on the talk page, or on the Esperanto Wikipedia and Wiktionary Village Pumps. [4] [5] See also T197240. [6] == Did you know? == * Esperanto is a constructed language, created in the late 19th century. It is one of the most widely spoken constructed languages—with millions of speakers—and probably the only one with native speakers, who number in the thousands. [7] [0] https://en.wikipedia.org/wiki/Special:MyLanguage/Mirandese_language [1] https://phabricator.wikimedia.org/T197890 [2] https://phabricator.wikimedia.org/T196658 [3] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Esperanto_Stemmer_An… [4] https://eo.wikipedia.org/wiki/Vikipedio:Diskutejo/Diversejo#Need_help_revie… [5] https://eo.wiktionary.org/wiki/Vikivortaro:Diskutejo#Need_help_reviewing_Es… [6] https://phabricator.wikimedia.org/T197240 [7] https://en.wikipedia.org/wiki/Special:MyLanguage/Esperanto --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

5 years, 10 months

WDQS timeout and slowdown - Incident report

by Guillaume Lederrey

Hello! As you might already know, Wikdiata Query Service has been misbehaving in the last 24 hours. Our public SPARQL endpoint [1] was slow and throwing timeouts. Sadly, exposing a public SPARQL endpoint is a hard problem and we don't have a final solution to this. Still we have some improvements. Have a look at the incident report [2] if you want details. I also started to write a runbook for WDQS [3]. This should be interesting mostly to our SRE team, but feel free to also have a look and suggest improvements / clarifications. Note that our internal WDQS endpoint was stable during that time (as expected). Thanks for your help and your patience! Guillaume [1] https://query.wikidata.org/ [2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20180625-wdqs [3] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook -- Guillaume Lederrey Operations Engineer, Search Platform Wikimedia Foundation UTC+2 / CEST

5 years, 10 months

Discovery Weekly Update for the week starting 2018-06-11

by Chris Koerner

Hi, Here's the weekly update from the Search Platform team. As always, feedback and questions welcome. == Discussions == === Search === * Trey completed a technical review of the available Estonian morphological library with help from Guillaume and David, and unfortunately it's not usable, and the stemming algorithm is not easily ported. See T178928. [0] * Trey did an analysis [1] of the effect of using the Elasticsearch Indonesian analysis chain on Malay-language data. (See Wikipedia [2] for details on Malay and Indonesian.) Next step is getting speaker review of the stemming quality, then hopefully on to reindexing wikis in both Malay and Indonesian. * Trey did a write up about the weirdness that comes from searching for single punctuation characters without good redirect support [3] to explain why searching for a hyphen on Farsi Wikipedia redirects you to the article on the apostrophe. See also T196826. [4] * Erik and David looked at adding 'type' field to store same information as was in es5 types in metastore [5] * David did work on investigating (and implementing) how the prefix keyword should augment and not override the list of requested namespaces [6] * Trey got the feedback he needed to go head and create and merge Croatian, Serbo-Croatian, and Bosnian Analysis Chains Using Serbian Morphological Libraries [7] * Gehel found that when we freeze writes to elasticsearch, jobs pile up in the job queue and we needed an alert to tell us that the writes aren't getting thawed in a timely manner [8] * Trey worked on moving Serbian language wikis from extra-analysis to extra-analysis-serbian plugin (it went into production a week ago with the re-indexing) [9] * Erik and Gehel resolved current deprecation warnings in elasticsearch 5 [10] * David worked on adding support for boosting keywords [11] and adding support for Filtering keyword (FilterQueryFeature) [12] * Erik did quite a bit of research on how to ensure that the regex highlighting doesn't always timeout as expected because @ apparently matches "any string" in the lucene regex syntax; Trey helped with the analysis and it got pushed into production in early June [13] * Stas added lemma & form representation texts to fulltext search index, which allows (very primitive) fulltext search for Lexemes [14]. Better search coming soon! == Other Noteworthy Stuff == * Wikidata Quality Constraints violation now can be exported into RDF.[15] Loading to Wikidata Query Service coming soon. [16] [0] https://phabricator.wikimedia.org/T178928#4267448 [1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Analysis_of_Applying… [2] https://en.wikipedia.org/wiki/Comparison_of_Standard_Malay_and_Indonesian [3] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Searching_for_Punctu… [4] https://phabricator.wikimedia.org/T196826 [5] https://phabricator.wikimedia.org/T192615 [6] https://phabricator.wikimedia.org/T195815 [7] https://phabricator.wikimedia.org/T192395 [8] https://phabricator.wikimedia.org/T193605 [9] https://phabricator.wikimedia.org/T193734 [10] https://phabricator.wikimedia.org/T192614 [11] https://phabricator.wikimedia.org/T195305 [12] https://phabricator.wikimedia.org/T195788 [13] https://phabricator.wikimedia.org/T195491 [14] https://phabricator.wikimedia.org/T195912 [15] https://www.wikidata.org/wiki/Q42?action=constraintsrdf [16] https://phabricator.wikimedia.org/T172380 --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

5 years, 10 months

Discovery Weekly Update for the week starting 2018-06-04

by Chris Koerner

Howdy, Here's the weekly update from the Search Platform team. As always, feedback and questions welcome. == Discussions == === Search === * After lots of talk about stemmers getting committed and plugins getting deployed, the Slovak-language wikis have finally been *reindexed*, and stemming [0] is now happening on the Slovak wikis! [1] === Search—Time Machine Edition === A few things from May that got missed: * Trey wrote up some potential applications of natural language processing (NLP) to on-wiki search [2]. We're still going through them to pick out a couple that we'll turn into projects, probably next quarter. Right now, spelling correction and entity extraction are high on the list, but more questions, comments, and suggestions are welcome. * Erik pulled 90 days worth of regular expression (regex) searches across all wikis, and Trey did a quick survey of the most common patterns. [3] There are a lot more regex searches than we thought—5.6 million in 90 days!—and three apparently automated processes (bots, apps, or tools of some kind) are responsible for more than 90% of the regex searches. [0] https://en.wikipedia.org/wiki/Stemming [1] https://phabricator.wikimedia.org/T190815 [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applicatio… [3] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Regular_Ex… --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

5 years, 10 months

Presentation on UX design, mental models, and behavioral vs. survey data

by Pine W

I thought that this video, published in May 2018, was somewhat interesting and I am sharing it in case others are also interested. The presenter uses a change of design of Wikipedia's front page search box from 2010 (see https://blog.wikimedia.org/2010/06/15/usability-why-did-we-move-the-search-…) as an example, though I would hope that the lesson from this video isn't that it's okay to frequently disrupt the workflows of existing users with design changes regardless of the amount of complaints from existing users. The main points that I drew from this presentation are that interfaces should be intuitive and should have relatively light cognitive load. Those points may sound obvious to experienced UX designers, but may be of interest to people whose areas of expertise are in other domains. I also appreciated that the presenter shared an example of a situation in which people said one thing in surveys but behaved in the opposite way in practice. Here is the link to the video: https://www.youtube.com/watch?v=mxzK4sWfvH8 Regards, Pine ( https://meta.wikimedia.org/wiki/User:Pine )

5 years, 10 months

From Wikimedia-l: Most wanted articles across languages

by Chris Koerner

Not directly a search feature in the sense of our normal queries, indexes, and results pages, but I thought this research by Amir Elisha Aharoni is worth mentioning on this list. Amir has logged the queries people search for via the Compact Language Links search box. He then provides a list of "most wanted" articles. >From Amir, "This is a report of the articles that people most often try to find in a different language, and cannot find. This is done by logging the searches in the Compact Language Links' language search box that don't yield any results. For example, if somebody goes to the English Wikipedia article en:Newspaper, searches for "telugu", and this article doesn't exist in the Telugu Wikipedia, this is logged and counted here." https://lists.wikimedia.org/pipermail/wikimedia-l/2018-May/090376.html https://meta.wikimedia.org/wiki/Most_wanted_articles_across_languages Yours, Chris Koerner Community Liaison Wikimedia Foundation

5 years, 10 months

Survey of Regular Expression Searches

by Trey Jones

Hey everyone, As part of T195491 <https://phabricator.wikimedia.org/T195491>, Erik has been looking into the details of our regex processing and ways to handle ridiculously long-running regex queries. He pulled all the regex queries over the last 90 days to get a sense of what features people are using and what impact certain changes he was considering would have on users. Turns out there are a lot more users than I would have thought—which is good news! And a lot of them look like bots. He also made the mistake of pointing me to the data and highlighting a common pattern—searches for interwiki links. I couldn't help myself—I started digging around found that the majority of the searches are looking for those interwiki links, and the vast majority of regex searches fall into three types—interwiki links, URLs, and Library of Congress collection IDs. Overall, there are 5,613,506 regexes total across all projects and all languages, over a 90-day period. That comes out to ~62K/day—which is a lot more than I'd expected, though I hadn't thought about bots using regexes. Read more on MediaWiki <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Regular_Ex…> . —Trey Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

5 years, 10 months

Discovery Weekly Update for the week starting 2018-05-28

by Chris Koerner

Hello, We're back! Here is our post Hackathon update this week from the Search Platform team. As always, feedback and questions welcome. == Discussions == === Search === * Erik worked on evaluating and building out features provided by `query_explorer` functionality of learning-to-rank plugin, there is lots of good info in the ticket [0] * David worked on allowing searches with "all:" keyword to also work on non-English projects, and not only with its translations ("searchall") [1] * David increased the cirrus indices to have more shards for enwiki_general, viwiki_general and wikidatawiki_content [2] * David reverted an earlier patch that had depreciated the global namespace handling of the prefix keyword; the new patch was deployed — prefix and associated InputBox forms should work as before. [3] * Trey and Gehel deployed the updated search/extra plugin and search/extra-analysis-slovak plugin with Slovak Stemmer [4] and will be available after a re-indexing [5] * Stas enabled deep category support on all wikis (except private) [6] * Stas reindexed wikidata to enable support for searching for string & external ID property values [7], [8] * Stas implemented basic text indexing for Lexemes [9] [0] https://phabricator.wikimedia.org/T187148#4086754 [1] https://phabricator.wikimedia.org/T165110 [2] https://phabricator.wikimedia.org/T192064 [3] https://phabricator.wikimedia.org/T193392 [4] https://phabricator.wikimedia.org/T191543 [5] https://phabricator.wikimedia.org/T191545 [6] https://phabricator.wikimedia.org/T194260 [7] https://phabricator.wikimedia.org/T163642 [8] https://phabricator.wikimedia.org/T99899 [9] https://phabricator.wikimedia.org/T195912 --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

5 years, 10 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery June 2018