Discovery

discovery@lists.wikimedia.org

1 participants
756 discussions

Language Analysis Glossary
by Trey Jones 19 Jul '18

19 Jul '18

Happy Friday the 13th,* everyone! I realized that I've gotten more lax about defining terms in my write ups over time, because I am always talking about types (pre- and post-analysis) and tokens and monolithic and unpacked analyzers, etc, etc. So, I reorganized the Search Glossary a bit into topical sections, and added a big section on Language Analysis, which I will point to in my write ups. Please review the new section it if you have time and interest. If you know this stuff, please correct any errors. If you don't know this stuff, please ask questions about anything that's unclear so I can improve it. Thanks! The new section of the glossary is under "Language Analysis <https://www.mediawiki.org/wiki/Wikimedia_Discovery/Search/Glossary#Language…> ". Cheers, —Trey *Word of the day: friggatriskaidekaphobia <https://en.wiktionary.org/wiki/friggatriskaidekaphobia>. Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

2 1

Re: [discovery] NLP for on-wiki search
by Trey Jones 19 Jul '18

19 Jul '18

Hi everyone, I've got an update on the NLP project selection. We've narrowed things down to a handful of projects we could work on with a consultant, and a handful we could work on internally. David, Erik, and I reviewed a selection of the most promising-seeming and/or most interesting projects and gave them a very rough cost estimate based on how big of a relative impact they would have, technologically how hard they would be, and how difficult the UI aspect would be. The scores are not definitive, but helped guide the discussion. You can see the list of projects we looked at and more details of the scoring on MediaWiki <https://www.mediawiki.org/w/index.php?title=User:TJones_(WMF)/Notes/Potenti…> . For the possibility of working with an outside consultant, we also considered how easily separated each project would be from our overall system (making it easier for someone new to get up to speed), how projects feed into each other, how easily we could work on projects ourselves (like, we know pretty much what to do, we just have to do it), etc. Our current *recommendation for an outside consultant* would be to start with (1) *spelling correction/did you mean improvements,* with an option to extend the project to include either (2) *"more like" suggestion improvements,* or (3) *query reformulation mining,* specifically for typo corrections. For spelling correction (#1), we are envisioning an approach that integrates generic intra-word and inter-word statistical models, optional language-specific features, and explicit weighted corrections. We believe we could mine redirects flagged as typo correction for explicit corrections, and the query reformulation mining (#3) would also provide frequency-weighted explicit corrections. Our hope is that a system built initially for English would be readily applicable to other alphabetic languages, most probably other Indo-European languages, based on statistics available from Elastic; and that some elements of the system could be applied to other non-alphabetic languages and languages that are typologically <https://en.wikipedia.org/wiki/Morphological_typology> dissimilar to Indo-European languages. Looking at the rest of the list, (a) *wrong keyboard detection* seems like something we should work on internally, since we already have a few good ideas on how to approach it. (b) *Acronym support* is a pet peeve for several members of the team, and seems to be straightforward to improve. (c) *Automatic stemmer building* and (d) *automatic stop word* generation aren't so much projects we should work on as things we should research to see if there are already tools or lists out there we could use to make the projects much easier. Comments and questions here or on the talk page are welcome. Cheers, —Trey Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation On Tue, May 15, 2018 at 11:30 AM, Trey Jones <tjones(a)wikimedia.org> wrote: > Hi everyone, > > I just finished putting together an annotated list of potential > applications of natural language processing to on-wiki search > <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applicatio…>. > There are dozens and dozens of ideas there—including many that are > interesting but probably not practical. If you have any additional ideas, > questions, suggestions, recommendations, or preferences, please > share!—either on the mailing list or on the talk page. > > The goal is to narrow it down to one or two things to pursue over the next > two to four quarters, along with other projects we are working on. > > Thanks! > —Trey > > Trey Jones > Sr. Software Engineer, Search Platform > Wikimedia Foundation > >

2 1

Discovery Weekly Update for the week starting 2018-07-02
by Chris Koerner 14 Jul '18

14 Jul '18

Hello again, This is the weekly update from the Search Platform team for the weeks starting 2018-07-02 and 2018-07-09. Programming Note: With the Wikimania Hackathon, Wikimania proper, and resulting travel for folks in coming weeks, the next update will be for the week starting 2018-07-30. As always, feedback and questions welcome. == Discussions == === Search === * David and Stas worked on fine tuning of search configs to mediawiki-config for Wikidata [0] * Stas and Addshore helped to catch and clean up some bad lookups and report them properly [1] * A "Wrong document type" error was corrected by Erik by fixing Sanitizer MetaStore integration [2] * Erik worked on tracking queries that run on the Elastic Search clusters longer than both server side and client side timeouts by fixing some slow logging functionality [3] * There was a Meta-wiki error where search suggests non-existent title due to namespace/redirect mixup. Erik's note: "it's a bit awkward, but typing Help:Glo into autocomplete on metawiki suggests 'Global Account' from the main namespace, and selecting it takes you to Help:Unified login" [4] * In order to dispatch queries to a particular search setup (cirrus defaults vs wikibase custom query builder), David created a flexible way to classify queries, meant to replace the 'getSyntaxUsed' approach currently in SearchContext. [5] * Trey and a community volunteer, Athena, created a basic Mirandese analysis chain. It was tested on RelForge and pushed into production this week [6]. Trey kicked off, completed and tested the re-indexing of the Mirandese Wikis [7]. * The Re-Re-Index of the Serbian Wikis after refactored plugins were deployed has been completed [8] and the re-index of the Croatian, Serbo-Croatian, and Bosnian Wikis was also done [9] * We currently mix a tiny number of namespace documents into the regular indices, which seems inefficient; so Erik built a unified namespace index [10] * Erik updated the 'OtherIndex' to operate on a cluster other than the one holding the wiki [11] * Trey updated a variety of things on the Analysis Tools with lots little fixes and improvements as well as a few small errors in the analysis code that conflated post-analysis types and pre-analysis types [12] == Did you know? == * The period of this status update includes Friday, July 13, 2018. The fear of the number thirteen is called "triskaidekaphobia" [13]. There are two words for fear of Friday the 13th: "paraskavedekatriaphobia" [14] and "friggatriskaidekaphobia" [15]—the first maintains a consistent etymology with the Greek word for Friday, "Paraskeví", while the second invokes "Frigg", the Norse Goddess after whom Friday is named in English. [0] https://phabricator.wikimedia.org/T182717 [1] https://phabricator.wikimedia.org/T198091 [2] https://phabricator.wikimedia.org/T197446 [3] https://phabricator.wikimedia.org/T196180 [4] https://phabricator.wikimedia.org/T115756 [5] https://phabricator.wikimedia.org/T197774 [6] https://phabricator.wikimedia.org/T194941 [7] https://phabricator.wikimedia.org/T197890 [8] https://phabricator.wikimedia.org/T196404 [9] https://phabricator.wikimedia.org/T196658 [10] https://phabricator.wikimedia.org/T192699 [11] https://phabricator.wikimedia.org/T194678 [12] https://phabricator.wikimedia.org/T199273 [13] https://en.wiktionary.org/wiki/triskaidekaphobia [14] https://en.wiktionary.org/wiki/paraskavedekatriaphobia [15] https://en.wiktionary.org/wiki/friggatriskaidekaphobia --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Relations Specialist Wikimedia Foundation

1 0

Fwd: [MediaWiki-l] New episode of "Between the Brackets": Stas Malyshev
by Deborah Tankersley 10 Jul '18

10 Jul '18

Cross-posting...! -- deb tankersley Program Manager, Engineering Wikimedia Foundation ---------- Forwarded message --------- From: Yaron Koren <yaron(a)wikiworks.com> Date: Tue, Jul 10, 2018 at 11:01 AM Subject: [MediaWiki-l] New episode of "Between the Brackets": Stas Malyshev To: MediaWiki announcements and site admin list < mediawiki-l(a)lists.wikimedia.org> Hi, A new episode of the MediaWiki podcast "Between the Brackets" has been released, featuring an interview with Wikimedia Foundation developer Stas Malyshev, who works on search in both MediaWiki and Wikidata. You can listen to the interview here: http://betweenthebrackets.libsyn.com/episode-12-stas-malyshev -Yaron _______________________________________________ MediaWiki-l mailing list To unsubscribe, go to: https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

1 0

Discovery Weekly Update for the week starting 2018-06-25
by Chris Koerner 03 Jul '18

03 Jul '18

Hello, Here's the update from the Search Platform team for the week of 2018-06-25. As always, feedback and questions welcome. == Discussions == === Search === * Stas has added Lexemes to the list of namespaces for instant indexing on Wikidata [0]. * David has merged refactoring of Wikidata search configs [1]. === Wikidata Query Service === * Stas have fixed several bugs in WDQS: global limit on MWAPI [2], correct reporting of results order [3] and "q" option in regexps [4]. [0] https://phabricator.wikimedia.org/T196896 [1] https://phabricator.wikimedia.org/T182717 [2] https://phabricator.wikimedia.org/T197495 [3] https://phabricator.wikimedia.org/T197496 [4] https://phabricator.wikimedia.org/T197566 --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Relations Specialist Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2018-06-18
by Chris Koerner 26 Jun '18

26 Jun '18

Here's the update from the Search Platform team for the week of 2018-06-18. As always, feedback and questions welcome. == Discussions == === Search === * We merged a basic analysis chain for Mirandese, [0] and we should be reindexing the Mirandese Wikipedia in a week or so. See T197890. [1] *Bosnian, Croatian, and Serbo-Croatian wikis have been reindexed and are now using the new stemmer, which also supports cross-script searching. See T196658. [2] *Trey did an analysis [3] of an Esperanto stemmer that could be converted for use with Elasticsearch. If you know some Esperanto, join the discussion on the talk page, or on the Esperanto Wikipedia and Wiktionary Village Pumps. [4] [5] See also T197240. [6] == Did you know? == * Esperanto is a constructed language, created in the late 19th century. It is one of the most widely spoken constructed languages—with millions of speakers—and probably the only one with native speakers, who number in the thousands. [7] [0] https://en.wikipedia.org/wiki/Special:MyLanguage/Mirandese_language [1] https://phabricator.wikimedia.org/T197890 [2] https://phabricator.wikimedia.org/T196658 [3] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Esperanto_Stemmer_An… [4] https://eo.wikipedia.org/wiki/Vikipedio:Diskutejo/Diversejo#Need_help_revie… [5] https://eo.wiktionary.org/wiki/Vikivortaro:Diskutejo#Need_help_reviewing_Es… [6] https://phabricator.wikimedia.org/T197240 [7] https://en.wikipedia.org/wiki/Special:MyLanguage/Esperanto --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

WDQS timeout and slowdown - Incident report
by Guillaume Lederrey 25 Jun '18

25 Jun '18

Hello! As you might already know, Wikdiata Query Service has been misbehaving in the last 24 hours. Our public SPARQL endpoint [1] was slow and throwing timeouts. Sadly, exposing a public SPARQL endpoint is a hard problem and we don't have a final solution to this. Still we have some improvements. Have a look at the incident report [2] if you want details. I also started to write a runbook for WDQS [3]. This should be interesting mostly to our SRE team, but feel free to also have a look and suggest improvements / clarifications. Note that our internal WDQS endpoint was stable during that time (as expected). Thanks for your help and your patience! Guillaume [1] https://query.wikidata.org/ [2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20180625-wdqs [3] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook -- Guillaume Lederrey Operations Engineer, Search Platform Wikimedia Foundation UTC+2 / CEST

1 0

Discovery Weekly Update for the week starting 2018-06-11
by Chris Koerner 20 Jun '18

20 Jun '18

Hi, Here's the weekly update from the Search Platform team. As always, feedback and questions welcome. == Discussions == === Search === * Trey completed a technical review of the available Estonian morphological library with help from Guillaume and David, and unfortunately it's not usable, and the stemming algorithm is not easily ported. See T178928. [0] * Trey did an analysis [1] of the effect of using the Elasticsearch Indonesian analysis chain on Malay-language data. (See Wikipedia [2] for details on Malay and Indonesian.) Next step is getting speaker review of the stemming quality, then hopefully on to reindexing wikis in both Malay and Indonesian. * Trey did a write up about the weirdness that comes from searching for single punctuation characters without good redirect support [3] to explain why searching for a hyphen on Farsi Wikipedia redirects you to the article on the apostrophe. See also T196826. [4] * Erik and David looked at adding 'type' field to store same information as was in es5 types in metastore [5] * David did work on investigating (and implementing) how the prefix keyword should augment and not override the list of requested namespaces [6] * Trey got the feedback he needed to go head and create and merge Croatian, Serbo-Croatian, and Bosnian Analysis Chains Using Serbian Morphological Libraries [7] * Gehel found that when we freeze writes to elasticsearch, jobs pile up in the job queue and we needed an alert to tell us that the writes aren't getting thawed in a timely manner [8] * Trey worked on moving Serbian language wikis from extra-analysis to extra-analysis-serbian plugin (it went into production a week ago with the re-indexing) [9] * Erik and Gehel resolved current deprecation warnings in elasticsearch 5 [10] * David worked on adding support for boosting keywords [11] and adding support for Filtering keyword (FilterQueryFeature) [12] * Erik did quite a bit of research on how to ensure that the regex highlighting doesn't always timeout as expected because @ apparently matches "any string" in the lucene regex syntax; Trey helped with the analysis and it got pushed into production in early June [13] * Stas added lemma & form representation texts to fulltext search index, which allows (very primitive) fulltext search for Lexemes [14]. Better search coming soon! == Other Noteworthy Stuff == * Wikidata Quality Constraints violation now can be exported into RDF.[15] Loading to Wikidata Query Service coming soon. [16] [0] https://phabricator.wikimedia.org/T178928#4267448 [1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Analysis_of_Applying… [2] https://en.wikipedia.org/wiki/Comparison_of_Standard_Malay_and_Indonesian [3] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Searching_for_Punctu… [4] https://phabricator.wikimedia.org/T196826 [5] https://phabricator.wikimedia.org/T192615 [6] https://phabricator.wikimedia.org/T195815 [7] https://phabricator.wikimedia.org/T192395 [8] https://phabricator.wikimedia.org/T193605 [9] https://phabricator.wikimedia.org/T193734 [10] https://phabricator.wikimedia.org/T192614 [11] https://phabricator.wikimedia.org/T195305 [12] https://phabricator.wikimedia.org/T195788 [13] https://phabricator.wikimedia.org/T195491 [14] https://phabricator.wikimedia.org/T195912 [15] https://www.wikidata.org/wiki/Q42?action=constraintsrdf [16] https://phabricator.wikimedia.org/T172380 --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2018-06-04
by Chris Koerner 11 Jun '18

11 Jun '18

Howdy, Here's the weekly update from the Search Platform team. As always, feedback and questions welcome. == Discussions == === Search === * After lots of talk about stemmers getting committed and plugins getting deployed, the Slovak-language wikis have finally been *reindexed*, and stemming [0] is now happening on the Slovak wikis! [1] === Search—Time Machine Edition === A few things from May that got missed: * Trey wrote up some potential applications of natural language processing (NLP) to on-wiki search [2]. We're still going through them to pick out a couple that we'll turn into projects, probably next quarter. Right now, spelling correction and entity extraction are high on the list, but more questions, comments, and suggestions are welcome. * Erik pulled 90 days worth of regular expression (regex) searches across all wikis, and Trey did a quick survey of the most common patterns. [3] There are a lot more regex searches than we thought—5.6 million in 90 days!—and three apparently automated processes (bots, apps, or tools of some kind) are responsible for more than 90% of the regex searches. [0] https://en.wikipedia.org/wiki/Stemming [1] https://phabricator.wikimedia.org/T190815 [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Potential_Applicatio… [3] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Regular_Ex… --- Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Presentation on UX design, mental models, and behavioral vs. survey data
by Pine W 10 Jun '18

10 Jun '18

I thought that this video, published in May 2018, was somewhat interesting and I am sharing it in case others are also interested. The presenter uses a change of design of Wikipedia's front page search box from 2010 (see https://blog.wikimedia.org/2010/06/15/usability-why-did-we-move-the-search-…) as an example, though I would hope that the lesson from this video isn't that it's okay to frequently disrupt the workflows of existing users with design changes regardless of the amount of complaints from existing users. The main points that I drew from this presentation are that interfaces should be intuitive and should have relatively light cognitive load. Those points may sound obvious to experienced UX designers, but may be of interest to people whose areas of expertise are in other domains. I also appreciated that the presenter shared an example of a situation in which people said one thing in surveys but behaved in the opposite way in practice. Here is the link to the video: https://www.youtube.com/watch?v=mxzK4sWfvH8 Regards, Pine ( https://meta.wikimedia.org/wiki/User:Pine )

1 0

← Newer
1
...
13
14
15
16
17
18
19
...
76
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery