Discovery January 2017

discovery@lists.wikimedia.org

9 participants
12 discussions

Discovery Weekly Update for the week starting 2017-01-23

by Chris Koerner

Howdy, A few quick updates form the Discovery department this week. == Highlights == * Fixed/updated where depreciated code wasn't being logged anymore (check logstash for `channel:deprecated`) [0] * Performed extensive refactoring of Special:Search code to make it significantly easier to understand and build upon. (technical debt) [1] [0] https://lists.wikimedia.org/pipermail/wikitech-l/2017-January/087478.html [1] https://phabricator.wikimedia.org/T150217 ---- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

7 years, 2 months

Re: [discovery] [Mediawiki-api] Some Issues on API:Opensearch

by Adam Baso

+discovery list On Wed, Jan 25, 2017 at 10:15 AM, Brad Jorsch (Anomie) < bjorsch(a)wikimedia.org> wrote: > On Wed, Jan 25, 2017 at 2:09 AM, <byeh(a)yahoo-inc.com> wrote: > >> While I was developing some services based on API:Opensearch, I found >> that the response of the same url request can be either Simplified Chinese >> or Traditional Chinese. To be more specific, I would love to know how can I >> determine the response language form from API layer ( Or other factors that >> may have impact ) ? Since the document of API:Opensearch doesn't seem to >> take language into consideration, >> > > The OpenSearch Suggestions extension specification[1] does not allow for > returning additional metadata such as language with the response. You may > want to look at the prefixsearch query module[2] instead which allows for > returning the same results in a different format, although I don't know the > details of how language variants are handled in the search output. > > > [1]: http://www.opensearch.org/Specifications/OpenSearch/ > Extensions/Suggestions/1.1 > [2]: https://www.mediawiki.org/wiki/API:Prefixsearch > > -- > Brad Jorsch (Anomie) > Senior Software Engineer > Wikimedia Foundation > > _______________________________________________ > Mediawiki-api mailing list > Mediawiki-api(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api > >

7 years, 3 months

Discovery Weekly Update for the week starting 2017-01-16

by Chris Koerner

Hello, Here are this past week's updates from the Discovery department. == Highlights == * Finalized the second BM25 testing analysis and linked to the pdf here. [0] ==Search == * Migrated Phan for CirrusSearch to Jenkins. (technical debt) [1] [2] * Finished writing up, summarizing, and recommending extensive changes to TextCat for language identification. [3] Overall improvement to F0.5 accuracy was a mean of just under 5% across the corpora from nine Wikipedias. The two worst performing corpora, from enwiki and nlwiki, each went up around 10%! All nine are now above 90% F0.5 score. Next step is to deploy the recommended changes. [4] * Completed (a round of) refactoring and cleanup of Special:Search code [5] [6] [0] https://www.mediawiki.org/wiki/Discovery_Analysis#Past_analyses [1] https://www.mediawiki.org/wiki/Continuous_integration/Phan [2] https://phabricator.wikimedia.org/T153040 [3] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Improvements… [4] https://en.wikipedia.org/wiki/F1_score [5] https://phabricator.wikimedia.org/T150217 [6] https://phabricator.wikimedia.org/T150390 ---- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

7 years, 3 months

Interactive Team putting work on pause

by Dan Garry

Hi all, The Interactive Team in Discovery is in the process of putting its work on pause. The team's aim during this period is to get its work to a stable and maintainable state. Currently, work on new features is on hold. It is not yet known what the timeline is for this transition to a paused state, or whether there will be further deployments of features that have already been completed. I will update this list when there is more information. Thanks, Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

7 years, 3 months

Language Identification Updates

by Trey Jones

Hi Everyone, What started out as an attempt to derive useful confidence measures for language identification (with TextCat <https://www.mediawiki.org/wiki/TextCat>) turned into a generalized improvement effort. We still don't have useful external confidence measures—though there's a little work yet to be done there (T149323 <https://phabricator.wikimedia.org/T149323>, T155670 <https://phabricator.wikimedia.org/T155670>). However, I did get a sizable improvement to the F0.5 <https://en.wikipedia.org/wiki/F1_score> accuracy scores by improving TextCat internals that don't really generalize to externally useful measures. The result was a mean improvement of just under 5% across the corpora from nine Wikipedias. The two worst performing corpora, enwiki and nlwiki, each went up around 10%! All nine are now above 90% F0.5 score. You can read the final summary and recommendations <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Improvements…>, or read the rest of the page, too, if you want to know more about the whole odyssey, or if you have trouble sleeping. ; ) Next steps for language identification are to get these changes deployed, and then to look at other measures of confidence, and/or extend language identification to more wikis, though the latter two may take a backseat to working on new and improved language analyzers <https://phabricator.wikimedia.org/T154511> for the rest of this quarter. —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

7 years, 3 months

So Many Search Options!

by Trey Jones

Hi everyone, As we keep coming up with more ways to try to rescue unsuccessful queries—"Did you mean" suggestions, language detection, quote stripping, wrong keyboard detection, etc—we have to have a plan for how they interact with each other. I've put together a straw man proposal for how to deal with all of this to have a more co-ordinated conversation: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/So_Many_Search_Optio… Comments and questions here or on the talk page are welcome! —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

7 years, 3 months

Discovery Weekly Update for the week starting 2017-01-02

by Chris Koerner

Hello! Here is another weekly update on the work of the Discovery department. Appologies for the late summary. If you have questions, please ask. Due to the Developer Summit and WMF All-hands event this coming week, the next weekly summary will be the week of 16 January. == Highlights == * Secondary search results are now possible over the API! This is currently used to suggest pages from other language projects that might be relevant for users. Please review the examples for more information. [0] [1] * Added cool new to the Metrics dashboard that highlight the selected date range [2] * Added a full geographic breakdown to the Portal dashboard to see traffic and clickthrough rates on Wikipedia.org portal across all countries [3] * Added final results of the successful long term testing of the addition of app and app page links to wikipedia.org portal page [4] * OSM now allows users who have a Wikimedia account to log into OSM and edit the map data, also new guide added. [5] * Extensions now can define search keywords using CirrusSearchAddQueryFeatures hook. [6] == Search == * Secondary search results are now possible over the API! This is currently used to suggest pages from other language projects that might be relevant for users. Please review the examples for more information. [0] [1] * Fixed a typo in the search preferences page [7] * Added support for extensions to register keywords with the search system [8] * Migrated some keywords to Extension:GeoData to test keyword registration support Analysis [9] [10] * Fixed a data bug where there was a spike in user engagement [11] == Portal == * Updated the stats on Wikipedia.org to reflect that a couple wiki's that have recently achieved big goals in stats: Maithili Wikipedia is now over 10,000 articles and Hebrew Wikipedia now has over 200,000 articles. [12] * Closed out a long term test for adding app and app page links to the wikipedia.org portal page [13] == Interactive == * Released mapframe to French and Finnish Wikipedias [14] [15] * Updated attribution for static and dynamic maps [16] [17] [0] https://phabricator.wikimedia.org/T142795#2923977 [1] https://phabricator.wikimedia.org/T142795 [2] http://discovery.wmflabs.org/metrics/ [3] http://discovery.wmflabs.org/portal/#all_country [4] https://phabricator.wikimedia.org/T146807 [5] https://www.mediawiki.org/wiki/Help:Extension:Kartographer/OSM%7Chere [6] https://www.mediawiki.org/wiki/Manual:Hooks/CirrusSearchAddQueryFeatures [7] https://phabricator.wikimedia.org/T154532 [8] https://phabricator.wikimedia.org/T152517 [9] https://www.mediawiki.org/wiki/Extension:GeoData [10] https://phabricator.wikimedia.org/T152730 [11] https://phabricator.wikimedia.org/T153887 [12] https://phabricator.wikimedia.org/T128546 [13] https://phabricator.wikimedia.org/T137495 [14] https://phabricator.wikimedia.org/T154524 [15] https://phabricator.wikimedia.org/T151591 [16 https://phabricator.wikimedia.org/T151906 [17] https://phabricator.wikimedia.org/T151900 ---- The full update, and archive of all past updates, can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation -- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

7 years, 3 months

Secondary search results (e.g. language detection) now available over API

by Dan Garry

Hello! *tl;dr: add srenablerewrites=yes to your API search queries to enable search results from different language projects* The Search Team <https://www.mediawiki.org/wiki/Wikimedia_Discovery#Search:_Backend> is thrilled to announce that secondary search results are now available over the API. This means that automated language detection (provided by TextCat <https://www.mediawiki.org/wiki/TextCat>) and query forwarding can now be used by API consumers. Here's the explanation. The Search Team's analysis of common search queries <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul…> showed that there are quite a few search queries that aren't in the language of the wiki the user is on. To help alleviate this problem, and give users useful results, we added language detection and query forwarding; for example, Луковичная глава <https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&s…> now gives the user results from the Russian Wikipedia. This is the functionality that's now available over the API, as you can see if you perform the same search over the API <https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=%D0%9B…> with the srenablerewrites parameter enabled. The secondary results functionality was added to MediaWiki core <https://gerrit.wikimedia.org/r/#/c/324652/> and is extendable so that, in the future, if we (or someone else!) provide secondary results from other sources, then this functionality can be used for that. For backwards compatibility, don't add the srenablerewrites parameter and you'll continue getting the same results in the same format as before this change. Happy querying! Thanks, Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

7 years, 3 months

Ideas around a public release of ML training set for search

by Erik Bernhardson

tl/dr: Can feature vectors about relevance of (query, page_id) pairs be released to the public if the final dataset only represents query's with numeric id's? Over the past 2 months i've been spending free time working on investigating machine learning for ranking. One of the earlier things i tried, to get some semblance of proof it had the ability to improve our search results, was port a set of features for text ranking from an open source kaggle competitor to a datset i could create from our own data. For relevance targets I took queries that had clicks from at least 50 unique sessions over a 60 day period and ran them through a click model (DBN). Perhaps not as useful as human judgements but working with what I have available. This actually showed it has some promise, and I've been moving further along. An idea was provided to me though about releasing the feature vectors from my initial investigation in an open format that might be useful for others. Each feature vector is for a (query, hit_page_id) pair that was displayed to at least 50 users. I don't have my original data, but I have all the code and just ran through it with 100 normalized queries to get a count, and there are 4852 features. Lots of them are probably useless, but choosing which ones is probably half the battle. These are ~230MB in pickle format, which stores the floats in binary. This can then be compressed to ~20MB with gzip, so the data size isn't particularly insane. In a released dataset i would probably use 10k normalized queries, meaning about 100x this size Could plausibly release as csv's instead of pickled numpy arrays. That will probably increase the data size further, but since we are only talking ~2GB after compression could go either way. The list of feature names is in https://phabricator.wikimedia.org/P4677 A few example feature names and their meaning, which hopefully is enough to understand the rest of the feature names: DiceDistance_Bigram_max_norm_query_x_outgoing_link_1D.pkl - dice distance of bigrams in normalized (stemmed) query string versus outgoing links. outgoing links are an array field, so the dice distanece is calculated per item and this feature has the max value. DigitCount_query_1D.pkl - Number of digits in the raw user query ES_TFIDF_Unigram_Top50_CosineSim_norm_query_category.plain_termvec_x_category.plain_termvec_1D.pkl - Cosine similarity of the top 50 terms, as reported by elasticsearch termvectors api, of the normalized query vs the category.plain field of matching document. More terms would perhaps have been nice, but doing this all offline in python made that a bit of a time+space tradeoff. Ident_Log10_score_mean_query_opening_text_termvec_1D.pkl - log base 10 of the score from the elasticsearch termvectors api on the raw user query applied to the opening_text field analysis chain. LongestMatchSize_mean_query_x_heading_1D.pkl - mean longest match, in number of characters of the query vs the list of headings for the page The main question here i think revolves around is this still PII? The exact queries would be normalized into id's and not released. We could leave the page_id in or out of the dataset. With it left in people using the dataset could plausibly come up with their own query independent features to add. With a large enough feature vector for (query_id, page_id) the query could theoretically be reverse engineered, but from a more practical side I'm not sure that's really a valid concern. Thoughts? Concerns? Questions?

7 years, 3 months

This quarter: researching new language analysers for search

by Dan Garry

Hello! As part of our goals for Q3 FY 2016-17 <https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Disco…> (Jan - Mar 2017), the Search Team will be researching, testing, and deploying new language analysers. Language analysers are features in Elasticsearch that analyse and alter queries to give users better results. Language analysers perform important functions such as tokenisation <https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)>, and can also alter queries with language-specific features, such as: - The English analyser would make the query "john's" also search for "john". - The German analyser would make the query "äußerst" also search for "ausserst". These alteration to users queries improve the relevance of the results given to users compared to not analysing the queries, because they can add extra documents that may be relevant into the results. Elastic has a bunch of documentation <https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-la…> if you want to read more about the language analysers do. Some of the criteria we'll be using to evaluate the new analysers are: - how much better we expect the analyser to be than the one we have - the maturity and maintainability of the code of the analyser - flexibility of customisation of the plugin We'll be testing using our standard search metrics, such as zero results rate, PaulScore <https://www.mediawiki.org/wiki/Wikimedia_Discovery/Search/Glossary#PaulScore>, and others. We'll be starting with Polish, since we already have some ideas for possible new plugins, and that'll allow us to more precisely figure out what criteria we want to use when evaluating the plugin. As always, if there are any questions, please let me know! Thanks, Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

7 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery January 2017