Discovery December 2016

discovery@lists.wikimedia.org

13 participants
14 discussions

So Many Search Options!
by Trey Jones 18 Jan '17

18 Jan '17

Hi everyone, As we keep coming up with more ways to try to rescue unsuccessful queries—"Did you mean" suggestions, language detection, quote stripping, wrong keyboard detection, etc—we have to have a plan for how they interact with each other. I've put together a straw man proposal for how to deal with all of this to have a more co-ordinated conversation: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/So_Many_Search_Optio… Comments and questions here or on the talk page are welcome! —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

3 4

Ideas around a public release of ML training set for search
by Erik Bernhardson 06 Jan '17

06 Jan '17

tl/dr: Can feature vectors about relevance of (query, page_id) pairs be released to the public if the final dataset only represents query's with numeric id's? Over the past 2 months i've been spending free time working on investigating machine learning for ranking. One of the earlier things i tried, to get some semblance of proof it had the ability to improve our search results, was port a set of features for text ranking from an open source kaggle competitor to a datset i could create from our own data. For relevance targets I took queries that had clicks from at least 50 unique sessions over a 60 day period and ran them through a click model (DBN). Perhaps not as useful as human judgements but working with what I have available. This actually showed it has some promise, and I've been moving further along. An idea was provided to me though about releasing the feature vectors from my initial investigation in an open format that might be useful for others. Each feature vector is for a (query, hit_page_id) pair that was displayed to at least 50 users. I don't have my original data, but I have all the code and just ran through it with 100 normalized queries to get a count, and there are 4852 features. Lots of them are probably useless, but choosing which ones is probably half the battle. These are ~230MB in pickle format, which stores the floats in binary. This can then be compressed to ~20MB with gzip, so the data size isn't particularly insane. In a released dataset i would probably use 10k normalized queries, meaning about 100x this size Could plausibly release as csv's instead of pickled numpy arrays. That will probably increase the data size further, but since we are only talking ~2GB after compression could go either way. The list of feature names is in https://phabricator.wikimedia.org/P4677 A few example feature names and their meaning, which hopefully is enough to understand the rest of the feature names: DiceDistance_Bigram_max_norm_query_x_outgoing_link_1D.pkl - dice distance of bigrams in normalized (stemmed) query string versus outgoing links. outgoing links are an array field, so the dice distanece is calculated per item and this feature has the max value. DigitCount_query_1D.pkl - Number of digits in the raw user query ES_TFIDF_Unigram_Top50_CosineSim_norm_query_category.plain_termvec_x_category.plain_termvec_1D.pkl - Cosine similarity of the top 50 terms, as reported by elasticsearch termvectors api, of the normalized query vs the category.plain field of matching document. More terms would perhaps have been nice, but doing this all offline in python made that a bit of a time+space tradeoff. Ident_Log10_score_mean_query_opening_text_termvec_1D.pkl - log base 10 of the score from the elasticsearch termvectors api on the raw user query applied to the opening_text field analysis chain. LongestMatchSize_mean_query_x_heading_1D.pkl - mean longest match, in number of characters of the query vs the list of headings for the page The main question here i think revolves around is this still PII? The exact queries would be normalized into id's and not released. We could leave the page_id in or out of the dataset. With it left in people using the dataset could plausibly come up with their own query independent features to add. With a large enough feature vector for (query_id, page_id) the query could theoretically be reverse engineered, but from a more practical side I'm not sure that's really a valid concern. Thoughts? Concerns? Questions?

3 3

Discovery Weekly Update for the week starting 2016-12-19 ⛄️
by Chris Koerner 24 Dec '16

24 Dec '16

Season's Greetings, A few updates from the Discovery department this week. This is the last weekly update from the Discovery department for the year. We'll be skipping next week due to the holiday's and will see you all in January with a fresh 2017 edition. == Highlights == * Secondary result functionality will be available over the search API in early January! Currently, this allows consumers of the search API to benefit from automated language detection [[TextCat]] and forwarding of search queries. [0] [1] == Discussions == === Search === * Secondary result functionality will be available over the search API in early January! Currently, this allows consumers of the search API to benefit from automated language detection [[TextCat]] and forwarding of search queries. [0] [1] * Corrected an error on Hebrew wikis where searches without diacritics could sometimes not find appropriate results that contained diacritics. [3] Feedback and suggestions on this weekly update are welcome. [0] https://www.mediawiki.org/wiki/TextCat [1] https://phabricator.wikimedia.org/T142795 [3] https://phabricator.wikimedia.org/T3836 ---- The full update, and archive of all past updates, can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Correction on Structured Data on Commons project
by Katie Horn 24 Dec '16

24 Dec '16

Hello all, Yesterday, an announcement (Now live: Shared structured data), incorrectly stated that Structured Data had been launched on Commons. The feature which was inaccurately named “Structured Data”, enables users to add tabular data to the data namespace on Commons via the regular page editor and to further display and/or visualize that data from other wikis. This work is unrelated to an ongoing project called Structured Data on Commons. For more on the newly launched feature, see the Tabular Data [1] and Map Data [2] help pages on MediaWiki.org. For information on the Structured Data on Commons project, designed to associate structured data with media files on Commons to improve their discoverability, please visit the project page on Commons. [3] Thank you, -Katie [1] - https://www.mediawiki.org/wiki/Help:Tabular_Data [2] - https://www.mediawiki.org/wiki/Help:Map_Data [3] - https://commons.wikimedia.org/wiki/Commons:Structured_data

1 0

Re: [discovery] [Wikitech-l] Now live: Shared structured data
by Yuri Astrakhan 23 Dec '16

23 Dec '16

Micru, thanks, I think Datasets sounds like a good name too! On Thu, Dec 22, 2016 at 2:44 PM David Cuenca Tudela <dacuetu(a)gmail.com> wrote: > On Thu, Dec 22, 2016 at 8:38 PM, Brad Jorsch (Anomie) < > bjorsch(a)wikimedia.org > > wrote: > > > On Thu, Dec 22, 2016 at 2:30 PM, Yuri Astrakhan < > yastrakhan(a)wikimedia.org> > > wrote: > > > > > Gift season! We have launched structured data on Commons, available > from > > > all wikis. > > > > > > > I was momentarily excited, then I read a little farther and discovered > this > > isn't about https://commons.wikimedia.org/wiki/Commons:Structured_data. > > > > Same here, I think it needs a better name... > > What about calling it datasets or structured datasets? > > Cheers, > Micru > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2 2

Re: [discovery] [Wikitech-l] Now live: Shared structured data
by Yuri Astrakhan 23 Dec '16

23 Dec '16

Yes, there seem to have been a bit of a naming collision. Tabular data and map data have been jointly known as structured data, but there is also the Structured Data project, which IMO should be called Structured Metadata project :) Naming suggestions are welcome! P.S. Brad, I'm sorry tabular and map data did not excite you :( On Thu, Dec 22, 2016 at 2:38 PM Brad Jorsch (Anomie) <bjorsch(a)wikimedia.org> wrote: > On Thu, Dec 22, 2016 at 2:30 PM, Yuri Astrakhan <yastrakhan(a)wikimedia.org> > wrote: > > > Gift season! We have launched structured data on Commons, available from > > all wikis. > > > > I was momentarily excited, then I read a little farther and discovered this > isn't about https://commons.wikimedia.org/wiki/Commons:Structured_data. > > > -- > Brad Jorsch (Anomie) > Senior Software Engineer > Wikimedia Foundation > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 0

Now live: Shared structured data
by Yuri Astrakhan 23 Dec '16

23 Dec '16

Gift season! We have launched structured data on Commons, available from all wikis. TLDR; One data store. Use everywhere. Upload table data to Commons, with localization, and use it to create wiki tables, lists, or use directly in graphs. Works for GeoJSON maps too. Must be licensed as CC0. Try this per-state GDP map demo, and select multiple years. More demos at the bottom. US Map state highlight <https://en.wikipedia.org/wiki/Template:Graph:US_Map_state_highlight> Data can now be stored as *.tab and *.map pages in the data namespace on Commons. That data may contain localization, so a table cell could be in multiple languages. And that data is accessible from any wikis, by Lua scripts, Graphs, and Maps. Lua lets you generate wiki tables from the data by filtering, converting, mixing, and formatting the raw data. Lua also lets you generate lists. Or any wiki markup. Graphs can use both .tab and .map directly to visualize the data and let users interact with it. The GDP demo above uses a map from Commons, and colors each segment with the data based on a data table. Kartographer (<maplink>/<mapframe>) can use the .map data as an extra layer on top of the base map. This way we can show endangered species' habitat. == Demo == * Raw data example <https://commons.wikimedia.org/wiki/Data:Weather/New_York_City.tab> * Interactive Weather data <https://en.wikipedia.org/wiki/Template:Graph:Weather_monthly_history> * Same data in Weather template <https://en.wikipedia.org/wiki/User:Yurik/WeatherDemo> * Interactive GDP map <https://en.wikipedia.org/wiki/Template:Graph:US_Map_state_highlight> * Endangered Jemez Mountains salamander - habitat <https://en.wikipedia.org/wiki/Jemez_Mountains_salamander#/maplink/0> * Population history <https://en.wikipedia.org/wiki/Template:Graph:Population_history> * Line chart <https://en.wikipedia.org/wiki/Template:Graph:Lines> == Getting started == * Try creating a page at data:Sandbox/<user>.tab on Commons. Don't forget the .tab extension, or it won't work. * Try using some data with the Line chart graph template A thorough guide is needed, help is welcome! == Documentation links == * Tabular help <https://www.mediawiki.org/wiki/Help:Tabular_Data> * Map help <https://www.mediawiki.org/wiki/Help:Map_Data> If you find a bug, create Phabricator ticket with #tabular-data tag, or comment on the documentation talk pages. == FAQ == * Relation to Wikidata: Wikidata is about "facts" (small pieces of information). Structured data is about "blobs" - large amounts of data like the historical weather or the outline of the state of New York. == TODOs == * Add a nice "table editor" - editing JSON by hand is cruel. T134618 * "What links here" should track data usage across wikis. Will allow quicker auto-refresh of the pages too. T153966 * Support data redirects. T153598 * Mega epic: Support external data feeds.

1 0

Discovery Weekly Update for the week starting 2016-12-12
by Chris Koerner 17 Dec '16

17 Dec '16

Greetings, Here is another weekly update on the work of the Discovery department. If you have questions, please ask! == Highlights == * Published Maps Terms of Use [0] == Discussions == === Search === * The time needed to restart our elasticsearch clusters is improving [1] * Fixed a rare edge case where a wiki could show duplicate results when detecting query language [2] * Fixed a rare edge case that could cause fewer results to be displayed than expected on French projects [3] === Portal === * Wikipedia.org portal article stats were updated on 12 Dec 2016 [4] * Updated Wikipedia portal with translations for browsers that have their primary language other than English. [5] === Interactive === * Kartotherian and Tilerator configurations are now deployed with scap3, a more streamlined process [6] * Updated static maps to now be faster in showing updated details [7] * Updated what we're logging for metrics [8] [0] https://wikimediafoundation.org/wiki/Maps_Terms_of_Use [1] https://phabricator.wikimedia.org/T145065 [2] https://phabricator.wikimedia.org/T153051 [3] https://phabricator.wikimedia.org/T151173 [4] https://phabricator.wikimedia.org/T128546 [5] https://phabricator.wikimedia.org/T142582 [6] https://phabricator.wikimedia.org/T150021 [7] https://phabricator.wikimedia.org/T150358 [8] https://phabricator.wikimedia.org/T151929 ---- The full update, and archive of all past updates, can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Minor hiccup on maps / kartotherian
by Guillaume Lederrey 14 Dec '16

14 Dec '16

Hello all! This morning at 8:30 UTC I merged a configuration cleanup on kartotherian configuration [1]. This ended up creating issue with geoshapes returning HTTP 400. The feature does not see much use yet and was only identified at 9:40 UTC. Rollback of the change was done at 9:57 UTC [2]. A slight increase in HTTP 400 on maps servers can be observed during this time frame [3]. The underlying issue is being investigated on T153188 [4]. Very sorry for the inconvenience... Guillaume [1] https://gerrit.wikimedia.org/r/#/c/326406/ [2] https://gerrit.wikimedia.org/r/#/c/327159/ [3] https://grafana-admin.wikimedia.org/dashboard/db/service-maps-varnish?panel… [4] https://phabricator.wikimedia.org/T153188 -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+1 / CET

1 0

Discovery Weekly Update for the week starting 2016-12-05
by Chris Koerner 13 Dec '16

13 Dec '16

This is the weekly update for the week starting 05 December. As always, feedback and questions are welcome. ==Highlights== * Load tests for cross-project searching were completed successfully. [0] We are now testing how many results are produced for typical queries as a first step towards figuring out how relevant the results are. [1] * Released tabular and map data namespaces to Commons, after reviewing implementation, consulting legal, enabling shared geojson data storage, and adding metrics tracking. [2] [3] [4] [5] [6] [7] [8] [9] === Search === * We've put together a draft proposal for how to deal with the interaction of all the possible additional search options—suggestions, quote-stripping, "wrong keyboard", language ID, etc.—to have a more co-ordinated conversation about all this. Comments on the talk page there are welcome. [10] * Load tests for cross-project searching were completed successfully. [11] We are now testing how many results are produced for typical queries as a first step towards figuring out how relevant the results are. [12] * Upgraded to Java 8 in preparation for upgrading ElasticSearch next quarter. [13] * Performed some refactoring to improve interwiki searching. [14] * Improved handling of characters containing diacritics or accents. [15] [16] * Added support for minimum query length to detect the language of the query. When a minimum length is eventually chosen, this will solve edge cases where some very short queries and queries consisting just of numbers could produce unexpected results. [17] * Fixed an edge case where third-party users of CirrusSearch could get an exception when building a completion suggester index. [18] === Analysis === * Fixed an error with installing R package Boom (& bsts) on stat1002 (but can on stat1003) [19] === Portal === * Add node 6 compatibility to Portal build tasks [20] === Interactive === * Enhance Report Updater to be able to send data to graphite [21] * Increase user sampling rate [22] * Investigate ClearTables [23] * Bug fixed: Closing maps by going back doesn't always work [24] * Updated and adapted Tabular Data to match published formats [25] * Updated Legal wording update for tabular and map data release [26] * Added information on how to change error message [27] * Updated the logging for Kartotherian from info to warn [28] == Other Noteworthy Stuff == * Published Wikimedia Discovery's narrative and roadmap for FY 2016/17 (July 2016 - June 2017) [29] [0] https://phabricator.wikimedia.org/T149740 [1] https://phabricator.wikimedia.org/T151935 [2] https://phabricator.wikimedia.org/T148745 [3] https://www.mediawiki.org/wiki/Help:Tabular_Data [4] https://www.mediawiki.org/wiki/Help:Map_Data [5] https://www.mediawiki.org/wiki/Extension:JsonConfig/Tabular [6] https://phabricator.wikimedia.org/T134426 [7] https://phabricator.wikimedia.org/T152553 [8] https://phabricator.wikimedia.org/T137930 [9] https://phabricator.wikimedia.org/T152661 [10] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/So_Many_Search_Optio… [11] https://phabricator.wikimedia.org/T149740 [12] https://phabricator.wikimedia.org/T151935 [13] https://phabricator.wikimedia.org/T151325 [14] https://phabricator.wikimedia.org/T141033 [15] https://phabricator.wikimedia.org/T137830 [16] https://phabricator.wikimedia.org/T146402 [17] https://phabricator.wikimedia.org/T149318 [18] https://phabricator.wikimedia.org/T150799 [19] https://phabricator.wikimedia.org/T147682: [20] https://phabricator.wikimedia.org/T150190 [21] https://phabricator.wikimedia.org/T150187 [22] https://phabricator.wikimedia.org/T152174 [23] https://phabricator.wikimedia.org/T141964 [24] https://phabricator.wikimedia.org/T151915 [25] https://phabricator.wikimedia.org/T152184 [26] https://phabricator.wikimedia.org/T152553 [27] https://phabricator.wikimedia.org/T152554 [28] https://phabricator.wikimedia.org/T148116 [29] https://www.mediawiki.org/wiki/File:Discovery_narrative_FY_2016-17.pdf ---- The full update, and archive of past updates, can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as Easy or Volunteer needed in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery December 2016