Discovery February 2016

discovery@lists.wikimedia.org

29 participants
34 discussions

Input requested about this mailing list
by Kevin Smith 30 Mar '16

30 Mar '16

Hello discovery mailing list members. We'd like to ask for some feedback on how the Discovery team can best use this list. We don't want to muddle the list with information that folks outside the team don't care about, and we want to make sure that what we post is of value. What sort of information do you think we should share here? It seems clear that items like these should go on this list: - Some interesting findings - we ran a test and found x=y! - Something new and we want to suss out interest - What if we moved the search box to a different position on the page‽ - Upcoming projects, or other milestones related to discovery - X users opted-in to the new autocompletion suggester. - Discussions about quarterly goals and roadmaps There are other items that we could share on this list, but would anyone outside the team be interested? For example, process-related items like: - Portal team decided to have 3x standups per week instead of 2. - Team structure changes, or plans to change how we do retrospectives - Notifications of meeting minutes being posted (e.g. Weekly team meeting minutes, retrospective minutes, etc.) For Discovery team members reading this: What criteria do you use (or wish you could use) when thinking, "Should I post this to public list?" Kevin Smith Agile Coach, Wikimedia Foundation

7 8

Re: [discovery] [Ops] First two weeks as part of the team
by Chad Horohoe 04 Mar '16

04 Mar '16

You may also find these diagrams useful: https://commons.wikimedia.org/wiki/File:Wikipedia_webrequest_flow_2015-10.p… https://wikitech.wikimedia.org/wiki/File:Infrastructure_overview.png -Chad On Feb 24, 2016 6:58 PM, "Mukunda Modell" <mmodell(a)wikimedia.org> wrote: > >> On Feb 17, 2016 1:50 AM, "Guillaume Lederrey" <glederrey(a)wikimedia.org> > >> wrote: > >>> * I still have not found a global architecture schema (something like > >>> a high level component or deplyoment diagram). But I have never seen > >>> any company having those... > > I made a diagram of the scap (mediawiki) deployment architecture a while > back: https://commons.wikimedia.org/wiki/File:Scap-diagram.png .. > > That does not exactly apply to the new scap3 architecture but it's not too > far off. > > .... > > On Thu, Feb 18, 2016 at 10:37 AM, Giuseppe Lavagetto > <glavagetto(a)wikimedia.org> wrote: > > > > About cherry-picks in beta: the problem is not cherry-picking (I think > > it's a reasonable way to test things) but persistent cherry-picking to > > monkey patch problems is. I think if we follow the flow of: > > > > - writing a patch > > - testing it on beta with a cherry-pick > > - get it merged on ops/puppet and production > > There are a lot of patches on beta these days and there have been a lot of > different people cherry-picking without much coordination. This has lead > to breakage quite often. Patches also get lost regularly. I assume this > usually happens because someone has rebased the HEAD and accidentally > dropped a patch. > > It can be really difficult to get a patch merged in ops/puppet within a > week (or even a month). I've seen a lot of patches sit around for weeks and > even now with the Puppet SWAT windows, it's still sometimes unrealistic to > expect patches get merged into production that quickly. (+CC Tyler) > > Without a system to manage things, and with very little coordination > between everyone who is working on beta, I don't expect the situation to > improve too much. > > I intend to propose a solution for beta & puppet patch cherry-picks very > soon, however, I haven't fully formulated my proposal yet. I will write to > the ops list when I have something written in a clear and presentable way. > > _______________________________________________ > Ops mailing list > Ops(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/ops > >

3 2

Wikivoyage mapframe
by Yuri Astrakhan 03 Mar '16

03 Mar '16

Yesterday, I posted a note on all Wikivoyages' traveler pubs, except EN and RU, describing our maps efforts and asking to switch their mapframe template to the Wikimedia maps service. Two languages are having a discussion, some have switched, some have not replied. *== Discussion ==* de Deutsch https://de.wikivoyage.org/wiki/Wikivoyage:Lounge#Maps_for_Wikivoyage nl Nederlands https://nl.wikivoyage.org/wiki/Wikivoyage:Reizigerscaf%C3%A9#Maps_for_Wikiv… *== Waiting ==* fa فارسی he עברית it Italiano pl Polski pt Português sv Svenska uk Українська vi Tiếng Việt *== Done ==* el Ελληνικά en English es Español fr Français ro Română (no embedded maps, fixed myself) ru Русский zh 中文

3 4

Help us make wiki search auto completion better!
by Chris Koerner 26 Feb '16

26 Feb '16

Searching for articles is getting a much needed boost in intelligence. When searching for an article on any Wikimedia project, a list of possible matches appear as you type. This incremental search <https://wikipedia.org/wiki/Special:MyLanguage/Incremental_search> 'completion suggester' helps narrow down possible results for your search query. Currently the completion suggester is very literal - mistype a word and you won't see any suggestions. The Discovery team has an update that will make the search better at detecting typos and spelling mistakes. The list of suggestions is also tends to be more accurate and relevant to the original search query. You can see the improvement for yourself as a beta feature <http://wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures>. The plan is to begin rolling out this update in the coming month. This improvement will impact all Wikimedia sites (with the exception of Wikidata). With the initial rollout, there will be a few limitations. For technical reasons, the completion suggester will only affect articles in the mainspace, not other pages like policy or user pages. Those searches will continue to use the existing incremental search. We're hoping to expand the scope of the completion suggester to include other pages in the future, but this will not be a part of the initial release. Why are we making this change? The success of our A/B tests on the completion suggester show a reduction in users who find zero results when searching. We also have a fairly large number of Wikimedians (nearly 19,000 editors since December 2015) using the beta feature, and we've received positive feedback on the feature so far. The completion suggester has the potential of lessening the need for redirects based on spelling mistakes as well. The goal is to bring these updates to the default search across all Wikimedia projects in the coming month. This change will affect the completion suggester in the main search box on desktop, mobile apps, adding links in VisualEditor, and the search box on the Wikimedia Portal. Here are two animations showing the completion suggester before and after (using a misspelling of "Abendessen", with a missing "s"). * Completion Suggester results for "Abendesen" on German Wikipedia before <https://meta.wikimedia.org/wiki/File:Completion-suggester-before-de.gif> * Completion Suggester results for "Abendesen" on German Wikipedia after <https://meta.wikimedia.org/wiki/File:Completion-suggester-after-de.gif> Since December 2015, nearly 19,000 editors have already opted into the completion suggester beta feature. We encourage you to try it out and share your feedback on the Completion Suggester discussion page <https://www.mediawiki.org/wiki/Help:CirrusSearch/CompletionSuggester>. If you'd like to read a little more about the work of the Discovery Department and other improvements to search, please check out the Wikimedia blog <https://blog.wikimedia.org/2015/12/23/search-and-discovery-on-wikipedia/> or read about CirrusSearch <https://www.mediawiki.org/wiki/Extension:CirrusSearch>, the Mediawiki Extension that powers our search. -- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Can we raise the size limit on this list?
by David Causse 26 Feb '16

26 Feb '16

Hi, it happened a few times and it's quite annoying, could we raise the size limit on this list? (which seems to be set to 40kb) I've just answered a mail that was crossposted to mobile-l and discovery-l and I had no problem with the former. Thanks!

1 0

Results from similar articles A/B test
by Dmitry Brant 26 Feb '16

26 Feb '16

Hello all, As mentioned previously, the current version of the Android app contains an A/B test where it presents "read more" suggestions to the user, based on (a) the standard "morelike" query, or (b) the new "opening_text" query. Here are the results from the last ~10 days of the test[0]: - The clickthrough rate using the default morelike query is (and has been) around 15%. - With the new opening_text query, the clickthrough rate decreases to about 12%: [image: Inline image 1] Therefore, it seems that the new query has a nontrivial negative effect on CTR :( We'll plan on removing this test in the next release of the app, but we'll be happy to plug in a different or updated query, if it will be of further use to Discovery. [0] https://docs.google.com/a/wikimedia.org/spreadsheets/d/1BFsrAcPgexQyNVemmJ3… (queries embedded as comments in the headers) -- Dmitry Brant Senior Software Engineer / Product Owner (Android) Wikimedia Foundation https://www.mediawiki.org/wiki/Wikimedia_mobile_engineering

3 2

Note for Discovery staff
by Pine W 25 Feb '16

25 Feb '16

Hi Discovery staff, I've read your notes from your February 16 meeting that were posted to MediaWiki <https://www.mediawiki.org/wiki/Discovery/2016-02-16_Discussing_Knowledge_En…>. I just want to express moral support for the engineers, analysts, designers, managers, and other well-intentioned staff in the Discovery team. As far as I can tell, the problems relating to the Knowledge Engine that are being discussed in public don't reflect any desire from the community to shut down Discovery or improvements to internal search. I hope that you know that improvements to 0RR, maps, WDQS, search from www.wikipedia.org, search APIs, and other present and future Discovery projects may yet receive positive community feedback. Thank you for your efforts and perseverance. Speaking in my personal capacity only, Pine

2 1

ElastiCon notes
by Erik Bernhardson 24 Feb '16

24 Feb '16

These are my notes, pretty much completely raw, from elasticon last week. ---------------------------------- Scaling 5 nodes -> 1,000 ---------------------------------- significant terms query (expensive) may help with trending articles? preferred log ingestion path: app -> kafka -> logstash -> elasticsearch - es has features (node affinity) to have high power "hot" nodes and lower power / spinning disk "warm" or "cold" nodes and ensure indices end up on the right machines. - **run master nodes on same phyiscal hardware, separate JVM** -- dedicate cores to master node new node type: ingest - specialized for indexing - most common logstash filters implemented in library, used in elasticsearch - can enrich data on indexing - makes logstash not necessary for some workloads 10 nodes (128G memory, 32 threads, 1TB SSD) can handle 150k+ indexes per second for logging **make clusters a first class citizen in your application** **multi-tenant scaling -> multiple clusters** new features: cluster state diffs -> allows more machines/indices/shards per cluster - may help with our master timeout issues, but talked to some core devs and they wern't really sure. Preferred the idea to split into multiple independent clusters for multi-tenant use case - **more indices does matter for master complexity. All about the total number of shards** types are just a filter on special _type column new feature: columnar data store (doc values) on by default in 2.0 new software: beats - is about data collection and shipping - may take on similar role as diamond, shipping to elasticsearch or logstash new goal: "release bonanza" release multiple products on same date to simplify compatability between products - es will go v2 -> v5, skipping between, to match other product versioning new tool for extracting graphs from elasticsearch data - can find strongly connected nodes, etc. - threat detection, possibly other itneresting things. ---------- Lucene ---------- segment - inverted index - norms - document store - column store in v4 - k-d tree - faster structured search in v6 -- replaces inverted index in certain limited circumstances -- old structured search is based on cells, optimizations are based around visiting as few cells as possible -- cells can have different sizes, 2x2, 3x3, (always square?) etc. -- k/d replacement - recursively splits the search space in two aiming for equal number of points on each side until each fraction of the search space has one point in it. -- works similar to traditional databases with binary search trees -- elasticsearch will stop at 1000 docs per side rather than just 1 -- it's a memory trade-off, the tree needs to be kept in memory at all times -- mike mccandles is a primary author -- Better in almost all aspects for spatial -- ElasticSearch 2.2 - geopoint searches -- for 1d numeric fields merge process is much faster, segments are presorted so it is only the final stage of a merge sort performed. BM25 by default in ES v5.0 -- - this is massively oversimplified, please read the code and the commit logs - common words are naturally discriminated - no more need to exclude stop words. -- doc freq contribution in bm25 goes all the way to 0, rather than 1.0 in tf-idf - term freq contribution -- bm25 saturates earlier. bounded at 2.2 by default Faster queries - two phase iteration - query can be divided into -- a fast approximation -- a (slower) match - phrase - approximation = conjuction -- match = position check - Geo polygon query -- approximation = points in cells that are in or cross the polygon -- match = check the point against the polygon Match cost API 1. approximate - (description:search AND description:engine) AND (body:postings AND body:list) ** 1. iterate body:positngs (1k) ** 2. Check description:engine (15k) ** 3. check description:search (200k) ** 4. check body:list (370k); 2. match - description: "search engine" body: "postings list" Other changes * better query-time synonyms - disk based norms - more memory efficient doc values - more disk efficient sparse doc values -- lucene 5.4 - BooleanQuery simplification in rewrite -- for instance imagine you are searching a matching docs query -- filter (constant score query) -- this is going to perform conjection (all docs that match both queries) -- lucene 5.5 - bulk scorer specialization for MatchAllDocsQuery and MUST_NOT clauses -- matching docs iterator must return sorted by score -- re balancing prioritizations was causing slowdowns -- instead of doing by document, now doing it in batches of 2048 -- a NOT b - improve file truncation detection -- leading cause of index corruption -- checksums added in 4.8 notice the error -- advances in 5.x add nice error message Also, SVN -> git Scaling log ingestion at ???? 10-12 nodes, old DB servers. lots of memory, disk, fast cpu's - number of indexes doesn't matter, it's about the number of shards - stores 13 months worth of data Initial config - two jvm's per server. One "hot" ingest instance pointed at fast disks. one "warm" instance pointed at slower spinning disks for querying historical data - 384G ram. 60TB of disk, 96 hardware threads, etc. (initially) The bad: - inefficient use of hardware - limited number of nodes - larger amounts of data to replicate on node failure Thoughts: - run all independent instances? Run many instances on one node? - hardware overpowered for data nodes (reworked cluster config) The good: - Run elasticsearch instances completely independent. Separate yml config, separate data directories, etc. - One hot ingest instance, 4+ warm instances -- uses ~50% of memory for JVM -- no matter how big your box is, keep jvm heap < 50% -- at 5 to 7 instances on 48 core machines (96 with HT) running 5 instances does not need cpu affinity. Running 15 it does (but 15 is a bad idea) - increase nodes -- more resources for queries -- better distribution of the larger number of shards -- more shards supported, thus more indexes, thus more data available online - faster replication on node failure due to smaller data sets - increased peak query load possible by 10x (probably due to more jvm heap? didn't say) - # of shards per instance effects scaling, query patterns - requires creative allocation rules to keep primary and replica off of the same physical machine -- quite a few. Requires many additional tags as well. - get much more out of the hardware with smaller (~30G) heaps and more JVM instances The Bad: What did it cost us? - more complicated cluster management -- need to address each instance for all management -- need to setup proper allocation rules to protect data from physical node crash --- originally intended to define racks. Allows separating primary/replicas by rack --- reuse this to support many independent instances per machine - Non standard configuration means more difficult to get assistance from public resources and documentation -- (paid) Support is your friend here. - physical node loss means greater cluster impact -- When a physical node goes down 40T+ of data needs to re-replicate -- this takes about an hour in their network/disk configuration - higher cost of hardware for data nodes - more power and environment controls necessary The End: well not really we are still growing - capable of processing 5.75 billion events a day - peak is 2x of trough (as high as 10x) - 24-30 physical data nodes - 5 instances (1hot, 4 warm), lost of storage, fast cpus, lots of memory - resilient -- can handle hardware failure and/or node crashes -- can handle maintenance/upgrades - small footprint in DC Questions? - tested JVM heaps down to 24G, was not enough for ES caching. Optimal is just under 32G in his instances - main difference between hot and warm instances is how expensive the disk is - running es 1.4.4, testing 2.2 but have not finishes converting everything - nodes are not separated on the boxes (VM, docker, etc), one kernel Relevant config variables: node.name - to identify the specific instance. For consistency suggest naming the process, configuration files and node all the same node.server - to identify the physical server node.tag - optional - to be able to move shards around path.conf - to identify the unique configuration directory path.data path.logs path.work cluster.routing.allocation.same_shard.host: true - enable check to rpevent allocation of multiple instances of the same shard on a single host -------------------------------------------------------------------- Stories from Support: Top Problems and Solutions -------------------------------------------------------------------- "It Depends" The Big Issues 1. Fielddata (always single word) - most common problem. 60% of total heap can be taken up in 1.x and 2.x more likely to be used in 1.x - 100% of data nodes will be impacted - 2.x (Mostly) Solves This - Thanks to doc_values being on by default -- Wouldn't help for wikimedia use case, we have very few aggregations that would use doc values. We also have relatively small fielddata though Doc Values! - Columnar store of values - Written at index time to disk -- Adds an extra file next to the segment on disk -- takes more indexing time -- Means you don't have to un-invert the inverted index at query time -- Part of the reason JVM should not take more than 50% of memory - Levererages the operating system's filesystem cache - When upgrading from 1.x to 2.x you have to reindex to get the change -- You cannot enable doc values after the fact. **** - GET /_cat/fielddata?v is your friend -- Couple hundred megabytes per node. Basically nothing. - Mostly for aggregates (as in, not useful at wikimedia) **** - analyzed strings do not currently support doc_values, which means that you must avoid using such fields for sorting, aggregating and scripting - analyzed strings are generally tokenized into multiple terms, which means that there is an array of values - with few exceptions (e.g., significant terms), aggregating against analyzed strings is not doing what you want - unless you want the individual tokens, scripting is largely not useful - Big improvements coming in ES 2.3 ("keyword" field) -- Redo not analyzed strings to allow them to use analyzers that produce single tokens - use multi fields to create analyzed and un-analyzed fields 2. Cluster state - every cluster state change is sent to every node -- requires a lot of short lived, potentially large network messages -- gets worse with more nodes or indices -- mappings tend to be the largest portion - GET /_cluster/state?pretty - Not stored in memory as JSON, so this is just to give the idea (it's likely 5% of it ballpark) -- elastic1001 reports JSON as 25MB. 5% would be 1.25MB (tiny) -- a Gig would be devastating for any cluster size -- 100M for a 100 node cluster would be "reasonable" -- 100M for a 10 node cluster would be odd -- Worst seen (by presenter) was 4G ES 2.0 introduces cluster state diffs between nodes -- change become far more manageable and a large cluster state is no longer as problematic -- reducing your mapping size helps too -- Do not allow dynamic mappings in production - Do not use _types to separate data -- duplicating mapping if they are the same -- if they are different unnecessarily something -- prefer to create separate indices - Create a "type" field to do this for you - Prefer changes in bulk rather than one-by-one (allow changes to be batched) ** Create 900 new titlesuggest indices in one cluster operation?** -- there is no recommended size. Maybe 5MB payload size? Play around with it 3. Recovery - Changed dramatically in ES since 1.4 - restarting a node or otherwise needing to replicate shards - terribly slow process - segment by segment - minor risk for currption pre-es 1.5 with sketchy network -- if 90% done and network drops there could be issues -- dont use 1.5 1.6 -> asynchronous allocation and synced flushing 1.7 -> delayed allocation and prioritized allocation 2.0 -> Cancelled allocation 2.1 -> prioritized allocation for replicas If you are not on 1.7.5 get there SOON. 4. Node Sizing - elasticsearch is a parallel processing machine - java can be a slow garbage collecting calculator - slow disks. The problem for every data store? -- Do not use SAN. Really. Don't do it - A few huge instances isn't best, keep JVMs <= 30G And how long is a piece of string? Memory - 50% of system ram to heap - Up to 30500M - no more or your heap loses optimizations! -- compressed pointers CPU - indexing tends to be CPU bound - At least 2 cores per instance -- more you give it the faster it can go (duh) IO - disks get hammered for other reasons, including write-impacting **** - translog in 2.0 fsyncs for every index operation -- major major change in 2.0 **** - SSDs of Flash are always welcome 5. Sharding - ES 1.x defaults to 5 primary, 1 replica - ES 2.x defaults to 2 primary, 1 replica ** - 50GB is MAX shard size. More for recovery than performance reasons ** - Increase primaries for higher write throughput and to spread load - Shard because you need write capacity. - If you need better search response (latency) increase number of shards - Replicas are not backups. Rarely see a benefit with more than 1 -- They are for HA, not backups -- They do not save you if "everything goes to hell" -- Snapshots are your friends --- Incremental, next snapshot gets only segments that are "new" -- Oversharding is becoming an issue --- 41k shards in a 10 node cluster is extreme Bad Queries - Deep pagination -- ES 2.0 has a soft limit on 10k hits per request. Linearly more expensive per shard -- Use scan and/or scroll API - Leading wildcards -- Equivalent to a full table scan (BAD!!) - Scripting -- Without parameters -- dynamically (inline) Aggregations - Search is always faster than aggregations - Dont aggregate when you want search Merge throttling - Disable it on SSD's Use bulk processing - Indexing 1 by 1 you will have a bad day (if trying to index many things) ------------------------------- Securing Elasticsearch ------------------------------- Pitfalls securing search - or why REST based security solutions rarely work Why would you? - Most of the clients are using HTTP protocol - Easy to implement using standard HTTP reverse proxy - Many elasticsearch operation are REST compliant -- PUT /index/type/doc_id - indexes a document -- GET /index/type/doc_id - gets a document - Reverse proxy isn't enough, you also need firewalls -- (i don't think that's a big idea, at all. Use SSL) -- must establish security perimeter What does it get you? - Works great for authentication. Access/No access to everything - Fails for authorization -- Role based access to certain indices, certain part of indices, certain fields Limiting access to an index by using URL matching - Theoretically easy! Just setup the following URL matching patterns: -- GET /myindex/*/* -- PUT /myindex/*/* -- POST /myindex/*/* - But what about? - POST /myindex/type/_bulk - /myindex1,myindex2/*/* - Could disable multi-index rest actions -- rest.action.multi.allow_multi.??? = none Search is not always limited to the index on the url Cross index search operation: - More like this query - Terms lookup mechanism in the terms query - percolation - you end up needing to whitelist/blacklist query patterns Be really careful with URL matching rules! - Documented exceptions on the elasticsearch website Document level security - A typical solution for document level access control -- Add an ACL token(s) to each document -- Crete an alias with ACL filters for each user -- Limit users access to their aliases - Works *most* of the time -- Aliases are not considered in --- Suggest API --- Children Aggs If you are trying to use aliases as security solution augment it WRT query types Filtered aliases are not a security mechanism - More like working together with other systems Field level security - A naive solution - filtering of source in response -- /_search?_source_exclude=secret_part.* -- This is not enough, If you can search it you have access to it. There is no exception to this. A field in the source can appear in multiple ways -- multi match, aggregations, etc. etc. -- Requires basically whitelisting queries Securing search is difficult problem - URL path is not a filter - Some operations are working on entire index - It's a search and analytics engine, which makes frequency analysis attack so much easier -- Lots of nice statistics like term frequency, etc. Is there a solution? - Move all security to business logic layer. Deny raw access - Move security to transport layer -- This is what shield (now xpack) does -- They believe that securing at the lowest level is the only way to secure elasticsearch - Hard to know from the REST api what will happen with a given request - Authorization is secured at the individual TransportAction level -- Prevents the user from accessing a terms lookup (for example) on an index they dont have - TLS encryption for inter-node transport and http - Message signing means even without TLS a node cannot join the cluster - Secured down to the lucene level - integrating with how elasticsearch reads the index - In a request with document level security the query defined on a role will get added to the index reader. Documents that arn't allowed look deleted to the IndexReader - Field level security is similar. At the lucene level fields look like they don't exist ------------------------------------------------------------------------------------------------ Boosting Search Accuracy with Engine Scoring and Predictive Analytics ------------------------------------------------------------------------------------------------ Paul Nelson Chief Architect, Search Technologies (note to self: This will probably be advertising to hire them as consultants) (talked to paul after talk, he had previously talked with nik and is interested in working with us. I think what he really wants is to use our click through data to improve his own business case though. Tomasz will be in contact and see if we can work together). - Speaker has been involved in this since mid 90s -- NIST - Many of these ideas came from Trek, updated with big data and logs - Common customer complaint -- Q: Whats wrong? -- A: Our search is bad -- Q: How bad? -- A: Bad -- Q: Scale 1-10? -- A: 8? 9? Lets call it 9.23 -- (completely bogus, means nothing without statistically based numbers) - Golden query set -- Key documents -- Just because it's called golden doesn't mean its any good - Top 100 / Top 1000 queries analysis -- May not represent actual usage of your system -- Taking a random selection can be more useful. For example one customer 40% of queries were human names. Large variance didn't show up in top 1000 queries. -- The long tail could be 85%+ of queries - Zero result queries - Abandonment rate - Queries with click - Conversion These all have a problem: You need to put your search engine into production to compute them which hurts your bottom line. - (if you are ecommerce or advertising) What are we trying to achieve? * Reliable metrics for search accuracy -- repeatable * Can run analysis off-line -- Does not require production deployment (!) * Can accurately compare two engines -- Different technologies -- Different versions of same engine -- Different scoring on same engine * Runs quickly = agility = high quality -- Iterating engine quickly is key * Can handle different user types / personalization -- Broad coverage -- Modify relevancy scoring based on source website, "cluster" user is in * Provides lots of data to analyze what's going on -- Data to decide how best to improve the engine Leverage logs for accuracy testing Query logs, Click Logs -> Engine scoring framework -> Search engine under evaluation * engine scores - have one (or two, or three) top level numbers that give your engine a score for the query set * other metrics and histograms - generate useful metrics, such as a histogram of where recorded click through were in the final positions * scoring database - Record historical scores, and what changes led to those scores. Re-review prior changes after making new changes. They might not be as useful any more! >From Queries -> Users * User by User metrics -- change in focus -- Queries don't matter. People matter. It does *not* matter if a query was the right answer. It matters if the user got their right answer (maybe two queries? Not the end of the world) * Group activity by session and/or user -- call this an "activity set" -- Merge session and users * Use Big Data to analyze *ALL* users -- There are no stupid queries and no stupid users -- Overall performance based on the experience of the users Engine Score - Group activity by session and/or user (Queries and Clicks) -- Trying to create a model for the user. Everything the user hs found useful Determine "relevant" documents -- Lots of people don't have user ids, only sessions -- What did the user view? Add to cart? Purchase? -- Did the search engine return what the user ultimately wanted? * Determine engine score per user per query -- Σ power(FACTOR, position) * isRelevant[user, searchResult[Q,position].DocID] -- Evaluated for each user's point of view -- (Note: Many other formula possible, MRR, MAP, DCG, etc.) -- Speaker happens to like this one. Some algo's prefer top results, some prefer results "on page" * Average score for all user queries = user scores * Average scores across all users = final engine score -- Important to average per user (or per session) and not per query -- Prevent single user with 99 queries from changing the score Off Line engine Analysis - searchRersult[Q,position].DocID -- Done by re-executing the users queries offline Continuous improvement cycle Modify engine -> execute queries -> compute engine score -> evaluate results -> back to beginning * Kind of like sports, you can have a personal best engine score * Scoring never stops, generating one score is not enough. Generate scores and track over time What else can we do with engine scoring? - predictive analytics - The brutal truth about search engine scores -- They are not based on science. -- Random ad-hoc formula put together - BM25 invented in 1980. Not state of the art. Predictive analytics is Use Big Data to predict relevancy Document Signals, Comparison Signals, Query Signals, User Signals The score predicts probability of relevancy Value from 0 -> 1 * Can be used for threshold processing * All documents too weak? Try something else! * Can combine results from different sources / constructions together - Identifies what's important -- Machine learning optimizes for parameters -- identifies the impact and contribution of every parameter - If a parameter does not improve relevancy -> REMOVE IT!!! Come out of the darkness - Ultimate message is use science DEMO - Note time based, just a list of queries and clicks - Don't have to tie clicks to users Q/A - Single relevancy score not as useful - Build relevancy lab to have multi-dimensional scores - td/idf may be a signal, but the resulting formula usually replaces tf/idf completely -- tf/idf only used to surface likely documents ------------------------- NY Times Search ------------------------- explicit search, formal user queries, used but not the main use case implicit search powers much of nytimes.com Ingestion --------- Few updates - A few thousand updates per day Sources - CMS, Legacy systems, file based archives, OCR Latency - <1s, Low latency is important. Especially for implicit search. Content typically doesn't show up on website until it's indexed Typical Use Cases ----------------- - not comprehensive - People search for something they read. They read an article they want to read it again -- Only 1 correct result. -- Goal (not accomplished): recognize these queries and give a single, correct, result - Find something you wrote -- Less common, but happens surprisingly often - Find reviews for movies/books -- Typically has a single correct result - Find recipes -- Similar, but more than a single correct result - "Manual sharing" -- Instead of sending a link to someone you tell them you read an article on xyz and they search for it - "Casual" news search -- Looking for information on recent events - Serious research -- More interesting -- Makes up a big amount of queries on the website -- Typically journalists or historians Why not just use google? 1. Keep the customers on site -- Google risks the customer goes to a different site 2. There is no google for native apps -- Native apps need to provide good search -- Because they have multiple apps that do slightly different things they need special cases 3. We know our content better -- Most important -- Have editorial teams, taxonomy teams. They know which article is relevant for a specific case -- Elsaticsearch has allowed them to tune this exactly how they want Search should be targeted at their most advanced readers - A movie review can be found on google - If you want to know the 10 most important articles on syria, NY Times can do that much better -- Not there yet, lots of work still to do but that is the direction they are headed Our current Setup ----------------- - The search stack runs entirely on AWS -- Case for NYTimes website - Deployed using an in-house infrastructure tool - Nagios and New Relic for monitoring - Sumo Logic for log management and analytics System Setup - Two full production clusters -- Not just for elasticsearch, but the whole stack (ingestion pipeline, etc) - DNS based failover - Also used for maintenance/major changes/reindexing - 16 node ES clusters in production (m1.xlarge) CMS->SQS->Handler->Normalization->Merge->Indexer->Elasticsearch->Search API<->Semantic platform - Many intermediate layers read/write data from S3 - Beanstalk handles queueing(i think?). Replacing with kafka Normalization - Large number of rules defining how to extract fields from different data sources - Constructs canonical document which is same for all data sources - Canonical document pushed into ES Merge - Ensures that only the latest version is indexed Indexer - Pushes documents into elasticsearch and mongodb Search API - powered by elasticsearch and mongodb - For each search result looks up full document from mongodb and returns that to client Two clusters makes maintenance work easy. - The way the handler works makes it relatively easy to handle the disparate formats Number of issues with this setup: - Too many moving parts - Too many things that can go wrong - Difficult with this setup to index new fields -- New fields are added quite frequently -- Have to go back to normalization, add new fields, reindex everything -- Many times requires going all the way back to CMS -- Simple task that shouldn't be this hard - Despite the fact they have two clusters they don't know if the offline cluster is still working -- Cache goes cold -- Have to warm up cache before putting online -- Offline cluster doesn't receive production traffic -- Only discover failure when they try to fail over - Hard to run a complex setup like this on a laptop -- Want to run entire pipeline locally - Overall This is a setup that is so complicated that they spend a good amount of their time doing operations work instead of search --Seen this issue in many places Future Work - Simplify 1. Full document in elasticsearch - store and index full documents in elasticsearch -- Not done previously for historical reasons of support for document sizes (alt search engine?) - No external lookup necessary in the API, no MongoDB - Demand driver API - clients decide what they want. - Not in production yet, but some tests show this is better 2. Replay - Use kafka - All content will go through kafka - Kafka will persist all content, source-of-truth for search - Easy to create new search clusters through Kafka replay - Goal: Never upgrade cluster. Boot new cluster, replay kafka to populate, switch over and kill old cluster 3. Keep all clusters busy - All production traffic will be replayed from active production cluster to standby production and to staging - Uses gor - Makes sure that standby cache is always warm - Changes to staging will be exposed to production traffic - If standby cluster crashes they will know 4. Virtualize - Vagrant boxes for everything - Make it easy to run full pipeline locally. - Provision vagrant boxes using same system as production servers -- *exactly* the same. No differences between local and prod per-box configuration CMD-Kafka->logstash->elasticsearch->Search API - No S3. No beanstalk. Much simpler. Conclusion and comments 1. Figure out how you can be better than Google -- IF you don't think you can be better let google do it -- Know your users -- Search should be an integrated part of the product --- Search should not be an afterthought --- Plan it from the beginning -- Be consistent across devices and platforms -- Focus on where you can make a difference for the end user 2. Infrastructure is hard Think about how to - do deployments -- Far too often people setup search clusters without planning how to deploy new versions -- Know how you can easily upgrade - do upgrades - reindex content - change the way you normalize data - create new search clusters ... and make it easy. Things to do - Put your infrastructure in version control -- Nothing should be done from command line to setup infrastructure -- When everything crashes, and some day it will, it needs to be easy - Vagrant everything - Have immutable servers, replace instead of changing -- Only really possible on cloud based, raw hardware requires different approach - Make deploying a new stack easy and automatic Is elasticsearch for text search? - Background for this question is they attended a one day elastic conference in new york -- None of the talks were about text search -- This week is very similar -- Looked over new features in ES2.0. Categorized into analytics/search/other stuff(mostly cluster) -- Found 6 issues out of 119 dealing with search. 5%. -- Skeptical that elastic wants to focus on text search as an important use case -- Those features are not revolutionary in any way, just regular tweaks New things necessary in elasticsearch for text search: - Query log handling -- Need to know what users are searching for and what they are doing -- Completely missing from elasticsearch -- Considers it a big problem -- Elasticsearch has plenty of ways to analyze logs. But nothing that logs and ingests it's own search history - Easier relevancy tuning -- Difficult to change things. Getting the right score for one document without changing all other searches -- Needs tooling. Providing a set of benchmark queries -- When you change things for one query, how does that effect other queries? - Improved linguistics -- No built in language detection -- No morphologies - lemitization -- No compounding -- No entity extraction - Really wants to see some leadership from elastic on the text search space Q/A: Optimizations around analyzers? - Not much. Most is done pre-indexing by adding keywords and custom scoring Primary data store? - Source of truth is custom CMS Moving away from AWS? - not trying to, just want to make sure it's possible (using kafka instead of SQS, etc.) Follow up on latency, time from publish to shown on site? - Less than 1s ------------------------------------------------------- Geospatial data structures in elasticsearch and lucene ------------------------------------------------------- Geo field types -geo point -- used to have 8 or 9 paramteres, as of ES 2.2 only 3 parameters necesary -- other still there, but deprecated and being removed -- type, geohash_prevision, geohash_prefix, ignore_malformed all that is left -- only type, ignore_malformed is being kept - geo shapes -- used to have 12 parameters, many expert only -- going forward: type, ignore_malformed, tree, precision, distance_error_pct, orientation, ???_only -- orientation going away. standards define a standard: right hand rule. -- es5.0 iso complient, will reject uncomplient -- points only allows short circuit quad tree logic, optimizes for point only types --- points only vs points index is 10-15% better performance --- only geo shape no points only, OGC simple feature access ISO Geographic information - spatial schema 19107:2003 Geo indexing GeoPointField change in lucene 5.4 - create a quad tree encoding, first bit is world level lat, next bit world level longitude -- rotates back and forth each two bits -- Space filling curve, dimensional reduction -- Their implementation is Z-curve, morton curve. Nearest neighbors are not necessarily so close. -- investigation hilbert curve. Current is 4 operations to encode, hilbert takes longer. Tradeoff of accuracy/locality vs performance -- precision step parameter effects size of index. More precision -> more terms -- "optimal" precision step for smallest index possible. Currently 9 2.1 -> 2.3 large change in performance of geo. ~30% index size, 12% heap, 40% time (indexing or query?) 5.0 -> geo shapes and geo points merged as geo @experimental - balanced k-d trees Geo search Geo Aggrgations --------- BM25 --------- A free text search is very inaccurate description of our information need What you want - quick learner - works hard - reliable - enduring - ... But you type: "hard-working, self-motivated, masochist" The purpose of this talk - know the monster, understand what the parameters of BM25 do - know why it has the label "probabilistic" - mathematical concepts behind it: more for your entertainment - be convinced that switching to BM25 is the right thing to do -- hardest part of this talk - be able to impress people with you in depth knowledge of probabilistic scoring The current default - TF/IDF - search in self-description of application for these words - we want to order applications by their relevance to the query Evidence for relevance - term frequencies - Use term frequencies in description, title, etc. Major tweaks: - term freq: more is better -- the more often it occurs the more relevant this document is -- square root of the term frequency. Sum it up -- increases non linearly - inverse document frequency -- common words are less important -- a, is, the, etc. -- also non-linear: multiple tf with idf and sum that - long documents with same tf are less important: norm -- also non-linear, idf is more L shaped than this though Bool query and the coord-factor - doc a: holiday: 4, china: 5 - doc b: holiday: 0 china: 15 - Coord factor: reward document a with both terms matched -- basically a hack TF/IDF - successful since the beginning of lucene, since first release - well studied - easy to understand - one size fits most So...can we do better? - It is somewhat a guess - sum square root of term frequencys, which was a guess of an original developer -- ad hoc Probabilistic ranking and how it led to BM25 The root of BM25: probability ranking principle (abridged) -"if retrieved documents are ordered by decreasing probability of relevance on the data available, then the system's effectiveness is the best that can be obtained for the data." - Chance to deploy prior knowledge of relevance and scoring to make results better - if mapped to mathematical framework can take advantage of mathemticians past proofs Getting into math: 1. Don't care how much heap, how much cpu's. Get to that later, forget about it now 2. Think of yourself as super smart, beyond 2 standard deviations Estimate relevancy: - simplification: relevance is binary! - get a dataset queries - relevant / irrelevant documents -- use that to estimate relevancy - how to use this to actually get a better scoring? - P(R=1|d,q) -- P(A|B) = probability of A given B, R = relevancy (1/0), d = document - For each document, query pair - what is the probability that the document is relevant -- Need some sort of matrix, document vs query ...here be math... - paper: The probabilistic relevance framework: bm25 and beyond - stephen robertson and hugo zaragoza ..and we get to.. - W(d) = sum log( P(..)P(..) / P(..)P(..) ) BM25: how to estimate all these probabilities - two assumptions -- the binary independence model - a dramatic but useful simplification -- query term occurs in a document or doesn't, we don't care how often - Robertson/Sparck Jones weight -- another equation -- assumes unlimited supply of interns - simplified -- still use robertson/sparck jones weight but assume that the number of relevant documents is negligible (R=0,r=0) - BM25 has S shape, goes to zero. tf/idf is L shaped - In TF/IDF the more often the term occurs the better - but, is a document about a term just because it occurs a certain number of times? - this property is called eliteness -- in the literature Example for "eliteness" - "tourism" - Look at wikipedia: Many documents are about tourism - Many documents contain the word tourism - but are about something completely different, like for example just a country Can we use prior knowledge on the distribution to ... Eliteness as Poisson Distribution Two cases: - document is not about the term - document is about the term How to estimate this? - father data on eliteness for term - many term frequencies -> do for many documents How relevance ties into that - suppose we knew the relationship of ??? .... Tf saturation curve - limits influence of tf - allows to tune influence by tweaking k - major difference from tf to bm25 So...we assume all documents have same length? - poisson distribution: assumes a fixed length of documents - but they don't have that (most of the time) - we have to incorporate this too! - scale tf by it like so: (long equation) -- includes interpolation between 1 and document length over average document length Influence of b: length weighting - tweak influence of document length - similar to tf/idf, depending on b value Is BM25 probabilistic? - many approximations - really hard to get the probabilities right even with unlimited data -- mathematicians have been trying to solve this for years, no success - BM25 is "inspired" by probabilistic ranking -- 1992 Trec-2 took a "leap of faith" (?) -- 1993 Trec-3 BM25 final! -- 1999 First lucene release (tf/idf) -- 2011 Pluggable similarities + BM25 in lucene So..will I get better scoring with BM25? Pros with the frequency cutoff: - TF/IDF: common words can still influence score - BM25: limits influence of term frequency -- less influence of common terms -- no more coord factor! -- check if you should disable coord for bool queries? --- index.similarity.default.type: BM25 Other benefits: - parameters can be tweaked. Not sure if many people actually need to. You need data to measure if scoring is getting better or not. Do not tweak without data To update: - close index - update mapping (or settings) - re-open index Mathematical framework to include non-textual features - go read the paper A warning: lower automatic boost for short fields - With TF/IDF: short fields (title,...) are automatically scored higher - BM25: Scales field length with average -- field length treatment does not automatically boost short fields (you have to explicitly boost) Is BM25 better? - literature suggests so - challenges suggests so (TREC, ...) - Users say so - Lucene developers say so - Konrad Beiske says so: "BM25 vs Lucene default similarity" - But: It depends on the features of your corpus. - Finally: You can try it out now! Lucene stores everything necessary already Useful literature: - Manning et al, Introduction to Information rerieval - Robertson and Zargoza, The probabilistic Relvance Framework: BM25 and Betond - Robertson et al, Okapi at TREC-3 ---------------------------- Quantitative Cluster Sizing ---------------------------- 1. Understanding why "it depends" Factors - Size of shards - Number of shards on each node - Size of each document - Mapping configuration -- which fields are searchable -- automatic multi-fields -- whether messages and _all are enabled - Backing server capacity (SSD vs HD, CPU, etc.) Your organization requirements / SLAs - retention period of data - ratio and quantity of index vs search - nature of use case - continuous vs bulk indexing - kinds of queries being executed - desired response time for queries that are run frequent vs occasionally - required sustained vs peak indexing rate - budget and failure tolerance Let's try to determine - How much disk storage will N documents require - When is a single shard too big for my requirements -- When will that shard reach a point where the search queries will no longer be within SLA - How many active shards saturate my particular hardware -- Keep putting shards on a node? Back it down - How many shard/nodes will i need to sustain X index rate and Y search response 2. Sizing methodology - run 4 experiements - 1. Determine various disk utilization -- Use a single node cluster with one index -- 1 primary, 0 replica -- Index a decent amount of data (1GB or about 10 million docs) -- Calculate sotrage on disk --- Mostly for logging use case, before and after _forcemerge --- -- Repeat the above calculations with different mapping configurations --- _all both enabled and disabled --- settings for each field - 2. Determine breaking point of a shard -- Use a single node cluster with one index: 1 primary, 0 replica -- Index realistic data and use realistic queries -- Plot index speed and query response time --- Build some dashboards. Latency increases with shard size -- Determine where point of diminishing returns is for your requirements - 3. Determine saturation point of a node -- USe a single node cluster with one index --- 2 primary, 0 replica --- repeat experiment two to see how performance varies --- keep adding more shards to see when point of diminishing returns occurs - 4. Test desired configuration on small cluster -- Configure small representative cluster -- Add representative data volume -- Run realistic benchmarks: --- Max indexing rate --- Querying across varying data volumes --- Benchmark concurrent querying and indexing at various levels -- Measure resource usage, overall docs, disk usage, etc. 3. Scenario and experiment results - Practical Sizing Example -- show some patterns, draw some conclusions about how it works -- no numbers, all your numbers will be different anyways - Graph latency, record count, shard size -- Latency increases with record count proportionally - ... lots of stuff about log ingestion use case ... 4. Interpreting results and expanding to other scenarios ...

5 4

Additional Analysis of Wikipedia.org Portal page
by Deborah Tankersley 23 Feb '16

23 Feb '16

Hello, After our code release <https://phabricator.wikimedia.org/T124713> and analysis (ppt review <https://docs.google.com/presentation/d/1tCN46zLp7r_Bdkh0To7hp-ryfWBS6ILVkjK…> | specific analysis pdf <https://commons.wikimedia.org/wiki/File:Analysis_of_Wikipedia_Portal_Traffi…>) of removing inline JavaScript on the Wikipedia.org portal page (it's now a separate, downloadable file), we would like to do a bit more analysis on the portal page. Specifically: 1. Determine mobile vs desktop visitors to the Portal page 2. Investigate the percentage of traffic that the portal gets from various search engines (to be displayed something like this <http://discovery.wmflabs.org/external/#traffic_by_engine> but on the Portal Dashboard <http://discovery.wmflabs.org/portal/>) Please respond to this email if there are additional analysis that you'd like to see done on the Portal page that would be helpful. Cheers, Deb -- Deb Tankersley Product Manager, Discovery Wikimedia Foundation

2 2

New dashboard features released
by Oliver Keyes 20 Feb '16

20 Feb '16

Hey all! I'm pleased to report some new features on http://discovery.wmflabs.org/ . These are: * The "Portal" dashboard now displays the number of pageviews the wikipedia portal receives a day (https://phabricator.wikimedia.org/T125737) * The "Search" dashboard now lets you break down the zero results rate by language and project (https://phabricator.wikimedia.org/T110590) * The Portal dashboard _also_ lets you look at a breakdown of the browsers portal visitors are using (https://phabricator.wikimedia.org/T124827) * A whole host of miscellaneous bugfixes We hope you like the new editions, and hope everyone has a great weekend :) For the Discovery Analytics team, -- Oliver Keyes Count Logula Wikimedia Foundation

7 8

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery February 2016