Hello discovery mailing list members.
We'd like to ask for some feedback on how the Discovery team can best use
this list. We don't want to muddle the list with information that folks
outside the team don't care about, and we want to make sure that what we
post is of value.
What sort of information do you think we should share here?
It seems clear that items like these should go on this list:
- Some interesting findings - we ran a test and found x=y!
- Something new and we want to suss out interest - What if we moved the
search box to a different position on the page‽
- Upcoming projects, or other milestones related to discovery - X users
opted-in to the new autocompletion suggester.
- Discussions about quarterly goals and roadmaps
There are other items that we could share on this list, but would anyone
outside the team be interested? For example, process-related items like:
- Portal team decided to have 3x standups per week instead of 2.
- Team structure changes, or plans to change how we do retrospectives
- Notifications of meeting minutes being posted (e.g. Weekly team
meeting minutes, retrospective minutes, etc.)
For Discovery team members reading this: What criteria do you use (or wish
you could use) when thinking, "Should I post this to public list?"
Agile Coach, Wikimedia Foundation
You may also find these diagrams useful:
On Feb 24, 2016 6:58 PM, "Mukunda Modell" <mmodell(a)wikimedia.org> wrote:
> >> On Feb 17, 2016 1:50 AM, "Guillaume Lederrey" <glederrey(a)wikimedia.org>
> >> wrote:
> >>> * I still have not found a global architecture schema (something like
> >>> a high level component or deplyoment diagram). But I have never seen
> >>> any company having those...
> I made a diagram of the scap (mediawiki) deployment architecture a while
> back: https://commons.wikimedia.org/wiki/File:Scap-diagram.png ..
> That does not exactly apply to the new scap3 architecture but it's not too
> far off.
> On Thu, Feb 18, 2016 at 10:37 AM, Giuseppe Lavagetto
> <glavagetto(a)wikimedia.org> wrote:
> > About cherry-picks in beta: the problem is not cherry-picking (I think
> > it's a reasonable way to test things) but persistent cherry-picking to
> > monkey patch problems is. I think if we follow the flow of:
> > - writing a patch
> > - testing it on beta with a cherry-pick
> > - get it merged on ops/puppet and production
> There are a lot of patches on beta these days and there have been a lot of
> different people cherry-picking without much coordination. This has lead
> to breakage quite often. Patches also get lost regularly. I assume this
> usually happens because someone has rebased the HEAD and accidentally
> dropped a patch.
> It can be really difficult to get a patch merged in ops/puppet within a
> week (or even a month). I've seen a lot of patches sit around for weeks and
> even now with the Puppet SWAT windows, it's still sometimes unrealistic to
> expect patches get merged into production that quickly. (+CC Tyler)
> Without a system to manage things, and with very little coordination
> between everyone who is working on beta, I don't expect the situation to
> improve too much.
> I intend to propose a solution for beta & puppet patch cherry-picks very
> soon, however, I haven't fully formulated my proposal yet. I will write to
> the ops list when I have something written in a clear and presentable way.
> Ops mailing list
Searching for articles is getting a much needed boost in intelligence. When
searching for an article on any Wikimedia project, a list of possible
matches appear as you type. This incremental search
suggester' helps narrow down possible results for your search query.
Currently the completion suggester is very literal - mistype a word and you
won't see any suggestions. The Discovery team has an update that will make
the search better at detecting typos and spelling mistakes. The list of
suggestions is also tends to be more accurate and relevant to the original
search query. You can see the improvement for yourself as a beta feature
The plan is to begin rolling out this update in the coming month.
This improvement will impact all Wikimedia sites (with the exception of
Wikidata). With the initial rollout, there will be a few limitations. For
technical reasons, the completion suggester will only affect articles in
the mainspace, not other pages like policy or user pages. Those searches
will continue to use the existing incremental search. We're hoping to
expand the scope of the completion suggester to include other pages in the
future, but this will not be a part of the initial release.
Why are we making this change? The success of our A/B tests on the
completion suggester show a reduction in users who find zero results when
searching. We also have a fairly large number of Wikimedians (nearly 19,000
editors since December 2015) using the beta feature, and we've received
positive feedback on the feature so far. The completion suggester has the
potential of lessening the need for redirects based on spelling mistakes as
The goal is to bring these updates to the default search across all
Wikimedia projects in the coming month. This change will affect the
completion suggester in the main search box on desktop, mobile apps, adding
links in VisualEditor, and the search box on the Wikimedia Portal.
Here are two animations showing the completion suggester before and after
(using a misspelling of "Abendessen", with a missing "s").
* Completion Suggester results for "Abendesen" on German Wikipedia before
* Completion Suggester results for "Abendesen" on German Wikipedia after
Since December 2015, nearly 19,000 editors have already opted into the
completion suggester beta feature. We encourage you to try it out and share
your feedback on the Completion Suggester discussion page
If you'd like to read a little more about the work of the Discovery
Department and other improvements to search, please check out the Wikimedia
read about CirrusSearch
<https://www.mediawiki.org/wiki/Extension:CirrusSearch>, the Mediawiki
Extension that powers our search.
Community Liaison - Discovery
it happened a few times and it's quite annoying, could we raise the size
limit on this list? (which seems to be set to 40kb)
I've just answered a mail that was crossposted to mobile-l and
discovery-l and I had no problem with the former.
As mentioned previously, the current version of the Android app contains an
A/B test where it presents "read more" suggestions to the user, based on
(a) the standard "morelike" query, or (b) the new "opening_text" query.
Here are the results from the last ~10 days of the test:
- The clickthrough rate using the default morelike query is (and has been)
- With the new opening_text query, the clickthrough rate decreases to about
[image: Inline image 1]
Therefore, it seems that the new query has a nontrivial negative effect on
We'll plan on removing this test in the next release of the app, but we'll
be happy to plug in a different or updated query, if it will be of further
use to Discovery.
(queries embedded as comments in the headers)
Senior Software Engineer / Product Owner (Android)
Hi Discovery staff,
I've read your notes from your February 16 meeting that were posted to
I just want to express moral support for the engineers, analysts,
designers, managers, and other well-intentioned staff in the Discovery
team. As far as I can tell, the problems relating to the Knowledge Engine
that are being discussed in public don't reflect any desire from the
community to shut down Discovery or improvements to internal search. I hope
that you know that improvements to 0RR, maps, WDQS, search from
www.wikipedia.org, search APIs, and other present and future Discovery
projects may yet receive positive community feedback.
Thank you for your efforts and perseverance.
Speaking in my personal capacity only,
These are my notes, pretty much completely raw, from elasticon last week.
Scaling 5 nodes -> 1,000
significant terms query (expensive) may help with trending articles?
preferred log ingestion path: app -> kafka -> logstash -> elasticsearch
- es has features (node affinity) to have high power "hot" nodes and
lower power / spinning disk "warm" or "cold" nodes and ensure indices
end up on the right machines.
- **run master nodes on same phyiscal hardware, separate JVM**
-- dedicate cores to master node
new node type: ingest
- specialized for indexing
- most common logstash filters implemented in library, used in elasticsearch
- can enrich data on indexing
- makes logstash not necessary for some workloads
10 nodes (128G memory, 32 threads, 1TB SSD) can handle 150k+ indexes per
second for logging
**make clusters a first class citizen in your application**
**multi-tenant scaling -> multiple clusters**
new features: cluster state diffs -> allows more machines/indices/shards
- may help with our master timeout issues, but talked to some core devs and
they wern't really sure. Preferred the idea to split into multiple
independent clusters for multi-tenant use case
- **more indices does matter for master complexity. All about the total
number of shards**
types are just a filter on special _type column
new feature: columnar data store (doc values) on by default in 2.0
new software: beats
- is about data collection and shipping
- may take on similar role as diamond, shipping to elasticsearch or logstash
new goal: "release bonanza" release multiple products on same date to
simplify compatability between products
- es will go v2 -> v5, skipping between, to match other product versioning
new tool for extracting graphs from elasticsearch data
- can find strongly connected nodes, etc.
- threat detection, possibly other itneresting things.
- inverted index
- document store
- column store in v4
- k-d tree - faster structured search in v6
-- replaces inverted index in certain limited circumstances
-- old structured search is based on cells, optimizations are based around
visiting as few cells as possible
-- cells can have different sizes, 2x2, 3x3, (always square?) etc.
-- k/d replacement - recursively splits the search space in two aiming for
equal number of points on each side until each fraction of the search
one point in it.
-- works similar to traditional databases with binary search trees
-- elasticsearch will stop at 1000 docs per side rather than just 1
-- it's a memory trade-off, the tree needs to be kept in memory at all times
-- mike mccandles is a primary author
-- Better in almost all aspects for spatial
-- ElasticSearch 2.2 - geopoint searches
-- for 1d numeric fields merge process is much faster, segments are
presorted so it is only the final stage of a merge sort performed.
BM25 by default in ES v5.0
- this is massively oversimplified, please read the code and the commit logs
- common words are naturally discriminated
- no more need to exclude stop words.
-- doc freq contribution in bm25 goes all the way to 0, rather than 1.0 in
- term freq contribution
-- bm25 saturates earlier. bounded at 2.2 by default
Faster queries - two phase iteration
- query can be divided into
-- a fast approximation
-- a (slower) match
- approximation = conjuction
-- match = position check
- Geo polygon query
-- approximation = points in cells that are in or cross the polygon
-- match = check the point against the polygon
Match cost API
1. approximate - (description:search AND description:engine) AND
(body:postings AND body:list)
** 1. iterate body:positngs (1k)
** 2. Check description:engine (15k)
** 3. check description:search (200k)
** 4. check body:list (370k);
2. match - description: "search engine" body: "postings list"
* better query-time synonyms
- disk based norms
- more memory efficient doc values
- more disk efficient sparse doc values
-- lucene 5.4
- BooleanQuery simplification in rewrite
-- for instance imagine you are searching a matching docs query
-- filter (constant score query)
-- this is going to perform conjection (all docs that match both queries)
-- lucene 5.5
- bulk scorer specialization for MatchAllDocsQuery and MUST_NOT clauses
-- matching docs iterator must return sorted by score
-- re balancing prioritizations was causing slowdowns
-- instead of doing by document, now doing it in batches of 2048
-- a NOT b
- improve file truncation detection
-- leading cause of index corruption
-- checksums added in 4.8 notice the error
-- advances in 5.x add nice error message
Also, SVN -> git
Scaling log ingestion at ????
10-12 nodes, old DB servers. lots of memory, disk, fast cpu's
- number of indexes doesn't matter, it's about the number of shards
- stores 13 months worth of data
- two jvm's per server. One "hot" ingest instance pointed at fast disks.
one "warm" instance pointed at slower spinning disks for querying
- 384G ram. 60TB of disk, 96 hardware threads, etc.
(initially) The bad:
- inefficient use of hardware
- limited number of nodes
- larger amounts of data to replicate on node failure
- run all independent instances? Run many instances on one node?
- hardware overpowered for data nodes
(reworked cluster config) The good:
- Run elasticsearch instances completely independent. Separate yml config,
separate data directories, etc.
- One hot ingest instance, 4+ warm instances
-- uses ~50% of memory for JVM
-- no matter how big your box is, keep jvm heap < 50%
-- at 5 to 7 instances on 48 core machines (96 with HT) running 5 instances
does not need cpu affinity. Running 15 it does (but 15 is a bad idea)
- increase nodes
-- more resources for queries
-- better distribution of the larger number of shards
-- more shards supported, thus more indexes, thus more data available online
- faster replication on node failure due to smaller data sets
- increased peak query load possible by 10x (probably due to more jvm heap?
- # of shards per instance effects scaling, query patterns
- requires creative allocation rules to keep primary and replica off of the
same physical machine
-- quite a few. Requires many additional tags as well.
- get much more out of the hardware with smaller (~30G) heaps and more JVM
The Bad: What did it cost us?
- more complicated cluster management
-- need to address each instance for all management
-- need to setup proper allocation rules to protect data from physical node
--- originally intended to define racks. Allows separating primary/replicas
--- reuse this to support many independent instances per machine
- Non standard configuration means more difficult to get assistance from
public resources and documentation
-- (paid) Support is your friend here.
- physical node loss means greater cluster impact
-- When a physical node goes down 40T+ of data needs to re-replicate
-- this takes about an hour in their network/disk configuration
- higher cost of hardware for data nodes
- more power and environment controls necessary
The End: well not really we are still growing
- capable of processing 5.75 billion events a day
- peak is 2x of trough (as high as 10x)
- 24-30 physical data nodes
- 5 instances (1hot, 4 warm), lost of storage, fast cpus, lots of memory
-- can handle hardware failure and/or node crashes
-- can handle maintenance/upgrades
- small footprint in DC
- tested JVM heaps down to 24G, was not enough for ES caching. Optimal is
just under 32G in his instances
- main difference between hot and warm instances is how expensive the disk
- running es 1.4.4, testing 2.2 but have not finishes converting everything
- nodes are not separated on the boxes (VM, docker, etc), one kernel
Relevant config variables:
node.name - to identify the specific instance. For consistency suggest
naming the process, configuration files and node all the same
node.server - to identify the physical server
node.tag - optional - to be able to move shards around
path.conf - to identify the unique configuration directory
cluster.routing.allocation.same_shard.host: true - enable check to rpevent
allocation of multiple instances of the same shard on a single host
Stories from Support: Top Problems and Solutions
The Big Issues
1. Fielddata (always single word)
- most common problem.
60% of total heap can be taken up in 1.x and 2.x
more likely to be used in 1.x
- 100% of data nodes will be impacted
- 2.x (Mostly) Solves This - Thanks to doc_values being on by default
-- Wouldn't help for wikimedia use case, we have very few aggregations that
would use doc values. We also have relatively small fielddata though
- Columnar store of values
- Written at index time to disk
-- Adds an extra file next to the segment on disk
-- takes more indexing time
-- Means you don't have to un-invert the inverted index at query time
-- Part of the reason JVM should not take more than 50% of memory
- Levererages the operating system's filesystem cache
- When upgrading from 1.x to 2.x you have to reindex to get the change
-- You cannot enable doc values after the fact.
- GET /_cat/fielddata?v is your friend
-- Couple hundred megabytes per node. Basically nothing.
- Mostly for aggregates (as in, not useful at wikimedia)
- analyzed strings do not currently support doc_values, which means that
you must avoid using such fields for sorting, aggregating and scripting
- analyzed strings are generally tokenized into multiple terms, which means
that there is an array of values
- with few exceptions (e.g., significant terms), aggregating against
analyzed strings is not doing what you want
- unless you want the individual tokens, scripting is largely not useful
- Big improvements coming in ES 2.3 ("keyword" field)
-- Redo not analyzed strings to allow them to use analyzers that produce
- use multi fields to create analyzed and un-analyzed fields
2. Cluster state
- every cluster state change is sent to every node
-- requires a lot of short lived, potentially large network messages
-- gets worse with more nodes or indices
-- mappings tend to be the largest portion
- GET /_cluster/state?pretty
- Not stored in memory as JSON, so this is just to give the idea (it's
likely 5% of it ballpark)
-- elastic1001 reports JSON as 25MB. 5% would be 1.25MB (tiny)
-- a Gig would be devastating for any cluster size
-- 100M for a 100 node cluster would be "reasonable"
-- 100M for a 10 node cluster would be odd
-- Worst seen (by presenter) was 4G
ES 2.0 introduces cluster state diffs between nodes
-- change become far more manageable and a large cluster state is no longer
-- reducing your mapping size helps too
-- Do not allow dynamic mappings in production
- Do not use _types to separate data
-- duplicating mapping if they are the same
-- if they are different unnecessarily something
-- prefer to create separate indices
- Create a "type" field to do this for you
- Prefer changes in bulk rather than one-by-one (allow changes to be
** Create 900 new titlesuggest indices in one cluster operation?**
-- there is no recommended size. Maybe 5MB payload size? Play around with it
- Changed dramatically in ES since 1.4
- restarting a node or otherwise needing to replicate shards
- terribly slow process
- segment by segment
- minor risk for currption pre-es 1.5 with sketchy network
-- if 90% done and network drops there could be issues
-- dont use 1.5
1.6 -> asynchronous allocation and synced flushing
1.7 -> delayed allocation and prioritized allocation
2.0 -> Cancelled allocation
2.1 -> prioritized allocation for replicas
If you are not on 1.7.5 get there SOON.
4. Node Sizing
- elasticsearch is a parallel processing machine
- java can be a slow garbage collecting calculator
- slow disks. The problem for every data store?
-- Do not use SAN. Really. Don't do it
- A few huge instances isn't best, keep JVMs <= 30G
And how long is a piece of string?
- 50% of system ram to heap
- Up to 30500M - no more or your heap loses optimizations!
-- compressed pointers
- indexing tends to be CPU bound
- At least 2 cores per instance
-- more you give it the faster it can go (duh)
- disks get hammered for other reasons, including write-impacting
- translog in 2.0 fsyncs for every index operation
-- major major change in 2.0
- SSDs of Flash are always welcome
- ES 1.x defaults to 5 primary, 1 replica
- ES 2.x defaults to 2 primary, 1 replica
- 50GB is MAX shard size. More for recovery than performance reasons
- Increase primaries for higher write throughput and to spread load
- Shard because you need write capacity.
- If you need better search response (latency) increase number of shards
- Replicas are not backups. Rarely see a benefit with more than 1
-- They are for HA, not backups
-- They do not save you if "everything goes to hell"
-- Snapshots are your friends
--- Incremental, next snapshot gets only segments that are "new"
-- Oversharding is becoming an issue
--- 41k shards in a 10 node cluster is extreme
- Deep pagination
-- ES 2.0 has a soft limit on 10k hits per request. Linearly more expensive
-- Use scan and/or scroll API
- Leading wildcards
-- Equivalent to a full table scan (BAD!!)
-- Without parameters
-- dynamically (inline)
- Search is always faster than aggregations
- Dont aggregate when you want search
- Disable it on SSD's
Use bulk processing
- Indexing 1 by 1 you will have a bad day (if trying to index many things)
Pitfalls securing search - or why REST based security solutions rarely work
Why would you?
- Most of the clients are using HTTP protocol
- Easy to implement using standard HTTP reverse proxy
- Many elasticsearch operation are REST compliant
-- PUT /index/type/doc_id - indexes a document
-- GET /index/type/doc_id - gets a document
- Reverse proxy isn't enough, you also need firewalls
-- (i don't think that's a big idea, at all. Use SSL)
-- must establish security perimeter
What does it get you?
- Works great for authentication. Access/No access to everything
- Fails for authorization
-- Role based access to certain indices, certain part of indices, certain
Limiting access to an index by using URL matching
- Theoretically easy! Just setup the following URL matching patterns:
-- GET /myindex/*/*
-- PUT /myindex/*/*
-- POST /myindex/*/*
- But what about?
- POST /myindex/type/_bulk
- Could disable multi-index rest actions
-- rest.action.multi.allow_multi.??? = none
Search is not always limited to the index on the url
Cross index search operation:
- More like this query
- Terms lookup mechanism in the terms query
- you end up needing to whitelist/blacklist query patterns
Be really careful with URL matching rules!
- Documented exceptions on the elasticsearch website
Document level security
- A typical solution for document level access control
-- Add an ACL token(s) to each document
-- Crete an alias with ACL filters for each user
-- Limit users access to their aliases
- Works *most* of the time
-- Aliases are not considered in
--- Suggest API
--- Children Aggs
If you are trying to use aliases as security solution augment it WRT query
Filtered aliases are not a security mechanism
- More like working together with other systems
Field level security
- A naive solution - filtering of source in response
-- This is not enough, If you can search it you have access to it. There is
no exception to this.
A field in the source can appear in multiple ways
-- multi match, aggregations, etc. etc.
-- Requires basically whitelisting queries
Securing search is difficult problem
- URL path is not a filter
- Some operations are working on entire index
- It's a search and analytics engine, which makes frequency analysis attack
so much easier
-- Lots of nice statistics like term frequency, etc.
Is there a solution?
- Move all security to business logic layer. Deny raw access
- Move security to transport layer
-- This is what shield (now xpack) does
-- They believe that securing at the lowest level is the only way to secure
- Hard to know from the REST api what will happen with a given request
- Authorization is secured at the individual TransportAction level
-- Prevents the user from accessing a terms lookup (for example) on an
index they dont have
- TLS encryption for inter-node transport and http
- Message signing means even without TLS a node cannot join the cluster
- Secured down to the lucene level - integrating with how elasticsearch
reads the index
- In a request with document level security the query defined on a role
will get added to
the index reader. Documents that arn't allowed look deleted to the
- Field level security is similar. At the lucene level fields look like
they don't exist
Boosting Search Accuracy with Engine Scoring and Predictive Analytics
Chief Architect, Search Technologies
(note to self: This will probably be advertising to hire them as
(talked to paul after talk, he had previously talked with nik and is
interested in working with us. I think what he really wants is to use our
click through data to improve his own business case though. Tomasz will be
in contact and see if we can work together).
- Speaker has been involved in this since mid 90s
- Many of these ideas came from Trek, updated with big data and logs
- Common customer complaint
-- Q: Whats wrong?
-- A: Our search is bad
-- Q: How bad?
-- A: Bad
-- Q: Scale 1-10?
-- A: 8? 9? Lets call it 9.23
-- (completely bogus, means nothing without statistically based numbers)
- Golden query set
-- Key documents
-- Just because it's called golden doesn't mean its any good
- Top 100 / Top 1000 queries analysis
-- May not represent actual usage of your system
-- Taking a random selection can be more useful. For example one customer
40% of queries were human names. Large variance didn't show up in top 1000
-- The long tail could be 85%+ of queries
- Zero result queries
- Abandonment rate
- Queries with click
These all have a problem: You need to put your search engine into
production to compute them which hurts your bottom line.
- (if you are ecommerce or advertising)
What are we trying to achieve?
* Reliable metrics for search accuracy
* Can run analysis off-line
-- Does not require production deployment (!)
* Can accurately compare two engines
-- Different technologies
-- Different versions of same engine
-- Different scoring on same engine
* Runs quickly = agility = high quality
-- Iterating engine quickly is key
* Can handle different user types / personalization
-- Broad coverage
-- Modify relevancy scoring based on source website, "cluster" user is in
* Provides lots of data to analyze what's going on
-- Data to decide how best to improve the engine
Leverage logs for accuracy testing
Query logs, Click Logs -> Engine scoring framework -> Search engine under
* engine scores - have one (or two, or three) top level numbers that give
your engine a score for the query set
* other metrics and histograms - generate useful metrics, such as a
histogram of where recorded click through were in the final positions
* scoring database - Record historical scores, and what changes led to
those scores. Re-review prior changes after making new changes. They might
not be as useful any more!
>From Queries -> Users
* User by User metrics
-- change in focus
-- Queries don't matter. People matter. It does *not* matter if a query was
the right answer. It matters if the user got their right answer (maybe two
queries? Not the end of the world)
* Group activity by session and/or user
-- call this an "activity set"
-- Merge session and users
* Use Big Data to analyze *ALL* users
-- There are no stupid queries and no stupid users
-- Overall performance based on the experience of the users
- Group activity by session and/or user (Queries and Clicks)
-- Trying to create a model for the user. Everything the user hs found
Determine "relevant" documents
-- Lots of people don't have user ids, only sessions
-- What did the user view? Add to cart? Purchase?
-- Did the search engine return what the user ultimately wanted?
* Determine engine score per user per query
-- Σ power(FACTOR, position) * isRelevant[user,
-- Evaluated for each user's point of view
-- (Note: Many other formula possible, MRR, MAP, DCG, etc.)
-- Speaker happens to like this one. Some algo's prefer top results, some
prefer results "on page"
* Average score for all user queries = user scores
* Average scores across all users = final engine score
-- Important to average per user (or per session) and not per query
-- Prevent single user with 99 queries from changing the score
Off Line engine Analysis
-- Done by re-executing the users queries offline
Continuous improvement cycle
Modify engine -> execute queries -> compute engine score -> evaluate
results -> back to beginning
* Kind of like sports, you can have a personal best engine score
* Scoring never stops, generating one score is not enough. Generate scores
and track over time
What else can we do with engine scoring?
- predictive analytics
- The brutal truth about search engine scores
-- They are not based on science.
-- Random ad-hoc formula put together
- BM25 invented in 1980. Not state of the art. Predictive analytics is
Use Big Data to predict relevancy
Document Signals, Comparison Signals, Query Signals, User Signals
The score predicts probability of relevancy
Value from 0 -> 1
* Can be used for threshold processing
* All documents too weak? Try something else!
* Can combine results from different sources / constructions together
- Identifies what's important
-- Machine learning optimizes for parameters
-- identifies the impact and contribution of every parameter
- If a parameter does not improve relevancy -> REMOVE IT!!!
Come out of the darkness
- Ultimate message is use science
- Note time based, just a list of queries and clicks
- Don't have to tie clicks to users
- Single relevancy score not as useful - Build relevancy lab to have
- td/idf may be a signal, but the resulting formula usually replaces tf/idf
-- tf/idf only used to surface likely documents
NY Times Search
explicit search, formal user queries, used but not the main use case
implicit search powers much of nytimes.com
Few updates - A few thousand updates per day
Sources - CMS, Legacy systems, file based archives, OCR
Latency - <1s, Low latency is important. Especially for implicit search.
Content typically doesn't show up on website until it's indexed
Typical Use Cases
- not comprehensive
- People search for something they read. They read an article they want to
read it again
-- Only 1 correct result.
-- Goal (not accomplished): recognize these queries and give a single,
- Find something you wrote
-- Less common, but happens surprisingly often
- Find reviews for movies/books
-- Typically has a single correct result
- Find recipes
-- Similar, but more than a single correct result
- "Manual sharing"
-- Instead of sending a link to someone you tell them you read an article
on xyz and they search for it
- "Casual" news search
-- Looking for information on recent events
- Serious research
-- More interesting
-- Makes up a big amount of queries on the website
-- Typically journalists or historians
Why not just use google?
1. Keep the customers on site
-- Google risks the customer goes to a different site
2. There is no google for native apps
-- Native apps need to provide good search
-- Because they have multiple apps that do slightly different things they
need special cases
3. We know our content better
-- Most important
-- Have editorial teams, taxonomy teams. They know which article is
relevant for a specific case
-- Elsaticsearch has allowed them to tune this exactly how they want
Search should be targeted at their most advanced readers
- A movie review can be found on google
- If you want to know the 10 most important articles on syria, NY Times can
do that much better
-- Not there yet, lots of work still to do but that is the direction they
Our current Setup
- The search stack runs entirely on AWS
-- Case for NYTimes website
- Deployed using an in-house infrastructure tool
- Nagios and New Relic for monitoring
- Sumo Logic for log management and analytics
- Two full production clusters
-- Not just for elasticsearch, but the whole stack (ingestion pipeline, etc)
- DNS based failover
- Also used for maintenance/major changes/reindexing
- 16 node ES clusters in production (m1.xlarge)
- Many intermediate layers read/write data from S3
- Beanstalk handles queueing(i think?). Replacing with kafka
- Large number of rules defining how to extract fields from different data
- Constructs canonical document which is same for all data sources
- Canonical document pushed into ES
- Ensures that only the latest version is indexed
- Pushes documents into elasticsearch and mongodb
- powered by elasticsearch and mongodb
- For each search result looks up full document from mongodb and returns
that to client
Two clusters makes maintenance work easy.
- The way the handler works makes it relatively easy to handle the
Number of issues with this setup:
- Too many moving parts
- Too many things that can go wrong
- Difficult with this setup to index new fields
-- New fields are added quite frequently
-- Have to go back to normalization, add new fields, reindex everything
-- Many times requires going all the way back to CMS
-- Simple task that shouldn't be this hard
- Despite the fact they have two clusters they don't know if the offline
cluster is still working
-- Cache goes cold
-- Have to warm up cache before putting online
-- Offline cluster doesn't receive production traffic
-- Only discover failure when they try to fail over
- Hard to run a complex setup like this on a laptop
-- Want to run entire pipeline locally
- Overall This is a setup that is so complicated that they spend a good
amount of their time doing operations work instead of search
--Seen this issue in many places
Future Work - Simplify
1. Full document in elasticsearch
- store and index full documents in elasticsearch
-- Not done previously for historical reasons of support for document sizes
(alt search engine?)
- No external lookup necessary in the API, no MongoDB
- Demand driver API - clients decide what they want.
- Not in production yet, but some tests show this is better
- Use kafka
- All content will go through kafka
- Kafka will persist all content, source-of-truth for search
- Easy to create new search clusters through Kafka replay
- Goal: Never upgrade cluster. Boot new cluster, replay kafka to populate,
switch over and kill old cluster
3. Keep all clusters busy
- All production traffic will be replayed from active production cluster to
standby production and to staging
- Uses gor
- Makes sure that standby cache is always warm
- Changes to staging will be exposed to production traffic
- If standby cluster crashes they will know
- Vagrant boxes for everything
- Make it easy to run full pipeline locally.
- Provision vagrant boxes using same system as production servers
-- *exactly* the same. No differences between local and prod per-box
- No S3. No beanstalk. Much simpler.
Conclusion and comments
1. Figure out how you can be better than Google
-- IF you don't think you can be better let google do it
-- Know your users
-- Search should be an integrated part of the product
--- Search should not be an afterthought
--- Plan it from the beginning
-- Be consistent across devices and platforms
-- Focus on where you can make a difference for the end user
2. Infrastructure is hard
Think about how to
- do deployments
-- Far too often people setup search clusters without planning how to
deploy new versions
-- Know how you can easily upgrade
- do upgrades
- reindex content
- change the way you normalize data
- create new search clusters
... and make it easy.
Things to do
- Put your infrastructure in version control
-- Nothing should be done from command line to setup infrastructure
-- When everything crashes, and some day it will, it needs to be easy
- Vagrant everything
- Have immutable servers, replace instead of changing
-- Only really possible on cloud based, raw hardware requires different
- Make deploying a new stack easy and automatic
Is elasticsearch for text search?
- Background for this question is they attended a one day elastic
conference in new york
-- None of the talks were about text search
-- This week is very similar
-- Looked over new features in ES2.0. Categorized into
analytics/search/other stuff(mostly cluster)
-- Found 6 issues out of 119 dealing with search. 5%.
-- Skeptical that elastic wants to focus on text search as an important use
-- Those features are not revolutionary in any way, just regular tweaks
New things necessary in elasticsearch for text search:
- Query log handling
-- Need to know what users are searching for and what they are doing
-- Completely missing from elasticsearch
-- Considers it a big problem
-- Elasticsearch has plenty of ways to analyze logs. But nothing that logs
and ingests it's own search history
- Easier relevancy tuning
-- Difficult to change things. Getting the right score for one document
without changing all other searches
-- Needs tooling. Providing a set of benchmark queries
-- When you change things for one query, how does that effect other queries?
- Improved linguistics
-- No built in language detection
-- No morphologies - lemitization
-- No compounding
-- No entity extraction
- Really wants to see some leadership from elastic on the text search space
Optimizations around analyzers?
- Not much. Most is done pre-indexing by adding keywords and custom scoring
Primary data store?
- Source of truth is custom CMS
Moving away from AWS?
- not trying to, just want to make sure it's possible (using kafka instead
of SQS, etc.)
Follow up on latency, time from publish to shown on site?
- Less than 1s
Geospatial data structures in elasticsearch and lucene
Geo field types
-- used to have 8 or 9 paramteres, as of ES 2.2 only 3 parameters necesary
-- other still there, but deprecated and being removed
-- type, geohash_prevision, geohash_prefix, ignore_malformed all that is
-- only type, ignore_malformed is being kept
- geo shapes
-- used to have 12 parameters, many expert only
-- going forward: type, ignore_malformed, tree, precision,
distance_error_pct, orientation, ???_only
-- orientation going away. standards define a standard: right hand rule.
-- es5.0 iso complient, will reject uncomplient
-- points only allows short circuit quad tree logic, optimizes for point
--- points only vs points index is 10-15% better performance
--- only geo shape no points only,
OGC simple feature access
ISO Geographic information - spatial schema 19107:2003
GeoPointField change in lucene 5.4
- create a quad tree encoding, first bit is world level lat, next bit world
-- rotates back and forth each two bits
-- Space filling curve, dimensional reduction
-- Their implementation is Z-curve, morton curve. Nearest neighbors are not
necessarily so close.
-- investigation hilbert curve. Current is 4 operations to encode, hilbert
takes longer. Tradeoff of accuracy/locality vs performance
-- precision step parameter effects size of index. More precision -> more
-- "optimal" precision step for smallest index possible. Currently 9
2.1 -> 2.3 large change in performance of geo. ~30% index size, 12% heap,
40% time (indexing or query?)
5.0 -> geo shapes and geo points merged as geo @experimental
- balanced k-d trees
A free text search is very inaccurate description of our information need
What you want
- quick learner
- works hard
But you type: "hard-working, self-motivated, masochist"
The purpose of this talk
- know the monster, understand what the parameters of BM25 do
- know why it has the label "probabilistic"
- mathematical concepts behind it: more for your entertainment
- be convinced that switching to BM25 is the right thing to do
-- hardest part of this talk
- be able to impress people with you in depth knowledge of probabilistic
The current default - TF/IDF
- search in self-description of application for these words
- we want to order applications by their relevance to the query
Evidence for relevance - term frequencies
- Use term frequencies in description, title, etc.
- term freq: more is better
-- the more often it occurs the more relevant this document is
-- square root of the term frequency. Sum it up
-- increases non linearly
- inverse document frequency
-- common words are less important
-- a, is, the, etc.
-- also non-linear: multiple tf with idf and sum that
- long documents with same tf are less important: norm
-- also non-linear, idf is more L shaped than this though
Bool query and the coord-factor
- doc a: holiday: 4, china: 5
- doc b: holiday: 0 china: 15
- Coord factor: reward document a with both terms matched
-- basically a hack
- successful since the beginning of lucene, since first release
- well studied
- easy to understand
- one size fits most
So...can we do better?
- It is somewhat a guess
- sum square root of term frequencys, which was a guess of an original
-- ad hoc
Probabilistic ranking and how it led to BM25
The root of BM25: probability ranking principle (abridged)
-"if retrieved documents are ordered by decreasing probability of relevance
on the data available, then the system's effectiveness is the best that can
be obtained for the data."
- Chance to deploy prior knowledge of relevance and scoring to make results
- if mapped to mathematical framework can take advantage of mathemticians
Getting into math:
1. Don't care how much heap, how much cpu's. Get to that later, forget
about it now
2. Think of yourself as super smart, beyond 2 standard deviations
- simplification: relevance is binary!
- get a dataset queries - relevant / irrelevant documents
-- use that to estimate relevancy
- how to use this to actually get a better scoring?
-- P(A|B) = probability of A given B, R = relevancy (1/0), d = document
- For each document, query pair - what is the probability that the document
-- Need some sort of matrix, document vs query
...here be math...
- paper: The probabilistic relevance framework: bm25 and beyond
- stephen robertson and hugo zaragoza
..and we get to..
- W(d) = sum log( P(..)P(..) / P(..)P(..) )
BM25: how to estimate all these probabilities
- two assumptions
-- the binary independence model - a dramatic but useful simplification
-- query term occurs in a document or doesn't, we don't care how often
- Robertson/Sparck Jones weight
-- another equation
-- assumes unlimited supply of interns
-- still use robertson/sparck jones weight but assume that the number of
relevant documents is negligible (R=0,r=0)
- BM25 has S shape, goes to zero. tf/idf is L shaped
- In TF/IDF the more often the term occurs the better
- but, is a document about a term just because it occurs a certain number
- this property is called eliteness
-- in the literature
Example for "eliteness"
- Look at wikipedia: Many documents are about tourism
- Many documents contain the word tourism - but are about something
completely different, like for example just a country
Can we use prior knowledge on the distribution to ...
Eliteness as Poisson Distribution
- document is not about the term
- document is about the term
How to estimate this?
- father data on eliteness for term
- many term frequencies -> do for many documents
How relevance ties into that
- suppose we knew the relationship of ???
Tf saturation curve
- limits influence of tf
- allows to tune influence by tweaking k
- major difference from tf to bm25
So...we assume all documents have same length?
- poisson distribution: assumes a fixed length of documents
- but they don't have that (most of the time)
- we have to incorporate this too!
- scale tf by it like so: (long equation)
-- includes interpolation between 1 and document length over average
Influence of b: length weighting
- tweak influence of document length
- similar to tf/idf, depending on b value
Is BM25 probabilistic?
- many approximations
- really hard to get the probabilities right even with unlimited data
-- mathematicians have been trying to solve this for years, no success
- BM25 is "inspired" by probabilistic ranking
-- 1992 Trec-2 took a "leap of faith" (?)
-- 1993 Trec-3 BM25 final!
-- 1999 First lucene release (tf/idf)
-- 2011 Pluggable similarities + BM25 in lucene
So..will I get better scoring with BM25?
Pros with the frequency cutoff:
- TF/IDF: common words can still influence score
- BM25: limits influence of term frequency
-- less influence of common terms
-- no more coord factor!
-- check if you should disable coord for bool queries?
--- index.similarity.default.type: BM25
- parameters can be tweaked. Not sure if many people actually need to. You
need data to measure if scoring is getting better or not. Do not tweak
- close index
- update mapping (or settings)
- re-open index
Mathematical framework to include non-textual features
- go read the paper
A warning: lower automatic boost for short fields
- With TF/IDF: short fields (title,...) are automatically scored higher
- BM25: Scales field length with average
-- field length treatment does not automatically boost short fields (you
have to explicitly boost)
Is BM25 better?
- literature suggests so
- challenges suggests so (TREC, ...)
- Users say so
- Lucene developers say so
- Konrad Beiske says so: "BM25 vs Lucene default similarity"
- But: It depends on the features of your corpus.
- Finally: You can try it out now! Lucene stores everything necessary
- Manning et al, Introduction to Information rerieval
- Robertson and Zargoza, The probabilistic Relvance Framework: BM25 and
- Robertson et al, Okapi at TREC-3
Quantitative Cluster Sizing
1. Understanding why "it depends"
- Size of shards
- Number of shards on each node
- Size of each document
- Mapping configuration
-- which fields are searchable
-- automatic multi-fields
-- whether messages and _all are enabled
- Backing server capacity (SSD vs HD, CPU, etc.)
Your organization requirements / SLAs
- retention period of data
- ratio and quantity of index vs search
- nature of use case
- continuous vs bulk indexing
- kinds of queries being executed
- desired response time for queries that are run frequent vs occasionally
- required sustained vs peak indexing rate
- budget and failure tolerance
Let's try to determine
- How much disk storage will N documents require
- When is a single shard too big for my requirements
-- When will that shard reach a point where the search queries will no
longer be within SLA
- How many active shards saturate my particular hardware
-- Keep putting shards on a node? Back it down
- How many shard/nodes will i need to sustain X index rate and Y search
2. Sizing methodology
- run 4 experiements
- 1. Determine various disk utilization
-- Use a single node cluster with one index
-- 1 primary, 0 replica
-- Index a decent amount of data (1GB or about 10 million docs)
-- Calculate sotrage on disk
--- Mostly for logging use case, before and after _forcemerge
-- Repeat the above calculations with different mapping configurations
--- _all both enabled and disabled
--- settings for each field
- 2. Determine breaking point of a shard
-- Use a single node cluster with one index: 1 primary, 0 replica
-- Index realistic data and use realistic queries
-- Plot index speed and query response time
--- Build some dashboards. Latency increases with shard size
-- Determine where point of diminishing returns is for your requirements
- 3. Determine saturation point of a node
-- USe a single node cluster with one index
--- 2 primary, 0 replica
--- repeat experiment two to see how performance varies
--- keep adding more shards to see when point of diminishing returns occurs
- 4. Test desired configuration on small cluster
-- Configure small representative cluster
-- Add representative data volume
-- Run realistic benchmarks:
--- Max indexing rate
--- Querying across varying data volumes
--- Benchmark concurrent querying and indexing at various levels
-- Measure resource usage, overall docs, disk usage, etc.
3. Scenario and experiment results
- Practical Sizing Example
-- show some patterns, draw some conclusions about how it works
-- no numbers, all your numbers will be different anyways
- Graph latency, record count, shard size
-- Latency increases with record count proportionally
- ... lots of stuff about log ingestion use case ...
4. Interpreting results and expanding to other scenarios