On Tue, Feb 23, 2016 at 7:13 PM, Erik Bernhardson <ebernhardson@wikimedia.org> wrote:

These are my notes, pretty much completely raw, from elasticon last week.

----------------------------------
Scaling 5 nodes -> 1,000
----------------------------------
significant terms query (expensive) may help with trending articles?
preferred log ingestion path: app -> kafka -> logstash -> elasticsearch
- es has features (node affinity) to have high power "hot" nodes and
lower power / spinning disk "warm" or "cold" nodes and ensure indices
end up on the right machines.
- **run master nodes on same phyiscal hardware, separate JVM**
-- dedicate cores to master node
new node type: ingest
- specialized for indexing
- most common logstash filters implemented in library, used in elasticsearch
- can enrich data on indexing
- makes logstash not necessary for some workloads

10 nodes (128G memory, 32 threads, 1TB SSD) can handle 150k+ indexes per second for logging
**make clusters a first class citizen in your application**
**multi-tenant scaling -> multiple clusters**
new features: cluster state diffs -> allows more machines/indices/shards per cluster
- may help with our master timeout issues, but talked to some core devs and they wern't really sure. Preferred the idea to split into multiple independent clusters for multi-tenant use case
- **more indices does matter for master complexity. All about the total number of shards**
types are just a filter on special _type column
new feature: columnar data store (doc values) on by default in 2.0
new software: beats
- is about data collection and shipping
- may take on similar role as diamond, shipping to elasticsearch or logstash
new goal: "release bonanza" release multiple products on same date to simplify compatability between products
- es will go v2 -> v5, skipping between, to match other product versioning
new tool for extracting graphs from elasticsearch data
- can find strongly connected nodes, etc.
- threat detection, possibly other itneresting things.

----------
Lucene
----------

segment
- inverted index
- norms
- document store
- column store in v4
- k-d tree - faster structured search in v6
-- replaces inverted index in certain limited circumstances
-- old structured search is based on cells, optimizations are based around visiting as few cells as possible
-- cells can have different sizes, 2x2, 3x3, (always square?) etc.
-- k/d replacement - recursively splits the search space in two aiming for
equal number of points on each side until each fraction of the search space has
one point in it.
-- works similar to traditional databases with binary search trees
-- elasticsearch will stop at 1000 docs per side rather than just 1
-- it's a memory trade-off, the tree needs to be kept in memory at all times
-- mike mccandles is a primary author
-- Better in almost all aspects for spatial
-- ElasticSearch 2.2 - geopoint searches
-- for 1d numeric fields merge process is much faster, segments are presorted so it is only the final stage of a merge sort performed.

BM25 by default in ES v5.0
--
- this is massively oversimplified, please read the code and the commit logs
- common words are naturally discriminated
- no more need to exclude stop words.
-- doc freq contribution in bm25 goes all the way to 0, rather than 1.0 in tf-idf
- term freq contribution
-- bm25 saturates earlier. bounded at 2.2 by default

Faster queries - two phase iteration
- query can be divided into
-- a fast approximation
-- a (slower) match
- phrase
- approximation = conjuction
-- match = position check
- Geo polygon query
-- approximation = points in cells that are in or cross the polygon
-- match = check the point against the polygon

Match cost API
1. approximate - (description:search AND description:engine) AND (body:postings AND body:list)
** 1. iterate body:positngs (1k)
** 2. Check description:engine (15k)
** 3. check description:search (200k)
** 4. check body:list (370k);
2. match - description: "search engine" body: "postings list"

Other changes
* better query-time synonyms
- disk based norms
- more memory efficient doc values
- more disk efficient sparse doc values
-- lucene 5.4
- BooleanQuery simplification in rewrite
-- for instance imagine you are searching a matching docs query
-- filter (constant score query)
-- this is going to perform conjection (all docs that match both queries)
-- lucene 5.5
- bulk scorer specialization for MatchAllDocsQuery and MUST_NOT clauses
-- matching docs iterator must return sorted by score
-- re balancing prioritizations was causing slowdowns
-- instead of doing by document, now doing it in batches of 2048
-- a NOT b
- improve file truncation detection
-- leading cause of index corruption
-- checksums added in 4.8 notice the error
-- advances in 5.x add nice error message

Also, SVN -> git

Scaling log ingestion at ????

10-12 nodes, old DB servers. lots of memory, disk, fast cpu's

- number of indexes doesn't matter, it's about the number of shards
- stores 13 months worth of data

Initial config
- two jvm's per server. One "hot" ingest instance pointed at fast disks. one "warm" instance pointed at slower spinning disks for querying historical data
- 384G ram. 60TB of disk, 96 hardware threads, etc.

(initially) The bad:
- inefficient use of hardware
- limited number of nodes
- larger amounts of data to replicate on node failure

Thoughts:
- run all independent instances? Run many instances on one node?
- hardware overpowered for data nodes

(reworked cluster config) The good:
- Run elasticsearch instances completely independent. Separate yml config,
separate data directories, etc.
- One hot ingest instance, 4+ warm instances
-- uses ~50% of memory for JVM
-- no matter how big your box is, keep jvm heap < 50%
-- at 5 to 7 instances on 48 core machines (96 with HT) running 5 instances does not need cpu affinity. Running 15 it does (but 15 is a bad idea)
- increase nodes
-- more resources for queries
-- better distribution of the larger number of shards
-- more shards supported, thus more indexes, thus more data available online
- faster replication on node failure due to smaller data sets

- increased peak query load possible by 10x (probably due to more jvm heap? didn't say)
- # of shards per instance effects scaling, query patterns
- requires creative allocation rules to keep primary and replica off of the same physical machine
-- quite a few. Requires many additional tags as well.
- get much more out of the hardware with smaller (~30G) heaps and more JVM instances

The Bad: What did it cost us?
- more complicated cluster management
-- need to address each instance for all management
-- need to setup proper allocation rules to protect data from physical node crash
--- originally intended to define racks. Allows separating primary/replicas by rack
--- reuse this to support many independent instances per machine
- Non standard configuration means more difficult to get assistance from public resources and documentation
-- (paid) Support is your friend here.
- physical node loss means greater cluster impact
-- When a physical node goes down 40T+ of data needs to re-replicate
-- this takes about an hour in their network/disk configuration
- higher cost of hardware for data nodes
- more power and environment controls necessary

The End: well not really we are still growing
- capable of processing 5.75 billion events a day
- peak is 2x of trough (as high as 10x)
- 24-30 physical data nodes
- 5 instances (1hot, 4 warm), lost of storage, fast cpus, lots of memory
- resilient
-- can handle hardware failure and/or node crashes
-- can handle maintenance/upgrades
- small footprint in DC

Questions?

- tested JVM heaps down to 24G, was not enough for ES caching. Optimal is just under 32G in his instances
- main difference between hot and warm instances is how expensive the disk is
- running es 1.4.4, testing 2.2 but have not finishes converting everything
- nodes are not separated on the boxes (VM, docker, etc), one kernel

Relevant config variables:
node.name - to identify the specific instance. For consistency suggest naming the process, configuration files and node all the same
node.server - to identify the physical server
node.tag - optional - to be able to move shards around
path.conf - to identify the unique configuration directory
path.data
path.logs
path.work
cluster.routing.allocation.same_shard.host: true - enable check to rpevent allocation of multiple instances of the same shard on a single host

--------------------------------------------------------------------
Stories from Support: Top Problems and Solutions
--------------------------------------------------------------------
"It Depends"

The Big Issues
1. Fielddata (always single word)
- most common problem.
60% of total heap can be taken up in 1.x and 2.x
more likely to be used in 1.x
- 100% of data nodes will be impacted
- 2.x (Mostly) Solves This - Thanks to doc_values being on by default
-- Wouldn't help for wikimedia use case, we have very few aggregations that would use doc values. We also have relatively small fielddata though

Doc Values!
- Columnar store of values
- Written at index time to disk
-- Adds an extra file next to the segment on disk
-- takes more indexing time
-- Means you don't have to un-invert the inverted index at query time
-- Part of the reason JVM should not take more than 50% of memory
- Levererages the operating system's filesystem cache
- When upgrading from 1.x to 2.x you have to reindex to get the change
-- You cannot enable doc values after the fact.
****
- GET /_cat/fielddata?v is your friend
-- Couple hundred megabytes per node. Basically nothing.
- Mostly for aggregates (as in, not useful at wikimedia)
****
- analyzed strings do not currently support doc_values, which means that you must avoid using such fields for sorting, aggregating and scripting
- analyzed strings are generally tokenized into multiple terms, which means that there is an array of values
- with few exceptions (e.g., significant terms), aggregating against analyzed strings is not doing what you want
- unless you want the individual tokens, scripting is largely not useful
- Big improvements coming in ES 2.3 ("keyword" field)
-- Redo not analyzed strings to allow them to use analyzers that produce single tokens
- use multi fields to create analyzed and un-analyzed fields

2. Cluster state
- every cluster state change is sent to every node
-- requires a lot of short lived, potentially large network messages
-- gets worse with more nodes or indices
-- mappings tend to be the largest portion
- GET /_cluster/state?pretty
- Not stored in memory as JSON, so this is just to give the idea (it's likely 5% of it ballpark)
-- elastic1001 reports JSON as 25MB. 5% would be 1.25MB (tiny)
-- a Gig would be devastating for any cluster size
-- 100M for a 100 node cluster would be "reasonable"
-- 100M for a 10 node cluster would be odd
-- Worst seen (by presenter) was 4G

ES 2.0 introduces cluster state diffs between nodes
-- change become far more manageable and a large cluster state is no longer as problematic
-- reducing your mapping size helps too
-- Do not allow dynamic mappings in production
- Do not use _types to separate data
-- duplicating mapping if they are the same
-- if they are different unnecessarily something
-- prefer to create separate indices
- Create a "type" field to do this for you
- Prefer changes in bulk rather than one-by-one (allow changes to be batched)
** Create 900 new titlesuggest indices in one cluster operation?**
-- there is no recommended size. Maybe 5MB payload size? Play around with it

3. Recovery
- Changed dramatically in ES since 1.4
- restarting a node or otherwise needing to replicate shards
- terribly slow process
- segment by segment
- minor risk for currption pre-es 1.5 with sketchy network
-- if 90% done and network drops there could be issues
-- dont use 1.5

1.6 -> asynchronous allocation and synced flushing
1.7 -> delayed allocation and prioritized allocation
2.0 -> Cancelled allocation
2.1 -> prioritized allocation for replicas

If you are not on 1.7.5 get there SOON.

4. Node Sizing

- elasticsearch is a parallel processing machine
- java can be a slow garbage collecting calculator
- slow disks. The problem for every data store?
-- Do not use SAN. Really. Don't do it
- A few huge instances isn't best, keep JVMs <= 30G

And how long is a piece of string?

Memory
- 50% of system ram to heap
- Up to 30500M - no more or your heap loses optimizations!
-- compressed pointers

CPU
- indexing tends to be CPU bound
- At least 2 cores per instance
-- more you give it the faster it can go (duh)

IO
- disks get hammered for other reasons, including write-impacting
****
- translog in 2.0 fsyncs for every index operation
-- major major change in 2.0
****
- SSDs of Flash are always welcome

5. Sharding
- ES 1.x defaults to 5 primary, 1 replica
- ES 2.x defaults to 2 primary, 1 replica
**
- 50GB is MAX shard size. More for recovery than performance reasons
**
- Increase primaries for higher write throughput and to spread load
- Shard because you need write capacity.
- If you need better search response (latency) increase number of shards
- Replicas are not backups. Rarely see a benefit with more than 1
-- They are for HA, not backups
-- They do not save you if "everything goes to hell"
-- Snapshots are your friends
--- Incremental, next snapshot gets only segments that are "new"
-- Oversharding is becoming an issue
--- 41k shards in a 10 node cluster is extreme

Bad Queries
- Deep pagination
-- ES 2.0 has a soft limit on 10k hits per request. Linearly more expensive per shard
-- Use scan and/or scroll API
- Leading wildcards
-- Equivalent to a full table scan (BAD!!)
- Scripting
-- Without parameters
-- dynamically (inline)

Aggregations
- Search is always faster than aggregations
- Dont aggregate when you want search

Merge throttling
- Disable it on SSD's

Use bulk processing
- Indexing 1 by 1 you will have a bad day (if trying to index many things)

-------------------------------
Securing Elasticsearch
-------------------------------

Pitfalls securing search - or why REST based security solutions rarely work

Why would you?
- Most of the clients are using HTTP protocol
- Easy to implement using standard HTTP reverse proxy
- Many elasticsearch operation are REST compliant
-- PUT /index/type/doc_id - indexes a document
-- GET /index/type/doc_id - gets a document
- Reverse proxy isn't enough, you also need firewalls
-- (i don't think that's a big idea, at all. Use SSL)
-- must establish security perimeter

What does it get you?
- Works great for authentication. Access/No access to everything
- Fails for authorization
-- Role based access to certain indices, certain part of indices, certain fields

Limiting access to an index by using URL matching
- Theoretically easy! Just setup the following URL matching patterns:
-- GET /myindex/*/*
-- PUT /myindex/*/*
-- POST /myindex/*/*

- But what about?
- POST /myindex/type/_bulk
- /myindex1,myindex2/*/*
- Could disable multi-index rest actions
-- rest.action.multi.allow_multi.??? = none

Search is not always limited to the index on the url
Cross index search operation:
- More like this query
- Terms lookup mechanism in the terms query
- percolation
- you end up needing to whitelist/blacklist query patterns

Be really careful with URL matching rules!
- Documented exceptions on the elasticsearch website

Document level security
- A typical solution for document level access control
-- Add an ACL token(s) to each document
-- Crete an alias with ACL filters for each user
-- Limit users access to their aliases
- Works *most* of the time
-- Aliases are not considered in
--- Suggest API
--- Children Aggs
If you are trying to use aliases as security solution augment it WRT query types

Filtered aliases are not a security mechanism
- More like working together with other systems

Field level security
- A naive solution - filtering of source in response
-- /_search?_source_exclude=secret_part.*
-- This is not enough, If you can search it you have access to it. There is no exception to this.
A field in the source can appear in multiple ways
-- multi match, aggregations, etc. etc.
-- Requires basically whitelisting queries

Securing search is difficult problem
- URL path is not a filter
- Some operations are working on entire index
- It's a search and analytics engine, which makes frequency analysis attack so much easier
-- Lots of nice statistics like term frequency, etc.

Is there a solution?
- Move all security to business logic layer. Deny raw access
- Move security to transport layer
-- This is what shield (now xpack) does
-- They believe that securing at the lowest level is the only way to secure elasticsearch
- Hard to know from the REST api what will happen with a given request
- Authorization is secured at the individual TransportAction level
-- Prevents the user from accessing a terms lookup (for example) on an index they dont have
- TLS encryption for inter-node transport and http
- Message signing means even without TLS a node cannot join the cluster
- Secured down to the lucene level - integrating with how elasticsearch reads the index
- In a request with document level security the query defined on a role will get added to
the index reader. Documents that arn't allowed look deleted to the IndexReader
- Field level security is similar. At the lucene level fields look like they don't exist

------------------------------------------------------------------------------------------------
Boosting Search Accuracy with Engine Scoring and Predictive Analytics
------------------------------------------------------------------------------------------------
Paul Nelson
Chief Architect, Search Technologies

(note to self: This will probably be advertising to hire them as consultants)
(talked to paul after talk, he had previously talked with nik and is interested in working with us. I think what he really wants is to use our click through data to improve his own business case though. Tomasz will be in contact and see if we can work together).

- Speaker has been involved in this since mid 90s
-- NIST
- Many of these ideas came from Trek, updated with big data and logs

- Common customer complaint
-- Q: Whats wrong?
-- A: Our search is bad
-- Q: How bad?
-- A: Bad
-- Q: Scale 1-10?
-- A: 8? 9? Lets call it 9.23
-- (completely bogus, means nothing without statistically based numbers)

- Golden query set
-- Key documents
-- Just because it's called golden doesn't mean its any good
- Top 100 / Top 1000 queries analysis
-- May not represent actual usage of your system
-- Taking a random selection can be more useful. For example one customer 40% of queries were human names. Large variance didn't show up in top 1000 queries.
-- The long tail could be 85%+ of queries
- Zero result queries
- Abandonment rate
- Queries with click
- Conversion

These all have a problem: You need to put your search engine into production to compute them which hurts your bottom line.
- (if you are ecommerce or advertising)

What are we trying to achieve?
* Reliable metrics for search accuracy
-- repeatable
* Can run analysis off-line
-- Does not require production deployment (!)
* Can accurately compare two engines
-- Different technologies
-- Different versions of same engine
-- Different scoring on same engine
* Runs quickly = agility = high quality
-- Iterating engine quickly is key
* Can handle different user types / personalization
-- Broad coverage
-- Modify relevancy scoring based on source website, "cluster" user is in
* Provides lots of data to analyze what's going on
-- Data to decide how best to improve the engine

Leverage logs for accuracy testing
Query logs, Click Logs -> Engine scoring framework -> Search engine under evaluation
* engine scores - have one (or two, or three) top level numbers that give your engine a score for the query set
* other metrics and histograms - generate useful metrics, such as a histogram of where recorded click through were in the final positions
* scoring database - Record historical scores, and what changes led to those scores. Re-review prior changes after making new changes. They might not be as useful any more!

From Queries -> Users
* User by User metrics
-- change in focus
-- Queries don't matter. People matter. It does *not* matter if a query was the right answer. It matters if the user got their right answer (maybe two queries? Not the end of the world)
* Group activity by session and/or user
-- call this an "activity set"
-- Merge session and users
* Use Big Data to analyze *ALL* users
-- There are no stupid queries and no stupid users
-- Overall performance based on the experience of the users

Engine Score
- Group activity by session and/or user (Queries and Clicks)
-- Trying to create a model for the user. Everything the user hs found useful
Determine "relevant" documents
-- Lots of people don't have user ids, only sessions
-- What did the user view? Add to cart? Purchase?
-- Did the search engine return what the user ultimately wanted?

* Determine engine score per user per query
-- Σ power(FACTOR, position) * isRelevant[user, searchResult[Q,position].DocID]
-- Evaluated for each user's point of view
-- (Note: Many other formula possible, MRR, MAP, DCG, etc.)
-- Speaker happens to like this one. Some algo's prefer top results, some prefer results "on page"
* Average score for all user queries = user scores
* Average scores across all users = final engine score
-- Important to average per user (or per session) and not per query
-- Prevent single user with 99 queries from changing the score

Off Line engine Analysis
- searchRersult[Q,position].DocID
-- Done by re-executing the users queries offline

Continuous improvement cycle
Modify engine -> execute queries -> compute engine score -> evaluate results -> back to beginning
* Kind of like sports, you can have a personal best engine score
* Scoring never stops, generating one score is not enough. Generate scores and track over time

What else can we do with engine scoring?
- predictive analytics
- The brutal truth about search engine scores
-- They are not based on science.
-- Random ad-hoc formula put together
- BM25 invented in 1980. Not state of the art. Predictive analytics is

Use Big Data to predict relevancy
Document Signals, Comparison Signals, Query Signals, User Signals

The score predicts probability of relevancy
Value from 0 -> 1
* Can be used for threshold processing
* All documents too weak? Try something else!
* Can combine results from different sources / constructions together
- Identifies what's important
-- Machine learning optimizes for parameters
-- identifies the impact and contribution of every parameter
- If a parameter does not improve relevancy -> REMOVE IT!!!

Come out of the darkness
- Ultimate message is use science

DEMO
- Note time based, just a list of queries and clicks
- Don't have to tie clicks to users

Q/A

- Single relevancy score not as useful - Build relevancy lab to have multi-dimensional scores

- td/idf may be a signal, but the resulting formula usually replaces tf/idf completely
-- tf/idf only used to surface likely documents

-------------------------
NY Times Search
-------------------------

explicit search, formal user queries, used but not the main use case
implicit search powers much of nytimes.com

Ingestion
---------
Few updates - A few thousand updates per day
Sources - CMS, Legacy systems, file based archives, OCR
Latency - <1s, Low latency is important. Especially for implicit search.

Content typically doesn't show up on website until it's indexed

Typical Use Cases
-----------------
- not comprehensive
- People search for something they read. They read an article they want to read it again
-- Only 1 correct result.
-- Goal (not accomplished): recognize these queries and give a single, correct, result
- Find something you wrote
-- Less common, but happens surprisingly often
- Find reviews for movies/books
-- Typically has a single correct result
- Find recipes
-- Similar, but more than a single correct result
- "Manual sharing"
-- Instead of sending a link to someone you tell them you read an article on xyz and they search for it
- "Casual" news search
-- Looking for information on recent events
- Serious research
-- More interesting
-- Makes up a big amount of queries on the website
-- Typically journalists or historians

Why not just use google?
1. Keep the customers on site
-- Google risks the customer goes to a different site
2. There is no google for native apps
-- Native apps need to provide good search
-- Because they have multiple apps that do slightly different things they need special cases
3. We know our content better
-- Most important
-- Have editorial teams, taxonomy teams. They know which article is relevant for a specific case
-- Elsaticsearch has allowed them to tune this exactly how they want

Search should be targeted at their most advanced readers
- A movie review can be found on google
- If you want to know the 10 most important articles on syria, NY Times can do that much better
-- Not there yet, lots of work still to do but that is the direction they are headed

Our current Setup
-----------------
- The search stack runs entirely on AWS
-- Case for NYTimes website
- Deployed using an in-house infrastructure tool
- Nagios and New Relic for monitoring
- Sumo Logic for log management and analytics

System Setup
- Two full production clusters
-- Not just for elasticsearch, but the whole stack (ingestion pipeline, etc)
- DNS based failover
- Also used for maintenance/major changes/reindexing
- 16 node ES clusters in production (m1.xlarge)

CMS->SQS->Handler->Normalization->Merge->Indexer->Elasticsearch->Search API<->Semantic platform
- Many intermediate layers read/write data from S3
- Beanstalk handles queueing(i think?). Replacing with kafka

Normalization
- Large number of rules defining how to extract fields from different data sources
- Constructs canonical document which is same for all data sources
- Canonical document pushed into ES

Merge
- Ensures that only the latest version is indexed

Indexer
- Pushes documents into elasticsearch and mongodb

Search API
- powered by elasticsearch and mongodb
- For each search result looks up full document from mongodb and returns that to client

Two clusters makes maintenance work easy.
- The way the handler works makes it relatively easy to handle the disparate formats

Number of issues with this setup:
- Too many moving parts
- Too many things that can go wrong
- Difficult with this setup to index new fields
-- New fields are added quite frequently
-- Have to go back to normalization, add new fields, reindex everything
-- Many times requires going all the way back to CMS
-- Simple task that shouldn't be this hard
- Despite the fact they have two clusters they don't know if the offline cluster is still working
-- Cache goes cold
-- Have to warm up cache before putting online
-- Offline cluster doesn't receive production traffic
-- Only discover failure when they try to fail over
- Hard to run a complex setup like this on a laptop
-- Want to run entire pipeline locally
- Overall This is a setup that is so complicated that they spend a good amount of their time doing operations work instead of search
--Seen this issue in many places

Future Work - Simplify
1. Full document in elasticsearch
- store and index full documents in elasticsearch
-- Not done previously for historical reasons of support for document sizes (alt search engine?)
- No external lookup necessary in the API, no MongoDB
- Demand driver API - clients decide what they want.
- Not in production yet, but some tests show this is better

2. Replay
- Use kafka
- All content will go through kafka
- Kafka will persist all content, source-of-truth for search
- Easy to create new search clusters through Kafka replay
- Goal: Never upgrade cluster. Boot new cluster, replay kafka to populate, switch over and kill old cluster

3. Keep all clusters busy
- All production traffic will be replayed from active production cluster to standby production and to staging
- Uses gor
- Makes sure that standby cache is always warm
- Changes to staging will be exposed to production traffic
- If standby cluster crashes they will know

4. Virtualize
- Vagrant boxes for everything
- Make it easy to run full pipeline locally.
- Provision vagrant boxes using same system as production servers
-- *exactly* the same. No differences between local and prod per-box configuration

CMD-Kafka->logstash->elasticsearch->Search API
- No S3. No beanstalk. Much simpler.

Conclusion and comments
1. Figure out how you can be better than Google
-- IF you don't think you can be better let google do it
-- Know your users
-- Search should be an integrated part of the product
--- Search should not be an afterthought
--- Plan it from the beginning
-- Be consistent across devices and platforms
-- Focus on where you can make a difference for the end user

2. Infrastructure is hard
Think about how to
- do deployments
-- Far too often people setup search clusters without planning how to deploy new versions
-- Know how you can easily upgrade
- do upgrades
- reindex content
- change the way you normalize data
- create new search clusters
... and make it easy.

Things to do
- Put your infrastructure in version control
-- Nothing should be done from command line to setup infrastructure
-- When everything crashes, and some day it will, it needs to be easy
- Vagrant everything
- Have immutable servers, replace instead of changing
-- Only really possible on cloud based, raw hardware requires different approach
- Make deploying a new stack easy and automatic

Is elasticsearch for text search?
- Background for this question is they attended a one day elastic conference in new york
-- None of the talks were about text search
-- This week is very similar
-- Looked over new features in ES2.0. Categorized into analytics/search/other stuff(mostly cluster)
-- Found 6 issues out of 119 dealing with search. 5%.
-- Skeptical that elastic wants to focus on text search as an important use case
-- Those features are not revolutionary in any way, just regular tweaks

New things necessary in elasticsearch for text search:
- Query log handling
-- Need to know what users are searching for and what they are doing
-- Completely missing from elasticsearch
-- Considers it a big problem
-- Elasticsearch has plenty of ways to analyze logs. But nothing that logs and ingests it's own search history
- Easier relevancy tuning
-- Difficult to change things. Getting the right score for one document without changing all other searches
-- Needs tooling. Providing a set of benchmark queries
-- When you change things for one query, how does that effect other queries?
- Improved linguistics
-- No built in language detection
-- No morphologies - lemitization
-- No compounding
-- No entity extraction
- Really wants to see some leadership from elastic on the text search space

Q/A:
Optimizations around analyzers?
- Not much. Most is done pre-indexing by adding keywords and custom scoring
Primary data store?
- Source of truth is custom CMS
Moving away from AWS?
- not trying to, just want to make sure it's possible (using kafka instead of SQS, etc.)
Follow up on latency, time from publish to shown on site?
- Less than 1s

-------------------------------------------------------
Geospatial data structures in elasticsearch and lucene
-------------------------------------------------------

Geo field types
-geo point
-- used to have 8 or 9 paramteres, as of ES 2.2 only 3 parameters necesary
-- other still there, but deprecated and being removed
-- type, geohash_prevision, geohash_prefix, ignore_malformed all that is left
-- only type, ignore_malformed is being kept
- geo shapes
-- used to have 12 parameters, many expert only
-- going forward: type, ignore_malformed, tree, precision, distance_error_pct, orientation, ???_only
-- orientation going away. standards define a standard: right hand rule.
-- es5.0 iso complient, will reject uncomplient
-- points only allows short circuit quad tree logic, optimizes for point only types
--- points only vs points index is 10-15% better performance
--- only geo shape no points only,
OGC simple feature access
ISO Geographic information - spatial schema 19107:2003

Geo indexing
GeoPointField change in lucene 5.4
- create a quad tree encoding, first bit is world level lat, next bit world level longitude
-- rotates back and forth each two bits
-- Space filling curve, dimensional reduction
-- Their implementation is Z-curve, morton curve. Nearest neighbors are not necessarily so close.
-- investigation hilbert curve. Current is 4 operations to encode, hilbert takes longer. Tradeoff of accuracy/locality vs performance
-- precision step parameter effects size of index. More precision -> more terms
-- "optimal" precision step for smallest index possible. Currently 9

2.1 -> 2.3 large change in performance of geo. ~30% index size, 12% heap, 40% time (indexing or query?)

5.0 -> geo shapes and geo points merged as geo @experimental
- balanced k-d trees

Geo search

Geo Aggrgations

---------
BM25
---------

A free text search is very inaccurate description of our information need
What you want
- quick learner
- works hard
- reliable
- enduring
- ...

But you type: "hard-working, self-motivated, masochist"

The purpose of this talk
- know the monster, understand what the parameters of BM25 do
- know why it has the label "probabilistic"
- mathematical concepts behind it: more for your entertainment
- be convinced that switching to BM25 is the right thing to do
-- hardest part of this talk
- be able to impress people with you in depth knowledge of probabilistic scoring

The current default - TF/IDF
- search in self-description of application for these words
- we want to order applications by their relevance to the query

Evidence for relevance - term frequencies
- Use term frequencies in description, title, etc.

Major tweaks:
- term freq: more is better
-- the more often it occurs the more relevant this document is
-- square root of the term frequency. Sum it up
-- increases non linearly
- inverse document frequency
-- common words are less important
-- a, is, the, etc.
-- also non-linear: multiple tf with idf and sum that
- long documents with same tf are less important: norm
-- also non-linear, idf is more L shaped than this though

Bool query and the coord-factor
- doc a: holiday: 4, china: 5
- doc b: holiday: 0 china: 15
- Coord factor: reward document a with both terms matched
-- basically a hack

TF/IDF
- successful since the beginning of lucene, since first release
- well studied
- easy to understand
- one size fits most

So...can we do better?
- It is somewhat a guess
- sum square root of term frequencys, which was a guess of an original developer
-- ad hoc

Probabilistic ranking and how it led to BM25
The root of BM25: probability ranking principle (abridged)
-"if retrieved documents are ordered by decreasing probability of relevance on the data available, then the system's effectiveness is the best that can be obtained for the data."
- Chance to deploy prior knowledge of relevance and scoring to make results better
- if mapped to mathematical framework can take advantage of mathemticians past proofs

Getting into math:
1. Don't care how much heap, how much cpu's. Get to that later, forget about it now
2. Think of yourself as super smart, beyond 2 standard deviations

Estimate relevancy:
- simplification: relevance is binary!
- get a dataset queries - relevant / irrelevant documents
-- use that to estimate relevancy
- how to use this to actually get a better scoring?
- P(R=1|d,q)
-- P(A|B) = probability of A given B, R = relevancy (1/0), d = document
- For each document, query pair - what is the probability that the document is relevant
-- Need some sort of matrix, document vs query
...here be math...
- paper: The probabilistic relevance framework: bm25 and beyond
- stephen robertson and hugo zaragoza
..and we get to..
- W(d) = sum log( P(..)P(..) / P(..)P(..) )

BM25: how to estimate all these probabilities
- two assumptions
-- the binary independence model - a dramatic but useful simplification
-- query term occurs in a document or doesn't, we don't care how often

- Robertson/Sparck Jones weight
-- another equation
-- assumes unlimited supply of interns
- simplified
-- still use robertson/sparck jones weight but assume that the number of relevant documents is negligible (R=0,r=0)

- BM25 has S shape, goes to zero. tf/idf is L shaped

- In TF/IDF the more often the term occurs the better
- but, is a document about a term just because it occurs a certain number of times?
- this property is called eliteness
-- in the literature

Example for "eliteness"
- "tourism"
- Look at wikipedia: Many documents are about tourism
- Many documents contain the word tourism - but are about something completely different, like for example just a country

Can we use prior knowledge on the distribution to ...

Eliteness as Poisson Distribution
Two cases:
- document is not about the term
- document is about the term

How to estimate this?
- father data on eliteness for term
- many term frequencies -> do for many documents

How relevance ties into that
- suppose we knew the relationship of ???

....

Tf saturation curve
- limits influence of tf
- allows to tune influence by tweaking k
- major difference from tf to bm25

So...we assume all documents have same length?
- poisson distribution: assumes a fixed length of documents
- but they don't have that (most of the time)
- we have to incorporate this too!
- scale tf by it like so: (long equation)
-- includes interpolation between 1 and document length over average document length

Influence of b: length weighting
- tweak influence of document length
- similar to tf/idf, depending on b value

Is BM25 probabilistic?
- many approximations
- really hard to get the probabilities right even with unlimited data
-- mathematicians have been trying to solve this for years, no success
- BM25 is "inspired" by probabilistic ranking

-- 1992 Trec-2 took a "leap of faith" (?)
-- 1993 Trec-3 BM25 final!
-- 1999 First lucene release (tf/idf)
-- 2011 Pluggable similarities + BM25 in lucene

So..will I get better scoring with BM25?

Pros with the frequency cutoff:
- TF/IDF: common words can still influence score
- BM25: limits influence of term frequency
-- less influence of common terms
-- no more coord factor!
-- check if you should disable coord for bool queries?
--- index.similarity.default.type: BM25

Other benefits:

- parameters can be tweaked. Not sure if many people actually need to. You need data to measure if scoring is getting better or not. Do not tweak without data
To update:
- close index
- update mapping (or settings)
- re-open index

Mathematical framework to include non-textual features
- go read the paper

A warning: lower automatic boost for short fields
- With TF/IDF: short fields (title,...) are automatically scored higher
- BM25: Scales field length with average

-- field length treatment does not automatically boost short fields (you have to explicitly boost)

Is BM25 better?
- literature suggests so
- challenges suggests so (TREC, ...)
- Users say so
- Lucene developers say so
- Konrad Beiske says so: "BM25 vs Lucene default similarity"
- But: It depends on the features of your corpus.
- Finally: You can try it out now! Lucene stores everything necessary already

Useful literature:
- Manning et al, Introduction to Information rerieval
- Robertson and Zargoza, The probabilistic Relvance Framework: BM25 and Betond
- Robertson et al, Okapi at TREC-3

----------------------------
Quantitative Cluster Sizing
----------------------------

1. Understanding why "it depends"
Factors
- Size of shards
- Number of shards on each node
- Size of each document
- Mapping configuration
-- which fields are searchable
-- automatic multi-fields
-- whether messages and _all are enabled
- Backing server capacity (SSD vs HD, CPU, etc.)

Your organization requirements / SLAs
- retention period of data
- ratio and quantity of index vs search
- nature of use case
- continuous vs bulk indexing
- kinds of queries being executed
- desired response time for queries that are run frequent vs occasionally
- required sustained vs peak indexing rate
- budget and failure tolerance

Let's try to determine
- How much disk storage will N documents require
- When is a single shard too big for my requirements
-- When will that shard reach a point where the search queries will no longer be within SLA
- How many active shards saturate my particular hardware
-- Keep putting shards on a node? Back it down
- How many shard/nodes will i need to sustain X index rate and Y search response

2. Sizing methodology
- run 4 experiements
- 1. Determine various disk utilization
-- Use a single node cluster with one index
-- 1 primary, 0 replica
-- Index a decent amount of data (1GB or about 10 million docs)
-- Calculate sotrage on disk
--- Mostly for logging use case, before and after _forcemerge
---
-- Repeat the above calculations with different mapping configurations
--- _all both enabled and disabled
--- settings for each field
- 2. Determine breaking point of a shard
-- Use a single node cluster with one index: 1 primary, 0 replica
-- Index realistic data and use realistic queries
-- Plot index speed and query response time
--- Build some dashboards. Latency increases with shard size
-- Determine where point of diminishing returns is for your requirements

- 3. Determine saturation point of a node
-- USe a single node cluster with one index
--- 2 primary, 0 replica
--- repeat experiment two to see how performance varies
--- keep adding more shards to see when point of diminishing returns occurs

- 4. Test desired configuration on small cluster
-- Configure small representative cluster
-- Add representative data volume
-- Run realistic benchmarks:
--- Max indexing rate
--- Querying across varying data volumes
--- Benchmark concurrent querying and indexing at various levels
-- Measure resource usage, overall docs, disk usage, etc.

3. Scenario and experiment results
- Practical Sizing Example
-- show some patterns, draw some conclusions about how it works
-- no numbers, all your numbers will be different anyways

- Graph latency, record count, shard size
-- Latency increases with record count proportionally

- ... lots of stuff about log ingestion use case ...

4. Interpreting results and expanding to other scenarios

...

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery

@wikidata