Lucene @ Yelp

Page 1

Lucene @ Yelp Sudarshan Gaikaiwari


Bio 1. Over a decade of experience in information retrieval 2. Used IR techniques at Symantec's DLP group 3. Search Engineer at Yelp


Outline 1. Overview of search services at Yelp 2. Federation Motivation 3. Lucy Indexing 4. Lucy Searching 5. Efficiently Retrieving top k hits


The services we provide


Lucy: business search


Lucy also powers phone search


Cathy: she 'talks' a lot


Listsearch: it searches lists....


Reviewsearch: it searches reviews....


DYM: did you really mean that?


Suggest: auto completion


Federation Motivation


Problem

Search is too slow


Hard Disk Seek Latency Disk seek 10,000,000 ns

Source Software Engineering Advice from Building Large-Scale Distributed Systems Jeffery Dean


RAM read latency Main memory reference 100 ns


Pinning Index in RAM ● vmtouch ● mlock ● http://hoytech.com/vmtouch/


Problem Index is too large fit in memory on a single machine


Geographical sharding


Geographical Sharding drawbacks 1. Cumbersome manual process to determine shard boundary 2. No guarantee that a boundary can be found.


Federation 1. Split index across multiple machines 2. Shard on business id 3. TF-IDF scores from different machines should be comparable


Mapping businesses to shards 1. Assigning businesses to shards shard = shardlist[hash(business_id) % len(shardlist)] Problems 1. Involves re-indexing all the businesses if we want to add a new shard


Virtual Nodes


Advantages 1. Flexibility (move vbuckets from one shard to another) 2. Split hot spot shards


Lucy Master Slave Architecture Separate indexing (masters) A master for each shard of a service Searching (slaves) A slave for every replica of a service


Lucy Indexing






Lucy Searching



Federator: Combining results across shards 1. Once we distribute an index across shards we need a component which will search all these shards and combine their results. 2. Written in Python (runs inside a python web process). 3. Uses Tornado IO loop to send requests to all shards. 4. The transfer protocol for the requests in JSON RPC


Lucy Server




Tokens to Business Attributes


Executing queries 1. Gather the top results for a query 2. Collect attribute statitics for attributes like places, categories


Lucene 1. Efficiently executes queries over the index 2. Provides how relevant the business is to the words in the query (word score) 3. Upgrading lucene to 2.9/3.1 is WIP



Successive geobounds relaxation


Successive geobounds relaxation


Federation


Efficiently Retrieving top k hits 1. When user moves through multiple pages the number of hits to be returned increases num hits = start + count 2. So if we need to retrieve 500 hits the naive way would be to retrieve 500 hits from each shard and then sort them


Distribution of hits in shards



Probability a hit is in a shard


Binomial Distribution Probability (r of top k hits) are in a particular shard

Mean

Variance


Formula Std Deviation

Formula


Simulation

Formula

Hits selected from each shard k = 100 p = 0.2

Results Missed (%)

24

0.017

32

0.0001407

44

0.00000


Simulation Graph


Results 1. ~ 50% savings over 100 hits (44 hits requested from each shard) 2. 77% savings over 1000 hits (228 hits requested from each shard)


Future work 1. In memory index 2. Move towards real time search


Come Join Us!


Thank You

smg@yelp.com


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.