Recruiting Solutions 1 AsifDaniel Structure, Personalization, Scale: A Deep Dive into LinkedIn Search.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Recruiting Solutions 1 Daniel My Three Ex’s: A Data Science Approach for Applied Machine Learning.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Company confidential Prepared by HERE Transit Sr. Product Manager, HERE Transit Product Overview David Volpe.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
Search Engines and Information Retrieval
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
Graph databases …the other end of the NoSQL spectrum. Material taken from NoSQL Distilled and Seven Databases in Seven Weeks.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Search Engines and Information Retrieval Chapter 1.
Improving the Catalogue Interface using Endeca Tito Sierra NCSU Libraries.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.
SUMMON ® 2.0 DISCOVERY REINVENTED. What is Summon 2.0? A new, streamlined, modern interface New and enhanced features providing layers of contextual guidance.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Monster cloud search Access and search resumes from all talent sources, no matter where they're located august 2014.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Yahoo! BOSS Open up Yahoo!’s Search data via web services Developer & Custom Tracks Big Goal – If you’re in a vertical and you perform a search, you should.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Information Retrieval in Practice
Information Retrieval in Practice
Information Organization: Overview
Search Engine Architecture
Information Retrieval in Practice
Search Engine Architecture
SIS: A system for Personal Information Retrieval and Re-Use
The Four Dimensions of Search Engine Quality
Ashutosh Rana Rahul Nori 7/17/2018
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Flexible Distributed Reporting for Millions of Publishers and Thousands of Advertisers Berlin |
Charles Tappert Seidenberg School of CSIS, Pace University
The Search Engine Architecture
Academic & More Group 4 谢知晖 王逸雄 郭嘉宋 程若愚.
Information Retrieval and Web Design
INF 141: Information Retrieval
Presentation transcript:

Recruiting Solutions 1 AsifDaniel Structure, Personalization, Scale: A Deep Dive into LinkedIn Search

Overview What is LinkedIn search and why should you care? What are our systems challenges? What are our relevance challenges? 2

3

4

Search helps members find and be found. 5

Search for people, jobs, groups, and more. 6

7

A separate product for recruiters. 8

Search is the core of key LinkedIn use cases. 9

What’s unique  Personalized  Part of a larger product experience –Many products –Big part  Task-centric –Find a job, hire top talent, find a person, … 10

Systems Challenges 11

Evolution of LinkedIn’ Search Architecture 2004:  No Search Engine  Iterate through your network and filter

Lucene (Single Shard) Lucene (Single Shard) Updates Queries Results 2007: Introducing Lucene (single shard, multiple replicas)

Lucene Updates Queries Results Updater Lucene Zoie 2008: Zoie - real-time search (search without commits/shutdown)

Lucene Source 1 Queries Results Updater Lucene Zoie Source 2 …. Source N Content Store …. 2008: Content Store (aggregating multiple input sources)

Source 1 Queries Results Updater Source 2 …. Source N Content Store …. Sharded Broker 2008: Sharded search

Source 1 Queries Results Updater Source 2 …. Source N Content Store …. Sensei Broker Lucene Zoie Bobo 2009: Bobo – Faceted Search

Updater 2010: SenseiDB (cluster management, new query language, wrapping existing pieces)

Updater 2011: Cleo (instant typeahead results)

Updater 2013: Too many stacks Group Search Article/Post Search And more…

Challenges  Index rebuilding very difficult  Live updates are at an entity granularity  Scoring is inflexible  Lucene limitations  Fragmentation – too many components, too many stacks  Economic Graph 21 Opportunity

Updater 2014: Introducing Galene

Life of a Query 23 Query Rewriter/ Planner Results Merging User Query Search Results Search Shard

Life of a Query – Within A Search Shard 24 Rewritten Query Top Results From Shard INDEX Top Results Retrieve a Document Score the Document

Life of a Query – Within A Rewriter 25 Query DATA MODEL Rewriter State Rewriter Module DATA MODEL Rewritten Query Rewriter Module

Life of Data - Offline 26 INDEX Derived Data Raw Data DATA MODEL

Improvements  Regular full index builds using Hadoop –Easier to reshard, add fields  Improved Relevance –Offline relevance, query rewriting frameworks  Partial Live Updates Support –Allows efficient updates of high frequency fields (no sync) –Goodbye Content Store, Goodbye Zoie  Early termination –Ultra low latency for instant results –Goodbye Cleo  Indexing and searching across graph entities/attributes  Single engine, single stack 27

Galene Deep dive 28

Primer on Search 29

Lucene An open source API that supports search functionality:  Add new documents to index  Delete documents from the index  Construct queries  Search the index using the query  Score the retrieved documents 30

The Search Index  Inverted Index: Mapping from (search) terms to list of documents (they are present in)  Forward Index: Mapping from documents to metadata about them 31

32 BLAH BLAH BLAH Daniel BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH Asif BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH DanielAsifLinkedIn 2 1 Inverted Index Forward Index

The Search Index  The lists are called posting lists  Upto hundreds of millions of posting lists  Upto hundreds of millions of documents  Posting lists may contain as few as a single hit and as many as tens of millions of hits  Terms can be –words in the document –inferred attributes about the document 33

Lucene Queries  term:“asif makhani”  term:asif term:daniel  +term:daniel +prefix:tunk  +asif +linkedIn  +term:daniel connection:50510  +term:daniel industry:software connection:50510^4 34

Early termination  We order documents in the index based on a static rank – from most important to least important  An offline relevance algorithm assigns a static rank to each document on which the sorting is performed  This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query)  Also works well with personalized search –+term:asif +prefix:makh +(connection:35176 connection: connection: ) 35

Partial Updates  Lucene segments are “document-partitioned”  We have enhanced Lucene with “term-partitioned” segments  We use 3 term-partitioned segments: –Base index (never changed) –Live update buffer –Snapshot index 36

37 Base Index Snapshot Index Live Update Buffer

Going Forward  Consolidation across verticals  Improved Relevance Support –Machine-learned models, query rewriting, relevant snippets,…  Improved Performance  Search as a Service (SeaS)  Exploring the Economic Graph 38

Quality Challenges 39

The Search Quality Pipeline 40 New document FeaturesFeatures FeaturesFeatures Machin e learning model score New document FeaturesFeatures FeaturesFeatures Machin e learning model score New document FeaturesFeatures FeaturesFeatures Machine learning model score Ordered list spellcheckquery taggingvertical intentquery expansion

Spellcheck 41 PEOPLE NAMES COMPANIES TITLES PAST QUERIES n-grams marissa => ma ar ri is ss sa metaphone mark/marc => MRK co-occurrence counts marissa:mayer = 1000 marisa meyer yahoo marissa marisa meyer mayer yahoo

Query Tagging 42 machine learning data scientist brooklyn

Vertical Intent: Results Blending 43 [company] [employees] [jobs] [name search]

Vertical Intent: Typeahead 44 P(mongodb | mon) = 5% P(monsanto | mons): 50% P(mongodb | mong): 80%

Query Expansion 45

Ranking 46 New document FeaturesFeatures FeaturesFeatures Machin e learning model score New document FeaturesFeatures FeaturesFeatures Machin e learning model score New document FeaturesFeatures FeaturesFeatures Machine learning model score Ordered list

Ranking is highly personalized. 47 kevin scott

Not just for name search. 48

Relevance Model 49 keywords document FeaturesFeatures FeaturesFeatures Machine learning model

Examples of Features 50 Search keywords matching title = 3 Searcher location = Result location Searcher network distance to result = 2 …

Model Training: Traditional Approach 51 Documents for training FeaturesFeatures FeaturesFeatures Human evaluation LabelsLabels LabelsLabels Machine learning model

Model Training: LinkedIn’s Approach 52 Documents for training FeaturesFeatures FeaturesFeatures Human evaluation Search logs Human evaluation Search logs LabelsLabels LabelsLabels Machine learning model

Fair Pairs and Easy Negatives 53 Flipped [Radlinski and Joachims, 2006] Sample negatives from bottom results But watch out for variable length result sets. Compromise, e.g., sample from page 10.

Model Selection  Select model based on user and query features. –e.g., person name queries, recruiters making skills queries  Resulting model is a tree with logistic regression leaves.  Only one regression model evaluated for each document. 54 X 2 =? X 10 < ?

Summary What is LinkedIn search and why should you care? LinkedIn search enables the participants in the economic graph to find and be found. What are our systems challenges? Indexing rich, structured content; retrieving using global and social factors; real-time updates. What are our relevance challenges? Query understanding, personalized machine-learned ranking models. 55

56 Asif MakhaniDaniel Tunkelang