TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Introduction to Information Retrieval
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Information Retrieval in Practice
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Boost Write Performance for DBMS on Solid State Drive Yu LI.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Overview of Search Engines
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
--He Xiangnan PhD student Importance Estimation of User-generated Data.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
CS4432: Database Systems II Query Processing- Part 2.
Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University SIGIR 2009.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
1 One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing Bei Yu 1, Guoliang Li 2, Beng Chin Ooi 1, Li-zhu Zhou 2 1 National.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Bigtable: A Distributed Storage System for Structured Data
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Information Retrieval in Practice
Xiang Li,1 Lili Mou,1 Rui Yan,2 Ming Zhang1
Information Retrieval in Practice
Methods and Apparatus for Ranking Web Page Search Results
Information Retrieval and Web Search
The Anatomy of a Large-Scale Hypertextual Web Search Engine
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Data Mining Chapter 6 Search Engines
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Similarity Search: A Matching Based Approach
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
Presentation transcript:

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National University of Singapore 18 May 2011 Taewhi Lee SIGMOD ’11

Introduction Related Work System Overview Content-Based Indexing Scheme Ranking Function Experimental Evaluation Conclusion Outline 1/32

Real-Time Search for SNS High update and query loads Lack of effective ranking functions  Timestamp + relevance 2/32

Classifying the tweets into two types  Distinguished tweets – real-time indexing  Noisy tweets – background batch indexing Ranking function  User’s PageRank  Popularity of topics  Similarity between data and query  Timestamp Main Idea: Tweet Index(TI) 3/32

Example of Search Results 4/32

Introduction Related Work System Overview Content-Based Indexing Scheme Ranking Function Experimental Evaluation Conclusion Outline 5/32

Partial indexing and view materialization  Adaptive & automatic creation Microblog search  Google & Twitter: results are sorted by time  Google – adaptively crawl the microblogs  Twitter – rely on an existing technique (e.g., Lucene)  Proposed ranking schemes are too complex and time consuming  Forum search – posts to the same thread are organized as a tree Related Work 6/32

Introduction Related Work System Overview Content-Based Indexing Scheme Ranking Function Experimental Evaluation Conclusion Outline 7/32

User graph G u = (U, E)  U: set of users  E: friend links Relationships of tweets  Tree encoding ID is assigned to each tweet Social Graphs Reply or RT 8/32

Architecture of the TI Distinguished tweets Noisy tweets 9/32

Structure of Inverted Index 10/32

Tweet Table ID of the replied tweet # of tweets that reply to this tweet Offset in the log file (for unindexed tweets) Metadata of tweets stored in database B+ tree index for TID and UID is built 11/32

Introduction Related Work System Overview Content-Based Indexing Scheme Ranking Function Experimental Evaluation Conclusion Outline 12/32

Data Flow of Index Processor 13/32

Query-based classification approach  A tweet itself does not provide too much information Assumption  Users are only interested in the top-K results Given a tweet t and a user’s query set Q,  ∃ q i ∈ Q and t is a top-K result for q i based on the ranking function F  t is a distinguished tweet  Otherwise, t is a noisy tweet Tweet Classification 14/32

Maintaining Query Set Suppose the n -th query appears with a prob. of (Zipf’s distribution) Let s be the # of submitted queries per sec. : a prob. that the n -th query appears in a sec. Expected time interval of the n -th query We will keep the n-th query in Q, only if t(n) < t’ Batch indexing interval 15/32

For every q i in Q,  ds(q i,t).size < K  distinguished tweet  Otherwise  noisy tweet Dominant set ds(q i,t)  The tweets that have higher ranks than t for a query q i Performance problems  Full scan of the tweet set is needed (computing DS)  Testing against every queries is needed for each tweet Naïve Classifier 16/32

Observation  The scores of the top 10th and 100th tweet are quite stable Opt. 1: Top-K Threshold Computing DS  score comparison 17/32

Candidate query set  Keywords in both tweet and query Opt. 2: Matrix Index for Queries 18/32

Real-time indexing 1. Retrieve parent tweet (2-3 I/Os via the index on TID) Update the count number in the parent tweet (1 I/O) 2. Insert the tweet into the tweet data table (insert: 1 I/O, index update: 2-3 I/Os) 3. Insert the tweet into the inverted index (n I/Os) Implementation of Indexes Batch indexing 1. Append the tweet to the log file (1 I/O) 2. Insert the tweet into the tweet data table (insert: 1 I/O, index update: 2-3 I/Os) 19/32

Introduction Related Work System Overview Content-Based Indexing Scheme Ranking Function Experimental Evaluation Conclusion Outline 20/32

Ranking Function User’s PageRank  V: user, E: following link Popularity of Topics(= tweet tree)  We just compute the popularities of active trees and maintain them in memory 21/32

Ranking Function (cont’d) Time-based Ranking  F is monotonically decreasing with time Problem  Search performance is affected by the size of inverted index 22/32

Adaptive Index Search  Read a block of the index iteratively  Stop reading if max. score before ts < T Θ (q) 23/32

Introduction Related Work System Overview Content-Based Indexing Scheme Ranking Function Experimental Evaluation Conclusion Outline 24/32

Experimental Setting Dataset  Twitter data collected for 3 years(Oct 2006~Nov 2009)  ~465K users, 25M+ tweets Experiments  Queries are generated by randomly Combining the keywords # of keywords in queries follows Zipf’s distribution (1-word: 60%, 2-word: 30%, 3+-word: 10%)  Queries are submitted at random timestamps 25/32

# of Indexed Tweets in Real-Time 26/32

Indexing Cost (per 10K Tweets) 27/32

Accuracy (Adaptive Threshold) 28/32

Performance of Query Processing Size of the inverted index for a keyword k i is proportional to the # of tweets containg k i 29/32

Distribution of Results 30/32

Introduction Related Work System Overview Content-Based Indexing Scheme Ranking Function Experimental Evaluation Conclusion Outline 31/32

Conclusion Classifying the tweets into two types  Distinguished tweets – real-time indexing  Noisy tweets – background batch indexing Ranking function  User’s PageRank  Popularity of topics  Similarity between data and query  Timestamp 32/32

Thank you!