Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Text Databases Text Types
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Information Retrieval in Practice
Search Engines and Information Retrieval
Decision Tree Algorithm
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Ensemble Learning: An Introduction
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Overview of Search Engines
Adversarial Information Retrieval The Manipulation of Web Content.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Search Engines and Information Retrieval Chapter 1.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Querying Structured Text in an XML Database By Xuemei Luo.
Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Database Indexing 1 After this lecture, you should be able to:  Understand why we need database indexing.  Define indexes for your tables in MySQL. 
Presented By Amarjit Datta
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
1 CS 430: Information Discovery Lecture 5 Ranking.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Bigtable: A Distributed Storage System for Structured Data
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Information Retrieval in Practice
An Efficient Algorithm for Incremental Update of Concept space
Search Engine Architecture
CS522 Advanced database Systems
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Toshiyuki Shimizu (Kyoto University)
Information Retrieval
Indexing and Hashing Basic Concepts Ordered Indices
Presentation transcript:

Pete Bohman Adam Kunk

 Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion

 Requirements ◦ Contents searchable immediately following creation ◦ Scale to thousands of updates/sec OBL Death 5,000 tweets/sec ◦ Results relevant to query via cost efficient ranking

TI Rank Vs. Time Rank

 Real-time search of microblogging applications is provided via two components: ◦ Indexing Mechanism – for pruning tweets, only looking at a subset of all tweets (allows for speed) ◦ Ranking Mechanism – for looking at relevant tweets (weeding out tweets that are not deemed important enough)  Main idea: look at important tweets only

 Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion

 The Case for Partial Indexes ◦ Stonebreaker, 1989 ◦ Index only a portion of a column  User specified index predicates (where salary > 500)  Build index as a side-effect of query processing  Incremental index building

 An application of materialized views is to use cost models to automatically select which views to materialize. ◦ Materialized views can be thought of as snapshots of a database, in which the results of a query are stored in an object.  The concept of only indexing essential tweets in real-time was borrowed from the idea of view materialization.

 Google and Twitter have both released real- time search engines. ◦ Google’s engine adaptively crawls the microblog ◦ Twitter’s engine relies on Apache’s Lucene (high- performance, full-featured text search engine library)  But, both the Google and Twitter engines only utilize time in their ranking algorithms.  TI’s ranking algorithm takes much more than just time into account.

 Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion

 Certain structures are kept in-memory to support indexing and ranking ◦ Keyword threshold – records statistics of recent popular queries ◦ Candidate topic list – information about recent topics ◦ Popular topic list – information about highly discussed topics

 Twitter users have links to other friends  A User Graph is utilized to demonstrate this relationship  G u = (U, E) ◦ U is the set of users in the system ◦ E is the friend links between them

 Nodes represent tweets  Directed edges indicate replies or retweets  Implemented by assigning tweets a tree encoding ID Tweet Tree Structure

 Search is handled via an inverted index for tweets  Given a keyword, the inverted index returns a tweet list, T ◦ T contains set of tweets sorted by timestamp

 TID = Tweet ID  U-PageRank = Used for ranking  TF = Term Frequency  tree = TID of root node of tweet tree  time = timestamp TI Inverted Index

 In order to help ranking, TI keeps a table of metadata for each tweet ◦ TID = tweet ID ◦ RID = ID of replied tweet (to find parent) ◦ tree = TID of root node of tweet tree ◦ time = timestamp ◦ count = number of tweets replying to this tweet Ranking Support

 Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion

 Observation ◦ Users are only interested in top-K results for a query  Given a tweet t and a user’s query set Q, ◦ ∃ q i ∈ Q and t is a top-K result for q i based on the ranking function F  t is a distinguished tweet  Maintenance cost for query set Q ?

 Observation ◦ 20% of queries represent 80% of user requests (Zipf’s dist.)  Suppose the n th query appears with a prob. of (Zipf’s distribution)  Let s be the # of queries submitted /sec. Expected time interval of the n th query is We will keep the n-th query in Q, only if t(n) < t’  Batch processing occurs every t’ sec

 Dominant set ds(q i,t) ◦ The tweets that have higher ranks than t for query q i  Performance problems ◦ Full scan of tweet set required for dominant set ◦ Test each tweet against every query

 Observation ◦ The rank of the lower results are stable  Replace dominant set with a comparison to the score of Q’s Kth result.

 Compare a tweet to similar queries Key Words k1k2k3k4Count Query Query Query 3 11 Query  Given tweet t =, compare t to Q1, Q3, Q4

 New tweets categorized as being distinguished (index these immediately) 1.If tweet belongs to existing tweet tree, retrieve its parent tweet to get root ID and generate encoding. Update count number in parent. 2.Tweet is inserted into tweet data table. 3.Tweet is inserted into inverted index.  Main cost is updating the inverted index (due to each keyword in the tweet).

 New tweets categorized as being noisy (index these at a later time)  Instead of indexing in inverted index, append tweet to log file.  Batch indexing process periodically scans the log file and indexes the tweets there.

 Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion

 Ranking functions are completely separate from the indexing mechanism ◦ New ranking functions could be used  TI’s proposed ranking function is based on: ◦ User’s PageRank ◦ Popularity of the topic ◦ Timestamp (self-explanatory) ◦ Similarity between tweet and the query

 Twitter has two types of links between users ◦ f(u): the set of users who follow user u ◦ f -1 (u): the set of users who user u follows  A matrix, M f [i][j], is used to record the following links between users  A weight factor is given for each user ◦ V = (w 1, w 2, ….. w n )

 PageRank formula is given as: P u = VM f x  So, the user’s PageRank is a combination of their user weight and how many followers they have ◦ The more popular the user, the higher the PageRank

 Users can retweet or reply to tweets.  Popularity can be determined by looking at the largest tweet trees.  Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree

 The similarity of a query and the tweet t can be computed as follows: sim(q,t) = (q x t) / (|q||t|)  q and t are turned into bags of words, then viewed as vectors

 q.timestamp = query submittal time  tree.timstamp = timestamp of tree t belongs to (timestamp of root node)  w 1, w 2, w 3 are weight factors for each component (all set to 1)

 The size of the inverted index limits the performance of the search for tweets ◦ The size of the inverted index grows with the number of tweets  To alleviate this problem, adaptive indexing is proposed:

 The main idea: ◦ Iteratively read a block of the inverted index (rather than the entire thing) ◦ Stop iterating blocks when the timestamp value gives a score low enough to throw out the results  Stop because the rest of the tweets in the inverted index will also have a lower score

 Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion

 Evaluation performed on real dataset ◦ Dataset collected for 3 years (October 2006 to November 2009) ◦ 500 random users picked as seeds (from which other users are integrated into the social graphs) ◦ 465,000 total users ◦ 25,000,000 total tweets  Experiments typically 10 days long ◦ 5 days training, 5 days measuring performance

 Queries lengths are distributed as follows: ◦ ~60% are 1 word ◦ ~30% are 2 words ◦ ~10% are more than 2 words  Queries submitted at random, tweets are inserted into system based on original timestamps (from dataset)

 TimeBased represents using only tweet timestamp (like Google)

 Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion

 Current search engines unable to index social networking data  Adaptive indexing mechanism to reduce update cost  Cost efficient and effective ranking function  Successful evaluation using real data set from twitter