1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer.

1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang {tcheng3,kcchang}@cs.uiuc.edu Computer Science Department University of Illinois at Urbana-Champaign

2 Customer service phone number of Amazon? Users in Frustration Search on Amazon? Search on Search Engine?

3 Professors in the area of data mining Even More Frustration cs.uiuc.edu cs.uiuc.edu/research cs.uiuc.edu/research/data cs.stanford.edu … … cs.stanford.edu/research cs.stanford.edu/research/faculty

4 Many many such cases: The email of Kevin Chang? The papers and presentations of ICDE 2010? Conferences and their due dates on databases in 2010? Sale price of “Canon PowerShot A400”? Often times, we are looking for data entities, e.g., emails, dates, prices, etc., not pages. Indeed, according to a recent survey, 52.9% of queries are directly targeting at structured entities [DE Bulletin’09] [DE Bulletin’09]: R. Kumar and A. Tomkins, “A Characterization of Online Search Behavior”

Recent Trends: WQA Web-based Question Answering (WQA) (Wu 2007, Lin 2003, Brill 2002) Who is CEO of Dell? Keywords: “CEO Dell” Parse Top-k results Michael Dell 5

Recent Trends: WIE 6 Specialized Information Extractors Web Information Extraction (WIE) (Marius 2006, Cafarella 2005, Etzioni 2004) Pattern: “X is CEO of Y” CompanyCEO GoogleEric Schmidt IBMS. Palmisano ……

Recent Trends: TAS 7 Typed-Annotated Search (TAS) (Cheng 2007, Cafarella 2007, Chakrabarti 2006) Inventor of television? …… 0.60 0.80 0.90 Ranked Entity List Finding person names near Keywords “invent” and “television” Finding person names near Keywords “invent” and “television” Typed-Annotated Search

8 From Pages to Data Entities Traditional SearchEntity Search Keywords Keywords & Entity Type Results Support

9 Concretely, what do we mean by Entity Search? Online Demo. 3TB Corpus of 150M pages 16 -machine cluster 24 entity types

10 Entity Search Problem Abstraction   Given: Entity Collection over Document Collection  Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone #email )  Output: Ranked list of sorted by Score(q(t)), the query score of t   Given: Entity Collection over Document Collection  Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone #email )  Output: Ranked list of sorted by Score(q(t)), the query score of t Input: Keywords & Entity Type (optionally with a pattern) E.g. Amazon Customer Service #phone Output: Ranked Entity Instances Ordered by: Score(e) where e is an entity instance …… 0.60 0.80 0.90 Given: D

Unanimous Requirements across the Trends Context Matching (in document)  Match the target type (say # location ) by keywords (e.g., “louvre museum”) that appear in its surrounding context, in certain desired patterns Global Aggregation (across documents)  Match an entity (say, #location = Paris) for as many times as it appears in numerous pages 11

Computation Challenges Expensive Context Matching (Join )  Need to perform proximity matching in documents Beyond simple containment checking Extensive Global Aggregation ( G )  Need to perform corpus scale aggregation A layer that is non-existent in online page retrieval 12 

Traditional Page Retrieval based Approach 13 Who is the CEO of Dell? Keywords: “CEO Dell” Analyze top-k results Michael Dell Limitation Only top-k documents Many random seeks

Our Proposal: Entity-aware Indexing Inspired by the success of inverted index in enabling efficient IR for searching documents However, traditional inverted index only aware of keywords and documents  How can we make index entity aware ? Our proposal: Dual-Inversion Index  Principle I : Document-inverted Index  Principle II : Entity-inverted Index 14

Entity-as-keyword: Document-inverted Index 15 :800-201-7575 :408-376-7400 keyword pos doc id

Document Space Partitioning Node 10 Node 1

Distributed Query Processing over D-inverted Index 17 Join …… Aggregation Local Ranking Global  Join  … results, scores …… … Node 1 Node 10

Entity-as-document: Entity-inverted Index 18 keyword posentity id entity pos

Entity Space Partitioning 19 Node 1 Node 9

Distributed Query Processing over E-inverted Index 20 … Local Ranking Global … results, scores … Node 1 Node 9 Join Aggregation  Join Aggregation  … …

21 Experiment Setup Corpus: General crawl of the Web (Aug, 2007), around 3TB with 150M pages. Entities: 24 diverse entity types Concrete Applications (Benchmark queries) :  Yellowpage: #email, #phone, #state, #location, #zipcode  CSAcademia: #university, #professor, #research, #email, #phone

Metrics Used for Evaluation to Measure Throughput & Response Time Local Processing Time  Overall local processing time.  Max local processing time Transfer Time  Overall transfer time  Max transfer time Global Processing Time 22

Local Processing Time Comparison 23

Network Transfer Comparison 24

Global Processing Time Comparison 25

Overall Time/Space Summary 26 Generally, ~2 to 4 orders of speedup, with reasonable space overhead

Dual-Inversion Index 27 Dual-Inversion Index: The two types of indexes can co-exist, and complement each other

Indexing Configuration 28 Entity Type Level Configuration: Create E-Inverted Index only for popular, space efficient entities D-Inverted Index for less popular, space expensive entities Keyword Level Configuration: Only create E-Inverted Index for pairs, when they are related, e.g., queried often from query log

Conclusion Identify essential computation requirements for entity search Dual-inversion indexing and partition schemes for efficient and scalable query processing  Document-inverted index  Entity-inverted index Verify over large-scale corpus with real applications 29

30 Thanks much for coming! Questions?

TopK Convergence 31

References of Related Work Index Design  Junghoo Cho and Sridhar Rajagopalan. A fast regular expression indexing engine. In ICDE, 2002.  Hugh E. Williams, Justin Zobel, and Dirk Bahle. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst., 22(4):573–594, 2004.  Xiaohui Long and Torsten Suel. Three-level caching for efficient query processing in large web search engines. In WWW, 2005.  Michael Cafarella and Oren Etzioni. A search engine for large-corpus language applications. In WWW, 2005. Question Answering  S. Abney, M. Collins, and A. Singhal. Answer extraction. In ANLP, 2000.  E. Brill, S. Dumais, and M. Banko. An analysis of the askmsr question-answering system. In EMNLP, 2002.  Cody C. T. Kwok, Oren Etzioni, and Daniel S. Weld. Scaling question answering to the web. In WWW, 2001.  Jimmy J. Lin and Boris Katz. Question answering from the web using knowledge annotation and knowledge mining techniques. In CIKM, 2003. 32

Search Interface 33

Query I: Amazon Customer Service Phone 34 Results # of Supporting Page Representative Supporting Pages

Query II: Professors in Data Mining 35

Query III: University of California Locations 36

1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer.

Similar presentations

Presentation on theme: "1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer.

Similar presentations

Presentation on theme: "1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer."— Presentation transcript:

Similar presentations

About project

Feedback