Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria.

Similar presentations


Presentation on theme: "1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria."— Presentation transcript:

1 1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria

2 2 Customer service phone number of Amazon? Motivating Scenario

3 3 Search on Amazon?

4 4 Search on Google?

5 5 Many many similar cases: The email of Luis Gravano? What profs are doing databases at UIUC? The papers and presentations of ICDE 2007? Due date of SIGMOD 2008? Sale price of “Canon PowerShot A400”? “Hamlet” books available at bookstores? Often times, we are looking for data entities, e.g. emails, dates, prices, etc, not pages.

6 6 What you search is not what you want.

7 7 From pages to entities Traditional SearchEntity Search Keywords Entities Results Support

8 8 Concretely, what do we mean by Entity Search? Online Demo. Yellowpage: Comprehensive corpus. Special Thanks: ~100M Pages from Stanford WebBase

9 9 Entity Search Problem:   Given: Entity Collection over Document Collection  Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone #email )  Output: Ranked list of sorted by Score(q(t)), the query score of t   Given: Entity Collection over Document Collection  Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone #email )  Output: Ranked list of sorted by Score(q(t)), the query score of t Given: Input: Keywords & Entities (optionally with a pattern) E.g. Amazon Customer Service #phone Output: Ranked Entity Tuples …… 0.60 0.80 0.90

10 10 How to rank Entities? Why a novel Problem? Challenge:

11 11 Characteristics I: Contextual -Utilize Entities’ Surrounding Context Content Context

12 12 Characteristics II: Uncertain -Extractions are non-”prefect”

13 13 Characteristics III: Holistic -Many evidences from multiple sources

14 14 Characteristics IV: Discriminative - Web Pages are of Varying Quality

15 15 Characteristics V: Associative -Tell True Associations from Accidental Example: Finding Prof. Luis Gravano’s Email Observation: info@acm.org appears very frequently with keywords “Luis”, “Gravano”info@acm.org However, such association is only accidental as info@acm.org appears on many pages. info@acm.org

16 16 EntityRank : The Impression Model Tireless Observer......... ?? Access Layer: Global Aggregation Recognition Layer: Local Assessment Validation Layer: Hypothesis Testing …… 0.60 0.80 0.90

17 17 Recognition Layer: Local Assessment C ontextual U ncertain H olistic D iscriminative A ssociative Input: L1L1 L2L2 Extraction Conf = 1.0Extraction Conf = 0.3 Output:

18 18 Access Layer: Global Aggregation C ontextual U ncertain H olistic D iscriminative A ssociative Holistic Discriminative Output: Input:

19 19 Validation Layer: Hypothesis Testing C ontextual U ncertain H olistic D iscriminative A ssociative Input: Collection E over D Output: Virtual Collection E’ over D’ randomize

20 20 EntityRank : The Scoring Function Local RecognitionGlobal Aggregation Validation

21 21  Sort-merge Join Query Processing 7, 33d9d9 3d7d7 10d6d6 5d3d3 8, 25d1d1 Doc Posting Doc 8, 24d7d7 66d5d5 11d3d3 Posting 44d8d8 9d7d7 12d3d3 Doc Posting AmazonCustomer Service (13,800-202-7575,1.0) (78,800-322-9266,1.0) d7d7 (18,800-202-7575,1.0)d3d3 (42,851-0400,0.8)d2d2 Doc Posting #phone Aggregation 800-202-7575: 0.5 800-322-9266: 0.2 800-202-7575: 0.6 800-322-9266: 0.1 800-202-7575: 0.4 Hypothesis Test Result

22 22 Experiment Setup Corpus: General crawl of the Web(Aug, 2006), around 2TB with 93M pages. Entities: Phone (8.8M distinctive instances) Email (4.6M distinctive instances) System: A cluster of 34 machines

23 23 Comparing EntityRank to the Following Different Approaches C ontextual U ncertain H olistic D iscriminative A ssociative N aïve L ocal G lobal C ombine W ithout E ntity R ank

24 24 Example Query Results

25 25 Comparison… EntityRank Naïve approch Local only Global only Combine L by simple summation L+G without hypothesis testing %Satisfied Queries at #Rank Query Type I: Phone for Top-30 Fortune500 Companies Query Type II: Email for 51 of 88 SIGMOD07 PC

26 26 Conclusions Formulate the entity search problem Study and define the characteristics of entity search Conceptual Impression Model and concrete EntityRank framework for ranking entities An online prototype with real Web corpus

27 27 Thanks! Questions?


Download ppt "1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria."

Similar presentations


Ads by Google