Presentation is loading. Please wait.

Presentation is loading. Please wait.

EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Similar presentations


Presentation on theme: "EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam."— Presentation transcript:

1 EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam PhD Student CSE Department, UTA

2 Motivating Scenario Customer service phone number of Amazon?

3 Search on Amazon?

4 Search on Google?

5 Many many Similar Cases The email of Luis Gravano? What profs are doing databases at UIUC? The papers and presentations of ICDE 2007? Due date of SIGMOD 2008? Sale price of “Canon PowerShot A400”? “Hamlet” books available at bookstores? Often times, we are looking for data entities, e.g. emails, dates, prices, etc, not pages.

6 What you search is not what you want.

7 From pages to entities Traditional SearchEntity Search Keywords Entities Results Support

8 Concretely, what is meant by Entity Search?

9 9 Entity Search Problem:   Given: Entity Collection over Document Collection  Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone #email )  Output: Ranked list of sorted by Score(q(t)), the query score of t   Given: Entity Collection over Document Collection  Input: where is a tuple pattern,, and is a keyword e.g. ow(David DeWitt #phone #email )  Output: Ranked list of sorted by Score(q(t)), the query score of t Given: Input: Keywords & Entities (optionally with a pattern) E.g. Amazon Customer Service #phone Output: Ranked Entity Tuples …… 0.60 0.80 0.90

10 10 How to rank Entities? Challenge: Challenge:

11 Characteristics I: Contextual -Utilize Entities’ Surrounding Context Characteristics I: Contextual -Utilize Entities’ Surrounding Context Content Context

12 Characteristics II: Uncertain -Extractions are non”prefect” Characteristics II: Uncertain -Extractions are non”prefect”

13 Characteristics III: Holistic -Many evidences from multiple sources Characteristics III: Holistic -Many evidences from multiple sources

14 Characteristics IV: Discriminative - Web Pages are of Varying Quality Characteristics IV: Discriminative - Web Pages are of Varying Quality

15 Characteristics V: Associative -Tell True Associations from Accidental Characteristics V: Associative -Tell True Associations from Accidental Example: Finding Prof. Luis Gravano’s Email Observation: info@acm.org appears very frequently with keywords “Luis”, “Gravano”info@acm.org However, such association is only accidental as info@acm.org appears on many pages. info@acm.org

16 EntityRank: The Impression Model EntityRank: The Impression Model Tireless Observer......... ?? Access Layer: Global Aggregation Recognition Layer: Local Assessment Validation Layer: Hypothesis Testing …… 0.60 0.80 0.90

17 17 Recognition Layer: Local Assessment Recognition Layer: Local Assessment C ontextual U ncertain H olistic D iscriminative A ssociative Input: L1L1 L2L2 Output:

18 18 Access Layer: Global Aggregation Access Layer: Global Aggregation C ontextual U ncertain H olistic D iscriminative A ssociative Holistic Discriminative Output: Input:

19 19 Validation Layer: Hypothesis Testing Validation Layer: Hypothesis Testing C ontextual U ncertain H olistic D iscriminative A ssociative Input: Collection E over D Output: Virtual Collection E’ over D’ randomize

20 EntityRank: The Scoring Function EntityRank: The Scoring Function Local RecognitionGlobal Aggregation Validation

21 21  Sort-merge Join Query Processing 7, 33d9d9 3d7d7 10d6d6 5d3d3 8, 25d1d1 Doc Posting Doc 8, 24d7d7 66d5d5 11d3d3 Posting 44d8d8 9d7d7 12d3d3 Doc Posting AmazonCustomer Service (13,800-202-7575,1.0) (78,800-322-9266,1.0) d7d7 (18,800-202-7575,1.0)d3d3 (42,851-0400,0.8)d2d2 Doc Posting #phone Aggregation 800-202-7575: p1 800-322-9266: p3 800-202-7575: p2 800-322-9266: p5 800-202-7575: p4 Hypothesis Test Result

22 22 Experiment Setup Experiment Setup Corpus: General crawl of the Web(Aug, 2006), around 2TB with 93M pages. Entities: Phone (8.8M distinctive instances) Email (4.6M distinctive instances) System: A cluster of 34 machines

23 23 Comparing EntityRank to the Following Different Approaches C ontextual U ncertain H olistic D iscriminative A ssociative N aïve L ocal G lobal C ombine W ithout E ntity R ank

24 Online Demo.

25 25 Example Query Results

26 26 Conclusions Formulate the entity search problem Study and define the characteristics of entity search Conceptual Impression Model and concrete EntityRank framework for ranking entities An online prototype with real Web corpus

27 Thank You ! Thank You ! Questions?

28 Reference EntityRank: Searching Entities Directly and Holistically. T. Cheng, X. Yan, and K. C.-C. Chang. In Proceedings of the 33rd Very Large Data Bases Conference (VLDB 2007), pages 387-398, Vienna, Austria, September 2007 http://www-forward.cs.uiuc.edu/talks/2007/entityrank- vldb07-cyc-sep07.ppt


Download ppt "EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam."

Similar presentations


Ads by Google