Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 WinaCS Project Web Entity Extraction and Mapping Discovering.

Similar presentations


Presentation on theme: "Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 WinaCS Project Web Entity Extraction and Mapping Discovering."— Presentation transcript:

1 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 WinaCS Project Web Entity Extraction and Mapping Discovering and Propagating Context Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL

2 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Past, Present, Future Past – Entity search and retrieval is one of the dreams of the Web – TBL Present – Ranking and Retrieval bi-directional approach 1) Information Networks 2) Web mining and Information Extraction a) List Finding b) Entity-page Discovery c) Entity-page Mapping Future – InfoBase Project Information extraction via Schema Discovery

3 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Finding lists on the Web is Hard! (KDD Explorations Dec. 2010) 1. Google Sets 2. WebTables 3. Mining Data Records (MDR) 4. World Wide Tables (WWT) 5. Tag Path Clustering 6. RoadRunner 6. SEAL 7. Visual List Extraction 8. VIsual-based Page Segmentation (VIPS) 9. Visualized Element Nodes Table extraction (VENTex)

4 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Why is finding lists important? Jiawei Han ChengXiang Zhai Kevin Chang Dan Roth Marianne Winslett Jiawei Han ChengXiang Zhai Kevin Chang Dan Roth Marianne Winslett Sarita Adve Tarek Adelzaher Vikram Adve Gul Agha … Charu Aggarwal Deepayan Chakrabarti Ed Chang Kevin Chang Olivier Chapelle Chris Clifton Jiawei Han … C ORRECTION I NFERENCE D ISAMBIGUATION R ECOMMENDATION ETC

5 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Our list finding algorithm (Accepted: WWW 2011)

6 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 List Finding for Entity Page Discovery

7 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Growing Parallel Paths (Accepted: WWW 2011) Result:

8 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Mapping Pages to Records (CIKM’10)

9 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Mapping Pages to Records (CIKM’10) Example A p1 ={People, Faculty, Dan Roth, Personal Site} A p2 ={Research, Data Mining, Dan Roth, Personal Site} Bag of Anchors: {Research:1, People:1, Faculty:1, Data Mining:1, Dan Roth:2, Personal Site:2} Sorted Bag of Anchors: A u;v1 ={Dan Roth:2/2=1, Research:1/2=0.5, Data Mining:1/2 =0.5, Personal Site:2/5=0.4, People:1/3=0.33, Faculty:1/3=0.33}

10 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 CSMap Locations of top 25 computer science departments. Automatically generated by extracting and ranking 5 digit numbers from Entity Web pages.

11 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Next Steps: The hard part! Infer categories/schemas from a set of WebPages Example: What does these entities have in common? Name Address ZipCode Publications Collaborators Organizations How can we infer this schema? Wikipedia? How can we populate it?

12 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Idea! Propagating schemas

13 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Next Steps: The hardest part! NameAddressZipCodeOrganizationsCollaboratorsPublications Jiawei Han A1FK Tarek Adelzaher B2FK Gerald DeJong C3FK Michael Heath D4FK This can be modeled as a heterogeneous information network. Thus, Ranking and Clustering is possible So is semantic search, keyword search and typal search Cube operations are possible Given Inferred

14 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 WinaCS – An information network based Web search engine

15 Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 Questions? Challenges?


Download ppt "Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 WinaCS Project Web Entity Extraction and Mapping Discovering."

Similar presentations


Ads by Google