A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.


Similar presentations
Uncertainty in Data Integration Ai Jing

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005.
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
Data Management for XML: Research Directions By: Jennifer Widom Stanford University Reviewer: Kristin Streilein.
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger.
1 CBioC: Collaborative Bio- Curation Chitta Baral Department of Computer Science and Engineering Arizona State University.
BTW (“By The Way…”) Information Annotation By Rudd Stevens, Jason Endo University of San Francisco.
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
Accelerate Business Success With CRM CRM Interoperability.
Microsoft Office XP Illustrated Introductory, Enhanced Microsoft Office XP Introducing.
Automated Changes of Problem Representation Eugene Fink LTI Retreat 2007.
IST NeOn-project.org The Semantic Web is growing… #SW Pages Lee, J., Goodwin, R. (2004) The Semantic.
BTW Information Annotation By Rudd Stevens, Jason Endo.
Architecture External Web Services Supported Services Repository LMS Services Domain Model Process Container Process Instance Course Sequencing Presentation.
© 2004, The Trustees of Indiana University 1 OneStart Workflow Basics Brian McGough, Manager, Systems Integration, UITS Ryan Kirkendall, Lead Developer.
Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted.
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
Microsoft Office 2003 Illustrated Introductory, Premium Edition Microsoft Office 2003 Introducing.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Enterprise Systems & Architectures. Enterprise systems are mainly composed of information systems. Business process management mainly deals with information.
Lowell 2003 Challenges Alon Y. Halevy University of Washington.
Web 2.0 for Government Knowledge Management Everyone benefits by sharing knowledge March 24, 2010 Emerging Technologies Work Group Rich Zaziski, CEO FYI.
23. Juli By Benjamin Riedel Collaborative Web.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Search Engines and Information Retrieval Chapter 1.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Web Caching By Neeraj Agrawal. Caching Caching is widely used for improving performance in many context( e.g processor caches in hardware, buffer pool.
Siteman Cancer Center at Barnes-Jewish Hospital and Washington University School of Medicine Cancer Center Administration Database.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
ASSISTED BROWSING THROUGH SEMISTRUCTURED DATA PROBLEM The development of the RDF standard highlights the fact that a great deal of useful information is.
Mining Structured vs. Unstructured Data Where is the structure and where did the semantics go? Rahim Yaseen SAP Labs LLC.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
Recording the Context of Action for Process Documentation Ian Wootten Cardiff University, UK
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
DAT300 SQL Server Notification Services: Application Development Ken Henderson Technical Lead, SQL Server Support Microsoft Corporation
Individualized Knowledge Access David Karger Lynn Andrea Stein.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
1 Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan, MIT Susan T. Dumais, Microsoft Eric Horvitz, Microsoft SIGIR 2005.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Pedro DeRose University of Wisconsin-Madison The DBLife Prototype System in The Cimple Project on Community Information Management.
James A. Senn’s Information Technology, 3rd Edition
Building Enterprise Applications Using Visual Studio®
StYLiD: Structured Information Sharing with User-defined Concepts
Data Warehouse.
Microsoft Office 2003 Illustrated Introductory, Premium Edition
Data Integration with Dependent Sources
A Platform for Personal Information Management and Integration
Database Systems Instructor Name: Lecture-3.
Browsing Associations with Semex
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Fading Schemas… Alon Y. Halevy.
Presentation transcript:

A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington

Intranet Internet Is Your Personal Information a Mine or a Mess?

Intranet Internet Is Your Personal Information a Mine or a Mess?

Questions Hard to Answer Find my SEMEX paper and the presentation slides (maybe in an attachment).

Index Data from Different Sources E.g. Google, MSN desktop search Intranet Internet

Questions Hard to Answer Find my SEMEX paper and the presentation slides (maybe in an attachment). Find me the people working on SEMEX Find me all the “schema matching” papers by my advisor List me the phone numbers of my coauthors

Organize Data in a Semantically Meaningful Way Intranet Internet Co-authors

Questions Hard to Answer Find my SEMEX paper and the presentation slides (maybe in an attachment). Find me the people working on SEMEX Find me all the “schema matching” papers by my advisor List me the phone numbers of my coauthors Find me the authors of CIDR’05 papers, who have sent me s in the last 2 years

Integrate Organizational and Public Data with Personal Data Intranet Internet

SEMEX (SEMantic EXplorer) – I. Provide a Logical View of Data HTML Mail & calendar Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage Papers FilesPresentations

SEMEX (SEMantic EXplorer) – II. On-the-fly Data Integration Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage

Browse by Associations

“A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” “A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” Publication Bernstein

Browse by Associations Publication Bernstein Cited by Publication Citations

An Ideal PIM is a Magic Wand

Main Goals of Semex How can we create an ‘AHA!’ browsing experience? How can we leverage the PIM (Personal Information Management) environment and knowledge to increase productivity?

Outline Problem definition and project goals Technical issues:  Semex architecture  Reference reconciliation  Importing external data sources  Domain model personalization Overarching PIM Themes

System Architecture HTML Mail & calendar Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage Papers FilesPresentations

System Architecture WordExcelPPTPDFBibtexLatex Contacts Domain Model Reference Reconciliation Data Repository ObjectsAssociations Simple Extracted External Defined

System Architecture WordExcelPPTPDFBibtexLatex Contacts Domain Model ObjectsAssociations Reference Reconciliation Data Repository Simple Extracted External Defined Core Searcher and browser Data analyzer External data importer Extractor plug-ins Domain model personalization

Outline Problem definition and project goals Technical issues:  Semex architecture  Reference reconciliation  Importing external data sources  Domain model personalization Overarching PIM Themes

Reference Reconciliation

A very active area of research in Databases, Data Mining and AI Typically assume matching tuples from a single table  Approaches based on pair-wise comparisons Harder in our context

Challenges Article: a 1 =(“Bounds on the Sample Complexity of Bayesian Learning”, “ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Bounds on the sample complexity of bayesian learning”, “ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“Computational learning theory”, “1992”, “Austin, Texas”) c 2 =(“COLT”, “1992”, null) Person: p 1 =(“David Haussler”, null) p 2 =(“Michael Kearns”, null) p 3 =(“Robert Schapire”, null) p 4 =(“Haussler, D.”, null) p 5 =(“Kearns, M. J.”, null) p 6 =(“Schapire, R.”, null)

Challenges Article: a 1 =(“Bounds on the Sample Complexity of Bayesian Learning”, “ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Bounds on the sample complexity of bayesian learning”, “ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“Computational learning theory”, “1991”, “Austin, Texas”) c 2 =(“COLT”, “1992”, null) Person: p 1 =(“David Haussler”, null) p 2 =(“Michael Kearns”, null) p 3 =(“Robert Schapire”, null) p 4 =(“Haussler, D.”, null) p 5 =(“Kearns, M. J.”, null) p 6 =(“Schapire, R.”, null) p 7 =(“Robert Schapire”, p 8 =(null, p 9 =(“mike”, 1. Multiple Classes 3. Multi-value Attributes 2. Limited Information ? ?

Intuition— Exploit Context Information Exploit context information  E.g. name v.s.  E.g. contact list Propagate similarities between different types of objects  E.g., reconciling papers helps reconcile conferences Exploit richness of merged references  E.g., remember alternate representations of entities

Outline Problem definition and project goals Technical issues:  Semex architecture  Reference reconciliation  Importing external data sources  Domain model personalization Overarching PIM Themes

Importing External Data Sources Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage

Challenges— On-thy-fly Data Integration Current data integration study focuses on integrating enterprise data  Large-scale, heavy-weight  Performed by professional technicians  Built to support very frequently occurring queries The PIM context presents unique challenges  Small-scale, light-weight  Performed by non-technical savvy  Doing transient queries (done only once or twice, or use different pieces of data)

Intuition— Using Past Experiences and Knowledge We have a large number of instances  E.g., importing DBLP – help from overlapping paper instances [Doan et al, Sigmod’04][Etzioni et al, 1995] We know a lot about the domain model  Schema matching work [Doan et al, Sigmod’01][Madhavan et al, ICDE’05] Others have imported similar (or the same) data sources

Outline Problem definition and project goals Technical issues:  Semex architecture  Reference reconciliation  Importing external data sources  Domain model personalization Overarching PIM Themes

The Domain Model Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage The Semex core provides very basic classes and associations Users will need to personalize further cite

Challenges Easy-to-use for non-technical users  Suggest appropriate modifications Make the fragments fit together Guarantee high efficiency of updating and querying

Intuition— Suggest Changes from Past Experiences Strategy: mix and match from small components  May come with extractor plug-ins  A by-product of importing external data sources  Learn from other people’s domain models

Outline Problem definition and project goals Technical issues:  Semex architecture  Reference reconciliation  Importing external data sources  Domain model personalization Overarching PIM Themes

It is PERSONAL data!  What is the right granularity for modeling personal data? Manipulate any kind of INFORMATION  How to combine structured and un-structured data? Data and “schema” evolve over time  How to do life-long data management? Bring the benefits of data MANAGEMENT to users  How to build a system supporting users in their own habitat? PERSONAL INFORMATION MANAGEMENT

Related Work Personal Information Management Systems  Indexing Stuff I’ve Seen (MSN Desktop Search) [Dumais et al., 2003] Google Desktop Search [2004]  Richer relationships LifeStreams [Freeman and Gelernter, 1996] Placeless Documents [Dourish et al., 2000] MyLifeBits [Gemmell et al., 2002]  Objects and Associations Haystack [Karger et al., 2005]

Summary 60 years passed since the personal Memex was envisioned  It’s time to get serious  Great challenges for data management The goal of Semex  Set up a platform for applications that increase user’s productivity  Bring benefits of data management to ordinary users There is a lot of technology to build on. It is not a pipe dream!

A Platform for Personal Information Management and 2005 Xin (Luna) Dong and Alon Halevy University of Washington data.cs.washington.edu/semex