Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006

Agenda A few perspectives on the last 10 years –Technical, commercial Perspectives from our personal paths Wild speculations about the future This is not a survey on data integration (See the paper in the proceedings for another non-survey)

Acknowledgements Other members of the Information Manifold Project: –Jaewoo Kang (NCSU, Korea Univ.) –Divesh Srivastava (AT&T Labs) –Shuky Sagiv (Hebrew U.) –Tom Kirk

Acknowledgements To the SIGMOD 1996 Program committee For rejecting the earlier version of the paper.

Timeline 959697989900010203040506

Data Integration Legacy Databases Services and Applications Enterprise Databases Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment

The Information Manifold Goal: integrate data from multiple sources on the web: Find the Woody Allen movies playing in my area, and their reviews Need to describe the data sources: –Contents, constraints, access patterns

wrapper Mediated Schema Semantic mappings optimization & execution query reformulation Design timeRun time

Semantic Mappings [a.k.a. Source Descriptions] Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName CD: ASIN, Title, Genre, … Artist: ASIN, name, … Mediated Schema logic

Global-as-View (GAV) Source R1R2R3R4R5 CD(A,T,G) :- R1(A,T,G) CD(A,T,G) :- R2(A,T), R3(T,G) CD: ASIN, Title, Genre, … Artist: ASIN, name, … Mediated Schema Mapping:

Local-as-View (LAV) Source R1R2R3R4R5 R1(A,T,G) :- CD(A,T,G,Y), Artist(A,N), Y< 1970 R2(A,T) :- CD(A,T,”French”,Y) CD: ASIN, Title, Genre, Year Artist: ASIN, Name, … Mediated Schema Mapping:

Query Answering in LAV = Answering queries using views Given a set of views V 1,…,V n, And a query Q, Can we answer Q using only the answers to V 1,…,V n ?

AQUV (I) [Larson et al., 85 & 87], [Tsatalos et al., 94], [Chaudhuri et al., 95], Focus on AQUV for: –Query optimization –Supporting physical data independence Every commercial DBMS supports AQUV.

AQUV (II) AQUV for data integration: –Find maximally contained rewriting –Not necessarily equivalent rewriting Algorithms: –Bucket algorithm [LRO, 96] –Inverse rules [Duschka, 97] –Minicon [Pottinger and Halevy, 2000] Views and security: [Miklau and Suciu, 04] Survey: Halevy, VLDB Journal, 2001

Some Subsequent Results Semantics of data integration: –Abiteboul & Duschka, 1998: certain answers –Open vs. closed world assumption CWA is bad complexity news! Survey: Lenzerini, PODS 2002

Certain Answers OriginDestination SFSeattle NYSeoul OriginDestination SFSeoul NYSeattle Mediated schema: Route (Origin, Destination) Source 1: Origins SF NY Source 2: Destinations Seattle Seoul Query: Route (SF, Seattle)? Possible databases:

Some Subsequent Results Limitations due to binding patterns –Input title, get book info [Rajaraman et al., 95] Additional query processing capabilities –Form applies multiple predicates Disjunction, negation in sources. Ordering sources, probabilistic mappings –[Florescu et al., 97, Doan et al., Dong et al.] GLAV [Millstein et al., 99] Survey: Lenzerini, PODS 2002

A word on Description Logics Selecting relevant sources = reasoning. Description logics to the rescue: –[Catarci and Lenzerini, 93] Information Manifold –Combined the Classic DL with Datalog (CARIN) –See AAAI-96 (not sigmod) Brought DL and DB closer together. –A very active area of research today.

959697989900010203040506

XML and Semi-structured Data Tsimmis: semi-structured data for integration. XML: whetted the integration appetites –We have the syntax –Now just solve the silly semantics problems –Don’t bother: we’ll all standardize on DTDs. XML will have a significant role on the data integration industry and research.

959697989900010203040506

Back in the Lab… Two observations: –Who’s going to write all these LAV/GAV formulas? –This was the bottleneck. Once we have mappings, how can we execute queries? –Traditional plan-then-execute doesn’t work.

Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B “ Standards are great, but there are too many of them. ”

Techniques for Schema Mapping [Survey by Rahm and Bernstein, VLDBJ 2001] Compare schema elements based on: –Names (or n-grams) –Data types and instances –Text descriptions, integrity constraints Combine multiple techniques: –[Momis, Cupid, LSD, Coma] Create mappings from matches –[Clio @ IBM + Miller]

A Machine Learning Approach [Doan et al., 2001, ACM Distinguished Dissertation 2003] Many mapping tasks are repetitive Learn from previous experience: –Build a classifier for every element of the mediated schema. –Many kinds of cues  meta-strategy learning Mediated schema Given matches Predict new ones

listed-price $250,000 $110,000... address price agent-phone description Matching Real-Estate Sources location Miami, FL Boston, MA... phone (305) 729 0831 (617) 253 1429... comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320,000... contact-phone (278) 345 7215 (617) 335 2315... extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

Reference Reconciliation To Join or not to Join? Many ways to refer to the same object in the world: –“IBM”, “International Business Machines” –Alon Levy, Alon Halevy Automated methods are necessity –Can’t go through all the data manually Very active area in ML, KDD, DB, UAI, …

Query Processing To Plan or to Execute? In addition to distributed query processing issues: –Few statistics, if any. –Network behavior issues: latency, burstiness,… –Garlic @IBM “Adaptive query processing”: –Stonebraker saw it coming in Ingres. –Revivals by Graefe (1993) and DeWitt (1998). –Query scrambling [Urhan & Franklin] –Eddies [Avnur & Hellerstein] –Convergent query processing [Ives et al.]

959697989900010203040506

Commercialization Late 90’s – anything goes. Want money from VC’s? –Say “XML” 3 times loud and clear. Academia at the forefront: –Nimble (UW), Cohera (Berkeley), Enosys (UCSD),… Big companies took notice –Some faster than others

Commercialization Retrospective [See Panel-of-Experts, SIGMOD 05] Uphill battle vs. the warehousing folks –Virtual integration was more “pay-as-you-go” Another battle with the EAI folks –Should really be a symbiosis there. Go vertical or horizontal? –Obvious: go vertical if you can find the right one. The technology worked –But it’s all in the timing…

XML Query User Applications Lens™ FileInfoBrowser™ Software Developers Kit NIMBLE™ APIs Front-End XML Lens Builder™ Management Tools Management Tools Integration Builder Integration Builder Security Tools Data Administrator Data Administrator After $30M… Concordance Developer Integration Layer Nimble Integration Engine ™ CompilerExecutor Metadata Server Cache Relational Data Warehouse/ Mart Legacy Flat FileWeb Pages Common XML View

959697989900010203040506 NASDAQ

So… Back in the Lab Model management Peer data management systems Data exchange

Model Management [Bernstein et al.] Generic infrastructure for managing schemas and mappings: –Manipulate models and mappings as bulk objects –Operators to create & compose mappings, merge & diff models –Short operator scripts can solve schema integration, schema evolution, reverse engineering, etc. First challenge: semantics of operators.

Peer Data Management Systems Berkeley Stanford DBLP UW (Washington) UW (Wisconsin) CiteSeer UW (Waterloo) Q Q1 Q2 Q6 Q5 Q4 Q3 LAV, GLAV

PDMS-Related Projects Piazza (Washington) Hyperion (Toronto) PeerDB (Singapore) Local relational models (Trento, Toronto) Active XML (INRIA) Edutella (Hannover, Germany) Semantic Gossiping (EPFL Lausanne) Raccoon (UC Irvine) Orchestra (U. Penn)

PDMS Challenges Berkeley Stanford DBLP UW (Washington) UW (Wisconsin) CiteSeerUW (Waterloo) Semantics: careful about cycles Optimization: Compose mappings Prune paths Manage networks: Consistency Quality Caching

Data Exchange Key question: given an instance of S and a mapping, create an instance for T. [Fagin, Kolaitis, Popa & Tan] ST M

959697989900010203040506

959697989900010203040506 ?

2006 Status Report [The People Angle] Joann @ Avaya –Integrating communications into business processes Anand @ Kosmix – Creating a new kind of search company Alon @ Google –Working for Joann’s old boss –Deep web evangelist –Pondering data management for the masses

2006 Status Report [Enterprise Angle] Enterprise Information Integration is established: – IBM, BEA, Oracle, MetaMatrix, Composite, Actuate, … Impact on design tools: –IBM Rational Data Architect –ADO.NET v. 3

Forrester Says… "Enterprises are facing the growing challenges of using disparate sources of data managed by different applications, including problems with data integration, security, performance, availability and quality.... New technology is emerging that Forrester has coined "information fabric," a term defined as a virtualized data layer that integrates heterogeneous data and content repositories in real time.... The potential benefits of this technology are so great that enterprises should develop a strategy to leverage information fabric technology as it becomes more widely available."

2006 Status Report [Web Angle] Vertical search engines: one domain At scale: need even better source descriptions –deep web can be surfaced Terminology: Data integration = mashups!

Wikipedia: A mashup is a website or Web 2.0 application that uses content from more than one source to create a completely new service. This is akin to transclusion.Web 2.0transclusion

Looking Ahead Data management: from the enterprise to the masses Challenges: –Databases of everything –Need support for collaboration –Help people structure their data –Pay-as-you go data management

Pay-as-you-go Data Management Benefit Investment (time, cost) Dataspaces Data integration solutions Artist: Mike Franklin Dataspaces: Franklin, Halevy, Maier [see PODS 2006]

Big Carrots

Reusing Human Attention Principle:  User action = statement of semantic relationship  Leverage actions to infer other semantic relationships Examples –Providing a semantic mapping Infer other mappings –Writing a query Infer content of sources, relationships between sources –Creating a “digital workspace” Infer “relatedness” of documents/sources Infer co-reference between objects in the dataspace –Annotating, cutting & pasting, browsing among docs

Conclusion We’ve done extremely well as a community! Next challenge: data management and integration tools for the masses

Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Similar presentations

Presentation on theme: "Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Similar presentations

Presentation on theme: "Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006."— Presentation transcript:

Similar presentations

About project

Feedback