Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured.

Similar presentations


Presentation on theme: "Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured."— Presentation transcript:

1 Integrating Structured & Unstructured Data

2 Goals  Identify some applications that have crucial requirement for integration of unstructured and structured data  Identify key technical issues in integrating unstructured and structured data  Identify potential approaches

3 Definitions (simplified)  Structured object: – }>  Unstructured object: –  Semi-structured object – }, {word}> – pairs may be Given (e.g. author, title, etc.) Extracted (e.g. Date, Zipcode, etc.) Inferred (e.g. Topic)

4 Representative Applications  BPI: Messasges- unstructured  Web Applications: unstructured pages  Corporate Portals:  DSS involving Combination of simulation with database system  News syndication: author etc + story  Call centers: customer interaction + structured component of complaint  Mail system/document systems  Tourist information system  Product catalogs/engineering spec sheets  Patents/chenistry documents  Matching Legal documents (with cross citations) with building codes --- representative

5 Key Technical Issues  Query language & data model – Sharp vs fuzzy / complete vs best-effort – Boolean vs similarity queries (relationship to “value”)  Integration strategies – Loose vs. tight coupling Architectures (many possibilities) – Search engine into DBMS or DBMS into search engine – Late & early binding (warehousing vs virtual) – Integration vs articulation (union vs intersection)  Feature extraction from unstructured data  Role of meta data & integrity constraints  Inconsistency of data sources – Priorty rules for mediation  Management & data organization issues – Version management, freshness, security  Continuous queries over streams

6  Strucured:People(firstname, lastname, company, location)  Semi-structured:Papers(title, {authors}, text)  Unstructured: Reviews Q1: Reviews of papers by Almaden authors on II  Search reviews using Join(People., Papers.authors).keywords Q2: Folks in Almaden and Watson working on same topic  Join of Papers.text followed by joined with names in People Q3: Papers on privacy & data mining by Agarwal in Watson  Combine ranks of results from People and Papers Q4: Almaden authors whose papers had negative reviews  Infer sentiment of a review and interesting joins Q5: Crrent research topics in Almaden  Join People and Papers followed by clustering

7 Combining Scores  DB: – Aggarwal, Watson, s1 – Agarwal, Almaden, s2 – Agrawal, Almaden, s3  IR – Sigmod 00 paper, r2 – PODS 01 papers, r1 – KDD00 paper, r3 Query DB IR Result ChopperCombiner Papers on privacy & data mining by Agarwal in Watson

8 Query Processing Query Chopper & Router DB IR Result Query Chopper & Router DB IR Result

9 Approaches (1)  Query Languages – XML-based extensions for queries W3C working group on Xquery considering extension for full text XXL (Weikum), XIRQL (Fuhr) – Specialized languages for highly structured data (e.g. chemical molecules)? – Graph-based models & languages (RDF, Protégé – Stanford) – Extended relational (e.g. SQL/MM) – Inverse queries on business events – Reasoning systems – Statistical approaches (approximate/ data mining)

10 Approaches (2)  Pluses of tight coupling – Enforcement of ontologies, schemas – Security, management, query optimization, integriry constraints  Negatives of tight coupling – Does not address federation issues/autonomy  Pluses of loose coupling – Flexibility  Negatives of loose coupling And the dinner bell rings …

11 Concluding Remarks  We need further discussion on issues and approaches during the rest of the workshop


Download ppt "Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured."

Similar presentations


Ads by Google