Presentation is loading. Please wait.

Presentation is loading. Please wait.

Informationsintegration Information Quality 26.1.2006 Felix Naumann.

Similar presentations


Presentation on theme: "Informationsintegration Information Quality 26.1.2006 Felix Naumann."— Presentation transcript:

1 Informationsintegration Information Quality 26.1.2006 Felix Naumann

2 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/062 Overview Motivation: IQ for integrated IS Definition of IQ Optimizing IQ IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

3 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/063 Database Management Systems vs. Integrated Information Systems DBMS IIS

4 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/064 DBMS Quality vs. IIS Quality Complete (assumed) Accurate Trusted Fast Free Incomplete Inaccurate Untrusted Slow Possible cost High expectations High quality Low expectations Low quality

5 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/065 Datenqualität vs. Datenfehler Qualität kann nicht einzig durch Data Cleansing erhöht werden. Ansehen, Objektivität, … Accuracy Quality Duplicates Quality

6 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/066 Datenqualität in IIS Integrierte Informationssysteme besonders anfällig für Qualitätsprobleme Probleme akkumulieren Qualität der Ursprungsdaten (Eingabe, Fremdfirmen,...) Qualität der Quellsysteme (Konsistenz, Constraints, Fehler,...) Qualität der Integrationsprozesse Parsen, Transformieren Mappings Probleme treten erst bei integrierter Sicht zu Tage

7 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/067 Example: Customer Relationship Management (CRM) Probleme im CRM eines Multi-Channel Vertriebs Kunden doppelt geführt Kunden falsch bewertet Falsche Adressen Haushalte / Konzernstrukturen nicht erkannt Folgen False positives: Verärgerte Kunden durch mehrere / unpassende Mailings False negatives: Verpasste Gelegenheiten durch fehlende / falsche Zuordnung (Cross-Selling) Sinnlose Portokosten bei falschen Adressen Quelle: Prof. Ulf Leser (VL Data Warehouses)

8 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/068 Cost of Dirty Data A.T. Kearny: 25%-40% der operativen Kosten entstehen durch schlechte Datenqualität. Data Warehouse Institute: Industrie und Verwaltung in den USA verlieren jährlich 600 Milliarden USD. SAS Studie: Nur 18% der Deutschen Betriebe vertrauen ihren Daten. AT&T (70er): 20-30% aller Anschlüsse unbenutzt wegen schlechter Daten. 80% aller Krankenhaus Datensätze enthalten Fehler. Hmmm......

9 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/069 Optimize IQ! Fixed quality complete & correct Optimize cost time & throughput Fixed cost price, patience, … Optimize IQ IQ criteria

10 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0610 Overview Motivation: IQ for integrated IS Definition of IQ Optimizing IQ IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

11 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0611 Information Quality (IQ) Was ist Informationsqualität ? Fitness for use User satisfaction Anwendungsabhängig Folgen geringer Datenqualität Falsche Prognosen Verpasstes Geschäft Qualität ist besonders bei integrierten Informationen interessant Oft keine Kontrolle über Informationsquellen (Autonomie!) Oft zweifelhafte Qualität Internet macht Publikation leicht Vielzahl verfügbarer Quellen

12 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0612 IIS Quality Criteria IQ := Even though quality cannot be defined, you know what it is. Robert Pirsig

13 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0613 Information Quality (IQ) IQ := {Understandability, Reputation, Reliability, Timeliness, Availability, Price, Consistency, Coverage, Response time, Density, Completeness, Amount, Accuracy, Relevancy,... }

14 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0614 IQ Classification of [WS96] Intrinsic IQ Believability, Accuracy, Objectivity, Reputation Contextual IQ Value-added, Relevancy, Timeliness, Completeness, Amount Representational IQ Interpretability, Understandability, Repr. Consistency, Repr. conciseness Accessibility IQ Accessibility, Security

15 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0615 Content-based IQ Criteria …concern the actual data. Accuracy is the extent to which data is correct, reliable, and certified free of error. [WS96] Completeness is the extent to which data is not missing and is of sufficient breadth, depth, and scope for the task at hand. [WS96] Customer support is the amount and usefulness of human help via email or telephone. Documentation is the amount and usefulness of documents with metadata. Interpretability is the extent to which data is in appropriate languages, symbols, and units, and the definitions are clear. [WS96] Relevancy (or relevance) is the extent to which data is applicable and helpful for the task at hand. [WS96] Reliability is the degree to which the user can trust the information Value-Added is the extent to which data is beneficial and provides advantages from its use. [WS96]

16 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0616 Technical IQ Criteria …concern software and hardware. Accessibility (or availability) of a DBMS is the probability that a feasible query is correctly answered in a given time range. Is the extent to which data are available or easily and quickly receivable [WS96]. Latency is the amount of time in seconds from issuing the query until the first data item reaches the user Price (cost effectiveness) is the amount of money a user has to pay for a query. is the extent to which the cost of collecting appropriate data is reasonable [WS96]. Response time measures the delay in seconds between submission of a query by the user and reception of the complete response from the IS. Security is the extent to which access to data is restricted appropriately to maintain its security [WS96]. Timeliness is the extent to which the age of the data is appropriate for the task at hand [WS96].

17 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0617 Intellectual IQ Criteria …concern subjective aspects. Believability is the extent to which data is regarded as true, real, and credible [WS96]. Objectivity is the extent to which data is unbiased, unprejudiced, and impartial [WS96]. Reputation is the extent to which data is trusted or highly regarded in terms of its source or content [WS96].

18 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0618 Instantiation-related IQ Criteria …concern the presentation of retrieved data. Amount of data is the extent to which the quantity or volume of available data is appropriate [WS96]. Representational conciseness is the extent to which data is compactly represented without being overwhelming [WS96]. Representational consistency is the extent to which data is always represented in the same format and are compatible with previous data [WS96]. Understandability (ease of understanding) is the extent to which data are clear without ambiguity and easily comprehended [WS96]. Verifiability (traceability) Is the extent to which data are well documented, verifiable, and easily attributed to a source [WS96].

19 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0619 IQ Criteria (classical) Accuracy Definition: Usually: Percentage of incorrect tuples For integration: Percentage of incorrect data values Assessment: Domain and Constraint Testing Lookup tables Scientific measurements Data-input experience Improvement: Often: Deletion Better: Data Scrubbing

20 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0620 IQ Criteria (classical) Response Time Definition: Usually: Time until complete query result is received For integration: Latency Assessment: Cost Calibration Continuous assessment Improvement: Source selection Classical optimization Federated Optimization

21 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0621 IQ Criteria (new) Completeness Definition: Coverage: Number of real world objects represented Density: Number of attributes covered For IIS: NULL-values Assessment: Sampling Existing Metadata Improvement: Source selection Best k vs. k best

22 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0622 IQ Criteria (new) Reputation / Trust Definition: Reputation: Memory and summary of behavior from past transactions Trust: Expectation about future behavior Assessment: Individual experience Corporate guidance Trust-networks Improvement: ???

23 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0623 Overview Motivation: IQ for integrated IS Definition of IQ Optimizing IQ IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

24 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0624 Optimize IQ! ÜFixed quality Ücomplete & correct ÜOptimize cost Ütime & throughput ÜFixed cost Üprice, patience, … ÜOptimize IQ ÜIQ criteria

25 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0625 A New Optimization Paradigm – Many Changes DBMS Cost criteria Cost model Optimization algorithm Integrates IS Quality criteria Quality model Optimization algorithm + Information integration

26 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0626 DB Cost Criteria Response time Execution time Latency Throughput Cardinality … Assessed through system parameters and statistics.

27 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0627 IIS Quality Criteria {Understandability, Reputation, Reliability, Timeliness, Availability, Price, Consistency, Coverage, Response time, Density, Completeness, Amount, Accuracy, Relevancy,... } IQ := Assessed in 3 classes…

28 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0628 IQ-Assessment Subjekt Anfrage Prozess Objekt Vollständigkeit Zeitnähe... Verfügbarkeit Antwortzeit... Relevanz Glaubwürdigkeit...

29 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0629 IQ-Assessment

30 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0630 Overview Motivation: IQ for integrated IS Definition of IQ Optimizing IQ IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

31 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0631 DB Cost Models Operators + (add) max x (multiply)

32 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0632 A Quality Model for Integrated IS 2 Problems Many Dimensions Multidimensional S 1 ( 95,0,0.7,1,99.95,60,48.2,0 ) S 2 ( 99,0,1,0.2,99.9,80,52.8,0 ) S 3 ( 95,0,0.7,1,99.95,60,38,3 ) (?, ?, ?, ?, ?, ?, ?, ?) aggregated IQ-vector IQ-vector Merging operators

33 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0633 IQ Merge Functions Availability:A B Price:A + B Response Time:max[A, B] Coverage:Sylvester Merge IQ in many Dimensions (94.05,0,1,1,99.85,48,54.86,0) (89.35,0,1,1,99.8,28.8,76.06,3) S 1 (95,0,0.7,1,99.95,60,48.2,0) S 2 (99,0,1,0.2,99.9,80,52.8,0) S3S3 merge (95,0,0.7,1,99.95,60,38,3)

34 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0634 Multidimensional IQ IQ-criteria have Different units Different ranges Different importance So... convert scale weight (89.35,0,1,1,99.8,28.8,76.06,3) > (82.35,0,2,1.5,95,32,71.77,2) ? MADM methods: SAW, TOPSIS, ELECTRE, AHP, DEA

35 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0635 Overview Motivation: IQ for integrated IS Definition of IQ Optimizing IQ IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

36 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0636 DB-type Optimization Goal Minimize response time Maximize throughput Restrictions Complete Correct (not just accurate: filter conditions…) Find best plan!

37 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0637 DB-type Optimization (name, company) emp comp (compID=ID) (salary > 1000) (name, company) compemp (compID=ID) (salary > 1000) SELECT name, company FROM emp, comp WHERE emp.compID = comp.ID AND emp.salary > 1000

38 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0638 DB-type Optimization (name, company) compemp_1 (compID=ID) (salary > 1000) emp_n...

39 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0639 DB-type Optimization (name, company) comp emp_1 (compID=ID) (salary > 1000) emp_n... MERGE (name, company) comp emp_1 (compID=ID) (salary > 1000) emp_n... MERGE (salary > 1000)

40 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0640 IIS-type Optimization Change is efficient But: Result can be incomplete. Preferences? (name, company) comp emp_1 (compID=ID) (salary > 1000) emp_n... MERGE (salary > 1000)

41 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0641 Overview Motivation: IQ for integrated IS Definition of IQ Optimizing IQ IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

42 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0642 IIS-type Optimization Goal Maximize information quality (Maximize completeness) Restrictions Price Bandwidth Time (user patience) Find K best sources – Find best K sources

43 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0643 IIS-type Optimization K best sources Simple IQ model, but Sources may not complement each other at tuple level (replication) at attribute level Best K sources Finds optimal query result Uses IQ merging

44 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0644 Naive (3-phase) approach [NLF99] Input Query,views, IQ scores Phase 1 Source selection Phase 3 Plan selection Output Quality-ranked plans Phase 2 Query planning Good:executes only best plans Bad: still needs to compute all plans BA

45 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0645 Integrated (single-phase) approach [LN00] Input Query, views, and IQ scores Output Quality-ranked plans HiQ B&B: one phase, quality-based branch & bound algorithm Good: executes only best plans Good: computes only a fraction of all plans

46 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0646 Overview Motivation: IQ for integrated IS Definition of IQ Optimizing IQ IQ assessment IQ model IQ query answering in DBMS IQ query answering in IIS IQ-driven integration

47 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0647 Conflict Resolution amazon.com bn.com ID max length MINCONCAT $5.99 Moby Dick Herman Melville0766607194 $3.98H. Melville0766607194 These are IQ considerations!

48 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0648 Conflict Resolution null = unknown Internal Conflict- Resolution Function

49 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0649 Conflict Resolution Numerical: SUM, AVG, MAX, MIN, … Non-numerical: MAXLENGTH, CONCAT, AnnCONCAT,… Special: RANDOM, COUNT, CHOOSE, FAVOR, MaxIQ,… Domain-specific … Human speci- fication of IQ Automated speci- fication of IQ

50 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0650 Conflict Resolution amazon.com bn.com ID max length MINCONCAT $5.99 Moby Dick Herman Melville0766607194 $3.98H. Melville0766607194 slow, up-date, complete, … fast, outdated, incomplete, …

51 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0651 MaxIQ Resolution Function Per attribute Choose value of higher quality source. Per tuple Choose tuple of higher quality source. Per source Choose K best sources Choose best K sources

52 26.1.2006Felix Naumann, VL Informationsintegration, WS 05/0652 Literatur [RD00] Data Cleaning: Problems and Current Approaches, Rahm & Do, IEEE Bulletin 23(4), 2000. [MF03] Problems, Methods, and Challenges in Comprehensive Data Cleansing. Heiko Müller, Johann-Christoph Freytag, Technical Report HUB-IB-164, Humboldt University Berlin, 2003 [WS96] Richard Y. Wang and Diane M. Strong. Beyond accuracy: What data quality means to data consumers. Journal on Management of Information Systems, 12(4):5-34, 1996. [NLF99] Felix Naumann, Ulf Leser, and Johann-Christoph Freytag: Quality-driven Integration of Heterogenous Information Systems, VLDB 1999 [LN00] Ulf Leser and Felix Naumann: Query Planning with Information Quality Bounds, FQAS 2000.


Download ppt "Informationsintegration Information Quality 26.1.2006 Felix Naumann."

Similar presentations


Ads by Google