Presentation on theme: "Data Cleaning and Transformation"— Presentation transcript:
1 Data Cleaning and Transformation Helena GalhardasDEI IST(based on the slides: “A Survey of Data Quality Issues in Cooperative Information Systems”, Carlo Batini, Tiziana Catarci, Monica Scannapieco, 23rd International Conference on Conceptual Modelling (ER 2004))
2 Agenda Introduction Data Quality Problems Data Quality Dimensions Relevant activities in Data Quality
3 When materializing the integrated data (data warehousing)… SOURCE DATATARGET DATADataExtractionLoadingTransformation......ETL: Extraction, Transformation and Loading70% of the time in a datawarehousing project is spent withthe ETL process
4 Why Data Cleaning and Transformation? Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datae.g., occupation=“”noisy: containing errors or outliers (spelling, phonetic and typing errors, word transpositions, multiple values in a single free-form field)e.g., Salary=“-10”inconsistent: containing discrepancies in codes or names (synonyms and nicknames, prefix and suffix variations, abbreviations, truncation and initials)e.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., discrepancy between duplicate records
5 Why Is Data Dirty? Incomplete data comes from: Noisy data comes from: non available data value when collecteddifferent criteria between the time when the data was collected and when it is analyzed.human/hardware/software problemsNoisy data comes from:data collection: faulty instrumentsdata entry: human or computer errorsdata transmissionInconsistent (and redundant) data comes from:Different data sources, so non uniform naming conventions/data codesFunctional dependency and/or referential integrity violation
6 Why is Data Quality Important? Activity of converting source data into target data without errors, duplicates, and inconsistencies, i.e.,Cleaning and Transforming to get…High-quality data!No quality data, no quality decisions!Quality decisions must be based on good quality data (e.g., duplicate or missing data may cause incorrect or even misleading statistics)Exemplo: se existirem duplicados ou erros, a corespondência enviada para casa pode levar a erros e,Consequentemente a decisões não correctas. O problema pode estar nos dados utilizados para enviar a correspondência.
7 Research issues related to DQ Source SelectionSource CompositionQuery Result SelectionTime Syncronization…Record Matching(deduplication)Data Transformation…Conflict ResolutionRecord Matching…Data QualityDataIntegrationDataCleaningStatisticalData AnalysisDataMiningEdit-imputationRecord Linkage…ManagementInformationSystemsError LocalizationDB profilingPatterns in text strings…KnowledgeRepresentationAssessmentProcess ImprovementTradeoff Cost/Optimization…Conflict Resolution…
8 Research areas in DQ systems ModelsDimensionsMethodologiesMeasurement/ImprovementTechniquesMeasurement/ImprovementTools and FrameworksEGovScientificDataWebDataApplicationDomainsObter dados de boa qualidade constitui uma actividade complexa e multidisciplinar,Dada a sua natureza, a sua importância e a variedade de tipos de dados e Sis.Envolve vários tópicos de investigação e várias áreas de aplicação:Dimensões: para medir o nível de qualidade de dados.Modelos: os modelos usados em BDs para representar os dados e os esquemas, bem como osProcessos de negócio de uma organização devem ser enriquecidos de modo a representar as dimensõesE outros aspectos da qualidade de dados.Técnicas: correspondem a algoritmos, heuristicas, procedimentos baseados em conhecimento, e processos de aprendizagem queFornecem uma solução para um problema de QD especifico. Exs: identificar que dois registos de uma BD dizem respeito ao mesmo objectoDo mundo real.Metodologias fornecem as linhas directivas para escolher o processo de medida e melhoria da qualidade de dados mais efectivo, partindoDe um conjunto de técnicas e ferramentas disponíveis.Ferramentas: procedimentos automáticos, que normalmente dispõem de uma interface que aliviam o utilizador da execução manual de algumas técnicas. Implementam as metodologias e as técnicas.EGov: o principal objectivo dos projectos de e-Gov é contribuir para a melhoria dos relacionamentos entre o governo, as agências e os negócios, através da utilização de tecnologias de informação e de comunicação.Implica:Automatização completa dos processos administrativos governamentaisCriação d euma arquitecturaA criação de portaisOs projectos de eGov deparam-se em geral com o problema que informação semelhante acerca de um cidadão ou negócio provavelmenteAparece em múltiplas BDs. A situação ainda piora pelo facto de erros poderem ter sido introduzidos ao longo do tempo por variadas razões.Estes desalinhamentos entre a informação levam frequentemente a custos adicionais.…
9 Application contexts Integrate data from different sources E.g.,populating a DW from different operational data storesEliminate errors and duplicates within a single sourceE.g., duplicates in a file of customersMigrate data from a source schema into a different fixed target schemaE.g., discontinued application packagesConvert poorly structured data into structured dataE.g., processing data collected from the Web1) when a single source contains erroneous data (for example, if several records refer the same client within a list of ustomers used for marketing purposes)2) when 2 or more data sources are integrated in order to build a data warehouse for instance. Data coming from distinct sources, produced by different people using different conventions must be consolidated and enhanced to conform the DW schema. More than 50% of the total cost is spent for cleaning data.3) when dealing with legacy sources or data coming from the Web, it is important to extract relevant information and transform and convert it into the appropriate schema. For example, the Citeseer web site, that enables users to browse through citations, was built from a set of textual records that correspond to all bibliographic references in CS that appear in ps and pdf documents available on the Web.
10 Data Quality Dimensions AccuracyErrors in dataExample:”Jhn” vs. “John”CurrencyLack of updated dataExample: Residence (Permanent) Address: out-dated vs. up-to-datedConsistencyDiscrepancies into the dataExample: ZIP Code and City consistentCompletenessLack of dataPartial knowledge of the records in a table or of the attributes in a recordAccuracyof Jnh is intuitively less acurate than John, ...Consistency pertains with several aspects, e.g. the coherence of ZIP code and city in a database of permanent addresses of Citizens.Concerning completness, intutively it may deal with full or only partial knowledge of the records in a table (we name this entity completness) or else full or partial knowledge of the attributes of a record (attribute completness)
12 Existing technologyAd-hoc programs written in a programming language like C or Java or using an RDBMS proprietary languagePrograms difficult to optimize and maintainRDBMS mechanisms for guaranteeing integrity constraintsDo not address important data instance problemsData transformation scripts using an ETL(Extraction-Transformation-Loading) or data quality toolThe commercial tools available for data cleaning… concerning data transformation….
13 Typical architecture of a DQ system HumanKnowledgeTARGET DATASOURCE DATADataTransformationDataExtractionDataTransformationDataLoading......DataAnalysisMetadataDictionariesA typical architectural solution to solve these problems contains the following components. There are three main modules: extraction from data sources, transformation and loading data into heterogeneous data sources.Additional steps of data analysis permit to automatically detect data errors and derive some data transformation rules,schema integration techniques derive the appropriate target schema from the source data schemas and allow the specification of some transformation rules.This type of information can be stored in a metadata repository that feeds information to the extraction, transformation and loading activities.In addition, some dictionaries containing refernce data can be used when integrating schema and transforming data.All over the cleaning process, the human participation may be required to supply extraction, transformation, loading rules or even to transform some data items manually.HumanKnowledgeSchemaIntegration
15 Several taxonomiesBarateiro and Galhardas, Barateiro, J. and Galhardas, H. (2005). “A survey of data quality tools”. Datenbank-Spektrum, 14:15-21.Oliveira, P. (2009). “Detecção e correcção de problemas de qualidade de dados: Modelo, Sintaxe e Semântica”. PhD thesis, U. do Minho.Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K., and Lee, D. (2003). “A taxonomy of dirty data. Data Mining and Knowledge Discovery”, 7:81-99.Mueller, H. and Freytag, J.-C. (2003). “Problems, methods, and challenges in comprehensive data cleansing”. Technical report, Humboldt-Universitaet zu Berlin zu Berlin.Rahm, E. and Do, H. H. (2000). “Data cleaning: Problems and current approaches”. Bulletin of the Technical Committe on Data Engineering, Special Issue on Data Cleaning, 23:3-13.
16 Data quality problems (1/3) Schema level data quality problems prevented with better schema design, schema translation and integration.Instance level data quality problems errors and inconsistencies of data that are not prevented at schema level
17 Data quality problems (2/3) Schema level data quality problemsAvoided by an RDBMSMissing data – product price not filled inWrong data type – “abc” in product priceWrong data value – 0.5 in product tax (iva)Dangling data – category identifier of product does not existExact duplicate data – different persons with same ssnGeneric domain constraints – incorrect invoice priceNot avoided by an RDBMSWrong categorical data – countries and corresponding statesOutdated temporal data – just-in-time requirementInconsistent spatial data – coordinates and shapesName conflicts – person vs person or person vs clientStructural Conflicts - addresses
18 Data quality problems (3/3) Instance level data quality problemsSingle recordMissing data in a not null field – ssn:Erroneous data – price:5 but real price:50Misspellings: José Maria Silva vs José Maria SlivaEmbedded values: Prof. José Maria SilvaMisfielded values: city: PortugalAmbiguous data: J. Maria Silva; Miami Florida,OhioMultiple recordsDuplicate records: Name:Jose Maria Silva, Birth:01/01/1950 and Name:José Maria Sliva, Birth:01/01/1950Contradicting records: Name:José Maria Silva, Birth:01/01/1950 and Name:José Maria Silva, Birth:01/01/1956Non-standardized data: José Maria Silva vs Silva, José Maria
20 Traditional data quality dimensions AccuracyCompletenessTime-related dimensions: Currency, Timeliness, and VolatilityConsistencyTheir definitions do not provide quantitative measures so one or more metrics have to be associatedFor each metric, one or more measurement methods have to be provided regarding: (i) where the measurement is taken; (ii) what data are included; (iii) the measurement device; and (iv) the scale on which results are reported.Schema quality dimensions are also defined
21 AccuracyCloseness between a value v and a value v’, considered as the correct representation of the real-world phenomenon that v aims to represent.Ex: for a person name “John”, v’=John is correct, v=Jhn is incorrectSyntatic accuracy: closeness of a value v to the elements of the corresponding definition domain DEx: if v=Jack, even if v’=John , v is considered syntactically correctMeasured by means of comparison functions (e.g., edit distance) that returns a scoreSemantic accuracy: closeness of the value v to the true value v’Measured with a <yes, no> or <correct, not correct> domainCoincides with correctnessThe corresponding true value has to be knownAccuracy semântica mais dificil de calcular do que a sintáctica.Uma das técnicas para verificar a accuracy semântica consiste em olhar para os mesmos dados em fontes de dados diferentes e encontrar os dados correctos por comparação. Está relacionado com o problema de object identification, ou seja o problema de perceber se dois tuplos se referem ao mesmo objecto real ou não.
22 Ganularity of accuracy definition Accuracy may refer to:a single value of a relation attributean attribute or columna relationthe whole database
23 Metrics for quantifying accuracy Weak accuracy errorCharacterizes accuracy errors that do not affect identification of tuplesStrong accuracy errorCharacterizes accuracy errors that affect identification of tuplesPercentage of accurate tuplesCharacterizes the fraction of accurate matched tuples
24 Completeness“The extent to which data are of sufficient breadth, depth, and scope for the task in hand.”Three types:Schema completeness: degree to which concepts and their properties are not missing from the schemaColumn completeness: evaluates the missing values for a specific property or column in a table.Population completeness: evaluates missing values with respect to a reference population
25 Completeness of relational data The completeness of a table characterizes the extent to which the table represents the real world.Can be characterized wrt:The presence/absence and meaning of null valuesExample: Person(name, surname, birthdate, ), if is null may indicate the person has no mail (no incompleteness), exists but is not known (incompletenss), is is not known whether Person has an (incompleteness may not be the case)Validity of open world assumption (OWA) or closed world assumption (CWA)OWA: cannot state neither the truth or falsity of facts not represented in the tuples of a relationCWA: only the values actually present in a relational table and no other values represent facts of the real world.
26 Metrics for quantifying completeness (1) Model without null values with OWANeed a reference relation r’ for a relation r, that contains all the tuples that satisfy the schema of rC(r) = |r|/|ref(r)|Example: according to a registry of Lisbon municipality, the number of citizens is 2 million. If a company stores data about Lisbon citizens for the purpose of its business and that number is 1,400,000 then C(r) = 0,7
27 Metrics for quantifying completeness (2) Model with null values with CWA: specific definitions for different granularities:Values: to capture the presence of null values for some fields of a tupleTuple: to characterize the completeness of a tuple wrt the values of all its fields:Evaluates the % of specified values in the tuple wrt the total number of attributes of the tuple itselfExample: Student(stID, name, surname, vote, examdate)Equal to 1 for (6754, Mike, Collins, 29, 7/17/2004)Equal to 0.8 for (6578, Julliane, Merrals, NULL, 7/17/2004)
28 Metrics for quantifying completeness (3) Attribute: to measure the number of null values of a specific attribute in a relationEvaluates % of specified values in the column corresponding to the attribute wrt the total number of values that should have been specified.Example: For calculating the average of votes in Student, a notion of the completeness of Vote should be usefulRelations: to capture the presence of null values in the whole relationMeasures how much info is represented in the relation by evaluating the content of the info actually available wrt the maximum possible content, i.e., without null values.
29 Time-related dimensions Currency: concerns how promptly data are updatedExample: if the residential address of a person is updated (it corresponds to the address where the person lives) then the currency is highVolatility: characterizes the frequency with which data vary in timeExample: Birth dates (volatility zero) vs stock quotes (high degree of volatility)Timeliness: expresses how current data are for the task in handExample: The timetable for university courses can be current by containing the most recent data, but it cannot be timely if it is available only after the start of the classes.Um aspecto importante dos dados é a sua mudança e actualização no tempo
30 Metrics of time-related dimensions Last update metadata for currencyStraightforward for data types that change with a fixed frequencyLength of time that data remain valid for volatilityCurrency + check that data are available before the planned usage time for timeliness
31 ConsistencyCaptures the violation of semantic rules defined over a set of data items, where data items can be tuples of relational tables or records in a fileIntegrity constraints in relational dataDomain constraints, Key, inclusion and functional dependenciesData edits: semantic rules in statistics
32 Evolution of dimensions Traditional dimensions are Accuracy, Completeness, Timeliness, ConsistencyWith the advent of networks, sources increase dramatically, and data become often “found data”.Federated data, where many disparate data are integrated, are highly valuedData collection and analysis are frequently disconnected.As a consequence we have to revisit the concept of DQ and new dimensions become fundamental.
33 Other dimensionsInterpretability: concerns the documentation and metadata that are available to correctly interpret the meaning and properties of data sourcesSynchronization between different time series: concerns proper integration of data having different time stamps.Accessibility: measures the ability of the user to access the data from his/her own culture, physical status/functions, and technologies availavle.
36 Standardization/normalization Modification of data with new data according to defined standards or reference formatsExample:Change “Bob” to “Robert”Change of “Channel Str.” to “Channel Street”
37 Record Linkage/Object identification/ Entity identification/Record matching/Duplicate detection Activity required to identify whether data in the same source or in different ones represent the same object of the real worldRecord linkage, Object identification/Entity identification/Record matching are all names used in literature to denote similar problems, i.e. Given two tables or two sets of tables, representing two entities/objects of the real world, find and cluster all records in tables referring to the same entity/object instanceWe will dedicate a lot of time to such activity, so i move on to the next slide.
38 Data integrationTask of presenting a unified view of data owned by heterogeneous and distributed data sourcesTwo sub-activities:Quality-driven query processing: task of providing query results on the basis of a quality characterization of data at sourcesInstance-level conflict resolution: task of identifying and solving conflicts of values referring to the same real-world objects.in the area of data or extensional integration (in contrapposition to schema or intensional integration) we can definitely say all the topics in which data are effected by errors, pertain to Data quality issues.Schema matching can be considered belonging to this area since the activity of matching also involves comparison among values.
39 Instance-level conflict resolution Instance level conflicts can be of three types:representation conflicts, e.g. dollar vs. Eurokey equivalence conflicts, i.e. same real world objects with different identifiersattribute value conflicts, i.e. Instances corresponding to same real world objects and sharing an equivalent key, differ on other attributes
40 Error localization/Data Auditing Given one/two/n tables or groups of tables, and a group of integrity constraints/qualities (e.g. completeness, accuracy), find records that do not respect the constraints/qualities.Data editing-imputationFocus on integrity constraintsDeviation detectiondata checking that marks deviations as possible data errors
41 Data ProfilingEvaluating statistical properties and intensional properties of tables and recordsStructure induction of a structural description, i.e. “any form of regularity that can be found”
42 Data correction/data cleaning/data scrubbing Given one/two/n tables or groups of tables, and a set of identified errors in records wrt to given qualities, generates probable corrections and correct the records, in such a way that new records respect the qualities.
43 Schema cleaningTransform the conceptual schema in order to achieve or optimize a given set of qualities (e.g. Readability, Normalization), while preserving other properties (e.g. equivalence of content)
44 References“Data Quality: Concepts, Methodologies and Techniques”, C. Batini and M. Scannapieco, Springer-Verlag, 2006 (Chapts. 1, 2, and 4).“A Survey of Data Quality tools”, J. Barateiro, H. Galhardas, Datenbank-Spektrum 14: 15-21, 2005.
45 Next lectures Data Cleaning and Transformation tools Record Linkage The Ajax frameworkRecord LinkageData Fusionti