Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.

Similar presentations


Presentation on theme: "ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08."— Presentation transcript:

1 ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08

2 ISV Innovation Presented by Agenda 1.Introduction to data cleansing 2.SSIS tasks  Data Profiling  Fuzzy Lookup  Fuzzy Grouping 2 Data Cleansing

3 ISV Innovation Presented by Introduction to Data Cleansing 3Data Cleansing Data cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data. After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by different data dictionary definitions of similar entities in different stores, may have been caused by user entry errors, or may have been corrupted in transmission or storage. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. The actual process of data cleansing may involve removing typos or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records). http://en.wikipedia.org/wiki/Data_cleansing

4 ISV Innovation Presented by Data Profiling 4  The Data Profiling task computes various profiles that help you become familiar with a data source and identify problems in the data that have to be fixed.  After using the task to compute data profiles and save them in a file, you can use the stand-alone Data Profile Viewer to review the profile output. Data Cleansing

5 ISV Innovation Presented by Fuzzy Lookup 5  Fuzzy Lookup enables you to match input records with clean, standardized records in a reference table.  Fuzzy Lookup returns the closest match and indicates the quality of the match. Data Cleansing

6 ISV Innovation Presented by Fuzzy Grouping 6  Fuzzy Grouping enables you to identify groups of records in a table where each record in the group potentially corresponds to the same real-world entity.  The grouping is resilient to commonly observed errors in real data, because records in each group may not be identical to each other but are very similar to each other. Data Cleansing

7 ISV Innovation Presented by 7Data Cleansing

8 ISV Innovation Presented by Data Profiling – Profile Requests 8Data Cleansing Profiles that analyze individual columnsDescription Column Length Distribution Profile Reports all the distinct lengths of string values in the selected column and the percentage of rows in the table that each length represents. This profile helps you identify problems in your data, such as values that are not valid. For example, you profile a column of United States state codes that should be two characters and discover values longer than two characters. Column Null Ratio Profile Reports the percentage of null values in the selected column. This profile helps you identify problems in your data, such as an unexpectedly high ratio of null values in a column. For example, you profile a Zip Code/Postal Code column and discover an unacceptably high percentage of missing codes. Column Pattern Profile Reports a set of regular expressions that cover the specified percentage of values in a string column. This profile helps you identify problems in your data, such as string that are not valid. This profile can also suggest regular expressions that can be used in the future to validate new values. For example, a pattern profile of a United States Zip Code column might produce the regular expressions: \d{5}-\d{4}, \d{5}, and \d{9}. If you see other regular expressions, your data likely contains values that are not valid or in an incorrect format. Column Statistics Profile Reports statistics, such as minimum, maximum, average, and standard deviation for numeric columns, and minimum and maximum for datetime columns. This profile helps you identify problems in your data, such as dates that are not valid. For example, you profile a column of historical dates and discover a maximum date that is in the future. Column Value Distribution Profile Reports all the distinct values in the selected column and the percentage of rows in the table that each value represents. Can also report values that represent more than a specified percentage of rows in the table. This profile helps you identify problems in your data, such as an incorrect number of distinct values in a column. For example, you profile a column that is supposed to contain states in the United States and discover more than 50 distinct values.

9 ISV Innovation Presented by Data Profiling – Profile Requests 9Data Cleansing Profiles that analyze multiple columnsDescription Candidate Key Profile Reports whether a column or set of columns is a key, or an approximate key, for the selected table. This profile also helps you identify problems in your data, such as duplicate values in a potential key column. Functional Dependency Profile Reports the extent to which the values in one column (the dependent column) depend on the values in another column or set of columns (the determinant column). This profile also helps you identify problems in your data, such as values that are not valid. For example, you profile the dependency between a column that contains United States Zip Codes and a column that contains states in the United States. The same Zip Code should always have the same state, but the profile discovers violations of this dependency. Value Inclusion Profile Computes the overlap in the values between two columns or sets of columns. This profile can determine whether a column or set of columns is appropriate to serve as a foreign key between the selected tables. This profile also helps you identify problems in your data, such as values that are not valid. For example, you profile the ProductID column of a Sales table and discover that the column contains values that are not found in the ProductID column of the Products table.

10 ISV Innovation Presented by Fuzzy Lookup – ETI and Q-Grams 10  Error-Tolerant Index  Fuzzy Lookup uses the Error-Tolerant Index (ETI) to find matching rows in the reference table.  Each record in the reference table is broken up into words (also known as tokens), and the ETI keeps track of all the places in the reference table where a particular token occurs.  In addition, Fuzzy Lookup indexes substrings, known as q-grams, so that it can better match records that contain errors. Data Cleansing Reference Data 13831 N.E. 8th St ETI 13831,N,E,8th,St

11 ISV Innovation Presented by Fuzzy Lookup – Runtime 11  The task takes an input row and tries to find the best match or matches in the reference table as efficiently as possible.  By default, this is done by using the ETI to find candidate reference records that share tokens or q-grams in common with the input.  The best candidates are retrieved from the reference table and a more careful comparison is made between the two records.  Once there are no more candidates that could be better than any match found so far, Fuzzy Lookup stops and moves on to the next input row. Data Cleansing

12 ISV Innovation Presented by Fuzzy Grouping – Details 12  Fuzzy Grouping uses Fuzzy Lookup under the covers to perform the grouping.  Fuzzy Grouping passes its tokenization string intact to Fuzzy Lookup.  At run-time, Fuzzy Grouping uses to Fuzzy Lookup to build a temporary ETI against the input data and uses it to determine which input rows are close to each other.  Depending on the number of results it gets back and the resulting similarities between records, it generates groups. Data Cleansing

13 ISV Innovation Presented by Fuzzy Grouping – Details 13Data Cleansing

14 ISV Innovation Presented by Setup Considerations 14  Use the more lightweight DTExec.exe rather than the full SSIS Designer to execute packages in production.  Drop unused columns in your pipeline because they require memory.  For recurring Fuzzy Lookup tasks in which the reference table is considerably larger than the typical input table, you should consider pre- computing the index.  By default, Fuzzy Lookup will load the ETI and reference table into available memory before starting to process rows. If you only have a few rows to process in a particular run, you can reduce this time by setting the WarmCaches property to False. Data Cleansing

15 ISV Innovation Presented by Measurements 15  The number of rows and columns has the greatest impact on performance. The more data you have, the more resources Fuzzy Lookup and Fuzzy Grouping require. The figures in the following sections show specific data for various scenarios.  The average number of tokens per string column on which a fuzzy match is performed also has an impact on performance. Fuzzy transforms are not meant for document retrieval. For longer fields (greater than 20 tokens), it might be more efficient to use the SQL Server full-text indexing features. Data Cleansing

16 ISV Innovation Presented by ISV Innovation Presented by Thank You Fuzzy Lookup and Grouping http://msdn.microsoft.com/en- us/library/ms345128.aspx Data Profiler http://www.sqlservercentral.com/a rticles/Integration+Services/64133 16 Data Cleansing

17 ISV Innovation Presented by ISV Innovation Presented by www.isvinnovation.com 17 More recordings available at: Data Cleansing


Download ppt "ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08."

Similar presentations


Ads by Google