Oracle Enterprise Data Quality Product Briefing

Oracle Enterprise Data Quality Product Briefing
EDQ Fundamentals: Profile, Audit, Transform and Other Basic Functionality <Course name> <Lesson number> - 1

<Course name> <Lesson number> - 2
Objectives After completing this course, you will be able to: Profile data to understand it. Check data quality using audit processors. Transform data to improve its quality or to re-purpose it. Export data from Enterprise Data Quality (EDQ). Automate and schedule processes using jobs. Re-use your configuration with different processes and data sets. Describe the key features and high-level architecture of EDQ. <Course name> <Lesson number> - 2

Agenda Main Body of the Course: Enterprise Data Quality Overview
The User Interface and its Key Objects Profiling Auditing Transforming Writing and Exporting Data Automated Processesing: Jobs Web Services & Data Interfaces Re-using Your Work: Publishing, Packaging and Copying Publishing to the Dashboard In the Appendix: 1. Sampling Users and Security

1 Oracle Enterprise Data Quality Product Briefing Enterprise Data Quality Overview Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 4

Enterprise Data Quality Features (1)
Full range of data quality functionality in single user interface: Profiling, auditing, transforming, parsing and matching. Read, write and report on data. Batch and real-time. Director in migration project – some users are profiling. Other users are reviewing their work. 1 person finds & fixes, another person checks Projects – e.g. HR, Products, QA, Development, Production environments Servers – Geographical locations for different regions Test box, production box etc No need for SQL or scripting knowledge © Datanomic Limited 2010

Enterprise Data Quality Features (2)
Collaborative environment: multi-user. Simple, drag and drop user interface. Designed for data owners as well as data administrators. multi-project. multi-server. Director in migration project – some users are profiling. Other users are reviewing their work. 1 person finds & fixes, another person checks Projects – e.g. HR, Products, QA, Development, Production environments Servers – Geographical locations for different regions Test box, production box etc No need for SQL or scripting knowledge © Datanomic Limited 2010

Collaboration Across User Communities
Executives & Stakeholders Business Analysts Data Analysts Director Users Director Data Stewards Director Reviewers Director Executives Key point here is that Director can be used across the business by different audiences. Involve representatives of these groups to get business context and consensus © Datanomic Limited 2010

Why Enterprise Data Quality?
Main advantages: Ease of use: Powerful, but simple and intuitive. Designed with business analysts in mind. Quick installation and time to productivity. Enjoyable to use. Extensible and open. Friendly to Service Oriented Architecture. All Data Quality functionality available: Via a single user interface. To any process.

Enterprise Data Quality Architecture
Client Business Layer Data Storage Rich Java Web Start user interface: Configure, initiate, monitor processes. Browse Results Deployed on Java application server: Captures and processes data. Updates UI with results. Repository consists of two database schemas: Director – persistent configuration data. Results – temporary data refreshed when processes are run. <Course name> <Lesson number> - 9

Enterprise Data Quality Deployment Options (1)
Client Business Layer Data Storage Requires Java running on Windows. Must be deployed in a webserver: e.g. Weblogic, Tomcat, WebSphere, Glassfish. Requires: PostgresSQL or Oracle database. See EDQ Certification Matrix for details of supported platforms.

Enterprise Data Quality Deployment Options (2)
Client Business Layer Data Storage Client(s) All three layers can be installed on a single machine (e.g. a laptop), or each layer can be installed on a separate machine. Many clients can connect to the same 'server'.

Integration With Siebel: Functional Overview
Enterprise Data Quality for Siebel provides: Real-time Prevention of duplicate Accounts Contacts Standardization of Verification of addresses Batch De-duplication and standardization of There is basically a three step De-duplication / duplicate prevention process. However, this varies from real-time to batch and it also varies slightly depending on whether UCM or CRM is involved. The three step process is as follows: Key Generation: Siebel identifies driving record and generates key. Candidate Selection: Siebel looks up entities that share the driving record's key. Passes them to EDQ via web service or shared staging database. Matching: EDQ identifies possible duplicate records and, Passes them back to Siebel with match score. CRM/UCM then handles transaction commit (including commit to Cluster Key table), or not (if user picks an existing record) <Course name> <Lesson number> - 12

Integration With Siebel: Architecture
Siebel Server EDQ Server Shared Staging Database Universal DQ Connector interface EDQ Siebel Connector Web Services Jobs Uses Siebel's Universal DQ Connector interface. Real-time via EDQ web services. Batch via EDQ Jobs and shared staging database. EDQ does not hold a complete copy of Siebel working data. Avoids synchronization issues.

Real-Time Duplicate Prevention
Siebel UCM or CRM Web Services EDQ Id ClusterKey 1-LJZJ MATTCB23 Key created Candidate records sent Driving record Possible duplicates identified Candidates identified Match scores returned Siebel's identification of candidates (effectively clustering) is rudimentary in that it does so on the basis of a key that is created by a simple query. This is difficult to tune and can thus result in clusters that are too big to allow optimum performance or that are too small to ensure that all matches will be found. In UCM an automatic decision is made if the match score is greater than 95. If it is greater than 70 but less than 96 a manual decision must be made. In UCM duplicate records are merged by Siebel according to its survivorship rules. In CRM the user picks an existing record and the other records are 'discarded'. Commit (if necessary) Automatic or manual decision <Course name> <Lesson number> - 14

Batch De-Duplication Shared Staging Database Siebel UCM or CRM EDQ EDQ job reads data All records sent Clusters created and possible duplicates identified Commit (if necessary) EDQ job returns match scores Automatic or manual decision Incremental batch de-duplication also available: e.g. match only records updated since a specific date with their candidate duplicates. <Course name> <Lesson number> - 15

Complimentary Modes of Operation
Real-Time Duplicate Prevention Prevents duplicates from being entered. But Siebel candidate identification may limit effectiveness. Run constantly. Batch De-duplication Considers all records. May be time / resource intensive in large datasets. Run occasionally. Incremental Batch De-duplication Considers only 'delta' records. May not find all duplicates, but less time / resource intensive. Run Regularly.

Implement Siebel Integration Through Templates
To ease initial deployment, tuneable EDQ templates are provided for: Account standardization, duplicate prevention and de-duplication. Contact standardization, duplicate prevention and de-duplication. Address standardization. Full configuration instructions also provided: See Oracle Enterprise Data Quality for Siebel guide, provided with the release.

EDQ With Oracle Data Integrator: Use Cases
One-off Profiling Understand data to build ODI transformation and mapping processes. Automated Processes De-duplication, complex transformation and parsing called during ODI data flow. Measure Ongoing DQ Assess quality of data in target system. How well is ETL working? Sources EDQ Target(s) E.g. Data Warehouse such as Exadata Oracle Data Integrator For 'automated processes' ODI calls an EDQ job. <Course name> <Lesson number> - 18

EDQ and ODI: Complimentary Features
Oracle Data Integrator Enterprise Data Quality Extracts, Transforms, Loads 'Majors' on extraction and loading. Basic extraction and loading capabilities: Merely a means to an end. May also be used to perform simple transformations: e.g. remove inappropriate values from gender field. Strong matching, transformation and parsing capabilities: E.g. standardize address fields, de-duplicate data. Use strong profiling capabilities to design ODI processes. E t L e T l

EDQ Standard Installation
Windows Installation (covered in this course): Admin Privileges Required for Installation. Copy the distribution to your local drive. Run dnDirectorSetup.exe. The installer will: Install the Director repository (PostgreSQL). Install the Director App Server (Tomcat). Deploy Director in Tomcat. Create Start Menu shortcuts in ‘Enterprise Data Quality’. Requires Java 1.5 or 1.6. For more information, see Oracle EDQ Installation Notes, provided with the release. © Datanomic Limited 2010

Other EDQ Installation Options
Business Layer and Data Storage can be installed on Unix or Linux instead of Windows. Can use Oracle database instead of PostgresSQL. Can use Weblogic or other Web Servers instead of Tomcat. Not covered in this course. For more information see Oracle EDQ Advanced Installation notes, provided with the release.

Practice 1 Overview Install Enterprise Data Quality on Windows.
Standard installation. Create a shortcut for the Launchpad. Launch Enterprise Data Quality. © Datanomic Limited 2010

2 Oracle Enterprise Data Quality Product Briefing The User Interface and its Key Objects Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 23

Director - User Interface
Process Canvas Project Browser Tool Palette Results Browser © Datanomic Limited 2010

Key Objects in The Project Browser
Projects Hold related data and processes. Data Stores Connectors / Interfaces to data sources outside EDQ e.g. file, directory, database connection. Staged Data Copies of data from Data Stores (snapshots) / Data created by EDQ Processes. Processes Take data, process it and deliver results. Reference Data Used by processes as part of their configuration. Be clear on difference between Staged Data and a Snapshot Processes start with snapshot. Write to staged data – but can read from staged data in a second process Jobs Used to automate and schedule processes. © Datanomic Limited 2010

The Tool Palette Processors grouped together into families at top. Each family has its own icon. Each processor has a unique name and icon. Can import more processors as extensions. Drag and drop processors onto Process Canvas to use them. Use case-insensitive search box to quickly find processors. Mention examples of what is in each family. Show processor search in demo. © Datanomic Limited 2010

The Process Canvas The main work area for configuring processes. Processors on canvas can be renamed. Drag and drop processors onto the canvas. Connect together to form a process. Connectors can be set as ‘elbow’ or ‘straight’. In demo show Canvas Notes, double-clicking on connectors, grouping/ungrouping and renaming. Can group processors together into one object and re-expand to show all. Can add notes for clarity. © Datanomic Limited 2010

The Results Browser Displays results from the process or data in snapshots. Blue text is hyperlinked – can drill down to see more data. Show drill-down. Show sorting Can export to Excel, Reference Data or a Results Book. © Datanomic Limited 2010

A convenient method of capturing data from within a project.
Results Books A convenient method of capturing data from within a project. Used to report on data across several processes within a project. They can be exported and distributed. New Results Book pages can be created at any time from the Results Browser. © Datanomic Limited 2010

Interactive Help Press f1 to display context sensitive help.
© Datanomic Limited 2010

We will: create a new project. give it a name. Projects
Holds related data and processes. You can set permissions at project level. To create: Right-click Projects and select New Project... We will: create a new project. give it a name. © Datanomic Limited 2010

create a new data store connecting to an MS Access file.
Data Stores Connectors / Interfaces to data sources outside EDQ. E.g. file, directory or database connection. Used to import or export data. To Create: Right-click Data Stores and select New Data Store... We will: create a new data store connecting to an MS Access file. © Datanomic Limited 2010

import a table from the database.
Snapshots Copies of data from Data Stores (e.g. From a file or database). Stored ("staged") in EDQ repository. Reduces reads on the source system. Commonly read at start of process. To Create: Right-click Staged Data and select New Snapshot... We will: create a new snapshot. import a table from the database. © Datanomic Limited 2010

Read and analyse data from a table.
Processes Take data, process it and deliver results. Made up of series of connected processors. Begin with a Reader. To Create: Right-click Processes and select New Process... We will: Create a new process. Read and analyse data from a table. © Datanomic Limited 2010

Practices 2 – 4 Overview Create: A Project. A Data Store. A snapshot.
A Process. Run your process. Save your data to a Results Book. © Datanomic Limited 2010

3 Oracle Enterprise Data Quality Product Briefing Profiling Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 36

Profiling – Fundamentals
Profiling processors: Help you to quickly understand your data. Show you: Outliers. The frequency with which particular values occur. Whether your data is in a consistent format. Number of duplicate rows. Distinct values. Etc. You can: Drill down on results to see underlying records. Save findings to results books. Create issues from results browser.

Profiling – Process Wizard
You can add profiling from within New Process Wizard. Provides instant information from a range of processors: Syntax not meaning. Most appropriate for small samples: Mass profiling takes time for long or wide data sets.

An Introduction to ComputaMend
Fictional company offering technical services: Repair/upgrade of computer & networking hardware. Data held in Service Management database. Tables: Customers Workorders Parts Payments Employees

Service Management Data Model

Can’t check this with profiling
What are we Looking for? Fitness for purpose: Uniqueness: Keys Redundancy Completeness Consistency: Format Content Correctness Applies to: Tables Columns Rows Relationships Can’t check this with profiling

Practice 5 Overview: Profiling Data Using the Wizard
This practice covers the following topics: Use profiling to analyze the data. Answer the series of questions on the data. Practice # Overview: <Task Name Using Gerund> You can include a paragraph summary of the practice here. You do not need to repeat the list of tasks in the slide. <Course name> <Lesson number> - 42

Grouping Processors You can: Group two or more processors.
Collapse and expand the group. Grouping saves space on the Process Canvas.

Recording Issues You can: Create issues to record anomalous data.
Assign issues to other users for investigation. Use Issue Manager to view and manage issues.

Further Profiling Wizard Profiling can yield valuable information.
The Tool Palette contains further profiling tools: Dig deeper Target your search for data issues Profile on results from another process

Number Profiler The number profiler shows distribution of numbers.
Sorts numbers into pre-configured bands: The bands are set in a Reference Data table

Practice 6 Overview: Further Profiling
This practice covers the following topics: Profile stock on hand using the Number Profiler Check for negative numbers

Reference Data Lists used in lookups by processors.
Can contain, for example: Valid or invalid entries Valid or invalid patterns Noise values Delimiters Replacement words or characters Integral to the operation of many processors.

Reference Data Lookups and Returns
Lookup Only Lookup and Return

System Level and Project Level Reference Data
EDQ ships with System Level reference data that is: Linked to key processors by default. Used by every project in your EDQ instance. Over-written when you upgrade. You can create your own Project Level reference data that: Is specific to one project only. Is not over-written when you upgrade. Can be based on a copy of system level reference data.

4 Oracle Enterprise Data Quality Product Briefing Auditing Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 51

Audit Processors – Fundamentals
Categorize records as: Containing valid or invalid data. Containing data in key fields or no data. Etc. Add flags to records. Facilitate branching in processes. May rely on: Reference data. Regular expressions (Regex). Can be configured via options.

Audit Processors Many audit processors are available.
Use Online help to investigate them. Through slides and practices we will examine some examples: List Check. No Data Check. Logic Check. GBR Postcode Format Check.

Categorization and Branching – List Check Processor Example
The list check processor: Compares a field's value against the contents of a reference data list. You create the reference data list. Categorizes records as valid, invalid or unknown. Facilitates branching: Valid records in one direction, invalid records in another.

Reference Data For Audit Processors
You can create reference data: For use in processes. Directly from the Results Browser.

Flags Flags: Are added to data records by processors.
Are hidden in results browser by default, with option to display. Are persistent through a process. You can branch based on flag's value downstream of the processor that added it. Can be renamed. Can be output by a writer.

Processor Options Most processors have configuration options.
Use F1 context sensitive help to investigate options for particular processors.

Audit Techniques Audit processes are typically shaped by profiling.
Verification: Check against external data / list of values. Validation: Test or cross check using a business rule. Verification – Reference Tables / Snapshots / Files e.g.1 Does an address exist – marketing campaign costs – don’t want to send out leaflets to invalid addresses e.g.2 Check allowed genders or area codes from a list Validation – compare data against company business rules e.g.1 Is the Date of Death after the Date of Birth? e.g.2 Is the account code of the correct format pattern? Matching – Duplicates and redundancy within lists and between multiple lists e.g.1 Electronic parts catalogue is taken over by larger company – which parts in common? Which parts company A? Which parts company B? e.g.2 De-duplication of customer records prior to mailing catalogue to each £3 postage per address Parsing – Split up strings e.g.1 Product names or part numbers containing noise. e.g.2 Account Number PH -> From 1997, number, account manager’s initials © Datanomic Limited 2010

No Data Check Processor
Identifies records without data. Checks attribute(s) for: Nulls Empty strings. Values consisting entirely of spaces or non-printing characters. Filter on data / no data in attribute(s).

GBR Postcode Format Check
Uses a regular expression (RegEx) to Categorize UK postcodes as valid or invalid. Fails if other data is contained in string. Multiple attributes: passes if valid post code found in any attribute.

Logic Check Processor Compares attributes with values using logical checks: For strings: equal and not equal (!=). For numbers and dates: =, !=, >, <, >=, <=. You can: Build multiple comparisons using "and" or "or". use a logic check in downstream processing to route records depending on value of several flags.

Practices 7 and 8 Overview:
These practices covers the following topics: Practice 7: Audit the Postcode. Practice 8: Audit the GENDER Attribute – will it be Possible to Derive a Gender?

Lookup Check Processor
Used for Referential data integrity checks: Can determine whether values in one table exist in another. Categorizes records as valid if the value exists in the second table or invalid if not. See online help (F1) for processor options. ACCOUNT ACCOUNT_ HAS_PRODUCT

Lookup on Staged Data Can lookup on staged data.
Staged data used as if it were reference data. Define lookup column (input) and optionally return columns (output). Lookup Column Return Columns

Practice 9 Overview: The Lookup Check Processor
This practice covers the following topics: Use the lookup check processor to check referential data integrity between the following tables: WORKORDERS. Contains details of a repair to be carried out, including the customer's ID. E.g. Customer ID, employee to carry out repair, date targeted, equipment to be repaired etc. CUSTOMERS. Contains the customer's contact details. E.g. CU_NO, customer's name, title, address, mailing address etc. The CustomerID in the WORKORDERS table should exist as a CU_NO in the CUSTOMERS table.

5 Oracle Enterprise Data Quality Product Briefing Transforming Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 66

Transformation Processors – Fundamentals
Add: New attributes. Or New versions of attributes. May rely on: Reference data. Regular expressions (Regex). Can be configured via options. Can be used to: Prepare data for auditing or matching. Create output data to re-import back to source.

New Attributes and New Versions of Attributes
Transformation processors create either: New attributes Or New versions of attributes. New version of attribute New attribute <Course name> <Lesson number> - 68

Transformation Processors
Many transformation processors are available. Use Online help to investigate them. Through slides and practices we will examine selected examples: Lookup and Return. Denoise. Data Type Conversion. Add Attribute. Merge Attributes.

Denoise Removes noise characters from string attributes.
May include #, !, %, * and so on. Can be defined in: Reference data. System level reference data provided with the release. You can create your own project level reference data. Directly in the processor's Options tab. Both. May be used to clean free text fields (for example, names).

Data Type Conversion Processors
Date to string. Number to date. Number to string. String to date String to number. Useful for processors: That require a specific data type. That require multiple attributes of the same data type. E.g. the Cross Attribute Check processor.

Add Attribute Processors
Add String Attribute. Add Numeric Attribute. Add Date Attribute. Add Current Date. Useful for adding new data to records.

Merge Attributes Processor
Combines multiple attributes within a record into a new attribute. Takes first non-NULL value only from input attributes. The order in which attributes are merged is vital.

Practice 10 Overview: Merge Valid Postcodes
This practice covers the following topics: In the previous lesson you checked for valid postcodes in both the POSTCODE and ADDRESS3 attributes. Now merge all the valid postcodes you have found into a single new attribute…

Lookup and Return Takes values from an attribute of your snapshot.
Locates these values in lookup data. Returns other values from the same row in your lookup data. Lookup Return

Practice 11 Overview: Derive Gender from Title Using Lookup and Return
This practice covers the following topics: Use the Lookup and Return processor in association with reference data to derive a gender from a title. <Course name> <Lesson number> - 76

6 Oracle Enterprise Data Quality Product Briefing Writing and Exporting Data Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 77

Writer Processor Writes to Staged Data table.
Define attributes to output. Staged Data can then be exported to file / database or can use in a new process.

Exporting Data Export from Staged Data or Results Book.
Exports to a defined Data Store – file/database. Can wrap exports into jobs for automation.

Practice 12 Overview: Writing out Postcode Data
This practice covers the following topics: In the previous lesson you cleansed your data by creating a new attribute to contain all valid postcodes. You are now going to stage this data and export it in a csv file. Optional Practice 13 Overview: Writing Out Data With Derived Genders This practice covers the following topics: In this practice you will write out two data streams: one for records that now include a gender Another for any records where the gender is still NULL

7 Oracle Enterprise Data Quality Product Briefing Automated Processing: Jobs Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 81

Jobs: Use Case Problem: you have one or more processes that deal with lots of data. They: Need to be run regularly. Use lots of processing power. May take a long time to run. Solution: schedule a job to run overnight.

Jobs: Key Features Executes in sequence one or more: May be scheduled.
Snapshots, processes, exports, results book exports or external tasks. May be scheduled. Can be configured with notification on completion: Includes job status, warnings and errors. May be divided into phases. Controls process-level configuration: Sampling, dashboard-publishing.

Data Import and Export: Server Side Only
Data Store connections for imports (snapshots) and exports must be server side: Data files for imports must be placed in the landing area. Export files will be written to the landing area. Default landing area is: C:\Program Files\Datanomic\dnDirector\config\landingarea Databases with JDBC connections do not have to be in landing area. Client side snapshots and exports cannot be included in jobs.

Phases of a Job A job can be divided into sequential phases:
That are executed conditionally: Execute on success. Execute on failure. Or That are execute regardless of the previous phase’s success or failure.

Triggers Jobs can include: Pre-Configured triggers:
Run job triggers: Used to start one job from within another. Shutdown Web Service Triggers: Used to shut down real time processing. Customized triggers: E.g. Send a JMS message, call a web service: Contact Oracle Consulting for more information. Pre-Configured triggers can be accessed from the Tool Palette

Running Jobs from the Command Line
You can run jobs from the command line: Allows integration with external scheduler or application. Look in online help for syntax and more information: see the Jobs topic.

The Event Log Used to debug errors in jobs / processes:

Optional Practice 14 Overview: Jobs and Scheduling
This practice covers the following topics: Create Server-side Data Stores. Alter your configuration to use the new Data Stores. Create a New Job. Run Your New Job. Examine the Event Log. Schedule Your Job to Run Automatically.

8 Oracle Enterprise Data Quality Product Briefing Web Services & Data Interfaces Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 90

Why Process in Real-Time?
Why deal with data quality in real-time? To control data quality at the point of entry. For example: Real-time duplicate prevention. Via real-time matching. Check data in real-time: E.g. pattern check account code, value check numeric field is within a given range, perform suspect data check etc. Transform data as it is entered: E.g. De-noise, trim whitespace, proper case etc.

How Does Real-Time Processing Work?
Real-time processing uses Web Services. Web Services: Allow applications to exchange data over HTTP(S). Are used as delivery mechanisms for XML messages. Can be automatically created in Enterprise Data Quality. Deliver inputs to and carry outputs from an EDQ process. Are accessed via a reader (inputs) and a writer (outputs). Can be tested via EDQ's Web Service Tester. Are described by an automatically generated WSDL file. Web Service

Configuring Web Services
Create a single Web Service within a Director Project. Define its inputs and outputs. Create an EDQ process: Configure a reader to accept input from your Web Service. Configure a writer to deliver output to your Web Service. Run the process. Use EDQ's Web Service Tester to test the Web Service. Find your WSDL (Web Service Description Language) file and display it in a browser.

Why Use Data Interfaces?
Why build a generic process? Used to create process templates, which can be used by System Integrators and Consultants who may deal with many different clients with a similar type of data. Used by customers that need to process multiple data sources without creating individually tailored processes. For example: Processes that require different input and output sources: Eliminates the need to modify a process when the data sources change. Processes that must run in both batch and real-time: Configure Jobs with different input and output sources depending on the runtime requirements. Leverage maturity of existing processes: Enabling very fast results by leverage existing processes with different data sources.

How Do Data Interfaces Work?
Available Data Sources (Input & Output): Relational Databases, Excel Spreadsheets, Text Files Reference Data Real-time/Web Services Predefined Structures (Input & Output): Define the Input and Output structures once and then create mappings for the many different data sources. Can be tested via EDQ Director and Web Service Tester. Output Data Input

Configuring Data Interfaces
Create a Data Interface within a Director Project. Define its attribute structure. Create at least one associated mapping that links a data file to the Data Interface structure. Create an EDQ process: Configure a reader to accept input from the Data Interface. Configure a writer to deliver output to the Data Interface. Run the process. Use EDQ's Director or Web Service Tester to test the Data Interface. You will require two unique Data Interfaces if the input and output formats are different.

9 Oracle Enterprise Data Quality Product Briefing Re-Using Your Work: Publishing, Packaging and Copying Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 97

Publishing Processors: Use Case
You have configured, refined and tested a process featuring several processors. E.g. profile a date held in a string attribute: No value check on the attribute. For records with data: convert String to Date. For records that are successfully converted: Date profiler. Your process is generic and could be re-used: As part of another process in your project. In different projects.

Publishing Processors: Step 1
Make a processor: Take a process (a group of linked processors). Make it into a processor that: Will have a common set of inputs and outputs. You can rename the inputs. Includes functionality of all the processors within it. Cannot include Readers or Writers .

Publishing Processors: Step 2
Publish the processor: Give it a name. It becomes available from the tool palette. Your new processor can be dragged into any process in any project. It may be modified if necessary. See online help for more information.

Packaging Objects into a .dxi File
Packaging objects creates a .dxi file. This can be: Stored offline. ed. Imported into different projects on different servers. It is: Portable. Reusable.

Packaging Objects Package includes: Configuration. Reference data.
Does not include staged data.

What can be packaged? Most Director objects can be packaged. E.g.:
Whole Project. All projects on a server. All processes within a project. A single process. Reference data. Published processors etc.

Copying via Drag and Drop
To copy: You can drag objects from one project to another. Useful for: Reference Data. Not as useful for: Objects that are linked to other objects: E.g. a results book (results come from processes)

Optional Practice 15 Overview: Publishing a New Processor
This practice covers the following topics: Make and Publish a Processor. Optional Practice 16 Overview: Working With Packages This practice covers the following topics: Creating and importing packages. <Course name> <Lesson number> - 105

9 Oracle Enterprise Data Quality Product Briefing Publishing to the Dashboard <Course name> <Lesson number> - 106

Dashboard Overview (2) Dashboard displays results from audit and parse processors. Known as rules. Can aggregate and summarize results. Different KPIs can be presented to different user groups. You can adjust success thresholds. You can add three elements to the dashboard: Summaries, Indexes and Real time aggregation. © Datanomic Limited 2010

Summaries Display results from a group of rules.
Based on traffic light principle: Number of rules that are red, amber, green. Created: Automatically when you publish a process to the dashboard. Manually in Dashboard Administration.

Indexes Provide a numerical value between 1 & 1000 for a group of rules. You can weight each rule’s contribution to the index. Based on stock-market index principle. Value changes over time to reflect data quality. Created manually in Dashboard Administration.

Thresholds Define red, amber and green.
Applied to: rules, summaries, indexes, real time aggregators. Default thresholds: Apply globally. Can be based on percentages, counts or index value. Custom thresholds: Apply to specific dashboard elements only. Based on counts or index values.

Drill-Down You can drill-down on summaries and indexes:
Displays statics from each rule. Can view a rule’s graph of data quality over time.

Optional Practice 17 Overview: Configuring the Dashboard
This practice covers the following topics: Add a Summary to the Dashboard. Adjusting Thresholds. Adding an Index. <Course name> <Lesson number> - 113

Profile, Audit, Transform and Other Fundamentals Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total Appendix <Course name> <Lesson number> - 114

1 Oracle Enterprise Data Quality Product Briefing Sampling Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 115

Sampling – Why do it? You can: profile 50 million records, but:
It will take some time. You will need powerful hardware. Iterate more quickly with a subset of the data. Recommended approach: Profile a sample. Iterate to derive audit rules. Run audit rules on entire data set. 1% will let you know how big data set is – will give record count ~ build some processes on ~50,000 records then let loose on full data set © Datanomic Limited 2010

What Sort of Samples are Supported?
Two: Stages at which you can sample: When creating / modifying a snapshot. In the run preferences of a process. Methods of sampling: Percentage. Count. Directions: Ascending and descending. You can also offset the sample.

Percentage and Count Percentage:
Takes X contiguous records from every 100. E.g. 20%: Takes first 20 records, ignores next 80, takes next 20, ignores next 80 etc. Count: Takes a group of contiguous records. E.g. 20 ascending: Takes first 20 records.

Suggested Sampling Strategy
Sample randomly: Some from the top. Some from the middle. Some from the end. Be careful with SQL joins. Take enough to be representative. Too much will slow down your investigations: Start small and build up iteratively. Joins – can give falsely duplicated lines (e.g. Two people at one address – joined table returns 2 rows – only one row in ADDRESSES table – important when sending out expensive catalogues in the post) Can also miss noise/rubbish – the very stuff you are trying to detect – if using an equijoin. i.e. The join will remove the rubbish because the “WHERE <column1> = <column2>” clause will not take into account values that should not be there. © Datanomic Limited 2010

Using Filters (1) Filtering returns a sub-set of data. Examples:
CU_NO=13977. BALANCE > 0. NAME Like Pet. Only the sub-set is processed. Isolates a single record or a group of similar records. Can be used: In combination with sampling. Instead of sampling.

When creating / modifying a snapshot.
Using Filters (2) You can filter: When creating / modifying a snapshot. Includes Advanced option to write your own sql where clause. In the run preferences of a process. You can only filter on one attribute at a time.

2 Oracle Enterprise Data Quality Product Briefing Users and Security Schedule: Timing Topic <xx> minutes Lecture <xx> minutes Practice <xx> minutes Total <Course name> <Lesson number> - 122

Overview of User Security
You can control access to: Applications. Functions. Projects. User Group Application Permissions Functional Project Access

User Configuration (1) Users have case sensitive Username and Password. Option to force password change on first login. Password strength and account security can be set globally. Used to send job execution messages, issue notifications etc. Requires SMTP server configuration.

User Configuration (2) Users cannot be assigned permissions directly.
Users should be assigned one or more groups. Assign permissions to users through groups.

Group Configuration Groups can have one or more:
Application permissions e.g. permission to use Director. Functional Permissions: Add Process. Modify Job. Administer Issues. View Dashboard etc. Projects that they can access. E.g. Finance group can access the financial data project.

Default Groups Default groups are created at install: Administrators.
Data Analysts. Data Stewards. Executives. Match Reviewers. Review Managers. You can: Create your own groups. Modify the default groups.

The dnadmin User The dnadmin user: Is created at install.
Is assigned to the Administrators group. Functions as a superuser. Has a default password of dnadmin. To secure your system, change dnadmin’s default password.

The Add Project Permission
A group that has the Add Project permission can view any project. By default the Administrators and Data Analysts groups both have the Add Project permission. You can regard the default Data Analyst group as ‘power users’: By default, data analysts have wide-ranging permissions over all projects.

Partitioning Your Data
To partition your data (e.g. by business units – HR, Marketing etc.): Create a replica of each default group for each business unit, e.g: HR Data Analyst, HR Data Steward, Marketing Data Analyst, Marketing Data Steward, Etc. Ensure none of the groups have the add project permission. Assign groups appropriately to users and projects. E.g. So that users in the HR group can only access the HR project etc.

Application Permissions
Remember that: Groups need to be assigned application permissions (e.g. permission to use Director). Without this your security configuration will not work.

Optional Practice 18 Overview
Users and Security: Create a Group. Create a User. Assign Your Group to an Application. Assign Your Group to a Project. Test Your Configuration.

Continuing To Learn After the Product Briefing
To continue your learning after the briefing has finished: Audit Case Study: Assess Contact Details: Using a range of Audit processors to check the quality of customers' contact details. Further Exercises: Four further exercises are available. These will help you to practice the techniques that you have learned during the product briefing. See the EDQ microsite for worked solutions for the case study and further exercises.

Summary In this course, you have learned how to: Profile data to understand it. Check data quality using audit processors. Transform data to improve its quality or to re-purpose it. Export data from Enterprise Data Quality (EDQ). Automate and schedule processes using jobs. Re-use your configuration with different processes and data sets. Describe the key features and high-level architecture of EDQ. <Course name> <Lesson number> - 134

Oracle Enterprise Data Quality Product Briefing

Similar presentations

Presentation on theme: "Oracle Enterprise Data Quality Product Briefing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Oracle Enterprise Data Quality Product Briefing

Similar presentations

Presentation on theme: "Oracle Enterprise Data Quality Product Briefing"— Presentation transcript:

Similar presentations

About project

Feedback