Download presentation
Presentation is loading. Please wait.
1
DataStage Enterprise Edition
Enterprise Edition DSEE Contact: Learning Services Ascential Software 50 Washington Street Westboro, MA 01581 x3345
2
Proposed Course Agenda
DataStage Enterprise Edition Proposed Course Agenda Day 1 Review of EE Concepts Sequential Access Best Practices DBMS as Source Day 2 EE Architecture Transforming Data DBMS as Target Sorting Data Day 3 Combining Data Configuration Files Extending EE Meta Data in EE Day 4 Job Sequencing Testing and Debugging Draft agenda only.
3
Contents Introduction DataStage Administrator DataStage Manager
DataStage Enterprise Table of Contents Topic Page Agenda Starting point for those new to DataStage 1: Intro – Part 2: Intro – Part 2 Configuring Projects 21 3: Intro – Part 3 Managing Meta Data 32 4: Intro – Part 4 Designing Jobs 51 5: Intro – Part 5 Running Jobs 80 Starting point for those new to Enterprise Edition 1: Review 89 2: Sequential Access 127 3: Standards 158 4: DBMS Access 176 5: Platform Architecture 196 6: Transforming Data 7: Sorting Data 8: Combining Data 9: Configuration Files 277 10: Extending DataStage EE 294 11: Meta Data 334 12: Job Control 355 13: Testing Introduction DataStage Administrator DataStage Manager DataStage Designer DataStage Director ODBC/Relational Stages Intelligent Assistants Hashed Files Transforming Data DataStage BASIC Repository Objects Performance Enhancements Job Sequencer iii
4
DataStage Enterprise Edition
The Course Material Course Manual Exercise Files and Exercise Guide The course material includes the following: Course Manual is a duplication of the slides used during the course with notes at the bottom of the page to provide additional topic information Exercises reinforce the concepts taught during the lecture Online Help
5
Using the Course Material
DataStage Enterprise Edition Using the Course Material Suggestions for learning Take notes Review previous material Practice Learn from errors Everyone has their style of learning that works best for them. Here are a few suggestions for working with this course material. Take notes Write comments and notes on the slides and in the lined note section Review previous material This course material contains a great deal of information. Remind yourself of what you’ve covered. Don’t let it evaporate Practice Explore the interfaces and see how you can customize your tasks or find shortcuts Learn from errors Record solutions to your errors to avoid repeating the same errors Work with other students
6
DataStage Enterprise Edition Introduction to DataStage EE
Intro Part 1 Introduction to DataStage EE
7
DataStage Enterprise Edition
What is DataStage? Design jobs for Extraction, Transformation, and Loading (ETL) Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations Import, export, create, and managed metadata for use within jobs Schedule, run, and monitor jobs all within DataStage Administer your DataStage development and execution environments DataStage is a comprehensive tool for the fast, easy creation and maintenance of data marts and data warehouses. It provides the tools you need to build, manage, and expand them. With DataStage, you can build solutions faster and give users access to the data and reports they need. With DataStage you can: · Design the jobs that extract, integrate, aggregate, load, and transform the data for your data warehouse or data mart. · Create and reuse metadata and job components. · Run, monitor, and schedule these jobs. · Administer your development and execution environments.
8
DataStage Server and Clients
DataStage Enterprise Edition DataStage Server and Clients The DataStage client components are: Administrator Administers DataStage projects and conducts housekeeping on the server Designer Creates DataStage jobs that are compiled into executable programs Director Used to run and monitor the DataStage jobs Manager Allows you to view and edit the contents of the repository
9
DataStage Administrator
DataStage Enterprise Edition DataStage Administrator Use the Administrator to specify general server defaults, add and delete projects, and to set project properties. The Administrator also provides a command interface to the UniVerse repository. · Use the Administrator Project Properties window to: · Set job monitoring limits and other Director defaults on the General tab. · Set user group privileges on the Permissions tab. · Enable or disable server-side tracing on the Tracing tab. · Specify a user name and password for scheduling jobs on the Schedule tab. · Specify hashed file stage read and write cache sizes on the Tunables tab.
10
DataStage Enterprise Edition
Client Logon
11
DataStage Enterprise Edition
DataStage Manager Use the Manager to store and manage reusable metadata for the jobs you define in the Designer. This metadata includes table and file layouts and routines for transforming extracted data. Manager is also the primary interface to the DataStage repository. In addition to table and file layouts, it displays the routines, transforms, and jobs that are defined in the project. Custom routines and transforms can also be created in Manager.
12
DataStage Enterprise Edition
DataStage Designer The DataStage Designer allows you to use familiar graphical point-and-click techniques to develop processes for extracting, cleansing, transforming, integrating and loading data into warehouse tables. The Designer provides a “visual data flow” method to easily interconnect and configure reusable components.
13
DataStage Enterprise Edition
DataStage Director Use the Director to validate, run, schedule, and monitor your DataStage jobs. You can also gather statistics as the job runs.
14
Developing in DataStage
DataStage Enterprise Edition Developing in DataStage Define global and project properties in Administrator Import meta data into Manager Build job in Designer Compile Designer Validate, run, and monitor in Director · Define your project’s properties: Administrator · Open (attach to) your project · Import metadata that defines the format of data stores your jobs will read from or write to: Manager · Design the job: Designer - Define data extractions (reads) - Define data flows - Define data integration - Define data transformations - Define data constraints - Define data loads (writes) - Define data aggregations · Compile and debug the job: Designer · Run and monitor the job: Director
15
DataStage Enterprise Edition
DataStage Projects All your work is done in a DataStage project. Before you can do anything, other than some general administration, you must open (attach to) a project. Projects are created during and after the installation process. You can add projects after installation on the Projects tab of Administrator. A project is associated with a directory. The project directory is used by DataStage to store your jobs and other DataStage objects and metadata. You must open (attach to) a project before you can do any work in it. Projects are self-contained. Although multiple projects can be open at the same time, they are separate environments. You can, however, import and export objects between them. Multiple users can be working in the same project at the same time. However, DataStage will prevent multiple users from accessing the same job at the same time.
16
DataStage Enterprise Edition
Quiz– True or False DataStage Designer is used to build and compile your ETL jobs Manager is used to execute your jobs after you build them Director is used to execute your jobs after you build them Administrator is used to set global and project properties
17
DataStage Enterprise Edition Configuring Projects
Intro Part 2 Configuring Projects
18
DataStage Enterprise Edition
Module Objectives After this module you will be able to: Explain how to create and delete projects Set project properties in Administrator Set EE global properties in Administrator
19
DataStage Enterprise Edition
Project Properties Projects can be created and deleted in Administrator Project properties and defaults are set in Administrator Recall from module 1: In DataStage all development work is done within a project. Projects are created during installation and after installation using Administrator. Each project is associated with a directory. The directory stores the objects (jobs, metadata, custom routines, etc.) created in the project. Before you can work in a project you must attach to it (open it). You can set the default properties of a project using DataStage Administrator.
20
Setting Project Properties
DataStage Enterprise Edition Setting Project Properties To set project properties, log onto Administrator, select your project, and then click “Properties” The logon screen for Administrator does not provide the option to select a specific project (unlike the other DataStage clients).
21
DataStage Enterprise Edition
Licensing Tab The Licensing Tab is used to change DataStage license information.
22
DataStage Enterprise Edition
Projects General Tab Click Properties on the DataStage Administration window to open the Project Properties window. There are nine tabs. (The Mainframe tab is only enabled if your license supports mainframe jobs.) The default is the General tab. If you select the Enable job administration in Director box, you can perform some administrative functions in Director without opening Administrator. When a job is run in Director, events are logged describing the progress of the job. For example, events are logged when a job starts, when it stops, and when it aborts. The number of logged events can grow very large. The Auto-purge of job log box tab allows you to specify conditions for purging these events. You can limit the logged events either by number of days or number of job runs.
23
Environment Variables
DataStage Enterprise Edition Environment Variables
24
DataStage Enterprise Edition
Permissions Tab Use this page to set user group permissions for accessing and using DataStage. All DataStage users must belong to a recognized user role before they can log on to DataStage. This helps to prevent unauthorized access to DataStage projects. There are three roles of DataStage user: · DataStage Developer, who has full access to all areas of a DataStage project. · DataStage Operator, who can run and manage released DataStage jobs. · <None>, who does not have permission to log on to DataStage. UNIX note: In UNIX, the groups displayed are defined in /etc/group.
25
DataStage Enterprise Edition
Tracing Tab This tab is used to enable and disable server-side tracing. The default is for server-side tracing to be disabled. When you enable it, information about server activity is recorded for any clients that subsequently attach to the project. This information is written to trace files. Users with in-depth knowledge of the system software can use it to help identify the cause of a client problem. If tracing is enabled, users receive a warning message whenever they invoke a DataStage client. Warning: Tracing causes a lot of server system overhead. This should only be used to diagnose serious problems.
26
DataStage Enterprise Edition
Tunables Tab On the Tunables tab, you can specify the sizes of the memory caches used when reading rows in hashed files and when writing rows to hashed files. Hashed files are mainly used for lookups and are discussed in a later module.
27
DataStage Enterprise Edition
Parallel Tab You should enable OSH for viewing – OSH is generated when you compile a job.
28
DataStage Enterprise Edition Managing Meta Data
Intro Part 3 Managing Meta Data
29
DataStage Enterprise Edition
Module Objectives After this module you will be able to: Describe the DataStage Manager components and functionality Import and export DataStage objects Import metadata for a sequential file
30
DataStage Enterprise Edition
What Is Metadata? Data Source Target Transform Meta Data Meta Data Meta Data Repository Metadata is “data about data” that describes the formats of sources and targets. This includes general format information such as whether the record columns are delimited and, if so, the delimiting character. It also includes the specific column definitions.
31
DataStage Enterprise Edition
DataStage Manager DataStage Manager is a graphical tool for managing the contents of your DataStage project repository, which contains metadata and other DataStage components such as jobs and routines. The left pane contains the project tree. There are seven main branches, but you can create subfolders under each. Select a folder in the project tree to display its contents.
32
DataStage Enterprise Edition
Manager Contents Metadata describing sources and targets: Table definitions DataStage objects: jobs, routines, table definitions, etc. DataStage Manager manages two different types of objects: · Metadata describing sources and targets: - Called table definitions in Manager. These are not to be confused with relational tables. DataStage table definitions are used to describe the format and column definitions of any type of source: sequential, relational, hashed file, etc. - Table definitions can be created in Manager or Designer and they can also be imported from the sources or targets they describe. · DataStage components - Every object in DataStage (jobs, routines, table definitions, etc.) is stored in the DataStage repository. Manager is the interface to this repository. - DataStage components, including whole projects, can be exported from and imported into Manager.
33
DataStage Enterprise Edition
Import and Export Any object in Manager can be exported to a file Can export whole projects Use for backup Sometimes used for version control Can be used to move DataStage objects from one project to another Use to share DataStage jobs and projects with other developers Any set of DataStage objects, including whole projects, which are stored in the Manager Repository, can be exported to a file. This export file can then be imported back into DataStage. Import and export can be used for many purposes, including: · Backing up jobs and projects. · Maintaining different versions of a job or project. · Moving DataStage objects from one project to another. Just export the objects, move to the other project, then re-import them into the new project. · Sharing jobs and projects between developers. The export files, when zipped, are small and can be easily ed from one developer to another.
34
DataStage Enterprise Edition
Export Procedure In Manager, click “Export>DataStage Components” Select DataStage objects for export Specified type of export: DSX, XML Specify file path on client machine Click Export>DataStage Components in Manager to begin the export process. Any object in Manager can be exported to a file. Use this procedure to backup your work or to move DataStage objects from one project to another. Select the types of components to export. You can select either the whole project or select a portion of the objects in the project. Specify the name and path of the file to export to. By default, objects are exported to a text file in a special format. By default, the extension is dsx. Alternatively, you can export the objects to an XML document. The directory you export to is on the DataStage client, not the server.
35
DataStage Enterprise Edition
Quiz: True or False? You can export DataStage objects such as jobs, but you can’t export metadata, such as field definitions of a sequential file. True or False? You can export DataStage objects such as jobs, but you can't export metadata, such as field definitions of a sequential file. True: Incorrect. Metadata describing files and relational tables are stored as "Table Definitions". Table definitions can be exported and imported as any DataStage objects can. False: Correct! Metadata describing files and relational tables are stored as "Table Definitions". Table definitions can be exported and imported as any DataStage objects can.
36
DataStage Enterprise Edition
Quiz: True or False? The directory to which you export is on the DataStage client machine, not on the DataStage server machine. True or False? The directory you export to is on the DataStage client machine, not on the DataStage server machine. True: Correct! The directory you select for export must be addressable by your client machine. False: Incorrect. The directory you select for export must be addressable by your client machine.
37
Exporting DataStage Objects
DataStage Enterprise Edition Exporting DataStage Objects
38
Exporting DataStage Objects
DataStage Enterprise Edition Exporting DataStage Objects
39
DataStage Enterprise Edition
Import Procedure In Manager, click “Import>DataStage Components” Select DataStage objects for import To import DataStage components, click Import>DataStage Components. Select the file to import. Click Import all to begin the import process or Import selected to view a list of the objects in the import file. You can import selected objects from the list. Select the Overwrite without query button to overwrite objects with the same name without warning.
40
Importing DataStage Objects
DataStage Enterprise Edition Importing DataStage Objects
41
DataStage Enterprise Edition
Import Options
42
DataStage Enterprise Edition
Exercise Import DataStage Component (table definition)
43
DataStage Enterprise Edition
Metadata Import Import format and column destinations from sequential files Import relational table column destinations Imported as “Table Definitions” Table definitions can be loaded into job stages Table definitions define the formats of a variety of data files and tables. These definitions can then be used and reused in your jobs to specify the formats of data stores. For example, you can import the format and column definitions of the Customers.txt file. You can then load this into the sequential source stage of a job that extracts data from the Customers.txt file. You can load this same metadata into other stages that access data with the same format. In this sense the metadata is reusable. It can be used with any file or data store with the same format. If the column definitions are similar to what you need you can modify the definitions and save the table definition under a new name. You can import and define several different kinds of table definitions including: Sequential files and ODBC data sources.
44
Sequential File Import Procedure
DataStage Enterprise Edition Sequential File Import Procedure In Manager, click Import>Table Definitions>Sequential File Definitions Select directory containing sequential file and then the file Select Manager category Examined format and column definitions and edit is necessary To start the import, click Import>Table Definitions>Sequential File Definitions. The Import Meta Data (Sequential) window is displayed. Select the directory containing the sequential files. The Files box is then populated with the files you can import. Select the file to import. Select or specify a category (folder) to import into. · The format is: <Category>\<Sub-category> · <Category> is the first-level sub-folder under Table Definitions. · <Sub-category> is (or becomes) a sub-folder under the type.
45
Manager Table Definition
DataStage Enterprise Edition Manager Table Definition In Manager, select the category (folder) that contains the table definition. Double-click the table definition to open the Table Definition window. Click the Columns tab to view and modify any column definitions. Select the Format tab to edit the file format specification.
46
Importing Sequential Metadata
DataStage Enterprise Edition Importing Sequential Metadata
47
DataStage Enterprise Edition Designing and Documenting Jobs
Intro Part 4 Designing and Documenting Jobs
48
DataStage Enterprise Edition
Module Objectives After this module you will be able to: Describe what a DataStage job is List the steps involved in creating a job Describe links and stages Identify the different types of stages Design a simple extraction and load job Compile your job Create parameters to make your job flexible Document your job
49
DataStage Enterprise Edition
What Is a Job? Executable DataStage program Created in DataStage Designer, but can use components from Manager Built using a graphical user interface Compiles into Orchestrate shell language (OSH) A job is an executable DataStage program. In DataStage, you can design and run jobs that perform many useful data integration tasks, including data extraction, data conversion, data aggregation, data loading, etc. DataStage jobs are: · Designed and built in Designer. · Scheduled, invoked, and monitored in Director. · Executed under the control of DataStage.
50
Job Development Overview
DataStage Enterprise Edition Job Development Overview In Manager, import metadata defining sources and targets In Designer, add stages defining data extractions and loads And Transformers and other stages to defined data transformations Add linkss defining the flow of data from sources to targets Compiled the job In Director, validate, run, and monitor your job In this module, you will go through the whole process with a simple job, except for the first bullet. In this module you will manually define the metadata.
51
DataStage Enterprise Edition
Designer Work Area The appearance of the designer work space is configurable; the graphic shown here is only one example of how you might arrange components. In the right center is the Designer canvas, where you create stages and links. On the left is the Repository window, which displays the branches in Manager. Items in Manager, such as jobs and table definitions can be dragged to the canvas area. Click View>Repository to display the Repository window.
52
DataStage Enterprise Edition
Designer Toolbar Provides quick access to the main functions of Designer Show/hide metadata markers Job properties Compile
53
DataStage Enterprise Edition
Tools Palette The tool palette contains icons that represent the components you can add to your job design. You can also install additional stages called plug-ins for special purposes.
54
Adding Stages and Links
DataStage Enterprise Edition Adding Stages and Links Stages can be dragged from the tools palette or from the stage type branch of the repository view Links can be drawn from the tools palette or by right clicking and dragging from one stage to another
55
DataStage Enterprise Edition
Sequential File Stage Used to extract data from, or load data to, a sequential file Specify full path to the file Specify a file format: fixed width or delimited Specified column definitions Specify write action The Sequential stage is used to extract data from a sequential file or to load data into a sequential file. The main things you need to specify when editing the sequential file stage are the following: · Path and name of file · File format · Column definitions · If the sequential stage is being used as a target, specify the write action: Overwrite the existing file or append to it.
56
Job Creation Example Sequence
DataStage Enterprise Edition Job Creation Example Sequence Brief walkthrough of procedure Presumes meta data already loaded in repository
57
Designer - Create New Job
DataStage Enterprise Edition Designer - Create New Job Several types of DataStage jobs: Server – not covered in this course. However, you can create server jobs, convert them to a container, then use this container in a parallel job. However, this has negative performance implications. Shared container (parallel or server) – contains reusable components that can be used by other jobs. Mainframe – DataStage 390, which generates Cobol code Parallel – this course will concentrate on parallel jobs. Job Sequence – used to create jobs that control execution of other jobs.
58
Drag Stages and Links Using Palette
DataStage Enterprise Edition Drag Stages and Links Using Palette The tools palette may be shown as a floating dock or placed along a border. Alternatively, it may be hidden and the developer may choose to pull needed stages from the repository onto the design work area.
59
DataStage Enterprise Edition
Assign Meta Data Meta data may be dragged from the repository and dropped on a link.
60
Editing a Sequential Source Stage
DataStage Enterprise Edition Editing a Sequential Source Stage Any required properties that are not completed will appear in red. You are defining the format of the data flowing out of the stage, that is, to the output link. Define the output link listed in the Output name box. You are defining the file from which the job will read. If the file doesn’t exist, you will get an error at run time. On the Format tab, you specify a format for the source file. You will be able to view its data using the View data button. Think of a link as like a pipe. What flows in one end flows out the other end (at the transformer stage).
61
Editing a Sequential Target
DataStage Enterprise Edition Editing a Sequential Target Defining a sequential target stage is similar to defining a sequential source stage. You are defining the format of the data flowing into the stage, that is, from the input links. Define each input link listed in the Input name box. You are defining the file the job will write to. If the file doesn’t exist, it will be created. Specify whether to overwrite or append the data in the Update action set of buttons. On the Format tab, you can specify a different format for the target file than you specified for the source file. If the target file doesn’t exist, you will not (of course!) be able to view its data until after the job runs. If you click the View data button, DataStage will return a “Failed to open …” error. The column definitions you defined in the source stage for a given (output) link will appear already defined in the target stage for the corresponding (input) link. Think of a link as like a pipe. What flows in one end flows out the other end. The format going in is the same as the format going out.
62
DataStage Enterprise Edition
Transformer Stage Used to define constraints, derivations, and column mappings A column mapping maps an input column to an output column In this module will just defined column mappings (no derivations) In the Transformer stage you can specify: · Column mappings · Derivations · Constraints A column mapping maps an input column to an output column. Values are passed directly from the input column to the output column. Derivations calculate the values to go into output columns based on values in zero or more input columns. Constraints specify the conditions under which incoming rows will be written to output links.
63
Transformer Stage Elements
DataStage Enterprise Edition Transformer Stage Elements There are two: transformer and basic transformer. Both look the same but access different routines and functions. Notice the following elements of the transformer: The top, left pane displays the columns of the input links. The top, right pane displays the contents of the stage variables. The lower, right pane displays the contents of the output link. Unresolved column mapping will show the output in red. For now, ignore the Stage Variables window in the top, right pane. This will be discussed in a later module. The bottom area shows the column definitions (metadata) for the input and output links.
64
Create Column Mappings
DataStage Enterprise Edition Create Column Mappings
65
Creating Stage Variables
DataStage Enterprise Edition Creating Stage Variables Stage variables are used for a variety of purposes: Counters Temporary registers for derivations Controls for constraints
66
DataStage Enterprise Edition
Result
67
DataStage Enterprise Edition
Adding Job Parameters Makes the job more flexible Parameters can be: Used in constraints and derivations Used in directory and file names Parameter values are determined at run time
68
Adding Job Documentation
DataStage Enterprise Edition Adding Job Documentation Job Properties Short and long descriptions Shows in Manager Annotation stage Is a stage on the tool palette Shows on the job GUI (work area)
69
Job Properties Documentation
DataStage Enterprise Edition Job Properties Documentation
70
Annotation Stage on the Palette
DataStage Enterprise Edition Annotation Stage on the Palette Two versions of the annotation stage are available: Annotation Annotation description The difference will be evident on the following slides.
71
Annotation Stage Properties
DataStage Enterprise Edition Annotation Stage Properties You can type in whatever you want; the default text comes from the short description of the jobs properties you entered, if any. Add one or more Annotation stages to the canvas to document your job. An Annotation stage works like a text box with various formatting options. You can optionally show or hide the Annotation stages by pressing a button on the toolbar. There are two Annotation stages. The Description Annotation stage is discussed in a later slide.
72
Final Job Work Area with Documentation
DataStage Enterprise Edition Final Job Work Area with Documentation
73
DataStage Enterprise Edition
Compiling a Job Before you can run your job, you must compile it. To compile it, click File>Compile or click the Compile button on the toolbar. The Compile Job window displays the status of the compile. A compile will generate OSH.
74
Errors or Successful Message
DataStage Enterprise Edition Errors or Successful Message If an error occurs: Click Show Error to identify the stage where the error occurred. This will highlight the stage in error. Click More to retrieve more information about the error. This can be lengthy for parallel jobs.
75
DataStage Enterprise Edition Running Jobs
Intro Part 5 Running Jobs
76
DataStage Enterprise Edition
Module Objectives After this module you will be able to: Validate your job Use DataStage Director to run your job Set to run options Monitor your job’s progress View job log messages
77
Prerequisite to Job Execution
DataStage Enterprise Edition Prerequisite to Job Execution Result from Designer compile
78
DataStage Enterprise Edition
DataStage Director Can schedule, validating, and run jobs Can be invoked from DataStage Manager or Designer Tools > Run Director As you know, you run your jobs in Director. You can open Director from within Designer by clicking Tools>Run Director. In a similar way, you can move between Director, Manager, and Designer. There are two methods for running a job: · Run it immediately. · Schedule it to run at a later time or date. To run a job immediately: · Select the job in the Job Status view. The job must have been compiled. · Click Job>Run Now or click the Run Now button in the toolbar. The Job Run Options window is displayed.
79
DataStage Enterprise Edition
Running Your Job This shows the Director Status view. To run a job, select it and then click Job>Run Now. Better yet: Shift to log view from main Director screen. Then click green arrow to execute job.
80
Run Options – Parameters and Limits
DataStage Enterprise Edition Run Options – Parameters and Limits The Job Run Options window is displayed when you click Job>Run Now. This window allows you to stop the job after: · A certain number of rows. · A certain number of warning messages. You can validate your job before you run it. Validation performs some checks that are necessary in order for your job to run successfully. These include: · Verifying that connections to data sources can be made. · Verifying that files can be opened. · Verifying that SQL statements used to select data can be prepared. Click Run to run the job after it is validated. The Status column displays the status of the job run.
81
DataStage Enterprise Edition
Director Log View Click the Log button in the toolbar to view the job log. The job log records events that occur during the execution of a job. These events include control events, such as the starting, finishing, and aborting of a job; informational messages; warning messages; error messages; and program-generated messages.
82
Message Details are Available
DataStage Enterprise Edition Message Details are Available
83
Other Director Functions
DataStage Enterprise Edition Other Director Functions Schedule job to run on a particular date/time Clear job log Set Director options Row limits Abort after x warnings
84
DataStage Enterprise Edition DSEE – DataStage EE Review
Module 1 DSEE – DataStage EE Review
85
Ascential’s Enterprise Data Integration Platform
DataStage Enterprise Edition Ascential’s Enterprise Data Integration Platform Command & Control DISCOVER Gather relevant information for target enterprise applications Data Profiling PREPARE Data Quality Cleanse, correct and match input data TRANSFORM Extract, Transform, Load Standardize and enrich data and load to targets ANY SOURCE ANY TARGET CRM ERP SCM RDBMS Legacy Real-time Client-server Web services Data Warehouse Other apps. CRM ERP SCM BI/Analytics RDBMS Real-time Client-server Web services Data Warehouse Other apps. Ascential Software looked at what’s required across the whole lifecycle to solve enterprise data integration problems not just once for an individual project, but for multiple projects connecting any type of source (ERP, SCM, CRM, legacy systems, flat files, external data, etc) with any type of target. Ascential’s uniqueness comes from the combination of proven best-in-class functionality, across data profiling, data quality and matching, and ETL combined to provide a complete end-to-end data integration solution on a platform with unlimited scalability and performance through parallelization. We can therefore deal not only with gigabytes of data, but terabytes of data, and petabytes of data - data volumes becoming more and more common………., and do so with complete management and integration of all the meta data in the enterprise environment. This is indeed an end-to-end solution, which customers can implement in whatever phases they choose, and which, by virtue of its completeness and breadth and robustness of solution, helps our customers get the quickest possible time to value and time to impact from strategic applications. It assures good data is being fed into the informational systems and that our solution provides the ROI and economic benefit customers expect from investments in strategic applications, which are very large investments and, which done right, should command very large returns. Parallel Execution Meta Data Management
86
DataStage Enterprise Edition
Course Objectives You will learn to: Build DataStage EE jobs using complex logic Utilize parallel processing techniques to increase job performance Build custom stages based on application needs Course emphasis is: Advanced usage of DataStage EE Application job development Best practices techniques Students will gain hands-on experience with DSEE by building an application in class exercises. This application will use complex logic structures.
87
DataStage Enterprise Edition
Course Agenda Day 1 Review of EE Concepts Sequential Access Standards DBMS Access Day 2 EE Architecture Transforming Data Sorting Data Day 3 Combining Data Configuration Files Day 4 Extending EE Meta Data Usage Job Control Testing Suggested agenda – actual classes will vary.
88
DataStage Enterprise Edition
Module Objectives Provide a background for completing work in the DSEE course Tasks Review concepts covered in DSEE Essentials course Skip this module if you recently completed the DataStage EE essentials modules In this module we will review many of the concepts and ideas presented in the DataStage EE essentials modules. At the end of this module students will be asked to complete a brief exercise demonstrating their mastery of that basic material.
89
DataStage Enterprise Edition
Review Topics DataStage architecture DataStage client review Administrator Manager Designer Director Parallel processing paradigm DataStage Enterprise Edition (DSEE) Our topics for review will focus on overall DataStage architecture and a review of the parallel processing paradigm in DataStage Enterprise Edition.
90
Client-Server Architecture
DataStage Enterprise Edition Client-Server Architecture Command & Control Microsoft® Windows NT/2000/XP ANY SOURCE ANY TARGET CRM ERP SCM BI/Analytics RDBMS Real-Time Client-server Web services Data Warehouse Other apps. Repository Manager Designer Director Administrator Discover Prepare Transform Extend Extract Cleanse Transform Integrate DataStage Enterprise Edition consists of four clients connected to the DataStage Enterprise Edition engine. The client applications are: Manager, Administrator, Designer, and Director. These clients connect to a server engine that is located on either a UNIX (Enterprise Edition only) or NT operating system. Server Repository Microsoft® Windows NT or UNIX Parallel Execution Meta Data Management
91
DataStage Enterprise Edition
Process Flow Administrator – add/delete projects, set defaults Manager – import meta data, backup projects Designer – assemble jobs, compile, and execute Director – execute jobs, examine job run logs A typical DataStage workflow consists of: Setting up the project in Administrator Including metadata via Manager Building and assembling the job in Designer Executing and testing the job in Director.
92
Administrator – Licensing and Timeout
DataStage Enterprise Edition Administrator – Licensing and Timeout Change licensing, if appropriate. Timeout period should be set to large number or choose “do not timeout” option.
93
Administrator – Project Creation/Removal
DataStage Enterprise Edition Administrator – Project Creation/Removal Functions specific to a project. Available functions: Add or delete projects. Set project defaults (properties button). Cleanup – perform repository functions. Command – perform queries against the repository.
94
Administrator – Project Properties
DataStage Enterprise Edition Administrator – Project Properties RCP for parallel jobs should be enabled Recommendations: Check enable job administration in Director Check enable runtime column propagation May check auto purge of jobs to manage messages in director log Variables for parallel processing
95
Administrator – Environment Variables
DataStage Enterprise Edition Administrator – Environment Variables Variables are category specific You will see different environment variables depending on which category is selected.
96
DataStage Enterprise Edition
OSH is what is run by the EE Framework Reading OSH will be covered in a later module. Since DataStage Enterprise Edition writes OSH, you will want to check this option.
97
DataStage Enterprise Edition
DataStage Manager To attach to the DataStage Manager client, one first enters through the logon screen. Logons can be either by DNS name or IP address. Once logged onto Manager, users can import meta data; export all or portions of the project, or import components from another project’s export. Functions: Backup project Export Import Import meta data Table definitions Sequential file definitions Can be imported from metabrokers Register/create new stages
98
Export Objects to MetaStage
DataStage Enterprise Edition Export Objects to MetaStage Push meta data to MetaStage DataStage objects can now be pushed from DataStage to MetaStage.
99
DataStage Enterprise Edition
Designer Workspace Job design process: Determine data flow Import supporting meta data Use designer workspace to create visual representation of job Define properties for all stages Compile Execute Can execute the job from Designer
100
DataStage Generated OSH
DataStage Enterprise Edition DataStage Generated OSH The EE Framework runs OSH The DataStage GUI now generates OSH when a job is compiled. This OSH is then executed by the Enterprise Edition engine.
101
Director – Executing Jobs
DataStage Enterprise Edition Director – Executing Jobs Messages from previous runs are kept in different color from current run. Messages from previous run in different color
102
DataStage Enterprise Edition
Stages Can now customize the Designer’s palette Select desired stages and drag to favorites In Designer View > Customize palette This window will allow you to move icons into your Favorites folder plus many other customization features.
103
Popular Developer Stages
DataStage Enterprise Edition Popular Developer Stages Row generator The row generator and peek stages are especially useful during development to generate test data and display data in the message log. Peek
104
DataStage Enterprise Edition
Row Generator Can build test data Edit row in column tab Depending on the type of data, you can set values for each column in the row generator. Repeatable property
105
DataStage Enterprise Edition
Peek Displays field values Will be displayed in job log or sent to a file Skip records option Can control number of records to be displayed Can be used as stub stage for iterative development (more later) The peek stage will display column values in a job's output messages log.
106
DataStage Enterprise Edition
Why EE is so Effective Parallel processing paradigm More hardware, faster processing Level of parallelization is determined by a configuration file read at runtime Emphasis on memory Data read into memory and lookups performed like hash table EE takes advantage of the machines hardware architecture -- this can be changed at runtime.
107
Parallel Processing Systems
DataStage Enterprise Edition Parallel Processing Systems DataStage EE Enables parallel processing = executing your application on multiple CPUs simultaneously If you add more resources (CPUs, RAM, and disks) you increase system performance 1 2 3 4 5 6 Example system containing 6 CPUs (or processing nodes) and disks DataStage Enterprise Edition can take advantage of multiple processing nodes to instantiate multiple instances of a DataStage job.
108
Scaleable Systems: Examples
DataStage Enterprise Edition Scaleable Systems: Examples Three main types of scalable systems Symmetric Multiprocessors (SMP): shared memory and disk Clusters: UNIX systems connected via networks MPP: Massively Parallel Processing You can describe an MPP as a bunch of connected SMPs. note
109
SMP: Shared Everything
DataStage Enterprise Edition SMP: Shared Everything Multiple CPUs with a single operating system Programs communicate using shared memory All CPUs share system resources (OS, memory with single linear address space, disks, I/O) When used with Enterprise Edition: Data transport uses shared memory Simplified startup cpu A typical SMP machine has multiple CPUs that share both disks and memory. Enterprise Edition treats NUMA (NonUniform Memory Access) as plain SMP
110
Traditional Batch Processing
DataStage Enterprise Edition Traditional Batch Processing Source Transform Target Data Warehouse Operational Data Archived Data Clean Load Disk Traditional approach to batch processing: Write to disk and read from disk before each processing operation Sub-optimal utilization of resources a 10 GB stream leads to 70 GB of I/O processing resources can sit idle during I/O Very complex to manage (lots and lots of small jobs) Becomes impractical with big data volumes disk I/O consumes the processing terabytes of disk required for temporary staging The traditional data processing paradigm involves dropping data to disk many times throughout a processing run.
111
Pipeline Multiprocessing
DataStage Enterprise Edition Pipeline Multiprocessing Data Pipelining Transform, clean and load processes are executing simultaneously on the same processor rows are moving forward through the flow Operational Data Transform Clean Load Data Warehouse Archived Data Source Target Start a downstream process while an upstream process is still running. This eliminates intermediate storing to disk, which is critical for big data. This also keeps the processors busy. Still has limits on scalability On the other hand, but parallel processing paradigm rarely drops data to disk unless necessary for business reasons -- such as backup and recovery. Think of a conveyor belt moving the rows from process to process!
112
Partition Parallelism
DataStage Enterprise Edition Partition Parallelism Data Partitioning Break up big data into partitions Run one partition on each processor 4X times faster on 4 processors - With data big enough: 100X faster on 100 processors This is exactly how the parallel databases work! Data Partitioning requires the same transform to all partitions: Aaron Abbott and Zygmund Zorn undergo the same transform Transform Source Data Node 1 Node 2 Node 3 Node 4 A-F G- M N-T U-Z Data may actually be partitioned in several ways -- range partitioning is only one example. We will explore others later.
113
Combining Parallelism Types
DataStage Enterprise Edition Combining Parallelism Types Putting It All Together: Parallel Dataflow Target Transform Clean Load Pipelining Partitioning Source Pipelining and partitioning can be combined together to provide a powerful parallel processing paradigm.
114
DataStage Enterprise Edition
Repartitioning Putting It All Together: Parallel Dataflow with Repartioning on-the-fly Target Transform Clean Load Pipelining Partitioning Repartitioning A-F G- M N-T U-Z Customer last name Customer zip code Credit card number In addition, data can change partitioning from stage to stage. This can either happened explicitly at the desire of a programmer or implicitly performed by the engine. Source Without Landing To Disk!
115
DataStage Enterprise Edition
EE Program Elements Dataset: uniform set of rows in the Framework's internal representation - Three flavors: 1. file sets *.fs : stored on multiple Unix files as flat files 2. persistent: *.ds : stored on multiple Unix files in Framework format read and written using the DataSet Stage 3. virtual: *.v : links, in Framework format, NOT stored on disk - The Framework processes only datasets—hence possible need for Import - Different datasets typically have different schemas - Convention: "dataset" = Framework data set. Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file). - All the partitions of a dataset follow the same schema: that of the dataset Enterprise Edition deals with several different types and data: file sets data sets -- both in persistent and non-persistent forms.
116
DataStage EE Architecture
DataStage Enterprise Edition DataStage EE Architecture DataStage: Provides data integration platform Orchestrate Framework: Provides application scalability DataStage Enterprise Edition: Best-of-breed scalable data integration platform No limitations on data volumes or throughput The Enterprise Edition engine was derived from DataStage and Orchestrate.
117
Introduction to DataStage EE
DataStage Enterprise Edition Introduction to DataStage EE DSEE: Automatically scales to fit the machine Handles data flow among multiple CPU’s and disks With DSEE you can: Create applications for SMP’s, clusters and MPP’s… Enterprise Edition is architecture-neutral Access relational databases in parallel Execute external applications in parallel Store data across multiple disks and nodes Enterprise Edition is architecturally neutral -- it can run on SMP's, clusters, and MPP's. The configuration file determines how Enterprise Edition will treat the hardware.
118
Job Design VS. Execution
DataStage Enterprise Edition Job Design VS. Execution Developer assembles data flow using the Designer …and gets: parallel access, propagation, transformation, and load. The design is good for 1 node, 4 nodes, or N nodes. To change # nodes, just swap configuration file. No need to modify or recompile the design Much of the parallel processing paradigm is hidden from the programmer -- they simply designate process flow as shown in the upper portion of this diagram. Enterprise Edition, using the definitions in that configuration file, will actually execute UNIX processes that are partitioned and parallelized.
119
Partitioners and Collectors
DataStage Enterprise Edition Partitioners and Collectors Partitioners distribute rows into partitions implement data-partition parallelism Collectors = inverse partitioners Live on input links of stages running in parallel (partitioners) sequentially (collectors) Use a choice of methods Partitioners and collectors work in opposite directions -- however, many are frequently together in job designs.
120
Example Partitioning Icons
DataStage Enterprise Edition Example Partitioning Icons partitioner Partitioners and collectors have no stage nor icons of their own. They live live on input links of stages running in parallel (resp. sequentially). Link markings indicate their presence. S >S (no Marking) S----(fan out)--->P (partitioner) P----(fan in) ---->S (collector) P----(box) >P (no reshuffling: partitioner using "SAME" method) P----(bow tie)--->P (reshuffling: partitioner using another method) Collectors = inverse partitioners recollect rows from partitions into a single input stream to a sequential stage They are responsible for some surprising behavior: The default (Auto) is "eager" to output rows and typically causes non-determinism: row order may vary from run to run with identical input.
121
DataStage Enterprise Edition
Exercise Complete exercises 1-1 and 1-2, and 1-3
122
DataStage Enterprise Edition DSEE Sequential Access
Module 2 DSEE Sequential Access
123
DataStage Enterprise Edition
Module Objectives You will learn to: Import sequential files into the EE Framework Utilize parallel processing techniques to increase sequential file access Understand usage of the Sequential, DataSet, FileSet, and LookupFileSet stages Manage partitioned data stored by the Framework In this module students will concentrate their efforts on sequential file access in Enterprise Edition jobs.
124
Types of Sequential Data Stages
DataStage Enterprise Edition Types of Sequential Data Stages Sequential Fixed or variable length File Set Lookup File Set Data Set Several stages handle sequential data. Each stage as both advantages and differences from the other stages that handle sequential data. Sequential data can come in a variety of types -- including both fixed length and variable length.
125
Sequential Stage Introduction
DataStage Enterprise Edition Sequential Stage Introduction The EE Framework processes only datasets For files other than datasets, such as flat files, Enterprise Edition must perform import and export operations – this is performed by import and export OSH operators generated by Sequential or FileSet stages During import or export DataStage performs format translations – into, or out of, the EE internal format Data is described to the Framework in a schema The DataStage sequential stage writes OSH – specifically the import and export Orchestrate operators. Q: Why import data into an Orchestrate data set? A: Partitioning works only with data sets. You must use data sets to distribute data to the multiple processing nodes of a parallel system. Every Orchestrate program has to perform some type of import operation, from: a flat file, COBOL data, an RDBMS, or a SAS data set. This section describes how to get your data into Orchestrate. Also talk about getting your data back out. Some people will be happy to leave data in Orchestrate data sets, while others require their results in a different format.
126
How the Sequential Stage Works
DataStage Enterprise Edition How the Sequential Stage Works Generates Import/Export operators, depending on whether stage is source or target Performs direct C++ file I/O streams Behind each parallel stage is one or more Orchestrate operators. Import and Export are both operators that deal with sequential data.
127
Using the Sequential File Stage
DataStage Enterprise Edition Using the Sequential File Stage Both import and export of general files (text, binary) are performed by the SequentialFile Stage. Data import: Data export Importing/Exporting Data EE internal format EE internal format When data is imported the imported operator translates that data into the Enterprise Edition internal format. The export operator performs the reverse action.
128
Working With Flat Files
DataStage Enterprise Edition Working With Flat Files Sequential File Stage Normally will execute in sequential mode Can be parallel if reading multiple files (file pattern option) Can use multiple readers within a node DSEE needs to know How file is divided into rows How row is divided into columns Both export and import operators are generated by the sequential stage -- which one you get depends on whether the sequential stage is used as source or target.
129
Processes Needed to Import Data
DataStage Enterprise Edition Processes Needed to Import Data Recordization Divides input stream into records Set on the format tab Columnization Divides the record into columns Default set on the format tab but can be overridden on the columns tab Can be “incomplete” if using a schema or not even specified in the stage if using RCP These two processes must want together to correctly interpret data -- that is, to break a data string down into records and columns.
130
DataStage Enterprise Edition
File Format Example Fields all columns are defined my delimiters. Similarly, records are defined by terminating characters.
131
DataStage Enterprise Edition
Sequential File Stage To set the properties, use stage editor Page (general, input/output) Tabs (format, columns) Sequential stage link rules One input link One output links (except for reject link definition) One reject link Will reject any records not matching meta data in the column definitions The DataStage GUI allows you to determine properties that will be used to read and write sequential files.
132
Job Design Using Sequential Stages
DataStage Enterprise Edition Job Design Using Sequential Stages Source stage Multiple output links - however, note that one of the links is represented by a broken line. This is a reject link, not to be confused with a stream link or a reference link. Target One input link Stage categories
133
General Tab – Sequential Source
DataStage Enterprise Edition General Tab – Sequential Source Show records Multiple output links If multiple links are present you'll need to down-click to see each.
134
Properties – Multiple Files
DataStage Enterprise Edition Properties – Multiple Files If specified individually, you can make a list of files that are unrelated in name. If you select “read method” and choose file pattern, you effectively select an undetermined number of files. Click to add more files having the same meta data.
135
Properties - Multiple Readers
DataStage Enterprise Edition Properties - Multiple Readers Multiple readers option allows you to set number of readers To use multiple readers on a sequential file, must be fixed-length records.
136
DataStage Enterprise Edition
Format Tab DXEE needs to know: How a file is divided into rows How a row is divided into columns Column properties set on this tab are defaults for each column; they can be overridden at the column level (from columns tab). Record into columns File into records
137
DataStage Enterprise Edition
Read Methods
138
DataStage Enterprise Edition
Reject Link Reject mode = output Source All records not matching the meta data (the column definitions) Target All records that are rejected for any reason Meta data – one column, datatype = raw The sequential stage can have a single reject link. This is typically used when you are writing to a file and provides a location where records that have failed to be written to a file for some reason can be sent. When you are reading files, you can use a reject link as a destination for rows that do not match the expected column definitions.
139
DataStage Enterprise Edition
File Set Stage Can read or write file sets Files suffixed by .fs File set consists of: Descriptor file – contains location of raw data files + meta data Individual raw data files Can be processed in parallel Number of raw data files depends on: the configuration file – more on configuration files later.
140
DataStage Enterprise Edition
File Set Stage Example Descriptor file The descriptor file shows both a record is metadata and the file's location. The location is determined by the configuration file.
141
DataStage Enterprise Edition
File Set Usage Why use a file set? 2G limit on some file systems Need to distribute data among nodes to prevent overruns If used in parallel, runs faster that sequential file File sets, lower yielding faster access and simple text files, are not in the Enterprise Edition internal format.
142
DataStage Enterprise Edition
Lookup File Set Stage Can create file sets Usually used in conjunction with Lookup stages The lookup file set is similar to the file set but also contains information about the key columns. These keys will be used later in lookups.
143
Lookup File Set > Properties
DataStage Enterprise Edition Lookup File Set > Properties Key column specified Key column dropped in descriptor file
144
DataStage Enterprise Edition
Data Set Operating system (Framework) file Suffixed by .ds Referred to by a control file Managed by Data Set Management utility from GUI (Manager, Designer, Director) Represents persistent data Key to good performance in set of linked jobs Data sets represent persistent data maintained in the internal format.
145
DataStage Enterprise Edition
Persistent Datasets Accessed from/to disk with DataSet Stage. Two parts: Descriptor file: contains metadata, data location, but NOT the data itself Data file(s) contain the data multiple Unix files (one per node), accessible in parallel input.ds record ( partno: int32; description: string; ) Accessed from/to disk with DataSet Stage. Two parts: Descriptor file User-specified name Contains table definition ("unformatted core" schema) Here is the icon of the DataSet Stage, used to access persistent datsets Descriptor file, e.g., "input.ds" contains: Paths of data files Metadata: unformatted table definition, no formats (unformatted schema) Config file used to store the data Data file(s) Contain the data itself System-generated long file names, to avoid naming conflicts. node1:/local/disk1/… node2:/local/disk2/…
146
DataStage Enterprise Edition
Quiz! True or False? Everything that has been data-partitioned must be collected in same job For example, if we have four nodes corresponding to four regions, we'll have four reports. No need to recollect if one does not need inter-region correlations.
147
DataStage Enterprise Edition
Data Set Stage Is the data partitioned? In both cases the answer is Yes.
148
Engine Data Translation
DataStage Enterprise Edition Engine Data Translation Occurs on import From sequential files or file sets From RDBMS Occurs on export From datasets to file sets or sequential files From datasets to RDBMS Engine most efficient when processing internally formatted records (I.e. data contained in datasets)
149
DataStage Enterprise Edition
Managing DataSets GUI (Manager, Designer, Director) – tools > data set management Alternative methods Orchadmin Unix command line utility List records Remove data sets (will remove all components) Dsrecords Lists number of records in a dataset Both dsrecords and orchadmin are Unix command-line utilities. The DataStage Designer GUI provides a mechanism to view and manage data sets.
150
DataStage Enterprise Edition
Data Set Management Display data The screen is available (data sets management) from manager, designer, and director. Schema
151
Data Set Management From Unix
DataStage Enterprise Edition Data Set Management From Unix Alternative method of managing file sets and data sets Dsrecords Gives record count Unix command-line utility $ dsrecords ds_name I.e.. $ dsrecords myDS.ds records Orchadmin Manages EE persistent data sets I.e. $ orchadmin rm myDataSet.ds
152
DataStage Enterprise Edition
Exercise Complete exercises 2-1, 2-2, 2-3, and 2-4.
153
DataStage Enterprise Edition Standards and Techniques
Module 3 Standards and Techniques
154
DataStage Enterprise Edition
Objectives Establish standard techniques for DSEE development Will cover: Job documentation Naming conventions for jobs, links, and stages Iterative job design Useful stages for job development Using configuration files for development Using environmental variables Job parameters
155
DataStage Enterprise Edition
Job Presentation Document using the annotation stage
156
Job Properties Documentation
DataStage Enterprise Edition Job Properties Documentation Organize jobs into categories Description shows in DS Manager and MetaStage
157
DataStage Enterprise Edition
Naming conventions Stages named after the Data they access Function they perform DO NOT leave defaulted stage names like Sequential_File_0 Links named for the data they carry DO NOT leave defaulted link names like DSLink3
158
DataStage Enterprise Edition
Stage and Link Names Stages and links renamed to data they handle
159
Create Reusable Job Components
DataStage Enterprise Edition Create Reusable Job Components Use Enterprise Edition shared containers when feasible Container
160
Use Iterative Job Design
DataStage Enterprise Edition Use Iterative Job Design Use copy or peek stage as stub Test job in phases – small first, then increasing in complexity Use Peek stage to examine records
161
DataStage Enterprise Edition
Copy or Peek Stage Stub Copy stage
162
Transformer Stage Techniques
DataStage Enterprise Edition Transformer Stage Techniques Suggestions - Always include reject link. Always test for null value before using a column in a function. Try to use RCP and only map columns that have a derivation other than a copy. More on RCP later. Be aware of Column and Stage variable Data Types. Often user does not pay attention to Stage Variable type. Avoid type conversions. Try to maintain the data type as imported.
163
DataStage Enterprise Edition
The Copy Stage With 1 link in, 1 link out: the Copy Stage is the ultimate "no-op" (place-holder): Partitioners Sort / Remove Duplicates Rename, Drop column … can be inserted on: input link (Partitioning): Partitioners, Sort, Remove Duplicates) output link (Mapping page): Rename, Drop. Sometimes replace the transformer: Rename, Drop, Implicit type Conversions Link Constraint – break up schema
164
DataStage Enterprise Edition
Developing Jobs Keep it simple Jobs with many stages are hard to debug and maintain. Start small and Build to final Solution Use view data, copy, and peek. Start from source and work out. Develop with a 1 node configuration file. Solve the business problem before the performance problem. Don’t worry too much about partitioning until the sequential flow works as expected. If you have to write to Disk use a Persistent Data set.
165
DataStage Enterprise Edition
Final Result
166
Good Things to Have in each Job
DataStage Enterprise Edition Good Things to Have in each Job Use job parameters Some helpful environmental variables to add to job parameters $APT_DUMP_SCORE Report OSH to message log $APT_CONFIG_FILE Establishes runtime parameters to EE engine; I.e. Degree of parallelization
167
Setting Job Parameters
DataStage Enterprise Edition Setting Job Parameters Click to add environment variables
168
DataStage Enterprise Edition
DUMP SCORE Output Setting APT_DUMP_SCORE yields: Double-click Partitoner And Collector Mapping Node--> partition
169
Use Multiple Configuration Files
DataStage Enterprise Edition Use Multiple Configuration Files Make a set for 1X, 2X,…. Use different ones for test versus production Include as a parameter in each job
170
DataStage Enterprise Edition
Exercise Complete exercise 3-1
171
DataStage Enterprise Edition DBMS Access
Module 4 DBMS Access
172
DataStage Enterprise Edition
Objectives Understand how DSEE reads and writes records to an RDBMS Understand how to handle nulls on DBMS lookup Utilize this knowledge to: Read and write database tables Use database tables to lookup data Use null handling options to clean data
173
Parallel Database Connectivity
DataStage Enterprise Edition Parallel Database Connectivity Traditional Client-Server Enterprise Edition Client Client Sort Client Client Client Load Client Parallel RDBMS Parallel RDBMS Parallel server runs APPLICATIONS Application has parallel connections to RDBMS Suitable for large data volumes Higher levels of integration possible Only RDBMS is running in parallel Each application has only one connection Suitable only for small data volumes
174
RDBMS Access Supported Databases
DataStage Enterprise Edition RDBMS Access Supported Databases Enterprise Edition provides high performance / scalable interfaces for: DB2 Informix Oracle Teradata
175
DataStage Enterprise Edition
RDBMS Access Automatically convert RDBMS table layouts to/from Enterprise Edition Table Definitions RDBMS nulls converted to/from nullable field values Support for standard SQL syntax for specifying: field list for SELECT statement filter for WHERE clause Can write an explicit SQL query to access RDBMS EE supplies additional information in the SQL query RDBMS access is relatively easy because Orchestrate extracts the schema definition for the imported data set. Litle or no work is required from the user.
176
DataStage Enterprise Edition
RDBMS Stages DB2/UDB Enterprise Informix Enterprise Oracle Enterprise Teradata Enterprise
177
DataStage Enterprise Edition
RDBMS Usage As a source Extract data from table (stream link) Extract as table, generated SQL, or user-defined SQL User-defined can perform joins, access views Lookup (reference link) Normal lookup is memory-based (all table data read into memory) Can perform one lookup at a time in DBMS (sparse option) Continue/drop/fail options As a target Inserts Upserts (Inserts and updates) Loader
178
RDBMS Source – Stream Link
DataStage Enterprise Edition RDBMS Source – Stream Link Stream link
179
DBMS Source - User-defined SQL
DataStage Enterprise Edition DBMS Source - User-defined SQL Columns in SQL statement must match the meta data in columns tab
180
DataStage Enterprise Edition
Exercise User-defined SQL Exercise 4-1
181
DBMS Source – Reference Link
DataStage Enterprise Edition DBMS Source – Reference Link Reject link
182
DataStage Enterprise Edition
Lookup Reject Link All columns from the input link will be placed on the rejects link. Therefore, no column tab is available for the rejects link. “Output” option automatically creates the reject link
183
DataStage Enterprise Edition
Null Handling Must handle null condition if lookup record is not found and “continue” option is chosen Can be done in a transformer stage
184
DataStage Enterprise Edition
Lookup Stage Mapping The mapping tab will show all columns from the input link and the reference link (less the column used for key lookup). Link name
185
Lookup Stage Properties
DataStage Enterprise Edition Lookup Stage Properties Reference link Must have same column name in input and reference links. You will get the results of the lookup in the output column. If the lookup results in a non-match and the action was set to continue, the output column will be null.
186
DataStage Enterprise Edition
DBMS as a Target
187
DataStage Enterprise Edition
DBMS As Target Write Methods Delete Load Upsert Write (DB2) Write mode for load method Truncate Create Replace Append
188
DataStage Enterprise Edition
Target Properties Generated code can be copied Upsert mode determines options
189
DataStage Enterprise Edition
Checking for Nulls Use Transformer stage to test for fields with null values (Use IsNull functions) In Transformer, can reject or load default value
190
DataStage Enterprise Edition
Exercise Complete exercise 4-2
191
DataStage Enterprise Edition Platform Architecture
Module 5 Platform Architecture
192
DataStage Enterprise Edition
Objectives Understand how Enterprise Edition Framework processes data You will be able to: Read and understand OSH Perform troubleshooting
193
DataStage Enterprise Edition
Concepts The Enterprise Edition Platform Script language - OSH (generated by DataStage Parallel Canvas, and run by DataStage Director) Communication - conductor,section leaders,players. Configuration files (only one active at a time, describes H/W) Meta data - schemas/tables Schema propagation - RCP EE extensibility - Buildop, Wrapper Datasets (data in Framework's internal representation)
194
DataStage Enterprise Edition
DS-EE Stage Elements EE Stages Involve A Series Of Processing Steps Input Data Set schema: prov_num:int16; member_num:int8; custid:int32; Output Data Set schema: prov_num:int16; member_num:int8; custid:int32; Piece of Application Logic Running Against Individual Records Parallel or Sequential Input Interface Business Logic Output Interface Partitioner Each stage has input and output interface schemas, a partitioner and business logic. Interface schemas define the names and data types of the required fields of the component’s input and output Data Sets. The component’s input interface schema requires that an input Data Set have the named fields and data types exactly compatible with those specified by the interface schema for the input Data Set to be accepted by the component. A component ignores any extra fields in a Data Set, which allows the component to be used with any data set that has at least the input interface schema of the component. This property makes it possible to add and delete fields from a relational database table or from the Orchestrate Data Set without having to rewrite code inside the component. In the example shown here, Component has an interface schema that requires three fields with named fields and data types as shown in the example. In this example, the output schema for the component is the same as the input schema. This does not always have to be the case. The partitioner is key to Orchestrate’s ability to deliver parallelism and unlimited scalability. We’ll discuss exactly how the partitioners work in a few slides, but here it’s important to point out that partitioners are an integral part of Orchestrate components. EE Stage
195
DataStage Enterprise Edition
DSEE Stage Execution Dual Parallelism Eliminates Bottlenecks! EE Delivers Parallelism in Two Ways Pipeline Partition Block Buffering Between Components Eliminates Need for Program Load Balancing Maintains Orderly Data Flow Producer Pipeline There is one more point to be made about the DSEE execution model. DSEE achieves parallelism in two ways. We have already talked about partitioning the records and running multiple instances of each component to speed up program execution. In addition to this partition parallelism, Orchestrate is also executing pipeline parallelism. As shown in the picture on the left, as the Orchestrate program is executing, a producer component is feeding records to a consumer component without first writing the records to disk. Orchestrate is pipelining the records forward in the flow as they are being processed by each component. This means that the consumer component is processing records fed to it by the producer component before the producer has finished processing all of the records. Orchestrate provides block buffering between components so that producers cannot produce records faster than consumers can consume those records. This pipelining of records eliminates the need to store intermediate results to disk, which can provide significant performance advantages, particularly when operating against large volumes of data. Consumer Partition
196
Stages Control Partition Parallelism
DataStage Enterprise Edition Stages Control Partition Parallelism Execution Mode (sequential/parallel) is controlled by Stage default = parallel for most Ascential-supplied Stages Developer can override default mode Parallel Stage inserts the default partitioner (Auto) on its input links Sequential Stage inserts the default collector (Auto) on its input links Developer can override default execution mode (parallel/sequential) of Stage > Advanced tab choice of partitioner/collector on Input > Partitioning tab
197
DataStage Enterprise Edition
How Parallel Is It? Degree of parallelism is determined by the configuration file Total number of logical nodes in default pool, or a subset if using "constraints". Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage
198
DataStage Enterprise Edition
OSH DataStage EE GUI generates OSH scripts Ability to view OSH turned on in Administrator OSH can be viewed in Designer using job properties The Framework executes OSH What is OSH? Orchestrate shell Has a UNIX command-line interface
199
DataStage Enterprise Edition
OSH Script An osh script is a quoted string which specifies: The operators and connections of a single Orchestrate step In its simplest form, it is: osh “op < in.ds > out.ds” Where: op is an Orchestrate operator in.ds is the input data set out.ds is the output data set
200
DataStage Enterprise Edition
OSH Operators OSH Operator is an instance of a C++ class inheriting from APT_Operator Developers can create new operators Examples of existing operators: Import Export RemoveDups Operators are the basic functional units of an Orchestrate application. Operators read records from input data sets, perform actions on the input records, and write results to output data sets. An operator may perform an action as simple as copying records from an input data set to an output data set without modification. Alternatively, an operator may modify a record by adding, removing, or modifying fields during execution. Operators are the basic functional units of an Orchestrate application. Operators read records from input data sets, perform actions on the input records, and write results to output data sets. An operator may perform an action as simple as copying records from an input data set to an output data set without modification. Alternatively, an operator may modify a record by adding,
201
Enable Visible OSH in Administrator
DataStage Enterprise Edition Enable Visible OSH in Administrator Will be enabled for all projects
202
DataStage Enterprise Edition
View OSH in Designer Operator Schema
203
DataStage Enterprise Edition
OSH Practice Exercise 5-1 – Instructor demo (optional)
204
Elements of a Framework Program
DataStage Enterprise Edition Elements of a Framework Program Operators Datasets: set of rows processed by Framework Orchestrate data sets: persistent (terminal) *.ds, and virtual (internal) *.v. Also: flat “file sets” *.fs Schema: data description (metadata) for datasets and links.
205
DataStage Enterprise Edition
Datasets Consist of Partitioned Data and Schema Can be Persistent (*.ds) or Virtual (*.v, Link) Overcome 2 GB File Limit What you program: What gets processed: Node 1 Node 2 Node 3 Node 4 GUI OSH Operator A Operator A Operator A Operator A = data files of x.ds What gets generated: Multiple files per partition Each file up to 2GBytes (or larger) $ osh “operator_A > x.ds“
206
Computing Architectures: Definition
DataStage Enterprise Edition Computing Architectures: Definition Dedicated Disk Shared Disk Shared Nothing Disk Disk Disk Disk Disk Disk CPU Shared Memory Memory Memory Memory Memory Memory Uniprocessor SMP System (Symmetric Multiprocessor) Clusters and MPP Systems PC Workstation Single processor server IBM, Sun, HP, Compaq 2 to 64 processors Majority of installations 2 to hundreds of processors MPP: IBM and NCR Teradata each node is a uniprocessor or SMP
207
Job Execution: Orchestrate
DataStage Enterprise Edition Job Execution: Orchestrate Conductor Node C Conductor - initial DS/EE process Step Composer Creates Section Leader processes (one/node) Consolidates massages, outputs them Manages orderly shutdown. Section Leader Forks Players processes (one/Stage) Manages up/down communication. Players The actual processes associated with Stages Combined players: one process only Send stderr to SL Establish connections to other players for data flow Clean up upon completion. Processing Node SL P Processing Node SL P Communication: - SMP: Shared Memory - MPP: TCP
208
Working with Configuration Files
DataStage Enterprise Edition Working with Configuration Files You can easily switch between config files: '1-node' file for sequential execution, lighter reports—handy for testing 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism 'BigN-nodes' file aims at full data-partitioned parallelism Only one file is active while a step is running The Framework queries (first) the environment variable: $APT_CONFIG_FILE # nodes declared in the config file needs not match # CPUs Same configuration file can be used in development and target machines
209
Scheduling Nodes, Processes, and CPUs
DataStage Enterprise Edition Scheduling Nodes, Processes, and CPUs DS/EE does not: know how many CPUs are available schedule Who knows what? Who does what? DS/EE creates (Nodes*Ops) Unix processes The O/S schedules these processes on the CPUs Nodes = # logical nodes declared in config. file Ops = # ops. (approx. # blue boxes in V.O.) Processes = # Unix processes CPUs = # available CPUs
210
Configuring DSEE – Node Pools
DataStage Enterprise Edition Configuring DSEE – Node Pools { node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {} resource scratchdisk "/temp" {} node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} 3 4 1 2
211
Configuring DSEE – Disk Pools
DataStage Enterprise Edition Configuring DSEE – Disk Pools { node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} 3 4 1 2
212
DataStage Enterprise Edition
Re-Partitioning Parallel to parallel flow may incur reshuffling: Records may jump between nodes node 1 node 2 partitioner
213
DataStage Enterprise Edition
Partitioning Methods Auto Hash Entire Range Range Map
214
DataStage Enterprise Edition
Collectors Collectors combine partitions of a dataset into a single input stream to a sequential Stage ... data partitions collector sequential Stage Collectors do NOT synchronize data
215
Partitioning and Repartitioning Are Visible On Job Design
DataStage Enterprise Edition Partitioning and Repartitioning Are Visible On Job Design Partitioners and collectors have no stage nor icons of their own. They live live on input links of stages running in parallel (resp. sequentially). Link markings indicate their presence. S >S (no Marking) S----(fan out)--->P (partitioner) P----(fan in) ---->S (collector) P----(box) >P (no reshuffling: partitioner using "SAME" method) P----(bow tie)--->P (reshuffling: partitioner using another method) Collectors = inverse partitioners recollect rows from partitions into a single input stream to a sequential stage
216
Partitioning and Collecting Icons
DataStage Enterprise Edition Partitioning and Collecting Icons Partitioner Collector
217
Setting a Node Constraint in the GUI
DataStage Enterprise Edition Setting a Node Constraint in the GUI
218
Reading Messages in Director
DataStage Enterprise Edition Reading Messages in Director Set APT_DUMP_SCORE to true Can be specified as job parameter Messages sent to Director log If set, parallel job will produce a report showing the operators, processes, and datasets in the running job
219
Messages With APT_DUMP_SCORE = True
DataStage Enterprise Edition Messages With APT_DUMP_SCORE = True
220
DataStage Enterprise Edition
Exercise Complete exercise 5-2
221
DataStage Enterprise Edition Transforming Data
Module 6 Transforming Data
222
DataStage Enterprise Edition
Module Objectives Understand ways DataStage allows you to transform data Use this understanding to: Create column derivations using user-defined code or system functions Filter records based on business criteria Control data flow based on data conditions
223
DataStage Enterprise Edition
Transformed Data Transformed data is: Outgoing column is a derivation that may, or may not, include incoming fields or parts of incoming fields May be comprised of system variables Frequently uses functions performed on something (ie. incoming columns) Divided into categories – I.e. Date and time Mathematical Logical Null handling More
224
DataStage Enterprise Edition
Stages Review Stages that can transform data Transformer Parallel Basic (from Parallel palette) Aggregator (discussed in later module) Sample stages that do not transform data Sequential FileSet DataSet DBMS
225
Transformer Stage Functions
DataStage Enterprise Edition Transformer Stage Functions Control data flow Create derivations
226
DataStage Enterprise Edition
Flow Control Separate records flow down links based on data condition – specified in Transformer stage constraints Transformer stage can filter records Other stages can filter records but do not exhibit advanced flow control Sequential can send bad records down reject link Lookup can reject records based on lookup failure Filter can select records based on data value
227
DataStage Enterprise Edition
Rejecting Data Reject option on sequential stage Data does not agree with meta data Output consists of one column with binary data type Reject links (from Lookup stage) result from the drop option of the property “If Not Found” Lookup “failed” All columns on reject link (no column mapping option) Reject constraints are controlled from the constraint editor of the transformer Can control column mapping Use the “Other/Log” checkbox
228
Rejecting Data Example
DataStage Enterprise Edition Rejecting Data Example Constraint – Other/log option “If Not Found” property Property Reject Mode = Output
229
Transformer Stage Properties
DataStage Enterprise Edition Transformer Stage Properties Link naming conventions are important because they identify appropriate links in the stage properties screen shown above. Four quadrants: Incoming data link (one only) Outgoing links (can have multiple) Meta data for all incoming links Meta data for all outgoing links – may have multiple tabs if there are multiple outgoing links Note the constraints bar – if you double-click on any you will get screen for defining constraints for all outgoing links.
230
Transformer Stage Variables
DataStage Enterprise Edition Transformer Stage Variables First of transformer stage entities to execute Execute in order from top to bottom Can write a program by using one stage variable to point to the results of a previous stage variable Multi-purpose Counters Hold values for previous rows to make comparison Hold derivations to be used in multiple field dervations Can be used to control execution of constraints
231
DataStage Enterprise Edition
Stage Variables Show/Hide button
232
DataStage Enterprise Edition
Transforming Data Derivations Using expressions Using functions Date/time Transformer Stage Issues Sometimes require sorting before the transformer stage – I.e. using stage variable as accumulator and need to break on change of column value Checking for nulls
233
DataStage Enterprise Edition
Checking for Nulls Nulls can get introduced into the dataflow because of failed lookups and the way in which you chose to handle this condition Can be handled in constraints, derivations, stage variables, or a combination of these If you perform a lookup from a lookup stage and choose the continue option for a failed lookup, you have the possibility of nulls entering your data flow.
234
Transformer - Handling Rejects
DataStage Enterprise Edition Transformer - Handling Rejects Constraint Rejects All expressions are false and reject row is checked
235
Transformer: Execution Order
DataStage Enterprise Edition Transformer: Execution Order Derivations in stage variables are executed first Constraints are executed before derivations Column derivations in earlier links are executed before later links Derivations in higher columns are executed before lower columns
236
Parallel Palette - Two Transformers
DataStage Enterprise Edition Parallel Palette - Two Transformers All > Processing > Transformer Is the non-Universe transformer Has a specific set of functions No DS routines available Parallel > Processing Basic Transformer Makes server style transforms available on the parallel palette Can use DS routines There is no longer a need to used shared containers to get Universe functionality on the parallel palette. Basic transformer is slow in that records need to be exported by the framework to Universe functions then imported. Program in Basic for both transformers
237
Transformer Functions From Derivation Editor
DataStage Enterprise Edition Transformer Functions From Derivation Editor Date & Time Logical Null Handling Number String Type Conversion
238
DataStage Enterprise Edition
Exercise Complete exercises 6-1, 6-2, and 6-3
239
DataStage Enterprise Edition Sorting Data
Module 7 Sorting Data
240
DataStage Enterprise Edition
Objectives Understand DataStage EE sorting options Use this understanding to create sorted list of data to enable functionality within a transformer stage
241
DataStage Enterprise Edition
Sorting Data Important because Some stages require sorted input Some stages may run faster – I.e Aggregator Can be performed Option within stages (use input > partitioning tab and set partitioning to anything other than auto) As a separate stage (more complex sorts)
242
DataStage Enterprise Edition
Sorting Alternatives One of the nation's largest direct marketing outfits has been using this simple program in DS-EE (and its previous instantiations) for years. Householding yields enormous savings by avoiding mailing the same material (in particular expensive catalogs) to the same household. Alternative representation of same flow:
243
Sort Option on Stage Link
DataStage Enterprise Edition Sort Option on Stage Link Stable will not rearrange records that are already in a properly sorted data set. If set to false no prior ordering of records is guaranteed to be preserved by the sorting operation.
244
DataStage Enterprise Edition
Sort Stage
245
DataStage Enterprise Edition
Sort Utility DataStage – the default UNIX
246
DataStage Enterprise Edition
Sort Stage - Outputs Specifies how the output is derived
247
Sort Specification Options
DataStage Enterprise Edition Sort Specification Options Input Link Property Limited functionality Max memory/partition is 20 MB, then spills to scratch Sort Stage Tunable to use more memory before spilling to scratch. Note: Spread I/O by adding more scratch file systems to each node of the APT_CONFIG_FILE
248
DataStage Enterprise Edition
Removing Duplicates Can be done by Sort stage Use unique option OR Remove Duplicates stage Has more sophisticated ways to remove duplicates
249
DataStage Enterprise Edition
Exercise Complete exercise 7-1
250
DataStage Enterprise Edition Combining Data
Module 8 Combining Data
251
DataStage Enterprise Edition
Objectives Understand how DataStage can combine data using the Join, Lookup, Merge, and Aggregator stages Use this understanding to create jobs that will Combine data from separate input streams Aggregate data to form summary totals
252
DataStage Enterprise Edition
Combining Data There are two ways to combine data: Horizontally: Several input links; one output link (+ optional rejects) made of columns from different input links. E.g., Joins Lookup Merge Vertically: One input link, one output link with column combining values from all input rows. E.g., Aggregator
253
Join, Lookup & Merge Stages
DataStage Enterprise Edition Join, Lookup & Merge Stages These "three Stages" combine two or more input links according to values of user-designated "key" column(s). They differ mainly in: Memory usage Treatment of rows with unmatched key values Input requirements (sorted, de-duplicated)
254
Not all Links are Created Equal
DataStage Enterprise Edition Not all Links are Created Equal Enterprise Edition distinguishes between: - The Primary Input (Framework port 0) - Secondary - in some cases "Reference" (other ports) Naming convention: Tip: Check "Input Ordering" tab to make sure intended Primary is listed first The Framework concept of Port # is translated in the GUI by Primary/Reference
255
DataStage Enterprise Edition
Join Stage Editor Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge) One of four variants: Inner Left Outer Right Outer Full Outer Several key columns allowed
256
DataStage Enterprise Edition
1. The Join Stage Four types: 2 sorted input links, 1 output link "left outer" on primary input, "right outer" on secondary input Pre-sort make joins "lightweight": few rows need to be in RAM Inner Left Outer Right Outer Full Outer Follow the RDBMS-style relational model: the operations Join and Load in RDBMS commute. x-products in case of duplicates, matching entries are reusable No fail/reject/drop option for missed matches
257
DataStage Enterprise Edition
2. The Lookup Stage Combines: one source link with one or more duplicate-free table links Source input One or more tables (LUTs) no pre-sort necessary allows multiple keys LUTs flexible exception handling for source input rows with no match 1 2 1 Contrary to Join, Lookup and Merge deal with missing rows. Obviously, a missing row cannot be captured, since it is missing. The closest thing one can capture is the corresponding unmatched row. Lookup can capture in a reject link unmatched rows from the primary input(Source). That is why it has only one reject link (there is only one primary). We'll see the reject option is exactly the opposite with the Merge stage. Lookup Output Reject
258
DataStage Enterprise Edition
The Lookup Stage Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging) On an MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link On an SMP, no physical duplication of a Lookup Table occurs
259
DataStage Enterprise Edition
The Lookup Stage Lookup File Set Like a persistent data set only it contains metadata about the key. Useful for staging lookup tables RDBMS LOOKUP NORMAL Loads to an in memory hash table first SPARSE Select for each row. Might become a performance bottleneck.
260
DataStage Enterprise Edition
3. The Merge Stage Combines one sorted, duplicate-free master (primary) link with one or more sorted update (secondary) links. Pre-sort makes merge "lightweight": few rows need to be in RAM (as with joins, but opposite to lookup). Follows the Master-Update model: Master row and one or more updates row are merged if they have the same value in user-specified key column(s). A non-key column occurs in several inputs? The lowest input port number prevails (e.g., master over update; update values are ignored) Unmatched ("Bad") master rows can be either kept dropped Unmatched ("Bad") update rows in input link can be captured in a "reject" link Matched update rows are consumed.
261
DataStage Enterprise Edition
The Merge Stage Allows composite keys Multiple update links Matched update rows are consumed Unmatched updates can be captured Lightweight Space/time tradeoff: presorts vs. in-RAM table Master One or more updates 1 2 1 2 Merge Contrary to Lookup, Merges captures unmatched secondary (update) rows. Since there may be several update links, there may be several reject links. Output Rejects
262
DataStage Enterprise Edition
Synopsis: Joins, Lookup, & Merge This table contains everything one needs to know to use the three stages. In this table: , <comma> = separator between primary and secondary input links (out and reject links)
263
DataStage Enterprise Edition
The Aggregator Stage Purpose: Perform data aggregations Specify: Zero or more key columns that define the aggregation units (or groups) Columns to be aggregated Aggregation functions: count (nulls/non-nulls) sum max/min/range The grouping method (hash table or pre-sort) is a performance issue
264
DataStage Enterprise Edition
Grouping Methods Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed doesn’t require sorted data good when number of unique groups is small. Running tally for each group’s aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM. Example: average family income by state, requires .05MB of RAM Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out. requires input sorted by grouping keys can handle unlimited numbers of groups Example: average daily balance by credit card WARNING! Hash has nothing to do with the Hash Partitioner! It says one hash table per group must be carried in RAM Sort has nothing to do with the Sort Stage. Just says expects sorted input.
265
DataStage Enterprise Edition
Aggregator Functions Sum Min, max Mean Missing value count Non-missing value count Percent coefficient of variation
266
Aggregator Properties
DataStage Enterprise Edition Aggregator Properties
267
DataStage Enterprise Edition
Aggregation Types Aggregation types
268
DataStage Enterprise Edition
Containers Two varieties Local Shared Simplifies a large, complex diagram Creates reusable object that many jobs can include
269
DataStage Enterprise Edition
Creating a Container Create a job Select (loop) portions to containerize Edit > Construct container > local or shared
270
DataStage Enterprise Edition
Using a Container Select as though it were a stage
271
DataStage Enterprise Edition
Exercise Complete exercise 8-1
272
DataStage Enterprise Edition Configuration Files
Module 9 Configuration Files
273
DataStage Enterprise Edition
Objectives Understand how DataStage EE uses configuration files to determine parallel behavior Use this understanding to Build a EE configuration file for a computer system Change node configurations to support adding resources to processes that need them Create a job that will change resource allocations at the stage level
274
Configuration File Concepts
DataStage Enterprise Edition Configuration File Concepts Determine the processing nodes and disk space connected to each node When system changes, need only change the configuration file – no need to recompile jobs When DataStage job runs, platform reads configuration file Platform automatically scales the application to fit the system
275
DataStage Enterprise Edition
Processing Nodes Are Locations on which the framework runs applications Logical rather than physical construct Do not necessarily correspond to the number of CPUs in your system Typically one node for two CPUs Can define one processing node for multiple physical nodes or multiple processing nodes for one physical node
276
Optimizing Parallelism
DataStage Enterprise Edition Optimizing Parallelism Degree of parallelism determined by number of nodes defined Parallelism should be optimized, not maximized Increasing parallelism distributes work load but also increases Framework overhead Hardware influences degree of parallelism possible System hardware partially determines configuration The hardware that makes up your system partially determines configuration. For example, applications with large memory requirements, such as sort operations, are best assigned to machines with a lot of memory. Applications that will access an RDBMS must run on its server nodes; operators using other proprietary software, such as SAS or SyncSort, must run on nodes with licenses for that software.
277
More Factors to Consider
DataStage Enterprise Edition More Factors to Consider Communication amongst operators Should be optimized by your configuration Operators exchanging large amounts of data should be assigned to nodes communicating by shared memory or high-speed link SMP – leave some processors for operating system Desirable to equalize partitioning of data Use an experimental approach Start with small data sets Try different parallelism while scaling up data set sizes
278
Factors Affecting Optimal Degree of Parallelism
DataStage Enterprise Edition Factors Affecting Optimal Degree of Parallelism CPU intensive applications Benefit from the greatest possible parallelism Applications that are disk intensive Number of logical nodes equals the number of disk spindles being accessed
279
DataStage Enterprise Edition
Configuration File Text file containing string data that is passed to the Framework Sits on server side Can be displayed and edited Name and location found in environmental variable APT_CONFIG_FILE Components Node Fast name Pools Resource
280
DataStage Enterprise Edition
Node Options Node name – name of a processing node used by EE Typically the network name Use command uname –n to obtain network name Fastname – Name of node as referred to by fastest network in the system Operators use physical node name to open connections NOTE: for SMP, all CPUs share single connection to network Pools Names of pools to which this node is assigned Used to logically group nodes Can also be used to group resources Resource Disk Scratchdisk Set of node pool reserved names: DB2 Oracle Informix Sas Sort Syncsort
281
Sample Configuration File
DataStage Enterprise Edition Sample Configuration File { node “Node1" fastname "BlackHole" pools "" "node1" resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" } resource scratchdisk "/usr/dsadm/Ascential/DataStage/Scratch" {pools "" } } For a single node system node name usually set to value of UNIX command –uname –n Fastname attribute is the name of the node as it is referred to on the fastest network in the system, such as an IBM switch, FDDI, or BYNET. The fast name is the physical node name that operators use to open connections for high-volume data transfers. Typically this is the principal node name as returned by the UNIX command uname –n.
282
DataStage Enterprise Edition
Disk Pools Disk pools allocate storage By default, EE uses the default pool, specified by “” pool "bigdata"
283
DataStage Enterprise Edition
Sorting Requirements Resource pools can also be specified for sorting: The Sort stage looks first for scratch disk resources in a “sort” pool, and then in the default disk pool Recommendations: Each logical node defined in the configuration file that will run sorting operations should have its own sort disk. Each logical node's sorting disk should be a distinct disk drive or a striped disk, if it is shared among nodes. In large sorting operations, each node that performs sorting should have multiple disks, where a sort disk is a scratch disk available for sorting that resides in either the sort or default disk pool.
284
Another Configuration File Example
DataStage Enterprise Edition Another Configuration File Example { node "n1" { fastname “s1" pool "" "n1" "s1" "sort" resource disk "/data/n1/d1" {} resource disk "/data/n1/d2" {} resource scratchdisk "/scratch" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/data/n2/d1" {} resource scratchdisk "/scratch" {} node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/data/n3/d1" {} node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/data/n4/d1" {} ... 4 5 1 6 2 3
285
DataStage Enterprise Edition
Resource Types Disk Scratchdisk DB2 Oracle Saswork Sortwork Can exist in a pool Groups resources together
286
Using Different Configurations
DataStage Enterprise Edition Using Different Configurations In this instance, since a sparse lookup is viewed as the bottleneck, the stage has been set to execute on multiple nodes. Lookup stage where DBMS is using a sparse lookup type
287
Building a Configuration File
DataStage Enterprise Edition Building a Configuration File Scoping the hardware: Is the hardware configuration SMP, Cluster, or MPP? Define each node structure (an SMP would be single node): Number of CPUs CPU speed Available memory Available page/swap space Connectivity (network/back-panel speed) Is the machine dedicated to EE? If not, what other applications are running on it? Get a breakdown of the resource usage (vmstat, mpstat, iostat) Are there other configuration restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them?
288
DataStage Enterprise Edition
Exercise Complete exercise 9-1 and 9-2
289
DataStage Enterprise Edition Extending DataStage EE
Module 10 Extending DataStage EE
290
DataStage Enterprise Edition
Objectives Understand the methods by which you can add functionality to EE Use this understanding to: Build a DataStage EE stage that handles special processing needs not supplied with the vanilla stages Build a DataStage EE job that uses the new stage
291
EE Extensibility Overview
DataStage Enterprise Edition EE Extensibility Overview Sometimes it will be to your advantage to leverage EE’s extensibility. This extensibility includes: Wrappers Buildops Custom Stages
292
When To Leverage EE Extensibility
DataStage Enterprise Edition When To Leverage EE Extensibility Types of situations: Complex business logic, not easily accomplished using standard EE stages Reuse of existing C, C++, Java, COBOL, etc…
293
Wrappers vs. Buildop vs. Custom
DataStage Enterprise Edition Wrappers vs. Buildop vs. Custom Wrappers are good if you cannot or do not want to modify the application and performance is not critical. Buildops: good if you need custom coding but do not need dynamic (runtime-based) input and output interfaces. Custom (C++ coding using framework API): good if you need custom coding and need dynamic input and output interfaces.
294
Building “Wrapped” Stages
DataStage Enterprise Edition Building “Wrapped” Stages You can “wrapper” a legacy executable: Binary Unix command Shell script … and turn it into a Enterprise Edition stage capable, among other things, of parallel execution… As long as the legacy executable is: amenable to data-partition parallelism no dependencies between rows pipe-safe can read rows sequentially no random access to data
295
DataStage Enterprise Edition
Wrappers (Cont’d) Wrappers are treated as a black box EE has no knowledge of contents EE has no means of managing anything that occurs inside the wrapper EE only knows how to export data to and import data from the wrapper User must know at design time the intended behavior of the wrapper and its schema interface If the wrappered application needs to see all records prior to processing, it cannot run in parallel.
296
DataStage Enterprise Edition
LS Example Can this command be wrappered?
297
DataStage Enterprise Edition
Creating a Wrapper To create the “ls” stage Example: wrapping ls Unix command: Ls /opt/AdvTrain/dnich would yield a list of files and subdirectories. The wrapper is thus comprised of the command and a parameter that contains a disk location. Used in this job ---
298
Wrapper Starting Point
DataStage Enterprise Edition Wrapper Starting Point Creating Wrapped Stages From Manager: Right-Click on Stage Type > New Parallel Stage > Wrapped We will "Wrapper” an existing Unix executables – the ls command
299
DataStage Enterprise Edition
Wrapper - General Page Name of stage Unix ls command can take several arguments but we will use it in its simplest form: Ls location Where location will be passed into the stage with a job parameter. Unix command to be wrapped
300
DataStage Enterprise Edition
The "Creator" Page Conscientiously maintaining the Creator page for all your wrapped stages will eventually earn you the thanks of others.
301
Wrapper – Properties Page
DataStage Enterprise Edition Wrapper – Properties Page If your stage will have properties appear, complete the Properties page This will be the name of the property as it appears in your stage
302
DataStage Enterprise Edition
Wrapper - Wrapped Page Interfaces – input and output columns - these should first be entered into the table definitions meta data (DS Manager); let’s do that now. The Interfaces > input and output describe the meta data for how you will communicate with the wrapped application.
303
DataStage Enterprise Edition
Interface schemas Layout interfaces describe what columns the stage: Needs for its inputs (if any) Creates for its outputs (if any) Should be created as tables with columns in Manager
304
Column Definition for Wrapper Interface
DataStage Enterprise Edition Column Definition for Wrapper Interface
305
How Does the Wrapping Work?
DataStage Enterprise Edition How Does the Wrapping Work? Define the schema for export and import Schemas become interface schemas of the operator and allow for by-name column access input schema export stdin or named pipe UNIX executable stdout or named pipe import Answer: You must first EXIT the DS-EE environment to access the vanilla Unix environment. Then you must reenter the DS-EE environment. output schema QUIZ: Why does export precede import?
306
Update the Wrapper Interfaces
DataStage Enterprise Edition Update the Wrapper Interfaces This wrapper will have no input interface – i.e. no input link. The location will come as a job parameter that will be passed to the appropriate stage property. Therefore, only the Output tab entry is needed.
307
DataStage Enterprise Edition
Resulting Job Wrapped stage
308
DataStage Enterprise Edition
Job Run Show file from Designer palette
309
Wrapper Story: Cobol Application
DataStage Enterprise Edition Wrapper Story: Cobol Application Hardware Environment: IBM SP2, 2 nodes with 4 CPU’s per node. Software: DB2/EEE, COBOL, EE Original COBOL Application: Extracted source table, performed lookup against table in DB2, and Loaded results to target table. 4 hours 20 minutes sequential execution Enterprise Edition Solution: Used EE to perform Parallel DB2 Extracts and Loads Used EE to execute COBOL application in Parallel EE Framework handled data transfer between DB2/EEE and COBOL application 30 minutes 8-way parallel execution
310
DataStage Enterprise Edition
Buildops Buildop provides a simple means of extending beyond the functionality provided by EE, but does not use an existing executable (like the wrapper) Reasons to use Buildop include: Speed / Performance Complex business logic that cannot be easily represented using existing stages Lookups across a range of values Surrogate key generation Rolling aggregates Build once and reusable everywhere within project, no shared container necessary Can combine functionality from different stages into one
311
DataStage Enterprise Edition
BuildOps The DataStage programmer encapsulates the business logic The Enterprise Edition interface called “buildop” automatically performs the tedious, error-prone tasks: invoke needed header files, build the necessary “plumbing” for a correct and efficient parallel execution. Exploits extensibility of EE Framework
312
BuildOp Process Overview
DataStage Enterprise Edition BuildOp Process Overview From Manager (or Designer): Repository pane: Right-Click on Stage Type > New Parallel Stage > {Custom | Build | Wrapped} "Build" stages from within Enterprise Edition "Wrapping” existing “Unix” executables
313
DataStage Enterprise Edition
General Page Identical to Wrappers, except: Under the Build Tab, your program!
314
Logic Tab for Business Logic
DataStage Enterprise Edition Logic Tab for Business Logic Enter Business C/C++ logic and arithmetic in four pages under the Logic tab Main code section goes in Per-Record page- it will be applied to all rows NOTE: Code will need to be Ansi C/C++ compliant. If code does not compile outside of EE, it won’t compile within EE either! The four tabs are Definitions, Pre-Loop, Per-Record, Post-Loop. The main action is in Per-record. Definitions is used to declare and initialize variables Pre-loop has code to be performed prior to processing the first row. Post-loop has code to be performed after processing the last row
315
Code Sections under Logic Tab
DataStage Enterprise Edition Code Sections under Logic Tab Temporary variables declared [and initialized] here Logic here is executed once BEFORE processing the FIRST row Logic here is executed once AFTER processing the LAST row
316
DataStage Enterprise Edition
I/O and Transfer Under Interface tab: Input, Output & Transfer pages First line: output 0 This is the Output page. This Input page is the same, except it has an "Auto Read," instead of the "Auto Write" column. The Input/Output interface TDs must be prepared in advance and put in the repository. Write row Input page: 'Auto Read' Read next row In-Repository Table Definition 'False' setting, not to interfere with Transfer page Optional renaming of output port from default "out0"
317
DataStage Enterprise Edition
I/O and Transfer First line: Transfer of index 0 Transfer all columns from input to output. If page left blank or Auto Transfer = "False" (and RCP = "False") Only columns in output Table Definition are written The role of Transfer will be made clearer soon with examples.
318
BuildOp Simple Example
DataStage Enterprise Edition BuildOp Simple Example Example - sumNoTransfer Add input columns "a" and "b"; ignore other columns that might be present in input Produce a new "sum" column Do not transfer input columns sumNoTransfer a:int32; b:int32 Only the column(s) explicitly listed in the output TD survive. sum:int32
319
DataStage Enterprise Edition
No Transfer From Peek: NO TRANSFER RCP set to "False" in stage definition and Transfer page left blank, or Auto Transfer = "False" Effects: input columns "a" and "b" are not transferred only new column "sum" is transferred Compare with transfer ON…
320
DataStage Enterprise Edition
Transfer TRANSFER RCP set to "True" in stage definition or Auto Transfer set to "True" Effects: new column "sum" is transferred, as well as input columns "a" and "b" and input column "ignored" (present in input, but not mentioned in stage) All the columns in the input link are transferred, irrespective of what the Input/Output TDs say.
321
Columns vs. Temporary C++ Variables
DataStage Enterprise Edition Columns vs. Temporary C++ Variables Temp C++ variables C/C++ type Need declaration (in Definitions or Pre-Loop page) Value persistent throughout "loop" over rows, unless modified in code Columns DS-EE type Defined in Table Definitions Value refreshed from row to row ANSWER to QUIZ Replacing index = count++; with index++ ; would result in index=1 throughout. See bottom bullet in left column.
322
DataStage Enterprise Edition
Exercise Complete exercise 10-1 and 10-2
323
DataStage Enterprise Edition
Exercise Complete exercises 10-3 and 10-4
324
DataStage Enterprise Edition
Custom Stage Reasons for a custom stage: Add EE operator not already in DataStage EE Build your own Operator and add to DataStage EE Use EE API Use Custom Stage to add new operator to EE canvas
325
DataStage Enterprise Edition
Custom Stage DataStage Manager > select Stage Types branch > right click
326
DataStage Enterprise Edition
Custom Stage Number of input and output links allowed Name of Orchestrate operator to be used
327
Custom Stage – Properties Tab
DataStage Enterprise Edition Custom Stage – Properties Tab
328
DataStage Enterprise Edition
The Result
329
DataStage Enterprise Edition Meta Data in DataStage EE
Module 11 Meta Data in DataStage EE
330
DataStage Enterprise Edition
Objectives Understand how EE uses meta data, particularly schemas and runtime column propagation Use this understanding to: Build schema definition files to be invoked in DataStage jobs Use RCP to manage meta data usage in EE jobs
331
Establishing Meta Data
DataStage Enterprise Edition Establishing Meta Data Data definitions Recordization and columnization Fields have properties that can be set at individual field level Data types in GUI are translated to types used by EE Described as properties on the format/columns tab (outputs or inputs pages) OR Using a schema file (can be full or partial) Schemas Can be imported into Manager Can be pointed to by some job stages (i.e. Sequential)
332
Data Formatting – Record Level
DataStage Enterprise Edition Data Formatting – Record Level Format tab Meta data described on a record basis Record level properties To view documentation on each of these properties, open a stage > input or output > format. Now hover your cursor over the property in question and help text will appear.
333
Data Formatting – Column Level
DataStage Enterprise Edition Data Formatting – Column Level Defaults for all columns
334
DataStage Enterprise Edition
Column Overrides Edit row from within the columns tab Set individual column properties
335
Extended Column Properties
DataStage Enterprise Edition Extended Column Properties Field and string settings
336
Extended Properties – String Type
DataStage Enterprise Edition Extended Properties – String Type Note the ability to convert ASCII to EBCDIC
337
DataStage Enterprise Edition
Editing Columns Properties depend on the data type
338
DataStage Enterprise Edition
Schema Alternative way to specify column definitions for data used in EE jobs Written in a plain text file Can be written as a partial record definition Can be imported into the DataStage repository The format of each line describing a column is: column_name:[nullability]datatype
339
DataStage Enterprise Edition
Creating a Schema Using a text editor Follow correct syntax for definitions OR Import from an existing data set or file set On DataStage Manager import > Table Definitions > Orchestrate Schema Definitions Select checkbox for a file with .fs or .ds
340
DataStage Enterprise Edition
Importing a Schema Schema location can be on the server or local work station
341
DataStage Enterprise Edition
Data Types Date Decimal Floating point Integer String Time Timestamp Vector Subrecord Raw Tagged Raw – collection of untyped bytes Vector – elemental array Subrecord – a record within a record (elements of a group level) Tagged – column linked to another column that defines its datatype
342
Runtime Column Propagation
DataStage Enterprise Edition Runtime Column Propagation DataStage EE is flexible about meta data. It can cope with the situation where meta data isn’t fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). RCP is always on at runtime. Design and compile time column mapping enforcement. RCP is off by default. Enable first at project level. (Administrator project properties) Enable at job level. (job properties General tab) Enable at Stage. (Link Output Column tab)
343
Enabling RCP at Project Level
DataStage Enterprise Edition Enabling RCP at Project Level
344
Enabling RCP at Job Level
DataStage Enterprise Edition Enabling RCP at Job Level
345
Enabling RCP at Stage Level
DataStage Enterprise Edition Enabling RCP at Stage Level Go to output link’s columns tab For transformer you can find the output links columns tab by first going to stage properties
346
Using RCP with Sequential Stages
DataStage Enterprise Edition Using RCP with Sequential Stages To utilize runtime column propagation in the sequential stage you must use the “use schema” option Stages with this restriction: Sequential File Set External Source External Target What is runtime column propagation? Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. Sequential files, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on sequential files if you have used the Schema File property (see Link\xd2 ” on page ‑ and on page ‑ ) to specify a schema which describes all the columns in the sequential file. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are: Sequential File File Set External Source External Target Column Import Column Export
347
Runtime Column Propagation
DataStage Enterprise Edition Runtime Column Propagation When RCP is Disabled DataStage Designer will enforce Stage Input Column to Output Column mappings. At job compile time modify operators are inserted on output links in the generated osh. Modify operators can add or change columns in a data flow.
348
Runtime Column Propagation
DataStage Enterprise Edition Runtime Column Propagation When RCP is Enabled DataStage Designer will not enforce mapping rules. No Modify operator inserted at compile time. Danger of runtime error if column names incoming do not match column names outgoing link – case sensitivity.
349
DataStage Enterprise Edition
Exercise Complete exercises 11-1 and 11-2
350
DataStage Enterprise Edition Job Control Using the Job Sequencer
Module 12 Job Control Using the Job Sequencer
351
DataStage Enterprise Edition
Objectives Understand how the DataStage job sequencer works Use this understanding to build a control job to run a sequence of DataStage jobs
352
DataStage Enterprise Edition
Job Control Options Manually write job control Code generated in Basic Use the job control tab on the job properties page Generates basic code which you can modify Job Sequencer Build a controlling job much the same way you build other jobs Comprised of stages and links No basic coding
353
DataStage Enterprise Edition
Job Sequencer Build like a regular job Type “Job Sequence” Has stages and links Job Activity stage represents a DataStage job Links represent passing control Stages
354
DataStage Enterprise Edition
Example Job Activity stage – contains conditional triggers
355
Job Activity Properties
DataStage Enterprise Edition Job Activity Properties Job to be executed – select from dropdown Job parameters to be passed
356
DataStage Enterprise Edition
Job Activity Trigger Trigger appears as a link in the diagram Custom options let you define the code
357
DataStage Enterprise Edition
Options Use custom option for conditionals Execute if job run successful or warnings only Can add “wait for file” to execute Add “execute command” stage to drop real tables and rename new tables to current tables
358
Job Activity With Multiple Links
DataStage Enterprise Edition Job Activity With Multiple Links Different links having different triggers
359
DataStage Enterprise Edition
Sequencer Stage Build job sequencer to control job for the collections application Can be set to all or any
360
DataStage Enterprise Edition
Notification Stage Notification
361
Notification Activity
DataStage Enterprise Edition Notification Activity
362
Sample DataStage log from Mail Notification
DataStage Enterprise Edition Sample DataStage log from Mail Notification Sample DataStage log from Mail Notification
363
Notification Activity Message
DataStage Enterprise Edition Notification Activity Message Message
364
DataStage Enterprise Edition
Exercise Complete exercise 12-1
365
DataStage Enterprise Edition Testing and Debugging
Module 13 Testing and Debugging
366
DataStage Enterprise Edition
Objectives Understand spectrum of tools to perform testing and debugging Use this understanding to troubleshoot a DataStage job
367
Environment Variables
DataStage Enterprise Edition Environment Variables Environment variables fall in broad categories, listed in the left pane. We'll see these categories one by one. All environments values listed in the ADMINISTRATOR are the project-wide defaults. Can be modified by DESIGNER per job, and again, by DIRECTOR per run. The default values are reasonable ones, there is no need for the beginning user to modify them, or even to know much about them--with one possible exception: APT_CONFIG_FILE-- see next slide.
368
Parallel Environment Variables
DataStage Enterprise Edition Parallel Environment Variables Highlighted: APT_CONFIG_FILE, contains the path (on the server) of the active config file. The main aspect of a given configuration file is the # of nodes it declares. In the Labs, we used two files: One with one node declared; for use in sequential execution One with two nodes declared; for use in parallel execution
369
Environment Variables Stage Specific
DataStage Enterprise Edition Environment Variables Stage Specific The correct settings for these should be set at install. If you need to modify them, first check with your DBA.
370
Environment Variables
DataStage Enterprise Edition Environment Variables These are for the user to play with. Easy: they take only TRUE/FALSE values. Control the verbosity of the log file. The defaults are set for minimal verbosity. The top one, APT_DUMP_SCORE, is an old favorite. It tracks datasets, nodes, partitions, and combinations --- all TBD soon. APT_RECORDS_COUNT helps you detect load imbalance. APT_PRINT_SCHEMAS shows the textual representation of the unformatted metadata at all stages. Online descriptions with the "Help" button.
371
Environment Variables Compiler
DataStage Enterprise Edition Environment Variables Compiler You need to have these right to use the Transformer and the Custom stages. Only these stages invoke the C++ compiler. The correct values are listed in the Release Notes.
372
DataStage Enterprise Edition
The Director Typical Job Log Messages: Environment variables Configuration File information Framework Info/Warning/Error messages Output from the Peek Stage Additional info with "Reporting" environments Tracing/Debug output Must compile job in trace mode Adds overhead
373
Job Level Environmental Variables
DataStage Enterprise Edition Job Level Environmental Variables Job Properties, from Menu Bar of Designer Director will prompt you before each run Project-wide environments set by ADMINISTRATOR can be modified on a job basis in DESIGNER's Job Properties, and on a run basis by DIRECTOR. Provides great flexibility.
374
DataStage Enterprise Edition
Troubleshooting If you get an error during compile, check the following: Compilation problems If Transformer used, check C++ compiler, LD_LIRBARY_PATH If Buildop errors try buildop from command line Some stages may not support RCP – can cause column mismatch . Use the Show Error and More buttons Examine Generated OSH Check environment variables settings Very little integrity checking during compile, should run validate from Director. Highlights source of error
375
DataStage Enterprise Edition
Generating Test Data Row Generator stage can be used Column definitions Data type dependent Row Generator plus lookup stages provides good way to create robust test data from pattern files
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.