Download presentation
Presentation is loading. Please wait.
1
What Does Ab Initio Mean?
Ab Initio is a Latin phrase that means: Of, relating to, or occurring at the beginning; first From first principles, in scientific circles From the beginning, in legal circles
2
About Ab Initio Ab Initio is a general purpose data processing platform for enterprise class, mission critical applications such as data warehousing, clickstream processing, data movement, data transformation and analytics. Supports integration of arbitrary data sources and programs, and provides complete metadata management across the enterprise. Proven best of breed ETL solution. Applications of Ab Initio: ETL for data warehouses, data marts and operational data sources. Parallel data cleansing and validation. Parallel data transformation and filtering. High performance analytics Real time, parallel data capture.
3
Ab initio Platforms No problem is too big or too small for Ab Initio. Ab Initio runs on a few processors or few hundred processors. Ab Initio runs on virtually every kind of hardware SMP (Symmetric Multiprocessor) systems MPP (Massively Parallel Processor) systems Clusters PCs
4
Ab Initio runs on many operating systems
Compaq Tru64 UNIX Digital unix Hewlett-Packard HP-UX Ibm aix NCR MP-RAS Red Hat Linux IBM/Sequent DYNIX/ptx Siemens Pyramid Reliant UNIX Slicon Graphics IRIX Sun Solaris Windows NT and Windows 2000
5
Ab Initio base software consists of three main pieces:
Ab Initio Co>Operating System and core components Graphical Development environment(GDE) Enterprise Metadata Environment(EME)
6
Ab Initio Architecture
Applications Ab Initio Metadata Repository Application Development Environments Graphical C Shell Component Library User-defined Components Third Party Components Ab Initio Co>Operating System Native Operating System UNIX Windows NT
7
Graph when deployed generate .ksh
Ab Initio Overview Store all variables in a repository / is also used for control / also collects all metadata about graph developed in GDE Run all your graphs Create all your graphs User EME GDE Co>Operating system User DTM Graph when deployed generate .ksh User Used to schedule graphs developed in GDE. It also has capability to maintain dependencies between graphs
8
Co>Operating System
The Co>Operating System is core software that unites a network of computing resources-CPUs, storage disks, programs, datasets-into a production-quality data processing system with scalable performance and mainframe reliability. The Co>Operating System is layered on top of the native operating systems of a collection of computers. It provides a distributed model for process execution, file management, process monitoring, check-pointing, and debugging.
9
Graphical Development Environment (GDE)
GDE lets create applications by dragging and dropping components onto a canvas configuring them with familiar, intuitive point and click operations, and connecting them into executable flowcharts. These diagrams are architectural documents that developers and managers alike can understand and use. the co>operating system executes these flowcharts directly. This means that there is a seamless and solid connection between the abstract picture of the application and the concrete reality of its execution.
10
Graphical Development Environment (GDE)
The Graphical Development Environment (GDE) provides a graphical user interface into the services of the Co>Operating System. Unlimited scalability : Data parallelism results in speedups proportional to the hardware resources provided, double the number of CPUs and execution time is halved. Flexibility : Provides a powerful and efficient data transformation engine and an open component model for extending and customizing Ab Initio’s functionality. Portability : Runs heterogeneously across a huge variety of operating system and hardware platforms.
11
Graphical Method for Building Business Applications
A Graph is a picture that represents the various processing stages of task and the streams of data as they move from one stage to another. One Picture is worth a thousand words, is one graph worth a thousand lines of code? Ab Initio application graphs often represent in a diagram or two what might have taken hundreds to thousands of lines of code. This can dramatically reduce the time it takes to develop, test, and maintain application
12
What is Graph Programming
Ab Initio has based the GDE on the Data Flow Model Data flow diagrams allow you to think in terms of meaningful processing steps, not microscopic lines of code Data flow diagrams capture the movement of information through the application. Ab Initio calls this development method Graph Programming
13
Graph Programming? The process of constructing Ab Initio applications is called Graph Programming. In Ab Initio’s Graphical Development Environment, you build an application by manipulating components, the building blocks of the graph. Ab Initio Graphs are based on the Data Flow Model. Even the symbols are similar. The basic parts of Ab Initio graphs are shown below.
14
Symbols Boxes for processing and Data Transforms
Arrows for Data Flows between processes Cylinders for serial I/O files Divided cylinders for parallel I/O files Grid boxes for database tables
15
Graph Programming Working with the GDE on your desktop is easier than drawing a data flow diagram on a white board. You simply drag and drop functional modules called Components and link them with a swipe of the mouse. When it’s time to run the application, Ab Initio Co>Operating System turns the diagram into a collection of process running on servers The Ab Initio term for running data flow diagram is a Graph. The inputs and outputs are dataset components; the processing steps are program components; and the data conduits are flows.
16
Anatomy of a Running Job
What happens when you push the “Run” button? Your graph is translated into a script that can be executed in the Shell Development Environment. This script and any metadata files stored on the GDE client machine are shipped (via FTP) to the server. The script is invoked (via REXEC or TELNET) on the server. The script creates and runs a job that may run across many nodes. Monitoring information is sent back to the GDE client.
17
Anatomy of a Running Job
Host Process Creation Pushing “Run” button generates script. Script is transmitted to Host node. Script is invoked, creating Host process. Host GDE Client Host Processing nodes
18
Anatomy of a Running Job
Agent Process Creation Host process spawns Agent processes. Host GDE Agent Agent Client Host Processing nodes
19
Anatomy of a Running Job
Component Process Creation Agent processes create Component processes on each processing node. Host GDE Agent Agent Client Host Processing nodes
20
Anatomy of a Running Job
Component Execution Component processes do their jobs. Component processes communicate directly with datasets and each other to move data around. Host GDE Agent Agent Client Host Processing nodes
21
Anatomy of a Running Job
Successful Component Termination As each Component process finishes with its data, it exits with success status. Host GDE Agent Agent Client Host Processing nodes
22
Anatomy of a Running Job
Agent Termination When all of an Agent’s Component processes exit, the Agent informs the Host process that those components are finished. The Agent process then exits. Host GDE Client Host Processing nodes
23
Anatomy of a Running Job
Host Termination When all Agents have exited, the Host process informs the GDE that the job is complete. The Host process then exits. Host GDE Client Host Processing nodes
24
Ab Initio S/w Versions & File Extensions
Software Versions Co>Operating System Version => GDE Version => File Extensions .mp Stored Ab Initio graph or graph component .mpc Program or custom component .mdc Dataset or custom dataset component .dml Data Manipulation Language file or record type definition .xfr Transform function file .dat Data file (either serial file or multifile)
25
Versions To find the GDE version Select Help >> About Ab Initio from the GDE window. To find the Co>Operating System version Select Run >> Settings from the GDE window. Look for the Detected base System Version.
26
Connecting to Co>op Server from GDE
27
Host Profile Setting Choose settings from the run menu
Check the use host profile setting checkbox. Click Edit button to open the Host profile dialog. If running Ab Initio on your local NT system, check Local Execution (NT) checkbox and go to step 6. If running Ab Initio on a Remote UNIX system, fill in the path to the Host and Host Login and Password. Type the full path of Host directory. Select the Shell Type from pull down menu. Test Login and if necessary make changes.
28
Enter Host, Login, Password & Host directory
Host Profile Enter Host, Login, Password & Host directory Select the Shell Type
29
Ab Initio Components Ab Initio provided components. Datasets, Partition, Transform, Sort, Database are frequently used.
30
Specify the Input .dat file
Creating Graph Type the Label Specify the Input .dat file
31
Create Graph - Dml Propagate from Neighbors: Copy record formats from connected flow. Same As: Copy record format’s from a specific component’s port. Path: Store record formats in a Local file, Host File, or in the Ab Initio repository. Embedded: Type the record format directly in a string. Specify the .dml file
32
Editing .dml file through Record Format Editor – Grid View
Creating Graph - dml DML is Ab Initio’s Data Manipulation Language. DML describes data in terms of Record Formats that list the fields and format of input, output, and intermediate records. Expressions that define simple computations, for example, selection. Transform Functions that control reformatting, aggregation, and other data transformations. Keys that specify groupings, ordering, and partitioning relationships between records. Editing .dml file through Record Format Editor – Grid View
33
Creating Graph - Transform
A transform function is either a DML file or a DML string that describes how you manipulate your data. Ab Initio transform functions mainly consist of a series of assignment statements. Each statement is called a business rule. When Ab Initio evaluates a transform function, it performs following tasks: Initializes local variables Evaluates statements Evaluates rules. Transform function files have the xfr extension. Specify the .xfr file
34
Creating Graph - xfr Transform functions: A set of rules that compute output values from input values. Business rule: Part of a transform function that describes how you manipulate one field of your output data. Variable: Optional part of a transform function that provides storage for temporary values. Statement: Optional part of a transform function that assigns values of variables in a specific order.
35
Sample Components Sort Dedup Join Replicate Rollup
Filter by Expression Merge Lookup Reformat etc.
36
Creating Graph – Sort Component
Sort: The sort component reorders data. It comprises two parameters: Key and max-core. Key: The Key is one of the parameters for Sort component which describes the collation order. Max-core: The max-core parameter controls how often the sort component dumps data from memory to disk. Specify Key for the Sort
37
Creating Graph – Dedup component
Dedup component removes duplicate records. Dedup criteria will be either unique-only, First or Last. Select Dedup criteria.
38
Creating Graph – Replicate Component
Replicate combines the data records from the inputs into one flow and writes a copy of that flow to each of its output ports. Use Replicate to support component parallelism.
39
Creating Graph – Join Component
Specify the key for join Specify Type of Join
40
Database Configuration (.dbc)
A file with a .dbc extension which provides the GDE with the information it needs to connect to a database. A configuration file contains the following information: The name and version number of the database to which you want to connect. The name of the computer on which the database instance or server to which you want to connect runs, or on which the database remote access software is installed. The name of the database instance, server, or provider to which you want to connect. You generate a configuration file by using the Properties dialog box for one of the Database components.
41
Creating Parallel Applications
Types of Parallel Processing Component-level Parallelism: An application with multiple components running simultaneously on separate data uses component parallelism. Pipeline parallelism: An application with multiple components running simultaneously on the same data uses pipeline parallelism. Data Parallelism: An application with data divided into segments that operates on each segment simultaneously uses data parallelism.
42
Partition Components Partition by Expression: Dividing data according to a DML expression. Partition by Key: Grouping data by a key. Partition with Load balance: Dynamic load balancing. Partition by Percentage: Distributing data, so the output is proportional to fractions of 100. Partition by Range: Dividing data evenly among nodes, based on a key and a set of partitioning ranges. Partition by Round-robin: Distributing data evenly, in blocksize chunks, across the output partitions.
43
Departition Components
Concatenate: Concatenate component produces a single output flow that contains first all the records from the first input partition, then all the records from the second input partition and so on. Gather: Gather component collects inputs from multiple partitions in an arbitrary manner, and produces a single output flow, does not maintain sort order. Interleave: Interleave component collects records from many sources in round robin fashion. Merge: Merge component collects inputs from multiple sorted partitions and maintains the sort order.
44
Multifile systems A multifile system is a specially created set of directories, possibly on different machines, which have identical substructure. Each directory is a partition of the multifile system. When a multifile is placed in a multifile system, its partitions are files within each of the partitions of the multifile system. Multifile system leads to better performance than flat file systems because multifile systems can divide your data among multiple disks or CPUs. Typically (SMP machine is exception) a multifile system is created with the control partition on one node and data partitions on other nodes to distribute the work and improve performance. To do this use full internet URLs that specify file and directory names and locations on remote machines.
45
Multifile
46
SANDBOX A sandbox is a collection of graphs and related files that are stored in a single directory tree, and treated as a group for purposes of version control, navigation, and migration. A sandbox can be a file system copy of a datastore project. In the graph, instead of specifying the entire path for any file location ,we specify only the sandbox parameter variable. For ex : $AI_IN_DATA/customer_info.dat. where $AI_IN_DATA contains the entire path with reference to the sandbox $AI_HOME variable. The actual in_data dir is $AI_HOME/in_data in sandbox
47
SANDBOX The sandbox provides an excellent mechanism to maintain uniqueness while moving from development to production environment by means switch parameters. We can define parameters in sandbox those can be used across all the graphs pertaining to that sandbox. The topmost variable $PROJECT_DIR contains the path of the home directory
48
SANDBOX
49
Deploying Every graph after validation and testing has to be deployed as .ksh file into the run directory on UNIX. This .ksh file is an executable file which is the backbone for the entire automation/wrapper process. The wrapper automation consists of .run, .env, dependency list ,job list etc For a detailed description on wrapper and different directories and files , Please refer the documentation on wrapper / UNIX presentation.
50
Symbols Boxes for processing and Data Transforms
Arrows for Data Flows between process Cylinders for serial I/O files Divided cylinders for parallel I/O files Grid boxes for database tables
51
Parallelism Component parallelism Pipeline parallelism
Data parallelism
52
Component Parallelism
Sorting Customers Sorting Transactions
53
Component Parallelism
Comes “for free” with graph programming. Limitation: Scales to number of “branches” a graph.
54
Pipeline Parallelism Processing Record: 100 Processing Record: 99
55
Pipeline Parallelism Comes “for free” with graph programming.
Limitations: Scales to length of “branches” in a graph. Some operations, like sorting, do not pipeline.
56
Data Parallelism Partitions
57
Two Ways of Looking at Data Parallelism
Expanded View: Global View:
58
Data Parallelism Scales with data. Requires data partitioning.
Different partitioning methods for different operations.
59
Data Partitioning Expanded View: Global View:
60
Data Partitioning: The Global View
Degree of Parallelism Fan-out Flow
61
Session III Partitioning
62
Partitioning Review For the various partitioning components:
Fan-out Flow For the various partitioning components: Is it Key-based? Does the problem require a key-based partition? Performance: Are the partitions balanced or skewed?
63
Partitioning: Performance
Balanced: Processors get neither too much nor too little. Skewed: Some processors get too much, others too little.
64
Sample Data to be Partitioned
Customers 42John 43Mark 44Bob 45Sue 46Rick 47Bill 48Mary 49Jane record decimal(2) id; string(5) name; decimal(5) zipcode; decimal(3) amount; string(1) newline; end
65
Partition by Round-robin
Customers 42John 45Sue 48Mary Customers 43Mark 46Rick 49Jane Customers 44Bob 47Bill
66
Partition by Round-robin
Not key based. Results in very well balanced data, especially with block-size of 1. Useful for record-independent parallelism.
67
Partition by Key partition on zipcode: Customers 43Mark 02114 9
45Sue 47Bill 49Jane Customers 42John 44Bob 46Rick 48Mary
68
Partition by Key often followed by a Sort
Sort on zipcode: Customers 43Mark 47Bill 45Sue 49Jane Customers 42John 44Bob 46Rick 48Mary Rollup by zipcode: Totals by Zipcode Totals by Zipcode
69
Partition by Key Key-based. Usually results in well balanced data.
Useful for key-dependent parallelism.
70
Partition by Expression
Expression: amount/33 Customers 42John 43Mark 44Bob 46Rick 47Bill 49Jane Customers 48Mary Customers 45Sue
71
Partition by Expression
Key-based, depending on the expression. Resulting balance very dependent on expression and on data. Various application-dependent uses.
72
Partition by Range With splitter values of 9 and 23: Customers
43Mark 44Bob 49Jane Customers 46Rick 47Bill Customers 42John 45Sue 48Mary
73
Range+Sort: Global Ordering
Sort following a partition by range: Customers 49Jane 44Bob 43Mark Customers 47Bill 46Rick Customers 42John 48Mary 45Sue
74
Partition by Range Key-based.
Resulting balance dependent on set of splitters chosen. Useful for “binning” and global sorting.
75
Partition with Load Balance
if middle node highly loaded: Customers 42John 43Mark 44Bob 49Jane Customers 45Sue Customers 46Rick 47Bill 48Mary
76
Partition by Load Balance
Not key-based. Results in skewed data distribution to complement skewed load. Useful for record-independent parallelism.
77
Partition with Percentage
With percentages: 4, 20 Customers 42John 43Mark 44Bob 45Sue Customers 46Rick 47Bill 48Mary 49Jane Customers ... The next 16 records would go here, and the next 76 records would go here
78
Partition by Percentage
Not key-based Results in usually skewed data distribution conforming to the provided percentages. Useful for record-independent parallelism.
79
Broadcast (as a Partitioner)
Unlike all other partitioners which write a record to ONE output flow, Broadcast writes each record to EVERY output flow. Customers 42John 43Mark 44Bob 45Sue 46Rick 47Bill 48Mary 49Jane Customers 42John 43Mark 44Bob 45Sue 46Rick 47Bill 48Mary 49Jane Customers 42John 43Mark 44Bob 45Sue 46Rick 47Bill 48Mary 49Jane
80
Broadcast Not key-based Results in perfectly balanced partitions
Useful for record-independent parallelism.
81
Session IV De-Partitioning
82
Departitioning Departitioning combines many flows of data to
produce one flow. It is the opposite of partitioning. Each departition component combines flows in a different manner.
83
Departitioning Expanded View: Global View: Score 1 Departition Score 2
Output File Score 3 Global View:
84
Departitioning Fan-in Flow For the various departitioning components:
Key-based? Result ordering? Effect on parallelism? Uses?
85
Concatenation Globally ordered, partitioned data: Sorted data:
49Jane 44Bob 43Mark 47Bill 46Rick 42John 48Mary 45Sue Sorted data: 49Jane 44Bob 43Mark 47Bill 46Rick 42John 48Mary 45Sue
86
Concatenation Not key-based. Result ordering is by partition.
Serializes pipelined computation. Useful for: creating serial flow from partitioned data appending headers and trailers writing DML Used infrequently
87
Merge Round-robin partitioned and sorted by amount:
42John 48Mary 45Sue 49Jane 43Mark 46Rick 44Bob 47Bill Sorted data, following merge on amount: 49Jane 44Bob 43Mark 47Bill 46Rick 42John 48Mary 45Sue
88
Merge Key-based. Result ordering is sorted if each input is sorted.
Possibly synchronizes pipelined computation; may even serialize. Useful for creating ordered data flows. Used more than concatenate, but still infrequently
89
Interleave Round-robin partitioned and scored:
42John A 45Sue A 48Mary A 43Mark C 46Rick B 49Jane C 44Bob C 47Bill B Scored dataset in original order, following interleave: 42John A 43Mark C 44Bob C 45Sue A 46Rick B 47Bill B 48Mary A 49Jane C
90
Interleave Not key-based. Result ordering is inverse of round-robin.
Synchronizes pipelined computation. Useful for restoring original order following a record-independent parallel computation partitioned by round-robin. Used in rare circumstances
91
Gather Round-robin partitioned and scored:
42John A 45Sue A 48Mary A 43Mark C 46Rick B 49Jane C 44Bob C 47Bill B Scored dataset in random order, following gather: 43Mark C 46Rick B 42John A 45Sue A 48Mary A 44Bob C 47Bill B 49Jane C
92
Gather Not key-based. Result ordering is unpredictable.
Neither serializes nor synchronizes pipelined computation. Useful for efficient collection of data from multiple partitions and for repartitioning. Used most frequently
93
Layout Layout determines the location of a resource.
A layout is either serial or parallel. A serial layout specifies one node and one directory. A parallel layout specifies multiple nodes and multiple directories. It is permissible for the same node to be repeated.
94
Layout The location of a Dataset is one or more places on one or more disks. The location of a computing component is one or more directories on one or more nodes. By default, the node and directory is unknown. Computing components propagate their layouts from neighbors, unless specifically given a layout by the user.
95
Session V Join
96
Join Types Inner join — sets the record-required parameters for all ports to True. Outer join — sets the record-required parameters for all ports to False. Explicit — allows you to set the record-required parameter for each port individually.
97
Join Types .. Contd. Case 1: Inner Join join-type
Case 2: Full Outer Join join-type Case 3: Explicit join-type:record-required0: false record-required1: true Case 4: Explicit join-type:record-required0: true record-required1: false
98
Some key Join Parameters
Name(s) of the field(s) in the input records that must have matching values for Join to call the transform function. driving Number of the port to which you want to connect the driving input. The driving input is the largest input. All other inputs are read into memory. The driving parameter is only available when the sorted-input parameter is set to In memory: Input need not be sorted.
99
Some key Join Parameters
dedupn Set the dedupn parameter to true to remove duplicates from the corresponding inn port before joining. This allows you to choose only one record from a group with matching key values as the argument to the transform function. Default is false, which does not remove duplicates override-keyn Alternative name(s) for the key field(s) for a particular in port.
100
References Ab Initio Tutorial Ab Initio Online Help
Website (abinitio.com)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.