HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.

Slides:



Advertisements
Similar presentations
The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002.
Advertisements

Computer Organization and Architecture
1 Optimizing compilers Managing Cache Bercovici Sivan.
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Multiscalar processors
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe,
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
1 Lecture 5a: CPU architecture 101 boris.
CS 352H: Computer Systems Architecture
4- Performance Analysis of Parallel Programs
SOFTWARE DESIGN AND ARCHITECTURE
Multiscalar Processors
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
/ Computer Architecture and Design
CDA 3101 Spring 2016 Introduction to Computer Organization
Vector Processing => Multimedia
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Pipelining and Vector Processing
The Memory Gap: to Tolerate or to Reduce?
Lecture 14: Reducing Cache Misses
Gary M. Zoppetti Gagan Agrawal
Siddhartha Chatterjee
Cache Performance Improvements
Presentation transcript:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA University of Southern California May 2001

USC Parallel and Distributed Processing Center 2 Outline n HiDISC Project Description n Experiments and Accomplishments n Work in Progress n Summary

USC Parallel and Distributed Processing Center 3 HiDISC: Hierarchical Decoupled Instruction Set Computer Dynamic Database Memory Cache Registers Application (FLIR SAR VIDEO ATR /SLD Scientific ) Application (FLIR SAR VIDEO ATR /SLD Scientific ) Processor Decoupling Compiler HiDISC Processor Sensor Inputs Processor Situational Awareness Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Some SPEC benchmarks spend more than half of their time stalling Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading

USC Parallel and Distributed Processing Center 4 Present Solutions Solution Larger Caches Hardware Prefetching Software Prefetching Multithreading Limitations —Slow —Works well only if working set fits cache and there is temporal locality. —Cannot be tailored for each application —Behavior based on past and present execution- time behavior —Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching —Adaptive software prefetching is required to change prefetch distance during run-time —Hard to insert prefetches for irregular access patterns —Solves the throughput problem, not the memory latency problem

USC Parallel and Distributed Processing Center 5 Observation: Software prefetching impacts compute performance PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching The HiDISC Approach Approach: Add a processor to manage prefetching -> hide overhead Compiler explicitly manages the memory hierarchy Prefetch distance adapts to the program runtime behavior

USC Parallel and Distributed Processing Center 6 Cache What is HiDISC? A dedicated processor for each level of the memory hierarchy n Explicitly manage each level of the memory hierarchy using instructions generated by the compiler n Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch) n Exploit instruction-level parallelism without extensive scheduling hardware n Zero overhead prefetches for maximal computation throughput Access Instructions Computation Instructions Access Processor (AP) Access Processor (AP) Computation Processor (CP) Computation Processor (CP) Registers Cache Mgmt. Processor (CMP) Cache Mgmt. Processor (CMP) Cache Mgmt. Instructions CompilerProgram 2nd-Level Cache and Main Memory

USC Parallel and Distributed Processing Center 7 MIPS DEAP CAPPHiDISC (Conventional)(Decoupled)(New Decoupled) 2nd-Level Cache and Main Memory Registers 5-issue Cache Access Processor (AP) - (3-issue) Access Processor (AP) - (3-issue) 2-issue Cache Mgmt. Processor (CMP) Registers Access Processor (AP) - (5-issue) Access Processor (AP) - (5-issue) Computation Processor (CP) Computation Processor (CP) 3-issue 2nd-Level Cache and Main Memory Registers 8-issue Registers 3-issue Cache Mgmt. Processor (CMP) DEAP: [Kurian, Hulina, & Caraor ‘94] PIPE: [Goodman ‘85] Other Decoupled Processors: ACRI, ZS-1, WM Cache 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Cache Decoupled Architectures SCQ LQ SDQSAQ

USC Parallel and Distributed Processing Center 8 Slip Control Queue n The Slip Control Queue (SCQ) adapts dynamically Late prefetches = prefetched data arrived after load had been issued Useful prefetches = prefetched data arrived before load had been issued if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2*late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ;

USC Parallel and Distributed Processing Center 9 Decoupling Programs for HiDISC (Discrete Convolution - Inner Loop) for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]*h[i-j-1]); while (not EOD) y = y + (x * h); send y to SDQ for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Inner Loop Convolution Computation Processor Code A ccess Processor Code Cache Management Code SAQ: Store Address Queue SDQ: Store Data Queue SCQ: Slip Control Queue EOD: End of Data

USC Parallel and Distributed Processing Center 10 Where We Were n HiDISC compiler Frontend selection (Gcc) Single thread running without conditionals n Hand compiling of benchmarks Livermore loops, Tomcatv, MXM, Cholsky, Vpenta and Qsort

USC Parallel and Distributed Processing Center 11 Benchmarks Source of Benchmark Lines of Source Code Description Data Set Size LLL1 Livermore Loops [45] element arrays, 100 iterations 24 KB LLL2 Livermore Loops element arrays, 100 iterations 16 KB LLL3 Livermore Loops element arrays, 100 iterations 16 KB LLL4 Livermore Loops element arrays, 100 iterations 16 KB LLL5 Livermore Loops element arrays, 100 iterations 24 KB Tomcatv SPECfp95 [68] x33-element matrices, 5 iterations <64 KB MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations 448 KB CHOLSKY NAS kernels 156 Cholsky matrix decomposition 724 KB VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously 128 KB Qsort Quicksort sorting algorithm [14] 58 Quicksort 128 KB

USC Parallel and Distributed Processing Center 12 ParameterValueParameterValue L1 cache size4 KBL2 cache size16 KB L1 cache associativity2L2 cache associativity2 L1 cache block size32 BL2 cache block size32 B Memory LatencyVariable, (0-200 cycles)Memory contention timeVariable Victim cache size32 entriesPrefetch buffer size8 entries Load queue size128Store address queue size128 Store data queue size128Total issue width8 Simulation Parameters

USC Parallel and Distributed Processing Center 13 Simulation Results

USC Parallel and Distributed Processing Center 14 Accomplishments n 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor n 7.4x speedup for matrix multiply (MXM) over an in-order issue superscalar processor - (similar operations are used in ATR/SLD) 2.6x speedup for matrix decomposition/substitution (Cholsky) over an in- order issue superscalar processor n Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) n Allows the compiler to solve indexing functions for irregular applications n Reduced system cost for high-throughput scientific codes

USC Parallel and Distributed Processing Center 15 Work in Progress n Silicon Space for VLSI Layout n Compiler Progress n Simulator Integration n Hand-compiling of Benchmarks n Architectural Enhancement for Data Intensive Applications

USC Parallel and Distributed Processing Center 16 VLSI Layout Overhead n Goal: Evaluate layout effectiveness of HiDISC architecture n Cache has become a major portion of the chip area n Methodology: Extrapolate HiDISC VLSI Layout based on MIPS10000 processor (0.35 μm, 1996) n The space overhead is 11.3% over a comparable MIPS processor

USC Parallel and Distributed Processing Center 17 VLSI Layout Overhead ComponentOriginal MIPS R10K(0.35  m) Extrapolation (0.15  m) HiDISC (0.15  m) D-Cache (32KB)26 mm mm 2 I-Cache (32KB)28 mm 2 7 mm 2 14 mm 2 TLB Part10 mm mm 2 External Interface Unit27 mm mm 2 Instruction Fetch Unit and BTB18 mm mm mm 2 Instruction Decode Section21 mm mm 2 Instruction Queue28 mm 2 7 mm 2 0 mm 2 Reorder Buffer17 mm mm 2 0 mm 2 Integer Functional Unit20 mm 2 5 mm 2 15 mm 2 FP Functional Units24 mm 2 6 mm 2 Clocking & Overhead73 mm mm 2 Total Size without L2 Cache292 mm mm mm 2 Total Size with on chip L2 Cache129.2 mm mm 2

USC Parallel and Distributed Processing Center 18 Compiler Progress n Preprocessor- Support for library calls Gnu pre-processor -Support for special compiler directives (#include, #define) n Nested Loops Nested loops without data dependencies (for and while) Support for conditional statements where the index variable of an inner loop is not a function of some outer loop computation n Conditional statements CP to perform all computations Need to move the condition to AP

USC Parallel and Distributed Processing Center 19 # 6 for(i=0;i<10;++i) sw $0, 12($sp) $32: # 7 for(k=0;k<10;++k) sw $0, 4($sp) $33: # 8 ++j; lw $15, 8($sp) addu $24, $15, 1 sw $24, 8($sp) lw $25, 4($sp) addu $8, $25, 1 sw $8, 4($sp) blt $8, 10, $33 lw $9, 12($sp) addu $10, $9, 1 sw $10, 12($sp) blt $10, 10, $32 Nested Loops (Assembly) Assembly Code for (i=0;i<10;++i) for(k=0;k<10;++k) ++j; High Level Code - C

USC Parallel and Distributed Processing Center 20 Nested Loops # 6 for(i=0;i<10;++i) $32: b_eod loop_i # 7 for(k=0;k<10;++k) $33: b_eod loop_k # 8 ++j; addu $15, LQ, 1 sw $15, SDQ b $33 Loop_k : b $32 loop_i : # 6 for(i=0;i<10;++i) sw $0, 12($sp) $32: # 7 for(k=0;k<10;++k) sw $0, 4($sp) $33: # 8 ++j; lw LQ, 8($sp) get SCQ sw 8($sp), SAQ lw $25, 4($sp) addu $8, $25, 1 sw $8, 4($sp) blt $8, 10, $33 s_eod lw $9, 12($sp) addu $10, $9, 1 sw $10, 12($sp) blt $10, 10, $32 s_eod # 6 for(i=0;i<10;++i) sw $0, 12($sp) $32: # 7 for(k=0;k<10;++k) sw $0, 4($sp) $33: # 8 ++j; pref 8($sp) put SCQ lw $25, 4($sp) addu $8, $25, 1 sw $8, 4($sp) blt $8, 10, $33 lw $9, 12($sp) addu $10, $9, 1 sw $10, 12($sp) blt $10, 10, $32 CP StreamAP StreamCMP Stream

USC Parallel and Distributed Processing Center 21 HiDISC Compiler n Gcc Backend Use input from the parsing phase for performing loop optimizations Extend the compiler to provide MIPS4 Handle loops with dependencies –Nested loops where the computation depends on the indices of more than 1 loop e.g. X(i,j) = i*Y(j,i) where i and j are index variables and j is a function of i

USC Parallel and Distributed Processing Center 22 HiDISC Stream Separator Sequential Source Program Flow Graph Classify Address Registers Allocate Instruction to streams Access Stream Access Stream Computation Stream Computation Stream Fix Conditional Statements Move Queue Access into Instructions Move Loop Invariants out of the loop Substitute Prefetches for Loads, Remove global Stores, and Reverse SCQ Direction Substitute Prefetches for Loads, Remove global Stores, and Reverse SCQ Direction Add global data Communication and Synchronization Produce Assembly code Computation Assembly Code Computation Assembly Code Access Assembly Code Access Assembly Code Cache Management Assembly Code Cache Management Assembly Code Previous Work Current Work Add Slip Control Queue Instructions

USC Parallel and Distributed Processing Center 23 Simulator Integration n Based on MIPS RISC pipeline Architecture (dsim) n Supports MIPS1 and MIPS2 ISA n Supports dynamic linking for shared library Loading shared library in the simulator n Hand compiling: Using SGI-cc or gcc as front- end Making 3 streams of codes Using SGI-cc as compiler back- end cc –mips2 -s.c dsim.s.cp.s.ap.s.cmp.s cc –mips2 -o.cp.ap.cmp shared library Modification on three.s codes to HiDISC assembly Convert.hs to.s for three codes (hs2s)

USC Parallel and Distributed Processing Center 24 DIS Benchmarks n Atlantic Aerospace DIS benchmark suite: Application oriented benchmarks Many defense applications employ large data sets – non contiguous memory access and no temporal locality Too large for hand-compiling, wait until the compiler ready Requires linker that can handle multiple object files n Atlantic Aerospace Stressmarks suite: Smaller and more specific procedures Seven individual data intensive benchmarks Directly illustrate particular elements of the DIS problem

USC Parallel and Distributed Processing Center 25 Stressmark Suite StressmarkProblemMemory Access PointerPointer followingSmall blocks at unpredictable locations. Can be parallelized UpdatePointer following with memory update Small blocks at unpredictable location MatrixConjugate gradient simultaneous equation solver Dependent on matrix representation Likely to be irregular or mixed, with mixed levels of reuse NeighborhoodCalculate image texture measures by finding sum and difference histograms Regular access to pairs of words at arbitrary distances FieldCollect statistics on large field of words Regular, with little re-use Corner-TurnMatrix transpositionBlock movement between processing nodes with practically nil computation Transitive ClosureFind all-pairs-shortest-path solution for a directed graph Dependent on matrix representation, but requires reads and writes to different matrices concurrently * DIS Stressmark Suite Version 1.0, Atlantic Aerospace Division

USC Parallel and Distributed Processing Center 26 Example of Stressmarks n Pointer Stressmark Basic idea: repeatedly follow pointers to randomized locations in memory Memory access pattern is unpredictable Randomized memory access pattern: –Not sufficient temporal and spatial locality for conventional cache architectures HiDISC architecture provides lower memory access latency

USC Parallel and Distributed Processing Center 27 Decoupling of Pointer Stressmarks for (i=j+1;i<w;i++) { if (field[index+i] > partition) balance++; } if (balance+high == w/2) break; else if (balance+high > w/2) { min = partition; } else { max = partition; high++; } while (not EOD) if (field > partition) balance++; if (balance+high == w/2) break; else if (balance+high > w/2) { min = partition; } else { max = partition; high++; } for (i=j+1; i<w; i++) { load (field[index+i]); GET_SCQ; } send (EOD token) for (i=j+1; i<w; i++) { prefetch (field[index+i]); PUT_SCQ; } Inner loop for the next indexing Computation Processor Code Access Processor Code Cache Management Code

USC Parallel and Distributed Processing Center 28 Stressmarks n Hand-compile the 7 individual benchmarks Use gcc as front-end Manually partition each of the three instruction streams and insert synchronizing instructions n Evaluate architectural trade-offs Updated simulator characteristics such as out-of- order issue Large L2 cache and enhanced main memory system such as Rambus and DDR

USC Parallel and Distributed Processing Center 29 Architectural Enhancement for Data Intensive Applications n Enhanced memory system such as RAMBUS DRAM and DDR DRAM Provide high memory bandwidth Latency does not improve significantly n Decoupled access processor can fully utilize the enhanced memory bandwidth More requests caused by access processor Pre-fetching mechanism hide memory access latency

USC Parallel and Distributed Processing Center 30 Flexi-DISC n Fundamental characteristics: inherently highly dynamic at execution time. n Dynamic reconfigurable central computational kernel (CK) n Multiple levels of caching and processing around CK adjustable prefetching n Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.

USC Parallel and Distributed Processing Center 31 Flexi-DISC n Partitioning of Computation Kernel It can be allocated to the different portions of the application or different applications n CK requires separation of the next ring to feed it with data n The variety of target applications makes the memory accesses unpredictable n Identical processing units for outer rings Highly efficient dynamic partitioning of the resources and their run-time allocation can be achieved

USC Parallel and Distributed Processing Center 32 Summary n Gcc Backend Use the parsing tree to extract the load instructions Handle loops with dependencies where the index variable of an inner loop is not a function of some outer loop computation n Robust compiler is being designed to experiment and analyze with additional benchmarks Eventually extend it to experiment DIS benchmarks n Additional architectural enhancements have been introduced to make HiDISC amenable to DIS benchmarks

USC Parallel and Distributed Processing Center 33 HiDISC: Hierarchical Decoupled Instruction Set Computer New Ideas A dedicated processor for each level of the memory hierarchy Explicitly manage each level of the memory hierarchy using instructions generated by the compiler Hide memory latency by converting data access predictability to data access locality Exploit instruction-level parallelism without extensive scheduling hardware Zero overhead prefetches for maximal computation throughput Impact 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor 7.4x speedup for matrix multiply over an in-order issue superscalar processor 2.6x speedup for matrix decomposition/substitution over an in- order issue superscalar processor Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) Allows the compiler to solve indexing functions for irregular applications Reduced system cost for high-throughput scientific codes Schedule StartApril 2001 Defined benchmarks Completed simulator Performed instruction-level simulations on hand-compiled benchmarks Continue simulations of more benchmarks (SAR) Define HiDISC architecture Benchmark result Update Simulator Dynamic Database Memory Cache Registers Application (FLIR SAR VIDEO ATR /SLD Scientific ) Application (FLIR SAR VIDEO ATR /SLD Scientific ) Processor Decoupling Compiler HiDISC Processor Sensor Inputs Processor Situational Awareness Develop and test a full decoupling compiler decoupling compiler Generate performance statistics and evaluate design