SAXS Scatter Performance Analysis CHRIS WILCOX 2/6/2008.

Slides:



Advertisements
Similar presentations
Optimizing Expression Selection for Lookup Table Program Transformation Chris Wilcox, Michelle Mills Strout, James M. Bieman Computer Science Department.
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Computer Organization and Architecture
Computer Organization and Architecture
Computer Organization and Architecture 18 th March, 2008.
1 Lecture 6 Performance Measurement and Improvement.
Computer Organization and Architecture The CPU Structure.
Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.
Chapter 12 Pipelining Strategies Performance Hazards.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
February 21, 2008 Center for Hybrid and Embedded Software Systems Mapping A Timed Functional Specification to a Precision.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Lecture 3: Computer Performance
PRE-PROGRAMMING PHASE
Code-Tuning By Jacob Shattuck. Code size/complexity vs computation resource utilization A classic example: Bubblesort A classic example: Bubblesort const.
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Programming. What is a Program ? Sets of instructions that get the computer to do something Instructions are translated, eventually, to machine language.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
1 4.2 MARIE This is the MARIE architecture shown graphically.
A.Broumandnia, 1 3 Parallel Algorithm Complexity Review algorithm complexity and various complexity classes: Introduce the notions.
Update on a New EPICS Archiver Kay Kasemir and Leo R. Dalesio 09/27/99.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
What have mr aldred’s dirty clothes got to do with the cpu
History of Microprocessor MPIntroductionData BusAddress Bus
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
Topic 1Topic 2Topic 3Topic 4Topic
Outline 3  PWA overview Computational challenges in Partial Wave Analysis Comparison of new and old PWA software design - performance issues Maciej Swat.
1 Lecture 7 LAN Wiring, Physical Topology, and Interface Hardware Computer Networks CS 4316.
Performance.
AES Encryption Code Generator Undergraduate Research Project by Paul Magrath. Supervised by Dr David Gregg.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
Performance Optimization Getting your programs to run faster.
1 Ch. 1: Software Development (Read) 5 Phases of Software Life Cycle: Problem Analysis and Specification Design Implementation (Coding) Testing, Execution.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
1 COMPUTER SCIENCE DEPARTMENT COLORADO STATE UNIVERSITY 1/9/2008 SAXS Software.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Performance Performance
CSE 373: Data Structures and Algorithms
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.
DataGrid is a project funded by the European Commission EDG Conference, Heidelberg, Sep 26 – Oct under contract IST OGSI and GT3 Initial.
FFT Accelerator Project Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210) Date : February 23,2007.
If you have a transaction processing system, John Meisenbacher
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
A computational ecosystem for near real-time satellite data processing
Variables, Operators, and Expressions
Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.
How do we evaluate computer architectures?
Central Processing Unit- CPU
PT Evaluation of the Dycore Parallel Phase (EDP2)
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CSC113: Computer Programming (Theory = 03, Lab = 01)
CSCE 212 Chapter 4: Assessing and Understanding Performance
Teaching Computing to GCSE
CIS16 Application Development – Programming with Visual Basic
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
PROGRAMMING FUNDAMENTALS Lecture # 03. Programming Language A Programming language used to write computer programs. Its mean of communication between.
Multithreading Why & How.
Module Recognition Algorithms
Instruction Level Parallelism
Presentation transcript:

SAXS Scatter Performance Analysis CHRIS WILCOX 2/6/2008

Scatter Status Prototype of basic algorithm, arbitrary number of atoms and topology. Atom types: C, N, O, H, P, S, Zn, and very easy to add more. Matches results with original R prototype from Stefan, for several small molecules. Computes intensity function divided into specified number of steps.

Scatter Performance (Current) Original algorithm, no optimization, debug version: 5000 atoms = ~ 60 hours Original algorithm, no optimization, release version: 5000 atoms = ~ 4 hours Obvious restructuring, pre-compute factors, release version: 5000 atoms = ~39 minutes. Avoid redundant work, compiler flags, release version: 5000 atoms = ~19 minutes. Pentium Core Duo, mobile CPU, 166Mhz

Scatter Performance (Analysis) Scatter factors are pre-computed, requires ~0% of the fastest calculation. Distance calculations are step independent, requires ~3% only because of SQRT function. FSIN function appears to be consuming ~60% of processor cycles, is there an alternative? Intensity calculation itself uses ~86% of the cycles, need to verify again on latest calculation. No real optimization yet, compiler wins anyway!

Scatter Performance (Model) N = # of atoms, S = # of steps, A = # of type s Scatter factors are O(SA) * (4 exp+4 pow+4 fmul), i.e. 10K iterations for 1000 steps, 10 types. Distance math is O(N 2 /2) * (1 sqrt+3 fmul+2 fadd), i.e. 12.5M iterations for 1000 steps, 5000 atoms. Intensity math is O(SN 2 /2) * (1 fsin+9 fmul+2 fadd), i.e. 12.5G iterations for same as previous. Operations shown are based on code reading, actual floating point instructions are ~2X more frequent.

Scatter Performance (Future) Complete optimizations, convert sine function to lookup table: 5000 atoms = ~500 seconds? Find faster floating point performance, not hard to beat by 8x: 5000 atoms = ~60 seconds? Intensity calculations are independent, so use more processors: 5000 atoms = ~10 seconds? Question: How many molecules need to be run to represent non-rigid structure?

Next Steps (Short Term) Add precise timing, develop model to predict performance for arbitrary number of atoms. Analyze instructions in inner loop of scatter, but may be impossible to improve on compiler. Extend to read.pdb file format, or integrate with existing Python code. Try on processor with better floating point, or on parallel machine, what is required to do this? Project setup takes precedence for several weeks.

Next Steps (Long Term) Close the loop with experimental data on known molecule, algorithms changes as necessary. Develop streaming version of program that accepts multiple molecules and averages. New program for modeling elastic topology, previously called “parametric” model. Investigate change to streaming architecture, may prototype simple framework user interface.