CO405H Computing in Space with OpenSPL Topic 10: Correlation Example Oskar Mencer Georgi Gaydadjiev Department of Computing Imperial College London

Slides:



Advertisements
Similar presentations
CO405H Computing in Space with OpenSPL Topic 9: Programming DFEs (Loops II) Oskar Mencer Georgi Gaydadjiev special thanks: Jacob Bower Department of Computing.
Advertisements

CSCI 4717/5717 Computer Architecture
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Assembly Language for Intel-Based Computers, 4 th Edition Chapter 1: Basic Concepts (c) Pearson Education, All rights reserved. You may modify and.
Computer Organization and Architecture
CO405H Computing in Space with OpenSPL Topic 10: Correlation Example Oskar Mencer Georgi Gaydadjiev Department of Computing Imperial College London
 2005 Pearson Education, Inc. All rights reserved Introduction.
Programing Concept Ken Youssefi/Ping HsuIntroduction to Engineering – E10 1 ENGR 10 Introduction to Engineering (Part A)
Lecture 7: Linear Programming in Excel AGEC 352 Spring 2011 – February 9, 2011 R. Keeney.
Instruction Level Parallelism (ILP) Colin Stevens.
Computer Programming and Basic Software Engineering 4. Basic Software Engineering 1 Writing a Good Program 4. Basic Software Engineering 3 October 2007.
RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
1 Lab Session-IV CSIT-120 Spring 2001 Lab 3 Revision and Exercises Rev: Precedence Rules Lab Exercise 4-A Machine Language Programming The “Micro” Machine.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 5: February 2, 2009 Architecture Synthesis (Provisioning, Allocation)
Chapter 8: Introduction to High-level Language Programming Invitation to Computer Science, C++ Version, Third Edition.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
CSC103: Introduction to Computer and Programming
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Topics Introduction Hardware and Software How Computers Store Data
Introduction to Engineering Computing GEEN 1300 Lecture 7 15 June 2010 Review for midterm.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved ADT Implementation:
Computer Programming Basics Assistant Professor Jeon, Seokhee Assistant Professor Department of Computer Engineering, Kyung Hee University, Korea.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 1 Introduction to Computers and Programming.
Programmer's view on Computer Architecture by Istvan Haller.
INTRODUCTION SOFTWARE HARDWARE DIFFERENCE BETWEEN THE S/W AND H/W.
CSCI 256 Data Structures and Algorithm Analysis Lecture 14 Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some.
Numerical Computation Lecture 2: Introduction to Matlab Programming United International College.
Building a Modern Computer From First Principles
Problem Solving and Program Design in C (5th Edition) by Jeri R. Hanly and Elliot B. Koffman CP 202 Chapter
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
Rules for Means and Variances. Rules for Means: Rule 1: If X is a random variable and a and b are constants, then If we add a constant a to every value.
Function Design in LISP. Program Files n LISP programs are plain text –DOS extensions vary; use.lsp for this course n (load “filename.lsp”) n Can use.
Chapter 9 DTW and VQ Algorithm  9.1 Basic idea of DTW  9.2 DTW algorithm  9.3 Basic idea of VQ  9.4 LBG algorithm  9.5 Improvement of VQ.
Lecture 2b Dr. Robert D. Kent.  Structured program development  Program control.
C o n f i d e n t i a l 1 Course: BCA Semester: III Subject Code : BC 0042 Subject Name: Operating Systems Unit number : 1 Unit Title: Overview of Operating.
Integrating QDEC with Slicer3 Click to add subtitle.
1 Fundamentals of Computer Science Combinational Circuits.
GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.
Binary! Objectives Recap what a instruction is Look at how binary can be used to store – Opcode – data.
SCRIPTS AND FUNCTIONS DAVID COOPER SUMMER Extensions MATLAB has two main extension types.m for functions and scripts and.mat for variable save files.
FUNCTIONS (C) KHAERONI, M.SI. OBJECTIVE After this topic, students will be able to understand basic concept of user defined function in C++ to declare.
COMPUTER PROGRAMMING Year 9 – lesson 1. Objective and Outcome Teaching Objective We are going to look at how to construct a computer program. We will.
Computer Programming II Lecture 9. Files Processing - We have been using the iostream standard library, which provides cin and cout methods for reading.
1 Introduction to Engineering Spring 2007 Lecture 18: Digital Tools 2.
A few words on locality and arrays
CO405H Department of Computing Imperial College London
CO405H Department of Computing Imperial College London
User-Written Functions
WORKSHOP 1 CUSTOM TIRE SUBROUTINE
Topics Introduction Hardware and Software How Computers Store Data
Statistical Analysis with Excel
Functions CIS 40 – Introduction to Programming in Python
Introduction of Transport Protocols
Statistical Analysis with Excel
Statistical Analysis with Excel
Alternate Version of STARTING OUT WITH C++ 4th Edition
CS703 - Advanced Operating Systems
Topics Introduction Hardware and Software How Computers Store Data
Sequential Circuits - 2 Constructive Computer Architecture Arvind
MARIE: An Introduction to a Simple Computer
Little Man Computer (continued).
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
CSE 1020:Software Development
CPU Structure CPU must:
Using C++ Arithmetic Operators and Control Structures
Computer Programming-1 CSC 111
Electrical and Computer Engineering Department SUNY – New Paltz
Presentation transcript:

CO405H Computing in Space with OpenSPL Topic 10: Correlation Example Oskar Mencer Georgi Gaydadjiev Department of Computing Imperial College London CO405H course page: WebIDE: OpenSPL consortium page:

Let’s create a spatial implementation of correlation: r is the correlation coefficient for a pair of data timeseries x,y over a window n Our example computes correlation for N= data timeseries and returns the top 10 correlations at any point in time. DFE Input: Interleaved stream of N timeseries, plus precalculations from the CPU 2 This Lecture: Correlation Example src: n n n n n n n w n

Product-moment correlation: Calculate mean (i.e. first moment) of product of mean-adjusted values: For sample data, calculate as: 3 Pearson product-moment correlation expected value mean value standard deviation with

Computation of correlation is based on 5 sums: sumXY: sumX: sumY: sumX2: sumY2: Typical application involves continuous correlation of time series: use sliding window approach. 4 How to compute a correlation in practice x1x1 x2x2 x4x4 x5x5 x6x6 x0x0 x3x3 data flow x7x7 … oldXnewX No need to recompute entire sum, add next element and subtract previous: sumX += newX – oldX; For every time step, complexity of computing sums = O(1) regardless of window size!

Application in finance requires correlations between 200 to 6000 time series. – Complexity arises from number of time series m to be correlated, not from number of data elements n to be summed in sliding window. 5 Correlation between many time series O(m) sums O(m 2 ) sums_xy O(m) sums_sq O(m 2 ) correlations

for (uint64_t s=0; s<numTimesteps; s++) { index_correlation = 0; for (uint64_t i=0; i<numTimeseries; i++) { double old = (s>=windowSize ? data[i][s-windowSize] : 0); double new = data [i][s]; sums[i] += new - old; sums_sq[i] += new*new - old*old; // start and end of window => let’s call them data pairs } for (uint64_t i=0; i<numTimeseries; i++) { double old_x = (s>=windowSize ? data[i][s-windowSize] : 0); double new_x = data [i][s]; for (uint64_t j=i+1; j<numTimeseries; j++) { double old_y = (s>=windowSize ? data[j][s-windowSize] : 0); double new_y = data [j][s]; sums_xy[index_correlation] += new_x*new_y - old_x*old_y; correlations_step[index_correlation] = (windowSize*sums_xy[index_correlation]-sums[i]*sums[j])/ (sqrt(windowSize*sums_sq[i]sums[i]*sums[i])* sqrt(windowSize*sums_sq[j]-sums[j]*sums[j])); indices_step[2*index_correlation] = j; indices_step[2*index_correlation+1] = i; index_correlation++; } 6 Original CPU Version for Correlation Precompute? Move to DFE LMEM?

for (uint64_t s=0; s<numTimesteps; s++) { index_correlation = 0; for (uint64_t i=0; i<numTimeseries; i++) { double old = (s>=windowSize ? data[i][s-windowSize] : 0); double new = data[i][s]; sums[i] += new - old; sums_sq[i] += new*new - old*old; } for (uint64_t i=0; i<numTimeseries; i++) { double old_x = (s>=windowSize ? data[i][s-windowSize] : 0); double new_x = data [i][s]; for (uint64_t j=i+1; j<numTimeseries; j++) { double old_y = (s>=windowSize ? data[j][s-windowSize] : 0); double new_y = data[j][s]; sums_xy[index_correlation] += new_x*new_y - old_x*old_y; correlations_step[index_correlation] = (windowSize*sums_xy[index_correlation] - sums[i]*sums[j]) / (sqrt(windowSize*sums_sq[i] - sums[i]*sums[i])* sqrt(windowSize*sums_sq[j] -sums[j]*sums[j])); indices_step[2*index_correlation] = j; indices_step[2*index_correlation+1] = i; index_correlation++; } 7 Opportunities for acceleration 1 O(numTimeseries 2 ) O(numTimeseries)

for (uint64_t s=0; s<numTimesteps; s++) { index_correlation = 0; for (uint64_t i=0; i<numTimeseries; i++) { double old = (s>=windowSize ? data[i][s-windowSize] : 0); double new = data[i][s]; sums[i] += new - old; sums_sq[i] += new*new - old*old; } for (uint64_t i=0; i<numTimeseries; i++) { double old_x = (s>=windowSize ? data[i][s-windowSize] : 0); double new_x = data [i][s]; for (uint64_t j=i+1; j<numTimeseries; j++) { double old_y = (s>=windowSize ? data[j][s-windowSize] : 0); double new_y = data[j][s]; sums_xy[index_correlation] += new_x*new_y - old_x*old_y; correlations_step[index_correlation] = (windowSize*sums_xy[index_correlation] - sums[i]*sums[j]) / (sqrt(windowSize*sums_sq[i] - sums[i]*sums[i])* sqrt(windowSize*sums_sq[j] -sums[j]*sums[j])); indices_step[2*index_correlation] = j; indices_step[2*index_correlation+1] = i; index_correlation++; } 8 Opportunities for acceleration 2 reorder data precompute split code

9 Split correlation on CPU and DFE pairs old, new precalculations correlations r indicies x,y CPU 12 pipes fill the chip DFE each pipe has a feedback loop of 12 ticks => 144 calculations running at the same time

for (uint64_t s=0; s<numTimesteps; s++) { index_correlation = 0; for (uint64_t i=0; i<numTimeseries; i++) { double old = (s>=windowSize ? data[i][s-windowSize] : 0); double new = data[i][s]; sums[i] += new - old; sums_sq[i] += new*new - old*old; } for (uint64_t i=0; i<numTimeseries; i++) { double old_x = (s>=windowSize ? data[i][s-windowSize] : 0); double new_x = data [i][s]; for (uint64_t j=i+1; j<numTimeseries; j++) { double old_y = (s>=windowSize ? data[j][s-windowSize] : 0); double new_y = data[j][s]; sums_xy[index_correlation] += new_x*new_y - old_x*old_y; correlations_step[index_correlation] = (windowSize*sums_xy[index_correlation] - sums[i]*sums[j]) / (sqrt(windowSize*sums_sq[i] - sums[i]*sums[i])* sqrt(windowSize*sums_sq[j] -sums[j]*sums[j])); indices_step[2*index_correlation] = j; indices_step[2*index_correlation+1] = i; index_correlation++; } 10 DFE implementation accelerate with DFE precompute sums and inverse of sqrt on CPU store pairs of new and old in LMEM

prepare_data_for_dfe() – The Movie n n n n n n n n first and last element of each window creates a data_pair, then go round robin… data timeseries window n n n n n n n n n reordering for data_pairs

1 1 n n 1 1 n n 1 1 n n Example: Sum of a Moving Window keep the window sum by subtracting the top and adding the new entrant Note: this is a Case 2 Loop (Lecture 9), with a sequential streaming loop (par_loop) 1 1 n n DFEParLoop lp = new DFEParLoop(this,“lp”); DFEVector data_pairs = io.input(”dp", dfeVector(…), lp.ndone); lp.set_input(sum.getType(), 0.0); DFEVar sum = (lp.feedback + data_pairs[0]) - data_pairs[1]; lp.set_output(sum); io.output(”sum", sum, sum.getType(), lp.done); multiplied data_pairs + - sum ×

13 Correlation.max file plotted in Space 12 pipes LMEM LOAD

14 Correlation Pipes: Dataflow Graph

15 Correlation Dataflow Graph: Zoom In

16 File: correlationCPUCode.c Purpose: calling correlationSAPI.h for correlation.max Correlation formula: scalar r(x,y) = (n*SUM(x,y) - SUM(x)*SUM(y))*SQRT_INVERSE(x)*SQRT_INVERSE(y) where: x,y,... - Time series data to be correlated n - window for correlation (minimum size of 2) SUM(x) - sum of all elements inside a window SQRT_INVERSE(x) - 1/sqrt(n*SUM(x^2)- (SUM(x)^2)) Action 'loadLMem': [in] memLoad - initializeLMem, used as temporary storage Action 'default': [in] precalculations: {SUM(x), SQRT_INVERSE(x)} for all timeseries for every timestep [in] data_pair: {..., x[i], x[i-n], y[i], y[i-n],..., x[i+1], x[i-n+1], y[i+1], y[i-n+1],...} for all timeseries for every timestep [out] correlation r: numPipes * CorrelationKernel_loopLength * topScores correlations for every timestep [out] indices x,y: pair of timeseries indices for each result r in the correlation stream

17 prepare_data_for_dfe() – The Code // 2 DFE input streams: precalculations and data pairs for (uint64_t i=0; i<numTimesteps; i++) { old_index = i - (uint64_t)windowSize; for (uint64_t j=0; j<numTimeseries; j++) { if (old_index<0) old = 0; else old = data [j][old_index]; new = data [j][i]; if (i==0) { sums [i][j] = new; sums_sq [i][j] = new*new; }else { sums [i][j] = sums [i-1][j] + new - old; sums_sq [i][j] = sums_sq [i-1][j] + new*new - old*old; } inv [i][j] = 1/sqrt((uint64_t)windowSize*sums_sq[i][j] - sums[i][j]*sums[i][j]); // precalculations REORDERED in DFE ORDER precalculations [2*i*numTimeseries + 2*j] = sums[i][j]; precalculations [2*i*numTimeseries + 2*j + 1] = inv [i][j]; // data_pairs REORDERED in DFE ORDER data_pairs[2*i*numTimeseries + 2*j] = new; data_pairs[2*i*numTimeseries + 2*j + 1] = old; }

DFE Systems Architectures 18 EXAMPLES User’s CPU Code High Level SW Platform HLSWP Risk Analytics Library Video Encoding CPU Code.max Files DFE User’s CPU Code SAPI DFE Programmer’s Platform User’s MaxJ Code DAPI DFEPP Max MPT Video transcoding/processing DFE User’s CPU Code Low Level SW Platform LLSWP Correlation App DFE Config Options (reorder data) SAPI.max File MAPI

 Open correlation project at h​ttps://openspl.doc.ic.ac.uk/  Full exercise specification on CATe  Four versions of correlation: 1.Original: Original software in C 2.Split Control and Data: C software split into dataflow and controlflow 3.Single Pipe: CPU code + DFE code. Runs in simulation and on DFE. 4.Multi Pipe: CPU code + DFE binary. Runs in simulation and on DFE. Task 1: Run the Original C code for correlating 256, 512, 1024 and 2048 time series for 10, 100 and 1000 time steps. Task 2: Run the Split C code, correlating 256, 512, 1024 and 2048 time series for 1000 time steps. Task 3: Run the Single Pipe and Multi Pipe dataflow designs for correlating 256 time series and 10 time steps in simulation mode (SIM). Task 4: Run the Single Pipe and Multi Pipe dataflow designs for correlating 265, 512, 1024, 2048, and 4096 time series for 10, 100 and 100 time steps on the DFE. 19 Correlation Lab Exercise (due 4 Dec 2015)