提升循环级并行 陈健2002/11 Copyright © 2002 Intel Corporation.

Slides:



Advertisements
Similar presentations
A Synergetic Approach to Throughput Computing on IA Chi-Keung (CK) Luk TPI/DPD/SSG Intel Corporation Nov 16, 2010.
Advertisements

INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.
INSTRUCTION-LEVEL PARALLEL PROCESSORS
Streaming SIMD Extension (SSE)
® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
Compiler techniques for exposing ILP
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
Optimizing single thread performance Dependence Loop transformations.
FOUNDATION TO PARALLEL PROGRAMMING. CONTENT 并行程序设计简介 并行程序设计模型 并行程序设计范型 2.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Lecturer: Mu Lingling (穆玲玲)
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Image Interpolation Use SSE 指導教授 : 楊士萱 學 生 : 楊宗峰 日 期 :
Lab : OpenMP Programming Parallel Programming (CS5423) Instructor : 鍾葉青 Author : 吳宇宸.
编译原理总结. 基本概念  编译器 、解释器  编译过程 、各过程的功能  编译器在程序执行过程中的作用  编译器的实现途径.
Introduction to Intel Core Duo Processor Architecture Al-Asli, Mohammed.
Image Interpolation Use SSE 指導教授 : 楊士萱 學 生 : 楊宗峰 日 期 :
Hiep Hong CS 147 Spring Intel Core 2 Duo. CPU Chronology 2.
Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.
软件调优基础 2004 年 2 月 23 日. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL s5.445s5.457s10.996s3.328s0.762s0.848s0.738s for(i=0;i
并行程序设计 Programming for parallel computing 张少强 QQ: ( 第一讲: 2011 年 9 月.
History of Microprocessor MPIntroductionData BusAddress Bus
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
Chapter 3 Programming Languages Unit 1 The Development of Programming Languages.
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
HONGIK UNIVERSITY School of Radio Science & Communication Engineering Visual Information Processing Lab Hong-Ik University School of Radio Science & Communication.
Introduction to MMX, XMM, SSE and SSE2 Technology
Part 3.  What are the general types of parallelism that we already discussed?
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
How to Enforce Reproducibility with your Existing Intel ® Math Kernel Library Code Noah Clemons Technical Consulting Engineer Intel ® Developer Products.
Lecture 3: Computer Architectures
Single Node Optimization Computational Astrophysics.
Template Library for Vector Loops A presentation of P0075 and P0076
1 计算机系统结构 Computer Architecture 主讲教师:张钢 教授 Assignment mailbox: 年.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
CS203 – Advanced Computer Architecture
Multi-core processors
Exploiting Parallelism
Morgan Kaufmann Publishers
Henk Corporaal TUEindhoven 2009
Vector Processing => Multimedia
Special Instructions for Graphics and Multi-Media
Compiler Back End Panel
STUDY AND IMPLEMENTATION
Compiler Back End Panel
5.6 Real-World Examples of ISAs
Coe818 Advanced Computer Architecture
Lecture on High Performance Processor Architecture (CS05162)
INF5063: Programming heterogeneous multi-core processors
Chapter 1 Introduction.
Intel Core I7 Pipeline Wei-Tse Sun.
CS 286 Computer Organization and Architecture
Mapping DSP algorithms to a general purpose out-of-order processor
Presentation transcript:

提升循环级并行 陈健2002/11 Copyright © 2002 Intel Corporation

Agenda  Introduction  Who Cares?  Definition  Loop Dependence and Removal  Dependency Identification Lab  Summary

Introduction  Loops must meet certain criteria… –Iteration Independence –Memory Disambiguation –High Loop Count –Etc…

Who Cares  实现真正的并行 : –OpenMP –Auto Parallelization…  显式的指令级并行 ILP (Instruction Level Parallelism) –Streaming SIMD (MMX, SSE, SSE2, …) –Software Pipelining on Intel® Itanium™ Processor –Remove Dependencies for the Out-of-Order Core –More Instructions run in parallel on Intel Itanium- Processor  自动编译器并行 –High Level Optimizations

Definition int a[MAX]; for (J=0;J<MAX;J++) { a[J] = b[J]; }  Loop Independence: Iteration Y of a loop is independent of when or whether iteration X happens

图例 OpenMP: True Parallelism SIMD: Vectorization SWP: Software Pipelining OOO: Out-of-Order Core ILP: Instruction Level Parallelism Green: Benefits from concept Yellow: Some Benefits from Concept Red: No Benefit from Concept

Agenda

Flow Dependency  Read After Write  Cross-Iteration Flow Dependence: Variables written then read in different iterations for (J=1; J<MAX; J++) { A[J]=A[J-1]; } A[1]=A[0]; A[2]=A[1];

Anti-Dependency  Write After Read  Cross-Iteration Anti-Dependence: Variables written then read in different iterations for (J=1; J<MAX; J++) { A[J]=A[J+1]; } A[1]=A[2]; A[2]=A[3];

Output Dependency  Write After Write  Cross-Iteration Output Dependence: Variables written then written again in a different iteration for (J=1; J<MAX; J++) { A[J]=B[J]; A[J+1]=C[J]; } A[1]=B[1]; A[2]=C[1]; A[2]=B[1]; A[3]=C[1];

IntraIteration Dependency  Dependency within an iteration  Hurts ILP  May be automatically removed by compiler K = 1; for (J=1; J<MAX; J++) { A[J]=A[J] + 1; B[K]=A[K] + 1; K = K + 2; } A[1] = A[1] + 1; B[1]= A[1] + 1;

for (J=1; J<MAX; J++) { A[J]= A[0] + J; } Remove Dependencies  Best Choice  Requirement for true Parallelism  Not all dependencies can be removed for (J=1; J<MAX; J++) { A[J]=A[J-1] + 1; }

for (J=1;J<MAX;J+=2) { A[J]=A[J-1] + B[J]; A[J+1]=A[J-1] + (B[J] + B[J+1]); } Increasing ILP, without removing dependencies  Good: Unroll Loop  Make sure the compiler can’t or didn’t do this for you  Compiler should not apply common sub- expression elimination  Also notice that if this is floating point data - precision could be altered for (J=1;J<MAX;J++) { A[J] =A[J-1] + B[J]; }

Induction Variables  Induction variables are incremented on each trip through the loop  Fix by replacing increment expressions with pure function of loop index i1 = 0; i2 = 0; for(J=0,J<MAX,J++) { i1 = i1 + 1; B(i1) = … i2 = i2 + J; A(i2) = … } for(J=0,J<MAX,J++) { B(J) =... A((J**2 + J)/2)=... }

Reductions  Reductions collapse array data to scalar data via associative operations:  Take advantage of associativity and compute partial sums or local maximum in private storage  Next, combine partial results into shared result, taking care to synchronize access for (J=0; J<MAX; J++) sum = sum + c[J];

Data Ambiguity and the Compiler void func(int *a, int *b) { for (J=0;J<MAX;J++) { a[J] = b[J]; }  Are the loop iterations independent?  The C++ compiler has no idea  No chance for optimization - In order to run error free the compiler assumes that a and b overlap

Function Calls for (J=0;J<MAX;J++) { compute(a[J],b[J]); a[J][1]=sin(b[J]); }  Generally function calls inhibit ILP  Exceptions: –Transcendentals –IPO compiles

Function Calls with State   Many routines maintain state across calls: – –Memory allocation – –Pseudo-random number generators – –I/O routines – –Graphics libraries – –Third-party libraries  Parallel access to such routines is unless synchronized  Parallel access to such routines is unsafe unless synchronized  Check documentation for specific functions to determine thread-safety

for(J=MAX-1;J>=0;J--){ compute(J,...) } A Simple Test 1.Reverse the loop order and rerun in serial 2.If results are unchanged, the loop is Independent* for(J=0;J compute(J,...) } *Exception: Loops with induction variables Reverse

Summary  Loop Independence: Loop Iterations are independent of each other.  Explained it’s importance –ILP and Parallelism  Identified common causes of loop dependence –Flow Dependency, Anti-Dependency, Output Dependency  Taught some methods of fixing loop dependence  Reinforced concepts through lab