Overview of Intel® Core 2 Architecture and Software Development Tools June 2009.

Overview of Intel® Core 2 Architecture and Software Development Tools June 2009

Overview of Architecture & Tools We will discuss:  What lecture materials are available  What labs are available  What target courses could be impacted  Some high level discussion of underlying technology

Objectives After completing this module, you will:  Be aware of and have access to several hours worth of MC topics including Architecture, Compiler Technology, Profiling Technology, OpenMP, & Cache Effects  Be able create exercises on how to avoid coding common threading hazards associated with some MC systems – such as Poor Cache Utilization, False Sharing and Threading load imbalance  Be able create exercises on how to use selected compiler directives & switches to improve behavior on each core  Be able create exercises on how to take advantage VTune analyzer to quickly identify load imbalance issues, poor cache reuse and false sharing issues

Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects

Why is the Industry moving to Multi-core? In order to increase performance and reduce power consumption Its is much more efficient to run several cores at a lower frequency than one single core at a much faster frequency

Power and Frequency Power vs. Frequency Curve for Single Core Architecture 9 59 109 159 209 259 309 359 00.20.40.60.811.21.41.61.822.22.42.62.833.23.4 Frequency (GHz) Power (w) Dropping Frequency = Large Drop Power Lower Frequency Allows Headroom for 2nd Core

Processor-independent optimizations /OdDisables optimizations /O1Optimizes for Binary Size and for Speed: Server Code /O2Optimizes for Speed (default): Vectorization on Intel 64 /O3Optimizes for Data Cache: Loopy Floating Point Code /ZiCreates symbols for debugging /Ob0Turns off inlining which can sometimes help the Analysis tools do a more through job

AutoVectorization optimizations QaxSSE2Intel Pentium 4 and compatible Intel processors. QaxSSE3 Intel(R) Core(TM) processor family with Streaming SIMD Extensions 3 (SSE3) instruction support QaxSSE3_ATOMCan generate MOVBE instructions for Intel processors and can optimize for the Intel(R) Atom(TM) Processor and Intel(R) Centrino(R) Atom(TM) Processor Technology Extensions 3 (SSE3) instruction support QaxSSSE3 Intel(R) Core(TM)2 processor family with SSSE3 QaxSSE4.1Intel(R) 45nm Hi-k next generation Intel Core(TM) microarchitecture with support for SSE4 Vectorizing Compiler and Media Accelerator instructions QaxSSE4.2Can generate Intel(R) SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel(R) Core(TM) i7 processors. Can generate Intel(R) SSE4 Vectorizing Compiler and Media Accelerator, Intel(R) SSSE3, SSE3, SSE2, and SSE instructions and it can optimize for the Intel(R) Core(TM) processor family. Intel has a long history of providing auto-vectorization switches along with support for new processor instructions and backward support for older instructions is maintained Developers should keep an eye on new developments in order to leverage the power of the latest processors

More Advanced optimizations QipoInterprocedural optimization performs a static, topological analysis of your application. With /Qipo (-ipo), the analysis spans all of your source files built with /Qipo (-ipo). In other words, code generation in module A can be improved by what is happening in module B. May enable other optimizations like autoparallel and autovector Qparallelenable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel Qopenmpenable the compiler to generate multi-threaded code based on the OpenMP* directives

Lab 1 - AutoParallelization Objective: Use auto-parallelization on a simple code to gain experience with using the compiler’s auto-parallelization feature Follow the VectorSum activity in the student lab doc Try AutoParallel compilation on Lab called VectorSum Extra credit: parallelize manually and see how you can beat the auto-parallel option – see openmp section for constructs to try this

Parallel Studio to find where to parallelize Parallel Studio will be used in several labs to find appropriate locations to add parallelism to the code. Parallel Amplifier specifically is used to find hotspot information – where in your code does the application spend most of its time Parallel amplifier does not require instrumenting your code in order to find hotspots, compiling with symbol information is a good idea - /Zi Compiling with /Ob0 turns off inlining and sometimes seems to give a more through analysis in Parallel Studio

Parallel Amplifier Hotspots

What does hotspot analysis show?

What about drilling down?

The call stack The call stack shows the callee/caller relationship among function in he code

Found potential parallelism

Lab 2 – Mandelbrot Hotspot Analysis Objective: Use sampling to find some parallelism in the Mandelbrot application Follow the Mandelbrot activity called Mandelbrot Sampling in the student lab doc Identify candidate loops that could be parallelized

Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core  High level overview – Intel® Core Architecture Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects

Mobile Platform Optimized 1-4 Execution Cores 3/6MB L2 Cache Sizes 64 Byte L2 cache line 64-bit 6M 6M L2 4M 4M L2 Desktop Platform Optimized 2-4 Execution Cores 2X3, 2X6 MB L2 Cache Sizes 64 Byte L2 Cache line 64-bit Server Platform Optimized 4 Execution Cores 2x6 L2 Caches 64 Byte L2 Cache line DP/MP support 64-bit 2 cores 4 cores **Feature Names TBD 6M 2X6M L2 2X3M L2 2 cores 4 cores 12M 4 cores 2X6M L2 12M Intel® Core 2 Architecture Snapshot in time during Penryn, Yorkfield, harpertown Software develoers should know number of cores, cache line size and cache sizes to tackle Cache Effects materials

Memory Hierarchy Magnetic Disk Main Memory L2 Cache L1 Cache CPU ~ 1’s Cycle ~ 1’s - 10 Cycle ~ 100’s Cycle ~ 1000’s Cycle

High Level Architectural view AAAA EEEE C1C2 BB AA EE C B Intel Core 2 Duo Processor Intel Core 2 Quad Processor A = Architectural State E = Execution Engine & Interrupt C = 2nd Level Cache B = Bus Interface Memory 64B Cache Line Dual Core has shared cache Quad core has both shared And separated cache Intel® Core™ Microarchitecture – Memory Sub-system

With a separated cache CPU1 CPU2 Memory Front Side Bus (FSB) Cache Line Shipping L2 Cache Line ~Half access to memory Intel® Core™ Microarchitecture – Memory Sub-system

CPU2 Advantages of Shared Cache – using Advanced Smart Cache® Technology CPU1 Memory Front Side Bus (FSB) Cache Line L2 is shared: No need to ship cache line Intel® Core™ Microarchitecture – Memory Sub-system

False Sharing Performance issue in programs where cores may write to different memory addresses BUT in the same cache lines Known as Ping-Ponging – Cache line is shipped between cores Core 0 Core 1 Time 1 0 X[0] = 1 X[1] =1 1 X[0] = 0 X[1] = 0 1 0 X[0] = 2 1 1 2 False Sharing not an issue in shared cache It is an issue in separated cache

Super Scalar Execution FP SIMD INT Multiple Execution units Allow SIMD parallelism Many instructions can be retired in a clock cycle Multiple operations executed within a single core at the same time

IntelSSE IntelSSE4.1 IntelSSE2 1999 2000 IntelSSE3 2004 IntelSSSE3 2006 2007 70 instr Single- Precision Vectors Streaming operations 144 instr Double- precision Vectors 8/16/32 64/128-bit vector integer 13 instr Complex Data 32 instr Decode 47 instructions Video Accelerators Graphics building blocks Advanced vector instr Will be continued by Intel SSE4.2 (XML processing end 2008) See - http://download.intel.com/technology/architecture/new- instructions-paper.pdf History of SSE Instructions Long history of new instructions Most require using packing & unpacking instructions

SSE Data Types & Speedup Potential 4x floats SSE 16x bytes 8x 16-bit shorts 4x 32-bit integers 2x 64-bit integers 1x 128-bit integer 2x doubles SSE-2 SSE-3 SSE-4 Potential speedup (in the targeted loop) roughly the same as the amount of packing ie. For floats – speedup ~ 4X

Goal of SSE(x) + Scalar processing  traditional mode  one instruction produces one result X Y X + Y = SIMD processing  with SSE(2,3,4)  one instruction produces multiple results+ x3x2x1x0 y3y2y1y0 x3+y3x2+y2x1+y1x0+y0 X Y X + Y = Uses full width of XMM registers Many functional units Choice of many of instructions Not all loops can be vectorized Cant vectorize most function calls

Lab 3 – IPO assisted Vectorization Objective: Explore how inlining a function can dramatically improve performance by allowing vectorization of loop with function call Open SquareChargeCVectorizationIPO folder and use “nmake all” to build the project from the command line To add switches to make envirnment use nmake all CF=“/QxSSE3” as example

Cache effects Cache effects can sometimes impact the speed of an application by as much as 10X or even 100X To take advantage of cache hierarchy in your machine, you should use and re-use data already in cache as much as possible Avoid accessing memory in non- contiguous memory locations – especially in loops You may need to consider a loop interchange to access data in a more efficient manner

Loop Interchange Very important for the vectorizer! for(i=0;i<NUM;i++) for(j=0;j<NUM;j++) for(k=0;k<NUM;k++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; for(i=0;i<NUM;i++) for(k=0;k<NUM;k++) for(j=0;j<NUM;j++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; Fast Loop Index Non unit stride skipping in memory can cause cache thrashing – particularly for arrays sizes 2^n

Unit Stride Memory Access (C/C++) bN-10bN-1N-1 bk0bk1bk2bk3bkN-1 b10b11b12b13b1N-1 b00b01b02b03b0N-1 j j b k Fastest incremented index Consecutive memory access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 k a k i Next fastest loop index Consecutive memory index

Pan ready to fry eggs Refrigerator Poor Cache Uilization - with Eggs : Carton represents cache line Refrigerator represents main memory Table represents cache When table is filled up – old cartons are evicted and most eggs are wasted Request for an egg not already on table, brings a new carton of eggs from the refrigerator, but user only fries one egg from each carton. When table fills up old carton is evicted User requests one specific egg User requests 2 nd specific egg User requests a 3rd egg – Carton evicted

Refrigerator Previous user had used all eggs on table : Good Cache Utilization - with Eggs Carton eviction doesn’t hurt us because we’ve already fried all the eggs in the cartons on the table – just like previous user User requests Eggs 1-8User requests Eggs 9-16 User eventually asks for all the eggs Request for one egg brings new carton of eggs from refrigerator User specifically requests eggs form carton already on table User fries all eggs in carton before egg from next carton is requested

Lab 4 – Matrix Multiply Cache Effects Objective: Explore the impact of poor cache utilization on performance with Parallel Studio and explore how to manipulation loops to achieve significantly better cache utilization & performance

BACKUP

Overview of Intel® Core 2 Architecture and Software Development Tools June 2009.

Similar presentations

Presentation on theme: "Overview of Intel® Core 2 Architecture and Software Development Tools June 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Overview of Intel® Core 2 Architecture and Software Development Tools June 2009.

Similar presentations

Presentation on theme: "Overview of Intel® Core 2 Architecture and Software Development Tools June 2009."— Presentation transcript:

Similar presentations

About project

Feedback