Presentation is loading. Please wait.

Presentation is loading. Please wait.

Class Overview: Parallel Computer Architecture

Similar presentations


Presentation on theme: "Class Overview: Parallel Computer Architecture"— Presentation transcript:

1 Class Overview: Parallel Computer Architecture

2 What you can expect from this course?
If you successfully get a great grade from this course, You will know what architects do and be ready to perform your own simulation-based researches You will get closer with a better graduation and have a good Ph.D. achievement >4300 results

3 Bibles in Computer Architecture
VS.

4 Bibles in Computer Architecture
Quantitative Design and Analysis Memory Hierarchy Design Instruction-level Parallelism Data-level Parallelism Thread-level Parallelism Computer Abstraction Basics of Instruction Arithmetic for Computers The Processor Storage Concept Multicore, Multiprocessor, and Clusters VS.

5 Bibles in Computer Architecture
VS. “A little bit” Advanced Topics Conceptual Idea

6 Computer Architecture Definition
Bridge application and technology Application Gap too large to bridge in one step  Computer Architecture: develop abstraction and implementation layers to execute information processing application efficiently using available fabrication technology Technology

7 Computer System Stack Application Algorithm Algorithm
Ex) Sort an array of numbers 2,6,3,8,4,5  2,3,4,5,6,8 Application Algorithm Algorithm Out-of-place selection sort algorithm Find min number in array Move min number into output array Repeat steps 1 & 2 until finished Computer Architecture Programming Language Programming Language Operating Systems Operating Systems Operating Systems Instruction Set Architecture Instruction Set Architecture Instruction Set Architecture C implementation of selection sort Microarchitecture Microarchitecture Microarchitecture Register-Transfer Level Register-Transfer Level Register-Transfer Level Gate Level Gate Level Gate Level Circuits Circuits Devices Devices Technology Technology

8 Computer System Stack Application Algorithm Computer Architecture
Mac OS X, Windows, Linux Handles low-level HW management Application Algorithm Computer Architecture Programming Language Operating Systems Operating Systems Instruction Set Architecture Instruction Set Architecture MIPS32 Instruction Set Instructions that machine executes Microarchitecture Register-Transfer Level Gate Level Circuits Devices Technology

9 Computer System Stack Application Algorithm Computer Architecture
How data flows through system Application Algorithm Computer Architecture Programming Language Boolean logic gates and functions Operating Systems Instruction Set Architecture Combining devices to do useful work Microarchitecture Register-Transfer Level Gate Level Gate Level Gate Level Transistors and wires Circuits Circuits Devices Devices Silicon process technology Technology Technology

10 Logic, State, and Interconnect
Application Digital systems are implemented with three basic building blocks Logic to process data State to store data Interconnect to move data Algorithm Computer Architecture Programming Language Operating Systems Instruction Set Architecture State State Microarchitecture Logic Logic Logic Register-Transfer Level Interconnect Gate Level Circuits State State State Logic Logic Devices Technology

11 General-Purpose Computing:
Processors, Memories, and Networks Application Computer engineering basic build blocks Processors for computation Memories for storage Networks for communication Algorithm Computer Architecture Programming Language Operating Systems Instruction Set Architecture Microarchitecture Register-Transfer Level Gate Level Input data Network Output data Circuits Devices Technology

12 System Research as a Scientific Approach
General Science Discover truths about nature Computer Engineering Explore design space for a new system Design and model baseline system Ask question about nature Ask question about system Construct hypothesis Test with experiment Test with experiment Analyze results & draw conclusions Analyze results & draw conclusions Design and model alternative system

13 Let’s Look at the Details of Mainboard
PCIx16 slots PCI slot North bridge CPU socket USB headers South bridge RAM slots SATA plugs PATA connectors

14 CPU socket Many different physical socket standards
Physical standard is less important than Instruction Set Architecture (ISA) IBM PCs are Intel compatible Original x86 design Intel, AMD, VIA Today’s dominant ISA: x86-64, developed by AMD CPU socket

15 North bridge RAM slots Coordinates access to main memory
Pre-1993: DRAM (Dynamic RAM) Post-1993: SDRAM (Synchronous DRAM) Current standard: Double data rate SDRAM (DDR SDRAM) RAM slots

16 Built in I/O I/O device slots South bridge
Slightly less old standard: PCI slots (or AGP/PCI-X slots) Attached to the south-bridge bus Built in I/O I/O device slots Facilitates I/O between devices, the CPU, and main memory South bridge

17 Storage connectors Controlled by the South Bridge
Old standard: Parallel ATA (P-ATA) AT Attachment Packet Interface (ATAPI) Evolution of the Integrated Drive Electronics (IDE) standard Other standards Small Computer System Interface (SCSI) Serial ATA (SATA) Storage connectors

18 All devices compete for access to memory
L1, L2, L3 Cache CPU(s) All devices compete for access to memory Graphics Memory North/South Bridge Graphics Memory I/O I/O

19 CPU Main Memory System Bus L1 (and L2, L3) Cache Instruction Fetch
Floating Point (FPU) Arithmetic and Logic (ALU) Arithmetic and Logic (ALU) Registers Decode Control Unit

20 Processor Performance Increase
Intel Pentium 4/3000 DEC Alpha 21264A/667 DEC Alpha 21264/600 Intel Xeon/2000 DEC Alpha 5/500 DEC Alpha 4/266 DEC Alpha 5/300 DEC AXP/500 IBM POWER 100 HP 9000/750 IBM RS6000 MIPS M2000 MIPS M/120 SUN-4/260

21 Impacts of Advancing Technology
Capacity: 2x every 2 years Speed: 1.5x every 10 years Cost per bit: decreases 25% per year Capacity: increases 60% per year Capacity: increases 30% per year Performance: 2x every 1.5 years

22 Performance vs. Power Density Trend
10000 𝑷𝒐𝒘𝒆𝒓 ~ 𝟏 𝟐 𝑪 𝑽 𝟐 𝑨𝐟 C: capacitance V: supply voltage A: activity factor F: clock frequency Rocket Nozzle 1000 Nuclear Reactor Surface of the Sun 100 Power Density (W/cm2) 8086 10 Hot Plate P6 C (Capacitance): Function of wire length, transistor size V (Supply voltage): Has been dropping with successive fab generations A (Activity factor): How often, on average, do wires switch? F (Clock frequency): Increasing… 8008 Pentium® proc 8085 286 4004 386 8080 486 1 1970 1980 1990 2000 2010 Year

23 Thermal map: 1.5 GHz Itanium-2
Cache Temp (oC) Execution core 120oC Source: Intel Corporation and Prof. V. Oklobdzija

24 The Result Getting a higher frequency to follow up the expectation of such a performance trend is non-trivial Architectural solutions help CPUs get higher performance: we will learn the answers for the below questions

25 Architectural solutions
How can we improve the performance at a same or similar clock frequency? How can we schedule instructions to improve the performance over hardware? L1 D$ Execution Units L2 CACHE Instruction Scheduling Retirement L1 I$ *Die photo credit: Intel WikiChips

26 Multi-Core Systems NVM BANKS DRAM BANKS SHARED L3 CACHE DRAM INTERFACE
PCIe NVM BANKS SHARED L3 CACHE DRAM INTERFACE DRAM MEMORY CONTROLLER DRAM BANKS CORE 2 L2 CACHE 2 L2 CACHE 3 CORE 3 *Die photo credit: AMD Barcelona/ETHZ Prof. Onur

27 Multi-Core Systems How can we inter-connect them cores?
Caches are now shared by multiple cores, how can we make them consistent and coherence? How can we schedule and synchronize threads across multiple cores? Do we need to be aware of a cache affinity, imposed by multiple cores? CORE 0 CORE 1 CORE 2 CORE 3

28 Multi-Core Systems DRAM are also shared by all of the cores, what do have to handle large memory footprints? How can we make storage performance follow up and sharable? PCIe

29 Course Logistics All materials will be available through the KLMS system (in a form of pdf) Reference will be “Computer Architecture: A Quantitative Approach” Note that, as we sit between advanced computer architecture and undergraduate course (EE312), the topics are not strictly aligned with the AQA textbook

30 Syllabus: Parallel Computer Architecture
Quick Review (ISA, MIPS/RISC, Pipelining, Hazard etc.) Parallel Processing Static Scheduling (Loop unrolling, etc) Dynamic Scheduling (Scoreboard, Tomasulo algorithm, ROB, etc.) and Precise Exception Management Branch Prediction (correlated, tournament, RAS/BTB) SSD internal parallelism RAID-level parallelism All-Flash Array (NVMe over Fabric, Storage Lock, etc.) Instruction-level parallelism Memory-level parallelism Data-level parallelism Thread-level parallelism NVM-level parallelism $ Reviews SRAM/DRAM/NVM Bank parallelism Persistency control Multi/many-core architecture GPU architecture Coherence, Synchronization, Cache affinity, lock, lock-free, interconnect

31 Syllabus: Parallel Computer Architecture
Lab projects Gem5-based projects Most projects are a step-by-step tutorial to teach you how you can do simulation-based architectural explorations and studies Syllabus: Parallel Computer Architecture Memory-level parallelism Instruction-level Data-level Thread-level NVM-level Quick Review (ISA, MIPS/RISC, Pipelining, Hazard etc.) Parallel Processing Branch Prediction Static Scheduling (Loop unrolling, etc) Dynamic Scheduling (Scoreboard, Tomasulo algorithm, ROB, etc.) and Precise Exception Management Bank parallelism Cache affinity, memory lock, persistent control Multi/many-core architecture Vector architecture, GPGPU architecture Coherence control, Synchronization, MPI, etc Interconnect SSD internal parallelism RAID-level parallelism All-Flash Array (NVMe over Fabric, Storage Lock, etc.) NOTE: Specific items given by our syllabus can be changed based on lecture progress and student demands Please check the “updated syllabus” from klms.kaist.ac.kr

32 Prerequisites I will assume you have detailed knowledge of
If you don’t remember a few of these.. If you don’t know what some of these are Pipelining: Classic 5-stage pipeline, pipeline hazards, stalls, etc. Caches: Tag/Index/Offset, hit/miss, set-associativity, replacement policies, write- through, etc. Assembler and ISAs: RISC, load/store, instruction encoding, caller- saved/calee-saved registers, stack pointer, frame pointer, function call/return code, etc. Read Appendices A, B, C Review “Computer Organization and Design” and take EE312 Take Undergraduate Computer Architecture before you take this class, or Read the “Computer Organization and Design” textbook before next week’s lectures Plus, don’t miss the “Quick Reviews” for CPU and cache lectures

33 Grades This course is project oriented, but will have two “lightweight” exams

34 Exams – a kind of quizzes (30%)
Exams will not be so serious (simply touch key components, and the following schedule would be updated; if there is a change, the new schedule will be announced at the class Closed book, no calculator and no cheat seat Exam2 (15%) Exam1 (15%) Instruction-level parallelism Memory-level parallelism Data-level parallelism Thread-level parallelism NVM-level parallelism

35 Paper Critiques (10%) A group of 3~4 students We will have 2~3 lectures (the number subject to change) strictly focused on paper reviews i) Processors (CPU/GPU) and Cache, ii) Memory, iii) NVM For each theme, we will review a small number of papers (probably 3) TA will give you the reading assignment (you don’t need to pick a paper) and review template All groups are expected to submit one paper summary of the presented papers Grades will be rated based on the submitted paper summary and peer-reviews of your group’s presentation from your class students

36 Lab Projects (35%) This will be individually performed There exist five projects that guide you to learn how to use gem5 and perform experimental evaluations w/ system-call mode and full-system mode simulations Lab #1 Lab #2 Lab #3 Lab #4 Lab #5 Installation, configuration, and evaluation of gem5 + CPU design analysis Cache design and optimizations Exploring different branch predictors and performance analysis Enabling full-system (w/ $ coherence) and multi-threading Enabling an NVMe SSD and internal parallelism analysis

37 Final Projects (25%) A team of 2 members (recommend) Undergraduates can do this project for evaluating/analyzing an existing system (w/ different configuration), but graduates need to add “at least” incremental improvement idea into the project Presentation The proposal will be given 2 weeks after the exam 1 (each team will have around 5 mins, as a short presentation) Final presentations will be performed right after the last lecture You will submit the project as an website (use github.io) that includes Presentation slides Final report and source code on github

38 Final Project examples
CacheSim Parallel Memory Allocators LockFree SIMD CacheSim LockFree SIMD Parallel Memory Allocators

39 Plagiarism Collaboration and discussions are strongly encouraged, but your work should be done by individual Turn-it-in will check up your reports by comparing not only your friend ones but also all the available resources in web to figure out if you have plagiarism If you have questions about an online resource, ask us If it is realized that your reports have plagiarism (stealing any other author’s language or expression) or cheat on the exam, you will have automatically ZERO - The penalties of plagiarism and cheating will be given based on KAIST guideline

40 Teaching Assistant Myoungsoo Jung (Instructor) Miryeong Kwon (TA)
Associate Professor, KAIST, 2019~present Assistant Professor, Yonsei Univ, 2015~2019 Assistant Professor, The University of Texas at Dallas, 2013~2015 Guest Scientist, Lawrence Berkeley National Laboratory, Computer Architecture Group, 2011~2015 Researcher/Engineer, Samsung Electronics, Memory Division, 2006~2009 Miryeong Kwon (TA) Office: N1 421 Mail: Thanks for MK; has many credits for this lecture!

41 Class Overview: Parallel Computer Architecture


Download ppt "Class Overview: Parallel Computer Architecture"

Similar presentations


Ads by Google