The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals. 1 、 What is Computer Architecture? 2 、 our purpose Enable better systems: make computers faster, cheaper, smaller, more reliable, … Enable new applications Enable better solutions to problems ： Software innovation is built into trends and changes in computer architecture, > 50% performance improvement per year has enabled this innovation
3 、 problems are solved by electrons Microarchitecture ISA (Architecture) Program/Language Algorithm Problem Logic Circuits Runtime System (VM, OS, MM) Electrons ISA (instruction set architecture) is the interface between hardware and software, It is a contract that the hardware promises to satisfy. Microarchitecture : Specific implementation of an ISA; Not visible to the software. Microprocessor : ISA, Microarchitecture, circuits “Architecture” = ISA + microarchitecture Implementation (uarch) can be various as long as it satisfies the specification (ISA), Microarchitecture usually changes faster than ISA. Few ISAs (x86, ARM, SPARC, MIPS, Alpha) but many uarchs.
Also called stored program computer (instructions in memory). Two key properties: （ 1 ） Stored program Instructions stored in a linear memory array ； Memory is unified between instructions and data ； The interpretation of a stored value depends on the control signals ； （ 2 ） Sequential instruction processing One instruction processed (fetched, executed, and completed) at a time ； Program counter (instruction pointer) identifies the current instr ； Program counter is advanced sequentially except for control transfer instructions ； 4 、 The Von Neumann Model/Architecture
CONTROL UNIT IPInst Register PROCESSING UNIT ALU TEMP MEMORY Mem Addr Reg Mem Data Reg INPUTOUTPUT The Von Neumann Model (of a Computer)
Dataflow model: An instruction is fetched and executed in data flow order. properties: when its operands are ready; there is no instruction pointer; Instruction ordering specified by data flow dependence; Each instruction specifies “who” should receive the result; An instruction can “fire” whenever all operands are received; Potentially many instructions can execute at the same time; In a data flow machine, a program consists of data flow nodes. A data flow node fires (fetched and executed) when all it inputs are ready. Von Neumann model: An instruction is fetched and executed in control flow order. As specified by the instruction pointer; Sequential unless explicit control flow instruction; 5 、 The Dataflow Model (of a Computer )
ISA: Specifies how the programmer sees instructions to be executed Programmer sees a sequential, control-flow execution order Programmer sees a data-flow execution order Microarchitecture: How the underlying implementation actually executes instructions Microarchitecture can execute instructions in any order as long as it obeys the semantics specified by the ISA when making the instruction results visible to software Programmer should see the order specified by the ISA 6 、 The distinctions between ISA and uarch
All major instruction set architectures today use this Von Neumann Model : x86, ARM, MIPS, SPARC, Alpha, POWER Underneath (at the microarchitecture level), the execution model of almost all implementations (or, microarchitectures) is very different: Pipelined instruction execution: Intel uarch Multiple instructions at a time: Intel Pentium uarch Out-of-order execution: Intel Pentium Pro uarch Separate instruction and data caches But, what happens underneath that is not consistent with the von Neumann model is not exposed to software.
(1)Instructions Opcodes, Addressing Modes, Data Types Instruction Types and Formats Registers, Condition Codes (2)MEMORY Address space, Addressability, Alignment Virtual memory management Call, Interrupt/Exception Handling (3)Access Control, Priority/Privilege (4)I/O: memory-mappedvs. instr. (5)Task/thread Management (6)Power and Thermal Management (7)Multi-threading support, Multiprocessor support ISA
Implementation of the ISA under specific design constraints and goals Anything done in hardware without exposure to software (1)Pipelining (2)In-order versus out-of-order instruction execution (3)Memory access scheduling policy (4)Speculative execution (5)Superscalar processing (multiple instruction issue) (6)Clock gating (7)Caching, Levels, size, associativity, replacement policy (8)Prefetching (9)Voltage/frequency scaling (10)Error correction Microarchitecture
(1)Denial of Memory Service in Multi-Core System (2) DRAM Refresh (3) DRAM Row Hammer (or DRAM Disturbance Errors) 7 、 Three examples in lecture 1 There are three examples (questions) in lecture 1. Because I mainly study about the second example, DRAM Refresh, I will talk more about that and describe others briefly.
In a multi-core chip, different cores share some hardware resources. In particular, they share the DRAM memory system. Multiple applications share the DRAM controller, DRAM controllers designed to maximize DRAM data throughput DRAM scheduling policies are unfair to some applications Row-hit first: unfairly prioritizes apps with high row buffer locality. Threads that keep on accessing the same row. Oldest-first: unfairly prioritizes memory-intensive applications. (1)Denial of Memory Service in Multi-Core System (3) DRAM Row Hammer (or DRAM Disturbance Errors) Repeatedly opening and closing a row enough times within a refresh interval induces disturbance errors in adjacent rows in most real DRAM chips.
A DRAM cell consists of a capacitor and an access transistor. It stores data in terms of charge in the capacitor. DRAM capacitor charge leaks over time. The memory controller needs to refresh each row periodically to restore charge, Typical Activate each row every 64 ms. Downsides of refresh -- Energy consumption: Each refresh consumes energy -- Performance degradation: DRAM rank/bank unavailable while refreshed -- predictability impact: (Long) pause times during refresh -- Refresh rate limits DRAM capacity scaling (2) DRAM Refresh
Existing DRAM devices refresh all cells at a rate determined by the leakiest cell in the device. However, most DRAM cells can retain data for significantly longer. Therefore, many of these refreshes are unnecessary. Solution: RAIDR (Retention-Aware Intelligent DRAM Refresh), a low-cost mechanism that can identify and skip unnecessary refreshes using knowledge of cell retention times. The key idea is to group DRAM rows into retention time bins and apply a different refresh rate to each bin. As a result, rows containing leaky cells are refreshed as frequently as normal, while most rows are refreshed less frequently. RAIDR requires no modification to DRAM and minimal modification to the memory controller, a modest storage overhead of 1.25 KB in the memory controller.
More details about RAIDR A retention time profiling step determines each row’s retention time ((1) in Figure 4). For each row, if the row’s retention time is less than the new default refresh interval, the memory controller inserts it into the appropriate bin (2). During system operation (3), the memory controller ensures that each row is chosen as a refresh candidate every 64 ms.
Three key components: (1) retention time profiling; (2) storing rows into retention time bins; (3) issuing refreshes to rows when necessary; (1)Retention time profiling The straightforward method of conducting these measurements is to write a small number of static patterns (such as “all 1s” or “all 0s”), turning off refreshes, and observing when the first bit changes. Before the row retention times for a system are collected, the memory controller performs refreshes using the baseline auto- refresh mechanism. After the row retention times for a system have been measured, the results can be saved in a file by the operating system.
(2)Storing Retention Time Bins: Bloom Filters A Bloom filter is a structure that provides a compact way of representing set membership and can be implemented efficiently in hardware. A Bloom filter consists of a bit array of length m and k distinct hash functions that map each element to positions in the array. A Bloom filter can contain any number of elements; the probability of a false positive gradually increases with the number of elements inserted into the Bloom filter, but false negatives will never occur. This means that rows may be refreshed more frequently than necessary, but a row is never refreshed less frequently than necessary.
(3) Performing Refresh Operations Selecting A Refresh Candidate Row : We choose all refresh intervals to be multiples of 64 ms, This is implemented with a row counter that counts through every row address sequentially. Determining Time Since Last Refresh : Determining if a row needs to be refreshed requires determining how many 64 ms intervals have elapsed since its last refresh. Issuing Refreshes : In order to refresh a specific row, the memory controller simply activates that row, essentially performing a RAS- only refresh.
An auto-refresh operation occupies all banks on the rank simultaneously (preventing the rank from servicing any requests) for a length of time tRFC(the average time between auto-refresh commands), where tRFC depends on the number of rows being refreshed. Previous DRAM generations also allowed the memory controller to perform refreshes by opening rows one-by-one (called RAS- only refresh), but this method has been deprecated due to the additional power required to send row addresses on the bus. Auto-refresh
A microprocessor processes instructions, it has to do three things: 1) supply instructions to the core of the processor where each instruction can do its job; 2) supply data needed by each instruction; 3) perform the operations required by each instruction; (1) Instruction Supply The number that can be fetched at one time has grown from one to four, and shows signs of soon growing to six or eight. Three things can get in the way of fully supplying the core with instructions to process: (a)Instruction cache misses,(b)fetch breaks,(c) conditional branch mispredictions. 8 、 Something new in “Requirements, Bottlenecks, and Good Fortune: Agents for Microprocessor Evolution”
(2) Data Supply To supply data needed by an instruction, one needs the ability to have available an infinite supply of needed data, to supply it in zero time, and at reasonable cost. Thebest we can do is a storage hierarchy: where a small amount of data can be accessed (on-chip) in one to three cycles, a lot more data can be accessed (also, on-chip) in ten to 16 cycles, and still more data can be accessed (off chip) in hundreds of cycles. (3) Instruction Processing To perform the operations required by these instructions, one needs a sufficient number of functional units to process the data as soon as the data is available, and sufficient interconnections to instantly supply a result produced by one functional unit to the functional unit that needs it as a source.