Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architectures M

Similar presentations


Presentation on theme: "Computer Architectures M"— Presentation transcript:

1 Computer Architectures M
Core Computer Architectures M

2 CMP Chip Multi Processor
In this context I/O indicates any communication with the external world (real I/O, memory, external caches). Shared cache indicates L2 or L3. Very often L2 too is integrated in the processor

3 Advantages Minimum latency time for data transfer No bus use for interprocessor communication Possible dynamic cache allocation between the processors Disadvantages Complexity. The controller must evaluate in real time the needs of the two CPUs and an error can block one of the CPUs. The cache bandwith must be much higher for serving two CPUs=> If the cache access is multiport => further increased complexity. If queue only then reduced efficiency This design does not cater for scaling (the cache cannot be divided)

4 Advantages Reduced handling complexity Easy scaling (i.e. one CPU ony) No Bus involvement Disadvantages No dynamic balancing Accurate I/O controller design Performance reduced because of the traffic between the two CPUs (which affects the I/O too)

5 Shared package Advantages It is a dual CPU and therefore easier design
Easy scaling with one CPU only Reduced test complexity (one CPU at a time can be tested) Shorter «time to market» Disadvantages Very important: CPU communication affects the bus Double electrical impact on the bus capacitors=>slower behaviour Shared package

6 Enhanced SpeedStep Technology
It allows to reduce the operating voltage while reducing the clock frequency Pentium 1,6 GHZ Voltage Clock 1.484 V 1.6 GHz 1.42 V 1.4 GHz 1.276 V 1.2 GHz 1.164 V 1 GHz 1.036 V 800 MHz 0.956 V  600 MHz There are different power lines for the functional units which can be selectively switched off

7 64 bit extensions AH AL EAX RAX AX
It allows the execution of 64 bit OS and programs Addressing space up to 64 exabytes (2**64 bytes – 2**32 x 2**32 bytes - 4Gx4G). 8 additional 64 bit registers/ accumulators R08-R16 All other accumulators 64 bits MMX registers 64 bit AH AL EAX RAX AX

8 Core 2005-2007 New architecture named Core for multicore
Low power consumption 14 stages pipeline Developed in Israel Multicore with Out Of Order While PIV tried to increase the efficiency by increasing the clock the Core relies on multiprocessing thanks to the transistor reduced size. This allows also an increase of the cache size. No trace cache

9 NB: in this figure the prefetcher
Core CORE Pipeline CORE Pipeline 1 NB: in this figure the prefetcher includes the L1 cache L1 – Core 0 L1 – Core 1 L2 - shared FSB interface

10 Core

11 Core 1 + 3 decoders – 7 u-op, greater ROB increased number of FUs
2.66 GHz Smart power reduction. For instance not only the unused FUs are powered down but also the internal busses path are activated only when necessary for each instruction For each core two L1 caches (Data and instructions): Instructions => 32 (or 64) KB 8 Ways Data=>32 (or 64) KB 2/8 Way L2 Cache: 2-4 MB unified The two cores L1 can exchange information directly without using the bus

12 Core Microarchitecture
Dual core superscalar 4, 36 bit physical addresses L2 shared inclusive – unified D e I. Each cores uses the portion which it needs. If the two cores use the same instructions they can be shared Core has no multithread which is resumed in the following processors generation

13 Advanced Smart Cache NON shared L2 has the following disadvantages:
possible replication of the same data in the two caches snoop through the FSB static partitioning of the silicon Shared L2 advantages: None of the previous disadvantages Independent L2 Caches Shared L2 Cache

14 Core Microarchitecture
4 + 3 u-ops 6 ports Higher efficiency ALUs

15 Core Pipeline The pipeline is 14 stages (P6 12 stages). The additional two stages inserted for delays and fusion handling ROB stores 96 u-ops (Xeon 126 because of the multithread) Unified RS handling (memory/non memory – no difference between the FUs) with increased entries for better FUs exploitation It must be noted that the same «in flight» instructions number is further increased by the fusion (see later). The Core window is therefore greater than the increase of the RS and ROB could induce to think

16 Core Architecture 6 Ports
Data restructured for the vector ALUs. Operations on 128 bit data split into two 64 bit operations The two ROBs are the same ROB!

17 Core Microarchitecture
Intelligent prefetcher: 2x16=32 bytes buffer (as in P6). The system tries to guess the required data. For instance when the data at address are requested then the system reads in advance the data at address 7 (if the bus is available) More precisely each Core has 2x3+2=8 prefetchers (two for the data and one for the instructions plus two prefetchers for the shared L2 ).The prefetch policies are different in different models according to the use (mobile, server, desktop) The prefetch algorithm (secret) considers the access sequences in order to predict the next requests and to anticipate them

18 Loop detector Exploits the hardware loop detectionI The loop detector analyzes the branches and determines whether it is a loop Avoids the repetitive fetch and branch prediction … But requires the decoding each cycle Decode Branch Prediction Fetch Loop Stream Detector 18 instructions

19 Macrofusion In the complex decoder couples of machine instructions can be fused (typically compare and test instructions are fused with the branch instructions) The only limit is that only one “macrofused” instruction per cycle can be generated This requires a higher complexity decoder, ALU and Branch FU more complex but grants a reduced number of «in flight» u-ops, faster ROB and RS emptying and an apparent higher efficiency of the ALUs. This means a lower power consumption fror the same program. The sequence load EAX, [mem1] cmp EAX, [mem2] jne Target becomes cmp EAX, [mem2] + jne Target (test and branch)

20 Microfusion Two distinct u-ops allocated in the same bit string
When a microfused instruction reaches the RS they are separately sent to the respective FUs either in parallel or serially when they require the same FU (i.e. the case of LOAD e STORE) STORE operations are normally subdivided into two u-ops: one for the data and one for the address (two separata FUs) The data are sent to the store buffer while the address is calculated: when ready it is retired by the store buffer. The same applies to the LOAD or READ-MODIFY : in this case the two operations are serially executed. The number of the u-ops is reduced in average by 10%. The efficiency increase is 5% for integer operations and 10% for FP opertions.

21 Core Front End Predecode and fusion stage: it detects the instructions length and the relative boundaries The trace-cache is not more present because of its statistically poor performance. 4 decoders (one complex and three simple – 7 uops/clock – one more than P6) Combining Macrofusion and Microfusion an average 10% u-ops reduction is achieved. Higher use of the FU. Higher paarallelism and higher number of u-ops among which to chose the OOO sequence.

22 Front End P6 Core One more simple decoder: 7 microops /cycle
The simple decoders are able to decode a larger number of instructions: almost one u-op per instruction achieved

23 Dispatching architecture

24 Core Core 6 ports One more dispatch port is dedicated the logical and arithmetical u-ops Increased integer units number Up to 3 u-ops (ports 0, 1 e 2) can be executed per clock (not counting the Branch Execution Unit and the Memory Address Unit – ports 3, 4 and 5 – which don’t produce results). The system is not simmetrical: FP multiplications can be executed only in a FPU and the same holds for the FADD Many u-ops require multiple clock for the execution but this doesn’t block the ports. For instance port 1, once a FADD is started is free for the IFU

25 Mathematical EUs Integer execution units
Three integer FUs each one able to execute a 64 bit u-op per clock. One is for complex u-ops (CIU Complex Integer Unit) and two for simple u-ops (SIU) like additions. All of them operate in parallel with the branch execution unit Floating point execution units Two units able to execute scalar and FP u-ops. One unit for simple operations (i.e. FADD)

26 Memory acces instructions
Load and Store are much more complex than – for instance – the addition. First of all because they require the access to the RF (for the address computation) and because they must access the data cache. L1 access is much slower than the acces to the renamed registers and there is always the risk of a L2 access Load and Store – when committed - are moved from the ROB to a FIFO called MOB (Memory Reorder Buffer) which in some cases allows «overtakings» of the Loads

27 Memory “disambiguation”
u-ops commitments (and therefore memory and registers updating and reading) must be necessarily executed “in order” But… Memory aliasing There are two cases: the “Store” uses the same address of the “Load” (case A) or not (case B) In case A the “Store” must precede the “Load”, in case B not Case A is the “Memory aliasing” Statistically B case is 97% (it depende on the compiler too!) but in P6 and PIV because of A cases (3%) no Load can be executed before the Store. Big performance loss

28 Memory “disambiguation”
Clock 1 2 3 4 5 6 In case A the address is computed at clock 1 and the store is executed in clock 2. Another cycle must elapse for the memory update (clock 3) and then the load can be executed which requires cycles 4 and 5 for the register update. (It must be remembered that a Load in any case «occupies» the memory location which cannot be at the same time used by a Store). Eventually on the 6h clock cycle the sum can be executed. If the processor detects that we are in case B we must not wait the memory update related to the Store and we can overlap the operations as in figure: one clock cycle was spared (more than 16%)

29 Memory “disambiguation”
-1 -2 1 2 3 4 5 6 In case of Core we have the B-2 situation where load is anticipated before the store sparing 3 cycles in comparison with A and two cycles in comparison with B. This is possible thanks to an algorithm which analyses the u-ops and predicts the memory aliasing. In this case too the prediction can be wrong and the pipeline must be flushed but the percent advantage is very significant


Download ppt "Computer Architectures M"

Similar presentations


Ads by Google