Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparison of Two Processors

Similar presentations


Presentation on theme: "Comparison of Two Processors"— Presentation transcript:

1 Comparison of Two Processors
P-3 vs. Athlon Comparison of Two Processors

2 Objectives Similarities Architectural Differences
Head to Head Comparisons Questions Road Map- Start with some high level and go into detail for differences.

3 Similarities Microarchitecture 4 FP Operations per Cycle
Caching, Pre-fetching, Streaming Controls Multiple Processor Support Both microarchitectures are three issue out of order designs with long pipelines Both provide methods to generate bit FP instructions per clock cycle Primary difference between 3D Now and ISSE is that 3DNow works with SIMD with two parallel registers in single percision and possesses two pipelines. The ISSE features 4 parallel SIMD registers with only one pipeline. Which mean they both produce 4 SIMD insts per cycle.

4 Athlon’s Architecture
ISA Instruction Stream Data Stream

5 ISA Athlon – 3DNow! 5 new DSP instructions 45 New instructions
Athlon’s 3DNow implements 19 new MMx/SSE and cache control instructions, these instructions are identical to the ones P-3’s new media instructions. This intelligent move allows Athlon to compete with P-3 in video encoding and decoding and multimedia instructions. These instructions do not however bring Athlon up to SSE’s SIMD FP standard. But given Intel’s half-wide implementation of SIMD-FP capability, it can still provide stiff competition for P-3. In addition to these 19, AMD added 5 new DSP, instructions to improve application performance such as soft modems, Dolby AC-3, MP3 audio codecs. Including other instructions, AMD totals in at 45 new instructions compared to P-3’s 71 new instructions.

6 Instruction Stream Instruction Decode Instruction Control Unit
Execution Units Branch Prediction

7 Instruction Decode Athlon takes the 16 bytes from the I-cache and is capable of producing 3 micro Ops per cycle using the three x86 decoders in the direct path and a 4th decoder is used for complex instructions in the vector path. These instructions are stored into the 72 entry ROB. In comparison P-3 has 2 simple and 1 complex inst and only 20 entry queue for Reservation station and a 40 entry ROB

8 Instruction Control Unit
Up to 72 instructions can be in flight at any given time, which is a result of the 72 entry ROB. From here the ICU takes over. The ICU control instruction issue, register renaming, and out of order execution of integer instructions. It two distinct 24 entry register files. The Future File holds the current state of the processor and is update as instructions complete execution. The results of execution are stored in the ROB and are retired to the Architectural file in program order. On an exception, the processor can be quickly check-pointed back to the correct Architectural state using a broadside copy into the Future File, hence the 24 ports. The advance to the FF file is that it can reduce the lookup time and complexity associated with classical ROB.

9 Execution Units FP Units Athlon: 3 P-3: 1 Integer Units Athlon: 3
Address Caclulation Units Athlon: 3 P-3: 2 Athlon – 10 stage integer Pipe and a 15 stage FP pipe, all pipes are out-of-order, superscalar each with 1 cycle through-puts vs. P-3’s 17 stage integer pipe and 30 stage FP pipe

10 Branch Prediction Athlon has 4096 entry BHT with a 2 bit prediction updated saturation counter. The BHT is accessed by using Gshare technique by an 8 bit global history register hashed with 4 bits of the PC. Ahtlon also has a 4K Branch Target Address Cache integrated with the Instruction Cache. For each 16-byte fetch quantum, the I-cache keeps two BTA and a 2 bit selector which indicates weather the branch is sequential or provided by the return stack. Athlon has a 12 stage return stack, to take advantage of procedure calls branches. And has an average hit rate 95% . In comparison, P-3 – 512 entry Branch Target Buffer with 4k BHT, 4 stage return stack, with a 95% hit rate

11 Data Stream Cache Bus Architecture

12 Cache L1 Cache L2 Cache Cache Controller Fill Buffers
Athlon 128KB P-3 32KB L2 Cache Athlon 256KB (on chip) P-3 256KB (on chip) 512KB (external) Cache Controller Fill Buffers Write Back Buffers L1 cache- Athlon has 4 times the cache as the P-3 (128 vs 32). AMD has a dedicated snoop port to eliminate system coherency traffic from interfering with app performance but none in P-3. It also supports concurrent accesses by two 64-bit loads or stores. 64 Data Cache and 64 Inst Cache which is 2-way set associative. L2 cache- The L2 on chip cache is 16 way associatve in the Athlon, in comparison the P-3 has an 8 way associative cache. Controller – The P-3 Controller sits as a separate component on the module while AMD has its integrated into the die Athlon has a multi-level 512 entry TLB (Translation Look-aside Buffer) Athlon has an 8-bit ECC (error correction code). Athlon supports 64-byte cache line transfers, twice the size of P-3. Athlon’s cache architecture is the first to incorporate a system based MOESI (Modify, Owner, Exclusive, Shared, Invalid) for multiprocessing support. Fill, Bus: Athlon 8 of each vs. P-3 6 & 4.

13 Bus Architecture System Bus Address Size Multiprocessor Support
Athlon utilizes Digital Corp’s EV6 Bus protocol. Protocol is primarily used for bursty traffic. The bus has 13 address channels per direction or 26 address channels and 64 data channels. The processor needs the read direction used for snooping. The address size is 43 bits for the Athlon and 36 bits for P-3. This means the Athlon can address up to 8 terabytes while the P-3 can only address 68 GB. Athlon Point-to-Point support while the P-3 has a shared support. This means that in the P-3 each processor must share the bandwidth and channels. AMD however, does not have this problem, with Athlon each processor has its own path to the chip set hence has the full bandwidth. This complicates board design for multiprocessor systems. To control this problem AMD is using Digital’s Tsunami dual processor chip set.

14 Data Stream Bus Speed Peak Bandwidth Outstanding Transactions
System Clock Bus Speed: Athlon 200 – 266 MHz vs. P – 166 MHz Bandwidth: Athlon 1.6 – 2.1 GB/sec vs. P – 1060 MB/sec Transactions: Athlon 24 per processor vs. P-3 4 – 8 per processor Clock: Athlon Source Synchronous vs. P-3 Common clock Multiprocessing: Athlon Point-to-Point vs. P-3 shared

15 Head to Head Integer Performance Floating Point Performance Multimedia
3D Now Implementation Application Level

16 Benchmarked Specs This is the hardware and software used to perform the benchmark.

17 Integer Performance Athlon 8-17%

18 Floating Point Performance
Athlon 7-11%

19 Multimedia Performance
Athlon out performs by 5 – 25%

20 3DNow vs. SSE Implementation
Athlon the winner 7-41%

21 Application Level Performance
Athlon 4- 15%

22 Conclusion GO ATHLON!! Athlon is superior in every sense of the word. It out performed P-3 with out problems


Download ppt "Comparison of Two Processors"

Similar presentations


Ads by Google