Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Science and Engineering Piranha: A Scalable Architecture Based on Single- Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings.

Similar presentations


Presentation on theme: "Computer Science and Engineering Piranha: A Scalable Architecture Based on Single- Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings."— Presentation transcript:

1 Computer Science and Engineering Piranha: A Scalable Architecture Based on Single- Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June 2000 Presented by Wael Kdouh Spring 2006 Professor: Dr. Hisham El Rewini

2 Computer Science and Engineering Motivation Economic: High demand for OLTP(on-line transaction processing) machines Disconnect between ILP-focus and this demand OLTP --High memory latency -- Little ILP (Get, process, store) --Large TLP OLTP unserved by aggressive ILP machines Use “old” cores, ASIC design methodology for “glueless,” scalable OLTP machines and low development costs and time to market Short wires as opposed to costly and slow long wires that can affect cycle time Amdahl’s Law

3 Computer Science and Engineering Other Innovations The design of the shared second-level cache uses sophisticated protocol that does not enforce inclusion in first-level instruction and data caches in order to maximize utilization of on-chip caches. The cache coherence protocol among nodes incorporates a number of unique features that result in fewer protocol messages and lower protocol engine occupancies compared to other designs. It has a unique I/O architecture, with an I/O node that is full-fledged member of the interconnect and the global shared memory-memory coherence protocol.

4 Computer Science and Engineering The Piranha Processing Node Separate I/D L1( 64KB, 2-way set-associative) for each CPU. Logically shared interleaved L2 cache(1MB). Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec. 180 nm process (2000) Almost entirely ASIC design 50% clock speed, 200% area versus full-custom methodology CPU: Alpha ECE152 work Single in-order 8-stage pipeline 500 Mhz Intra-Chip Switch, is a Unidirectional crossbar

5 Computer Science and Engineering Communication Assist + Home Engine (Exporting) and Remote Engine (Importing) support shared memory across multiple nodes + System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc. + OQ, Router, IQ, Switch standard which links multiple Piranha chips +Total inter-node I/O Bandwidth : 32 GB/sec + Each link and block here corresponds to actual wiring and module. THERE IS NO INHERENT I/O CAPABILITY.

6 Computer Science and Engineering I/O Organization  Smaller than processing node  Router  2 links, alleviates need for routing table  Memory is globally visible and part of coherency scheme  CPU  optimized placement for drivers, translations etc. with low-latency access needs to I/O.  Re-used dL1 design provides interface to PCI/X interface  Supports arbitrary I/O:P ratio, network topology  Glueless scaling up to 1024 nodes of any type supports application specific customization

7 Computer Science and Engineering Piranha System

8 Computer Science and Engineering Coherence: Local  L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory  Chip ICS responsible for all on-chip communication  L2 is “non-inclusive”.  “Large victim buffer” for L1s. Keeps tags and state copies of L1 data  The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist.  L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory.  L2 on forwards blocks conflicting requests

9 Computer Science and Engineering Coherence: Global Trades ECC granularity for “free” directory data storage (4x granularity  leaves 44 bits per 64 bit line) Invalidation-based distributed directory protocol Some optimizations No NACKing: Deadlock avoidance through I/O, L, H priority virtual lanes: L: Home node, low priority. H: Forwarded requests, replies Also guarantee forwards always serviced by targets: e.g. owner writes back to home, holds data until home acknowledges. Removes NACK/Retry traffic, as well as “ownership change” (DASH), retry- counts (Origin), “No, seriously” (Token).

10 Computer Science and Engineering Evaluation Methodology Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications) Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware “Fudged” for full-custom effect Four evaluations: P1 (One-core Piranha @ 500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)

11 Computer Science and Engineering Parameters for different processor designs.

12 Results

13 Computer Science and Engineering Performance Evaluation OLTP and DSS workloads: TPC-B/D, Oracle database SimOS-Alpha environment Compared: Piranha (P8) @ 500 MHz and Full-Custom (P8F) @ 1.25 GHz Next-generation Microprocessor (OOO) 1 GHz Single Chip Evaluation OOO outperforms P1 (individual proc) by 2.3x P8 outperforms OOO by 3x Speedup of P8 over P1 = 7x Multi-chip Configurations Four chips (only 4 CPUs per chip ?!) Results show that Piranha scales better than OOO

14 Computer Science and Engineering Questions/Discussion Evaluation methodology? Would the Piranha design be worthwhile if there were a well-designed SMT processor (with 4 or 8 threads)? Reliability better or worse with multiple chips per processor? Power consumption?

15 The authors maintain that : 1) The use of chip multiprocessing is inevitable in future microprocessor designs. 2) As more transistors become available, further increasing on-chip cache sizes or building more complex cores will only lead to diminishing performance gains and possibly longer design cycles. Given the enormous emphasis that Intel engineers are plcaing on massive L2 caches, Intel engineers appear to disagree.enormous emphasis that Intel engineers are plcaing on massive L2 caches Given the huge investment that both Intel and Compaq/HP have put into the Itanium family, and the fact that the Alpha is a moribund architecture, it is unlikely that the innovative Piranha microprocessor will ever see the light of day. Conclusion

16 Computer Science and Engineering No more penguins to eat……………… Harvey G. Cragon, discusses in his paper “Forty Five Years of Computer Architecture—All That's Old is New Again,” that he finds most of the performance-improvement advances in computer micro-architecture have been based on the exploitation of only two ideas: locality and pipelining. In my personal opinion the upcoming years are going to exploit two ides: SMT and CMP. The Future

17 Computer Science and Engineering Questions


Download ppt "Computer Science and Engineering Piranha: A Scalable Architecture Based on Single- Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings."

Similar presentations


Ads by Google