Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 4 Billion-Transistor.

Similar presentations


Presentation on theme: "ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 4 Billion-Transistor."— Presentation transcript:

1 ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 4 Billion-Transistor Architecture 97 (Part II)

2 ECE8833 H.-H. S. Lee 2009 2 Practitioners’ Groups Every one has an acronym ! IRAM –Implementation at Berkeley CMP –Lead to Sun Niagra and the multicore (r)evolution SMT –Intel HyperThreading (arguably Intel first envisioned the idea), IBM Power5, Alpha 21464 –Many credit this technology to UCSB’s multistreaming work in early 1990s. RAW –Lead to Tilera64

3 ECE8833 H.-H. S. Lee 2009 3 C. E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, K. Yelick

4 ECE8833 H.-H. S. Lee 2009 4 Mission Statement

5 ECE8833 H.-H. S. Lee 2009 5 Future Roadblocks that Inspires IRAM Latency issues –Continuingly increased performance gap between processor and memory –DRAM optimized for density, not speed Bandwidth issues –Off-chip bus Slow and narrow high capacitance, high energy –Especially, scientific codes, database, etc.

6 ECE8833 H.-H. S. Lee 2009 6 IRAM Approach Move DRAM closer to processor –Enlarge on-chip bandwidth Fewer I/O pins –Smaller package –Serial interface  Anything look familiar?

7 ECE8833 H.-H. S. Lee 2009 7 IRAM Chip Design Research How much larger and slower is a processor designed in a straight DRAM process vs. a standard logic process –Microprocessor fab offers fast transistors fo fast logic and many metal layers for accelerating communication and simplifying power distribution –DRAM fabs offer many poly layers to give small DRAM cells and low leakage for low refresh rate Speed of page buffer vs. registers and cache New DRAM interface based on fast serial links (2.5Gbit/s or 300 MB/s per pin) Quantify Bandwidth vs. Area/Power tradeoff Area overhead for IRAM vs. a DRAM Extra power dissipation for IRAM vs. a DRAM Performance of IRAM with same area and power as DRAM (“processor for free) Source: David Patterson’s slide in his IRAM Overview talk

8 ECE8833 H.-H. S. Lee 2009 8 IRAM Architecture Research How much slower can a processor with a high bandwidth memory be and yet be as fast as a conventional computer? (very interesting point) Compare memory management schemes (e.g., vector registers, scratch pad, wide TLB/cache) Compare scheme for running large programs, i.e., span multiple IRAMs Quantify value of compact programs and data (e.g., compact code, on-the-fly compression) Quantify pros and cons of standard instruction set vs. custom IRAM instruction set Source: David Patterson’s slide in his IRAM Overview talk

9 ECE8833 H.-H. S. Lee 2009 9 IRAM Compiler Research Explicit SW control of memory management vs. conventional implicit HW designs –Protection (software fault isolation) –Paging (dynamic relocation, overlap I/O accesses) –“Cache” control (vector register, scratch pad) –I/O interrupt/polling Evaluate benchmark performance in conjunction with architectural research –Number crunching (Vector vs. superscalar) –Memory intensive (database, operating system) –Real-time benchmarks (stability and performance) –Pointer intensive (GCC compiler) Impact of Language on IRAM (Fortran 77 vs. HPF, C/C++ vs Java) Source: David Patterson’s slide in his IRAM Overview talk

10 ECE8833 H.-H. S. Lee 2009 10 Potential IRAM Architecture “New Model”: VSIW=Very Short Instruction Word! –Compact: Describe N operations with 1 short inst. (vector) –Predictable: (real-time) perf. Vs. statistical perf. (cache) –Multimedia ready: choose Nx64b, 2Nx32b, 4Nx16 –Easy to get high performance; N operations: Are independent Use same functional unit Access disjoint registers Access registers in same order as previous instructions Access contiguous memory words or known pattern Hides memory latency (and any other latency) –Compiler technology already developed.. Source: David Patterson’s slide in his IRAM talk

11 ECE8833 H.-H. S. Lee 2009 11 Berkeley Vector-Intelligent RAM Why vector processing Scalable design Higher code density Run at a higher clock rate Better energy efficiency due to easier clock gating for vector / scalar units Lower die temperature to keep good data retention rate On-chip DRAM is sufficient for embedded applications Use external off-chip DRAM as secondary memory –Pages swapped between on- chip and off-chip DRAMs

12 ECE8833 H.-H. S. Lee 2009 12 VIRAM-1 Floorplan 180nm, CMOS, 6-layer copper 125 million transistors, 325 mm 2 2 watts @ 200MHz 13MB eDRAM macros from IBM and 4 vector units (total 8KB vector registers) VRF = 32x64b or 64x32b or 128x16b [Gebis et al. DAC student contest 04] IBM Embedded DRAM macros, each 13Mbit ¼ of 8KB VRF (Custom layout) 64-bit MIPS M5Kc

13 ECE8833 H.-H. S. Lee 2009 13 S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, D. M. Tullsen

14 ECE8833 H.-H. S. Lee 2009 14 SMT Concept vs. Other Alternatives Thread 1 Unused Execution Time FU1FU2FU3FU4ConventionalSuperscalarSingleThreaded SimultaneousMultithreading (or Intel’s HT) Fine-grainedMultithreading(cycle-by-cycleInterleaving) Thread 2 Thread 3 Thread 4 Thread 5 Coarse-grainedMultithreading (Block Interleaving) ChipMultiprocessor(CMP) Early SMT idea was developed by UCSB (Mario Nemirosky’s group HICSS’94) The name SMT was christened by the group at University of Washington ISCA’95

15 ECE8833 H.-H. S. Lee 2009 15 Exploiting Choice: SMT Inst Fetch Policies FIFO, Round Robin, simple but may be too naive RR.X.Y –X threads for Y instructions –RR1.8 –RR.2.4 or RR.4.2 –RR.2.8 What are the main design and/or performance issue when X > 1 [Tullsen et al. ISCA96]

16 ECE8833 H.-H. S. Lee 2009 16 Exploiting Choice: SMT Inst Fetch Policies Adaptive Fetching Policies –BRCOUNT (reduce wrong path issuing) Count # of br inst in decode/rename/IQ stages Give top priority to thread with the least BRCOUNT –MISSCOUT (reduce IQ clog) Count # of outstanding D-cache misses Give top priority to thread with the least MISSCOUNT –ICOUNT (reduce IQ clog) Count # of inst in decode/rename/IQ stages Give top priority to thread with the least ICOUNT –IQPOSN (reduce IQ clog) Give lowest priority to those threads with inst closest to the head of INT or FP instruction queues –Due to that threads with the oldest instructions will be most prone to IQ clog No Counter needed [Tullsen et al. ISCA96]

17 ECE8833 H.-H. S. Lee 2009 17 Exploiting Choice: SMT Inst Fetch Policies [Tullsen et al. ISCA96]

18 ECE8833 H.-H. S. Lee 2009 18 Alpha 21464 (EV8) Leading-edge process technology –1.2 to 2.0GHz –0.125  m CMOS –SOI-compatible –Cu interconnect, 7 metal layers –Low-k dielectrics Chip characteristics –1.2V V dd, 250W (EV6: 72W and EV7: 125W) –250 million transistors, 350mm 2 –1100 signal pins in flip chip packaging Slide Source: Dr. Joel Emer

19 ECE8833 H.-H. S. Lee 2009 19 EV8 Architecture Overview Enhanced OoO execution 8-wide issue superscalar processor Large on-die L2 (1.75MB) 8 DRDRAM channels On-chip router for system interconnect Directory-based ccNUMA for up to 512-way SMP 4-way SMT Slide Source: Dr. Joel Emer

20 ECE8833 H.-H. S. Lee 2009 20 SMT Pipeline Replicated –PCs –Register maps Slide Source: Dr. Joel Emer Fetch Decode/ Map Queue Reg Read ExecuteDcache/ Store Buffer Reg Write Retire Icache Dcache PC Register Map Regs Shared resources –RF –Instruction queue –First and second level caches –Translation buffers –Branch predictor

21 ECE8833 H.-H. S. Lee 2009 21 Intel HyperThreading Intel Xeon Processor, Xeon MP Processor, and ATOM Enable Simultaneous Multi-Threading (SMT) –Exploit ILP through TLP ( — Thread-Level Parallelism) –Issuing and executing multiple threads at the same snapshot Appears to be 2 logical processors Share the same execution resources Duplicate architectural states and certain microarchitectural states –IPs, iTLB, streaming buffer –Architectural register file –Return stack buffer –Branch history buffer –Register Alias Table

22 ECE8833 H.-H. S. Lee 2009 22 Sharing Resource in Intel HT P4’s TC (or  ROM) is alternatively accessed per cycle for each logical processor unless one is stalled due to TC miss TLB shared with logical processor ID but partitioned –X86 does not employ ASID –Hard-partitioning appears to be the only option to allow HT  op queue (into ½) after fetched from TC ROB (126/2 in P4) LB (48/2 in P4) SB (24/2 or 32/2 in P4) General  op queue and memory  op queue (1/2) Retirement: alternating between 2 logical processors

23 ECE8833 H.-H. S. Lee 2009 23 HT in Intel ATOM Source: Microprocessor Report and Intel First In-order processor with HT HT claimed to enlarge silicon asset by 8% Claimed 30% performance increase at 15% power increase Shared cache space deprived/competed between threads No dedicated Multiplier – use SIMD Multiplier No dedicated Int Divider - use FP Divider 32KB 24KB 25mm 2 @45nm 512KB

24 ECE8833 H.-H. S. Lee 2009 24 L. Hammond, B. A. Nayfeh, K. Olukotun

25 ECE8833 H.-H. S. Lee 2009 25 Main Argument Single thread of control has limited parallelism (ILP is dead) Cost of the above is prohibitive due to complexity Achieving parallelization with SW, not HW –Inherently parallel multimedia application –Widespread Multi-tasking OS –Emerging parallel compilers (ref. SUIF), mainly for loop-level parallelism Why not SMT? –Interconnect delay issue –Partitioning is less localized than CMP Use relatively simple single-thread processor –Exploit only “modest” amount of ILP –Execute multiple threads in parallel Bottom line

26 ECE8833 H.-H. S. Lee 2009 26 Architectural Comparison

27 ECE8833 H.-H. S. Lee 2009 27 Single Chip Multiprocessor

28 ECE8833 H.-H. S. Lee 2009 28 Commercial CMP (AMD Phenom II Quad-Core) AMD K10 (Barcelona) Code name “Deneb” 45nm process 4 cores, private 512KB L2 Shared 6MB L3 (2MB in Phenom) Integrated Northbridge –Up to 4 DIMMs Sideband Stack optimizer (SSO) –Parallelize many POPs and PUSHs (which were dependent on each other) Convert them into pure loads/store instructions –No uops in FUs for stack pointer adjustment

29 ECE8833 H.-H. S. Lee 2009 29 Intel Core i7 (Nehalem) 4-core HT support each core 8MB shared L3 3 DDR3 channels 25.6GB/s memory BW Turbo Boost Technology –New P-state (Performance) –DFVS when workloads operated under max power –Same frequency for all cores

30 ECE8833 H.-H. S. Lee 2009 30 Ultra Sparc T1 Up to Eight cores, each 4-way threaded Fine-grained multithreading –a thread-selection logic Take out threads that encounter long latency events –Round-robin cycle-by-cycle –4 threads in a group share a processing pipeline (Sparc pipe) 1.2 GHz (90nm) In-order, 8 instructions per cycle (single issue from each core) 1 shared FPU Caches –16K 4-way 32B L1-I –8K 4-way 16B L1-D –Blocking cache (reason for MT) –4-banked 12-way 3MB L2 + 4 memory controllers. (shared by all) –Data moved between the L2 and the cores using an integrated crossbar switch to provide high throughput (200GB/s)

31 ECE8833 H.-H. S. Lee 2009 31 Ultra Sparc T1 Thread-select logic marks a thread inactive based on –Instruction type A predecode bit in the I-cache to indicate long-latency instruction –Misses –Traps –Resource conflicts

32 ECE8833 H.-H. S. Lee 2009 32 Ultra Sparc T2 A fatter version of T1 1.4GHz (65nm) 8 threads per core, 8 cores on-die 1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1) L2 increased to 8-banked 16-way 4MB shared 8 stage integer pipeline ( as opposed to 6 for T1) 16 instructions per cycle One PCI Express port (x8 1.0) Two 10 Gigabit Ethernet ports with packet classification and filtering Eight encryption engines Four dual-channel FBDIMM memory controllers 711 signal I/O,1831 total Subsequent T2 Plus contains 2 sockets: 16 cores / 128 threads

33 ECE8833 H.-H. S. Lee 2009 33 Sun ROCK Processor 16-core, two threads per core Hardware scout threading (runahead) –Invisible to SW –Long latency inst starts auto HW scout L1 D$ miss Micro-DTLB miss Divide –Warm up branch predictor –Prefetch memory Execute Ahead (EXE) –Retire independent instructions while scouting Simultaneous Speculative Threading (SST) [ISCA’09] –Two hardware threads for one program –Runahead speculatively executes under a cache miss –OoO retirement HTM Support

34 ECE8833 H.-H. S. Lee 2009 34 Many-Core Processors 2KB Data Memory 3KB Instruction Memory No coherence support 2 FMACs Next-gen will have 3D- integrated memory –SRAM first –DRAM in the future Intel Teraflops (Polaris)

35 ECE8833 H.-H. S. Lee 2009 35 E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, A. Agarwal

36 ECE8833 H.-H. S. Lee 2009 36 MIT RAW Design Tenet Long wire across chip will be the constraint Exposed architecture to software (parallelizing compilers) –Explicit parallelization –Pins –Communication Use tile-based architecture –Similar designs sponsored by DARPA PCA program: UT TRIPS, Stanford Smart Memories Simple Point-to-point static routing network –One cycle across each tile –Scalable (than bus) –Harnessed by compiler with precise count of wire hops –Use dynamic router to support memory accesses that cannot be analyzed statically.

37 ECE8833 H.-H. S. Lee 2009 37 Application Mapping on RAW [Taylor IEEE MICRO’02] Four-way parallelized scalar code Two-way threaded Java program httpdZzzz.. Video Data Stream Frame Buffer And Screen Custom Data Path Pipeline (by Compiler) Sleep Mode (power saving) Fast Inter-tile ALU forwarding : 3 cycles

38 ECE8833 H.-H. S. Lee 2009 38 Scalar Operand Network Design [Taylor et al. HPCA’03] Non-Pipelined Scalar Operand Network Pipelined w/ Bypass Link Pipelined w/ Bypass Link and Multiple ALUs Lots of live values in the SON

39 ECE8833 H.-H. S. Lee 2009 39 Communication Scalability Issue RB (# of result bus) * WS (window size) compares made per cycle Long, dense wire elongates cycle time –Pipeline the wire Cost of processing incoming information is high Similar problem in bus-based snoopy cache protocol Routing area Large MUX Complex Compare logic

40 ECE8833 H.-H. S. Lee 2009 40 Scalar Operand Network RegFile Multiscalar Operand Network (distributed ILP machine) [Taylor et al. HPCA’03] RegFile Scalar Operand Network On a 2-D, p2p interconnect (e.g., Raw or TRIPS) Switch

41 ECE8833 H.-H. S. Lee 2009 41 Mapping Operations to Tile-based Architecture Done at –Compile time (RAW) –Or Runtime “Point-to-point” 2D mesh Tradeoff –Computation vs. Communication –Compute Affinity (data flow through fewer hops) How to maintain control flow-control RegFile ld a ld b st b * >> + i = a[j]; q = b[i]; r = q+j; s = q >> 3; t = r * s; b[j] = l; b[t] = t;

42 ECE8833 H.-H. S. Lee 2009 42 RAW Core-to-Core Communication Static Router –Place-and-route wires by software –P2p scalar transport –Compilers (or assembly writers) handle predictable communication Dynamic Router –Transport dynamic, unpredictable operations Interrupts Cache misses –Unpredictable communication at compile-time

43 ECE8833 H.-H. S. Lee 2009 43 Architectural Comparison Raw replace a bus of a superscalar with switched network Switched network is tightly integrated into processor’s pipeline to support single- cycle message injection and receive operations Raw software (compiler) has to implement functions such as instruction scheduling, dependency checking, etc. Raw yields complexity to software so that more hardware can be used for ALU and memory RAWSuperscalarMultiprocessor

44 ECE8833 H.-H. S. Lee 2009 44 RAW’s Four On-Chip Mesh Networks Compute Pipeline Registered at input  longest wire = length of tile 8 32-bit channels [Slide Source: Michael B. Taylor]

45 ECE8833 H.-H. S. Lee 2009 45 Raw Architecture [Slide Source: Volker Strumpen]

46 ECE8833 H.-H. S. Lee 2009 46 Raw Compute Processor Pipeline [Taylor IEEE MICRO’02] Fast ALU-to- network (4 cycles) R24-27 map to 4 on-chip physical networks 0-cycle local bypass

47 ECE8833 H.-H. S. Lee 2009 47 RAW Processor Tile Each tile contains Tile processor –32-bit MIPS, 8-stage in-order, single issue –32KB instruction memory –32KB data cache (not coherent, user managed) Switch processor –8K-instruction memory –Executes basic move and branch instructions –Transfer between local switch and neighbor switches Dynamic Router –Hardware control (not directly under programmer’s control)

48 ECE8833 H.-H. S. Lee 2009 48 Raw Programming Compute the sum c=a+b across four tiles:

49 ECE8833 H.-H. S. Lee 2009 49 Data Path: Zoom 1 Stateful hardware: local data memory (a,c), register (b) and both static networks (snet1 and 2)

50 ECE8833 H.-H. S. Lee 2009 50 Zoom 2: Processor Datapaths

51 ECE8833 H.-H. S. Lee 2009 51 Zoom 2: Switch Datapaths (+-tile processor)

52 ECE8833 H.-H. S. Lee 2009 52 Raw Assembly

53 ECE8833 H.-H. S. Lee 2009 53 RAW On-Chip Network 2D Mesh –Longest wire is no greater than one side of a tile –Worst case: 6 hops (or cycles) for 16 tiles 2 Static Routers, “point-to-point,” each has –A 64KB SW-managed instruction cache –A pair of routing crossbars –Example: 2 Dynamic Routers –Dimension-ordered routing by hardware –Example: lui $3, $0, 15 ihdr $cgno, $3, 0x0200#header msg len=2 or $cgno,$0,$9#sent word1 ld $cgno,$0,$csti#sent word2 or $2, $cgni, $0 #word1 or $3, $cgni, $0 #word2 Tile 15 (receiver)Tile 0 (sender) or $csto, $0, $5 nop route $csto->$cEo2 #SWITCH0 nop route $cWi2->$csti2 #SWITCH1 and $5, $5, $csti2 Tile 1 (receiver) Tile 0 (sender)

54 ECE8833 H.-H. S. Lee 2009 54 Control Orchestration Optimization Orchestrated by the Raw compiler Control localization –Hide control flow sequence within a “macro-instruction” assigned to a tile [Lee et al. ASPLOS’98] macroin s : One instruction

55 ECE8833 H.-H. S. Lee 2009 55 Example of RAW Compiler Transformation y = a+b; z = a*a; a = y*a*5; y = y*b*6; read(a) read(b) y_1 = a+b z_1 = a*a tmp_1 = y_1*a a_1 = tmp_1*5 tmp_2 = y_1*b y_2 = tmp_2*6 write(z) write(a) write(y) Initial Code Transformatio n Instruction Partitioner Global Data Partitioner Data & Inst Placer Communication Code Gen Event Scheduler read (a) z_1 = a*a write(z) tmp_1 = y_1*a a_1=tmp_1*5 write(a) read (b) y_1 = a+b tmp_2 = y_1*b y_2 = tmp_2*6 write(y) [Lee et al. ASPLOS’98] Initial Code Transformation

56 ECE8833 H.-H. S. Lee 2009 56 Example of RAW Compiler Transformation [Lee et al. ASPLOS’98] read (a) z_1 = a*a write(z) tmp_1 = y_1*a a_1=tmp_1*5 write(a) read (b) y_1 = a+b tmp_2 = y_1*b y_2 = tmp_2*6 write(y) Instruction Partitioner {a,z} {b,y} Global Data Partitioner read (a) z_1 = a*a write(z) tmp_1 = y_1*a a_1=tmp_1*5 write(a) read (b) y_1 = a+b tmp_2 = y_1*b y_2 = tmp_2*6 write(y) Data & Inst Placer {a,z} {b,y} P0 P1

57 ECE8833 H.-H. S. Lee 2009 57 Example of RAW Compiler Transformation [Lee et al. ASPLOS’98] read (a)z_1 = a*awrite(z)tmp_1 = y_1*aa_1=tmp_1*5write(a)read (b)y_1 = a+b tmp_2 = y_1*b y_2 = tmp_2*6write(y) Communication Code Gen P0 P1 send (a) route(P0,S1) route(S0,P1) a=rcv() send(y_1) route(S1,P0) route(P1,S0) y_1=rcv() S0S1

58 ECE8833 H.-H. S. Lee 2009 58 Example of RAW Compiler Transformation [Lee et al. ASPLOS’98] Event Scheduler route(P0,S1) route(S1,P0) S0 route(S0,P1) route(P1,S0) S1 read (a) z_1 = a*a write(z) tmp_1 = y_1*a a_1=tmp_1*5 write(a) P0 send (a) y_1=rcv() P1 read (b) y_1 = a+b tmp_2 = y_1*b y_2 = tmp_2*6 write(y) a=rcv() send(y_1)

59 ECE8833 H.-H. S. Lee 2009 59 Raw Compiler Example tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 Assign instructions to the tiles, maximizing locality. Generate the static router instructions to transfer Operands & streams tiles. [Slide Source: Michael B. Taylor]

60 ECE8833 H.-H. S. Lee 2009 60 Scalability 1 cycle 180 nm, 16 tiles 90 nm, 64 tiles Just stamp out more tiles! Longest wire, frequency, design and verification complexity all independent of issue width. Architecture is backwards compatible. [Slide Source: Michael B. Taylor]


Download ppt "ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 4 Billion-Transistor."

Similar presentations


Ads by Google