Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International Center for Design on Nanotechnologies Workshop.

Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International Center for Design on Nanotechnologies Workshop

2 Outline u Micro-architecture Design u 3-D IC Technology u 3D Architecture Exploration with 2D blocks u 3D Architecture Design with cubic folded blocks  3D cubic packing algorithm  3D architecture exploration with folded blocks u Pipelining Optimization with Throughput-Aware Floorplanning u Summary and Future Work

4 Superscalar Processors u Superscalar processing is the ability of a microprocessor to initiate multiple instructions into multiple pipelines so that the computations of many instructions can be done in parallel if they are not dependent on each other.

5 Alpha 21264

6 Performance of a microprocessor u Performance is measured as the time taken to complete a given task  Operating systems  Compiler optimizations  Workload used for studying the performance  Microprocessor organization  Typically, the processor performance is measured in MIPS or BIPS

8 Motivations of 3-D ICs u Alternative ways for device integration as we approach the limit of CMOS scaling u Interconnect length/delay reduction  System performance Improvement [Black04]  Power Reduction [Black04] u Integration of heterogeneous technologies u No existing flow to evaluate 3D implementations of architectures systematically   Performance   Thermal [Black04]

9 Technology background u Wafer bonding 3D IC technologies  With flipping the top layer;  Without flipping the top layer; (a) With flipping the top layer (b) Without flipping the top layer A 3D IC example with two device layers

10 R lateral Thermal Resistive Network [Wilkerson04] u u Circuit stack partitioned into tiles u u Tiles connected through thermal resistances   Lateral resistances: fixed   Vertical resistances  1/#via u u Heat sources modeled as current sources   Current value = power u u Heat sinks modeled as ground nodes u u Thermal vias:   After floorplanning, we can further reduce the temperature by thermal via insertion. (a) Tiles stack array (b) Single tile stack P1P1 R2R2 R3R3 R4R4 P4P4 P3P3 P2P2 R1R1 1 2 3 4 -  R5R5 P5P5 5

12 MEVA-3D u An Automated Design Flow for 3D Architecture Evaluation (MEVA-3D)  Evaluate 3D implementations of micro-architectures systematically and study them from both performance and thermal perspectives. u MEVA-3D Flow  Automated 2D/3D floorplanning; Reduce the latency along critical loops in the mico- architecture by considering interconnect pipelining at a given target frequency. Reduce the latency along critical loops in the mico- architecture by considering interconnect pipelining at a given target frequency.  Thermal Evaluation Resistive network model considering white-space and thermal via insertion. Resistive network model considering white-space and thermal via insertion.  3D router

13 3D Architecture Evaluation with Physical Planning u Optimize  BIPS (not IPC or Freq) Consider interconnect pipelining based on early floorplanning for critical paths Consider interconnect pipelining based on early floorplanning for critical paths Use IPC sensitivity model [Jagannathan05] Use IPC sensitivity model [Jagannathan05]  Area/wirelength  Temperature

14 Design Example u An out-of-order superscalar processor micro-architecture with 4 banks of L2 cache in 70nm technology u Critical paths

15 Baseline Processor Parameters

16 2D vs 3D Layout 2D EV6-like core 3D EV6-like core (2 layers) BIPS= 2.75 BIPS= 2.94 Wakeup loop : The extra cycle is eliminated. Branch misprediction resolution loop and the L2 cache access latency : Some of the extra cycles are eliminated Assume two device layers

17 Simulation Results  The 3D architecture outperforms 2D design about 11.7% when the frequency is 4GHz.

18 Performance for the micro-architecture with 2D and 3D layout at different target frequencies  3D integration can help improve the performance by 11% by eliminating most of the wire latencies in 2D.

19 Maximum On-Chip Temperature HS denotes a heat sink, and the 3D integration allows to insert thermal vias to reduce the temperature.  3D integration shows a temperature increase of over 4.78  on average. After thermal via insertion, we can reduce the maximum on-chip temperature by an average of about 62%.

21 3D Design w/ Component Folding and Stacking u Explore 3D design of architectural structures that are  Timing/Throughput Critical  Expensive in Terms of Power Consumption and/or Thermal Output u Possible candidates for 3D component folding  Instruction Scheduling Window Issue Queue can be partitioned into multiple levels via matchlines or taglines. Issue Queue can be partitioned into multiple levels via matchlines or taglines.  On-Chip Caches Regular structure lends itself to a wide range of partitionings Regular structure lends itself to a wide range of partitionings  Register File Thermally critical resource – also has a regular structure Thermally critical resource – also has a regular structure

22 3D Architectural Block Design and Modeling u First explore how to design blocks in 3D  Wordline folding Fold block horizontally Fold block horizontally  Port Partitioning Extend ports to different layers Extend ports to different layers u Tools  CACTI Caches and cache-like structures Caches and cache-like structures Register files Register files  HSpice Issue Queue Issue Queue u Then explore design space for a microprocessor with these blocks

23 3D Issue Queue (a) 2D issue queue with 4 taglines ； (b)block folding ； (c) port partitioning u Block folding  Fold the entries and place them on different layers  Effectively shortens the tag lines u Port partitioning  Place tag lines and ports on multiple layer, thus reducing both the height and width of the ISQ.  The reduction in tag and matchline wires can help reduce both power and delay.

24 Benefits from IQ folding u Maximum delay reduction of 50%, maximum area reduction of 90% and a maximum reduction in power consumption of 40% nL- n number of layers, FB – Folding banks, TP – Tag/Ports Partitioning

25 Improvements for blocks u u Port folding performs better than wordline folding for area.(72% vs 51%); u u Wordline folding is more effective in reducing the block delay (13% vs 5%); u u Port folding also performs better in reducing power (13% vs 5%)

26 3D packing with folded blocks u u The exploration of the use of vertical integration on microprocessor design requires consideration for both physical design and architecture.   True 3D packing   Architectural Alternative Selection The number of layers in folded blocks The partition way: block folding or port partitioning

27 3D Corner Block List Representation u (S, L, T) composes a 3D CBL.  S: a record of block name  L: corner cubic block orientation(X-, Y- or Z- oriented)  T: The sequence of {T n,T n-1, …,T 2 } recording the number of attached tri-branches covered by corner cubic block 3 4 12 S={1 2 3 4 5} L = ( Y,Z,Y,X) T=( 10,110,10,1110) 5

28 Packings with folded blocks

30 Performance u On average, multi-layer(3D) block configurations have 11% lower temperature as well as 14% improvement in BIPS.

31 Temperatures u Temperatures can be below 100 degree with thermal vias inserted.

32 Temperature profile 1 layer 2 layers with no via inserted

33 Temperature profile(2 layers with thermal vias)

35 Micro-architecture Pipelining Optimization u Previous works assume that the blocks are separately designed subject to a clock frequency, and the wire pipelining is then carried out on the global wires of the circuits.  Sub-optimal due to the possible utilized slacks in block pipeline designs u We propose a novel optimization methodology of architecture pipelining with physical design, so that block pipelining and interconnect pipelining can be considered simultaneously. A B A B 0.2 1 1 0.4 0.3 0.4 0.3 1.4  0.7  0.2 1 0.7 0.4 0.3 0.110.3 pipeline with pre-designed blocks path-based pipeline

36 Simultaneous Block and Interconnect Pipelining u u We define path-based pipelinging as Simultaneous Block and Interconnect Pipelining (SBIP) Problem   Represent the micro-architecture design by a path graph G(V,E).   The delay between any two flip-flops along the same path is less than clock period .   The performance of the architecture can be evaluated by the weighted sum of number of FFs on e i (n ei ) along the paths.   Therefore the objective is to find a feasible solution with the optimal performance. AB D C A E A’ E E’ B B’ C C’ D D’

37 MILP Formulation u We define a term a(P,v) that represents the arrival time at node (v) along path P, which is the longest delay from a flip-flop to the node v along path P. u With the given clock period  and the set of paths P, we can then formulate the problem as the following MILP Obj. Min s.t. 0  a(P i,v)    v  V and P i passes v (1) n ei  0  ei  E (2) a(P i,v)  a(P i,u) + d ei –  * n ei  ei  E and ei is a connection from node u to node v along path Pi. (3)

38 Graph-based heuristic algorithm u Traverse the graph to decide the optimal insertion of flip-flops such that the weighted sum of cycle numbers of paths is minimized  Dynamic scanning for combinational circuits  Slacks along paths are used to compute the optimal positions for FFs.  Near-optimal method for sequential circuits break the cycle into a path from s to t break the cycle into a path from s to t u Throughput aware floorplanning with pipelining  The path-based pipelining design guides the block design to optimize the performance for the whole design.

39 Experimental Results u We compare the results with the wire-pipelining results (WP), and the solutions obtained from the MILP solver (MILP), the ideal upper bound used in [6][8](UB) and our graph-based heuristic approach (GH). u Impact of frequencies  The path-based pipelining will give about a 27% performance improvement over wire pipelining

40 Integrated with floorplanning optimization Frequency GHz UB+post_MILPGH Area (mm 2 ) Wire (mm) BIPS Area (mm 2 ) Wire (mm) BIPS 232.115.61.49231.81421.714 334.6103.72.13933.3108.42.22 432.498.72.77636.1124.32.828 532.8126.22.88532.694.173.35 636.0108.43.63633.7100.33.882 735.9112.53.47936.8129.93.906 Comparison1111.0031.051.091 u MILP approach as a post process at the end of the floorplanning u integrate our approach with the thoughput-driven floorplannning.

41 Summary u 3D Architecture Exploration  Coupled with 3D physical planning  Consider both 3D component stacking and folding u MEVA-3D can systematically evaluate the 3D architecture both from the performance side and from the thermal side. u We propose the optimization methodology of architecture pipelining with physical design which simultaneously optimize the pipeline design and physical packing in terms of system throughput. The performance of the system can be improved a lot over the wire-pipelining.

42 Ongoing Work u 3D Multi-core architecture design and implementation u Deep pipeline design in microarchitecture with interconnect considered u The slacks in 3D design may be used to enlarge the sizes of blocks and get better performance.

Thank You! Mayuchun@tsinghua.org.cn Thank You! Mayuchun@tsinghua.org.cn

Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International Center for Design on Nanotechnologies Workshop.

Similar presentations

Presentation on theme: "Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International Center for Design on Nanotechnologies Workshop."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International Center for Design on Nanotechnologies Workshop.

Similar presentations

Presentation on theme: "Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International Center for Design on Nanotechnologies Workshop."— Presentation transcript:

Similar presentations

About project

Feedback