Presentation is loading. Please wait.

Presentation is loading. Please wait.

SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Issues on Designing Many-core Architectures Seokhyun Lee, Hanmin Park, Kyoung Hoon Kim, Jinho Lee and Junwhan.

Similar presentations


Presentation on theme: "SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Issues on Designing Many-core Architectures Seokhyun Lee, Hanmin Park, Kyoung Hoon Kim, Jinho Lee and Junwhan."— Presentation transcript:

1 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Issues on Designing Many-core Architectures Seokhyun Lee, Hanmin Park, Kyoung Hoon Kim, Jinho Lee and Junwhan Ahn Many-SC project Design Automation Laboratory Seoul National University

2 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Trends Increase in the number of cores Larger bandwidth demand Interference in shared resources (caches, off-chip links, …) Larger working set size Limitation Low off-chip bandwidth, high off-chip link energy Limited on-chip cache capacity Limited power budget Etc..

3 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications

4 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications

5 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB On-chip Caches: Partitioning Objective: Isolation of per-core data Eliminate interference among cores Better throughput by allocating capacity based on demand Examples of partitioning schemes Way partitioning: limited scalability Set partitioning: limited scalability & complex decode logic Replacement policy based: no guarantee of strict isolation Limited number of schemes that provides scalability with strict isolation (e.g., Vantage [Sanchez+ ISCA’11])

6 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB On-chip Caches: 3D Stacked DRAM Use 3D stacked DRAM as very large on-chip caches backed by large off-chip main memory Existing approaches LH-cache with MissMap [Loh+ MICRO’11] Alloy cache [Qureshi+ MICRO’12] Hit speculation & self-balancing dispatch [Sim+ MICRO’12] Footprint cache [Jevdjic+ ISCA’13] Dynamic resizing of DRAM caches [Chang+ CMU TR]

7 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB On-chip Caches: STT-RAM Researches in DAL LASIC [Ahn+ IEEE TVLSI] Lower-bits cache [Ahn+ ISCAS’12] Selectively protecting ECC [Ahn+ ASP-DAC’13] Write intensity prediction [Ahn+ ISLPED’13] DASCA [Ahn+ HPCA’14] Other researches related to multi/many-core systems STT-RAM aware NoC [Mishra+ ISCA’11] PVA-NUCA [Sun+ ISLPED’12] OAP [Wang+ DATE’12]

8 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Main Memory: Memory Controllers Numerous proposals for memory scheduling ATLAS [Kim+ HPCA’10] for multiple MCs SMS [Ausavarungnirun+ ISCA’12] for CPU-GPU systems Some other researches for many-core systems Page placement/migration for multiple MCs [Awasthi+ PACT’10] Application-aware channel partitioning & scheduling [Muralidhara+ MICRO’11]

9 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Main Memory: Interface Various styles of interfaces/technologies Slow parallel buses (e.g., DDR3: multi-drop, DDR4: P2P) SerDes-based high speed serial link (e.g., FB-DIMM, HMC) Silicon interposer (e.g., HBM) TSVs (e.g., Wide I/O) Photonic interconnect

10 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Summary Memory hierarchy for many-core systems Bandwidth limitation vs. increasing bandwidth demand Becomes more important as more cores are integrated Two main components On-chip caches: data placement, partitioning, emerging memory technologies, … Main memory: memory controllers, interface, …

11 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications

12 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB (Flat) Mesh NoCs Single-Chip Cloud (SCC) * TILE64™ † * J. Howard, et al., “A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS,” ISSCC, 2010. † S. Bell, “TILE64™ processor: A 64-core SoC with mesh interconnect,” ISSCC, 2008.

13 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB High-radix topologies Butterfly * Flattened butterfly † Alternatives – High-radix topology / Hierarchical topology * W. J. Dally and B. Towles, Principles and practices of Interconnection Networks, Morgan Kaufmann, 2004. † J. Kim, et al., “Flattened butterfly: A cost-efficient topology for high-radix networks,” ISCA, 2007.

14 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Hierarchical topology Concentrated mesh * Bus-mesh hybrid † Alternatives – High-radix topology / Hierarchical topology * J. Balfour and W. J. Dally, “Design tradeoffs for tiled CMP on-chip networks,” ICS, 2006. † R. Das, et al. “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, 2009.

15 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Conclusion Mesh is not scalable in terms of latency and energy consumption. High-radix / hierarchical topology are possible alternatives. Based on the target application, we can choose the NoC architecture. (Possible) New issues include: DSE on hierarchical NoCs DSE on bus-NoC combinations Topology combinations & cluster sizes 3D stacking → Thermal issues Task mapping, topology, routing, etc. perspectives * S. Bourduas and Z. Zilic, “A hybrid ring/mesh interconnect for network-on-chip using hierarchical rings for global routing,” NOCS, 2007.

16 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications

17 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Dark Silicon 17 The end of multicore scaling 4 Cores @ 1.8GHz 2X4 Cores @ 1.8GHz (8 dark) 4 Cores @ 2X1.8GHz (12 dark) 65 nm 32 nm http://darksilicon.org

18 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Dark Silicon 18 Power consumption as process scales [Taylor, DAC 2012]

19 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Near-Threshold Computing 19 Claremont, Intel

20 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Near-Threshold Computing 20 Energy per cycle

21 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Architecture Candidate 21 Only super-threshold Near-threshold or Near~super threshold

22 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Outline Memory System On-chip Caches Main Memory Interconnection Network Power Budget Issue Near-Threshold Computing Workload Characterization Vision Applications

23 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Parallelism level for Vision App. Parallelism level for vision app. 23 [1] C. Shi, N. Wu, and Z. Wang, "A high-speed vision processor based on pixel-parallel PE array and its applications," in Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, 2010, pp. 57-60. [2] C. Wu, H. Aghajan, and R. Kleihorst, "Real-Time Human Posture Reconstruction in Wireless Smart Camera Networks," in Information Processing in Sensor Networks, 2008. IPSN '08. International Conference on, 2008, pp. 321-331. [3]S. Kyo, S. i. Okazaki, and T. Arai, "An integrated memory array processor architecture for embedded image recognition systems," in Computer Architecture, 2005. ISCA'05. Proceedings. 32nd International Symposium on, 2005, pp. 134-145. [1] [2] [3]

24 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Vision Benchmark SD-VBS 24 S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia, et al., "SD-VBS: The San Diego Vision Benchmark Suite," in Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, 2009, pp. 55-64.

25 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization Vision benchmark SD-VBS, MEVBench Features (compared with SPEC 2006) Less ILP (Instruction level parallelism) Small register dependent distance Small basic block size Instruction mix ratio Computation intensive : Lots of fp & int operations Not memory intensive : Less load/store operation Memory (VGA & HD) Less memory stress Require small cache size 25

26 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB QnA

27 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Appendix

28 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Latency comparison 2014-01-16Case studies & DSE on NoCs in homogeneous many-core architectures28 * R. Das, et al., “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, 2009. DSE in Latency & energy perspectives 16 nodes64 nodes256 nodes

29 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Power comparison 2014-01-16Case studies & DSE on NoCs in homogeneous many-core architectures29 * R. Das, et al., “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, 2009. DSE in Latency & energy perspectives 16 nodes64 nodes256 nodes 1/2

30 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Power comparison 2014-01-16Case studies & DSE on NoCs in homogeneous many-core architectures30 * R. Das, et al., “Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs,” HPCA, 2009. 2/2 DSE in Latency & energy perspectives

31 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Benchmark comparison Condition Serial execution (Not parallel) Vision app. Smaller basic block size Less ILP than SPEC (Reg. dep. dist. & BBL size) 31 W. Alkohlani and J. Cook, "Towards Performance Predictive Application-Dependent Workload Characterization," in High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, 2012, pp. 426-436. 6 fp 4 Int Scientific App. Vision SD-VBS Vio- informatics

32 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Instruction mix ratio Load/store, float, integer, branch Lots of fp & int operation Less load/store operation 32

33 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Memory footprint (VGA & HD) lower memory stress 33

34 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Temporal locality  Cache size # of Unique cache lines between two access to the same cache line Vision application High temporal locality Don’t need big cache size 34 Cache line size : 64Byte

35 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (SD-VBS) Less ILP Less cache & memory memory pressure 35

36 SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Workload Characterization (MEVBench) 36 ILP DLP TLP


Download ppt "SEOUL NATIONAL UNIVERSITY DESIGN AUTOMATION LAB Issues on Designing Many-core Architectures Seokhyun Lee, Hanmin Park, Kyoung Hoon Kim, Jinho Lee and Junwhan."

Similar presentations


Ads by Google