Presentation is loading. Please wait.

Presentation is loading. Please wait.

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Similar presentations


Presentation on theme: "CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli."— Presentation transcript:

1 CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli LSI Integrated Systems Laboratory

2 NoCs Major Power Consumer  Move towards manycore Tiled architectures  Network-on-Chip (NoC) Significant power consumer 40% MIT RAW 30% Intel Tera-scale  Cache coherent CMP Server workloads C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ Core $$ Crossbar

3 Proposals to Reduce NoC Power  Multiple networks Better area and power [Balfour & Dally ICS 2006]  Commercial server workloads Traffic patterns are different  Run on cache coherent CMPs Strong relation between coherence protocol and NoC  Not optimized for Commercial Server Workload traffic

4 Contributions  Commercial server workloads Optimized for reuse in L1, little sharing Full blown coherence protocol in CMPs Only some transitions are frequent  Duality in Request/Response message size  CCNoC Full advantage of heterogeneity Same number of buffers 16% less power same performance as Mesh

5 Outline  Overview  Why CCNoC?  Dual-router design  Evaluation  Conclusions

6 Dual Router is More Efficient  Dual router Two crossbars per routing node  Wires less expensive on-chip Use more wires for better performance  Area and power grows faster than connectivity Balfour & Dally ICS 2006 Dual router: better performance, power and area N bit wide N/2 bit wide

7 Right Dual Router Design  Avoid protocol level deadlock Separate  Requests  Responses Use Virtual Channels  CCNoC sub-networks  Request / Response No VCs needed Same number of buffers  Buffers are power hungry H.S.Wang & L.S.Peh, MICRO 2003

8 Protocol Activity  CMPs implement full blown coherence protocol Some transitions are frequent [Hardavellas ISCA 2009]  Read clean block  Evict clean block  Write to unshared block Other transitions needed for correctness (infrequent)  Read dirty block  Evict dirty  Write to shared block

9 Frequent Read Protocol Activity ReaderDirectoryWriter Read Req Read Resp Evict Clean Req Short Req Short Resp Long Resp

10 Frequent Write Protocol Activity WriterDirectory Fetch/Upgrade Req Fetch Resp Short Req Short Resp Long Resp Upgrade Resp

11 Infrequent Read Protocol Activity ReaderDirectoryWriter Read Req Read Resp Short Req Short Resp Long Resp Downgrade Req Downgrade Resp

12 Infrequent Write Protocol Activity WriterDirectoryReader 1 Fetch/Upgrade Req Fetch Resp Short Req Short Resp Long Resp Reader 2 Upgrade Resp Inv Req Inv Resp Evict Dirty Req

13 Traffic Analysis Request: 93% shortResponse: 86% long

14 CCNoC Router Request network narrow: optimized for short messages Response network wide: optimized for long messages Request Switch Response Switch NI Router

15 Previous Work  Balfour et al. ICS 2006 Better than single large router Read/Write traffic Same number of reads and writes  Yoon et al. DAC 2010 Physical channel better then virtual channel  Not optimized for cache coherent CMP Running commercial server workloads

16 Outline  Overview  Why CCNoC?  Dual-router design  Evaluation  Conclusions

17 Evaluation Methodology  FLEXUS Full system simulation 16 or 8 UltraSPARC III ISA cores Split I/D, 64KB L1 1 or 2 MB L2  ORION 2.0 power estimation area estimation  Workloads OLTP: TPC-C  IBM DB2 and Oracle DSS: TPC-H  IBM DB2  Q1, Q6, Q13, Q16 Web: SPECweb99  Apache and Zeus Scientific: EM3D Multiprogrammed:  SPEC2K  2x: gcc, twolf, art, mcf

18 Evaluation NoCs  Mesh-128 - baseline 128 bit flit width  Torus - reference 128 bit flit width  Mesh-176 – high performance 176 bit flit width  CCNoC Request: 48 bit flit width Response: 128 bit flit width  Switches Wormhole flow control Input queued Transmission protocol  On/Off Input buffers  2 entry

19 Performance Performance loss: 2% Torus, 8% Mesh-176

20 Power Savings Power savings: 16% Mesh-128, 22% Torus, 38% Mesh-176

21 Conclusions  Duality in Request/Response traffic Request: dominated by short messages Response: dominated by long messages  Proposed CCNoC Narrow request network Wide response network  Showed significant power savings 22% against Torus 38% against Mesh-176

22 Thank you! Q&A


Download ppt "CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli."

Similar presentations


Ads by Google