Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked. Dennis Abts Google Natalie Enright Jerger University of Toronto John Kim KAIST Dan Gibson Univ of Wisconsin Mikko Lipasti Univ of Wisconsin

Executive Summary ® On what tiles should memory controllers reside? –Three-tiered simulation approach Heuristic-guided search Detailed network simulation Full-system simulation Diamond MC placement works well for on-chip meshes and tori –Diamonds minimize maximum channel load –Diamonds deliver lower and more predictable runtimes

Background Diverse on-chip communication –Cache-to-cache –LD/ST to Memory –Off-chip traffic (e.g., I/O) Processors/chip on the rise –Pins available for memory not rising as fast: Memory bandwidth becomes more precious –Reality: Many Cores, Few Memory Controllers Tiled architectures gaining popularity –Commonly employ on-chip meshes or tori

The Problem What Memory Controller placement is best overall? –Flip-chip packaging allows flexible escape routes –n tiles and m ports: Don’t worry, there are only configurations! –What are the characteristics of the best configuration? Performance: Low runtime for a set of objective workloads Throughput: Low latency as a function of offered load Fairness: Similar (low) average memory latency across all nodes. Predictability: Low latency and runtime variance Slight Simplification: Assume n = k 2 and m = 2k

Baseline Placement: row0_7 Ports to MCs located at top and bottom of chip Conceptually similar to real parts: –Tilera’s Tile64 64 cores, 4 MCs (4 ports each, top/bottom of chip) –Intel TeraFLOPs 80 cores, 2 MCs (8 ports each, top/bottom of chip) X-Dimension Traffic Encounters Congestion on Rows with Memory Controllers

Three-Tiered Approach Link Contention Simulation Detailed Network Simulation Full System More RunsShorter RuntimesMore Detail

Tier 0.5: Exhaustive Search It turns out is tractable for k<7 –(At least on the link contention simulator – only 3,268,760 possibilities for k=5) Patterns Emerge! Another Contender

Tier 1: Heuristic-Guided Search k>6: Intractable to search all configurations –Use search heuristics and random search Genetic Algorithm: –Represent designs as a population of strings (Bit Vectors) –Generate new designs by combining members of the population via genetic crossover (Bit Selection) –Occasionally, mutate new population members (Swap adjacent bits) –Reduce population size by removing least-fit members – Survival of the Fittest

Genetic MC Placement 0x00AA550000AA5500 0x0000FF0000FF0000 0x00AAF00000F25100 0x00AAF00000F25080 Mutate

Link Contention Results k=8 Config. Max Channel Load MeshTorus row0_713.59.25 X8.937.72 Diamond8.907.72 GA Selected Diamond as most fit solution for 8x8 –Minimizes MCs in a single row/column –Spreads DOR load Sanity Check: GA also prefers Diamond for 4x4, 5x5, and 6x6

Network Simulation: Open-Loop Evaluation Detailed simulation of all network events (buffers, links, etc.) Cores are Bernoulli injection processes, uniform random traffic Measure latency vs. offered load ParametersValues Router latency1 cycle (aggressive) Inter-router Delay1 cycle Buffers32-flit sized per port Packet sizeRequest: 1 flit Reply: 4 flit Virtual Channels4 (XY-YX routing)

Open-Loop Results 0 5 10 15 20 25 00.20.40.60.81 Offered load (flits/cycle) Latency (cycles) row0_7 row2_5 Diamond X

Closed-Loop Evaluation Each processor executes N memory operations Up to r operations outstanding at a time –Models MSHRs Uniform Random requests, and real request streams with ‘hot spot’ behavior

Closed-Loop Results 0 4 8 12 16 20 350040004500500055006000 6500 Completion Time Number of Processors 8000850090009500100001050011000 Diamondrow0_7

Full System Results Standard Deviation Average Network Latency (cycles) for Request to Memory Controller JBB WEB TPC-W TPC-W+H TPC-H TPC-W+H TPC-W TPC-H WEB JBB Diamond placement yields lower latency and lower latency variance.

Conclusion MC Placement Matters! –Diamond reduces contention, improves latency, and reduces latency/runtime variance –X does fairly well

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

Similar presentations

Presentation on theme: "Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

Similar presentations

Presentation on theme: "Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core."— Presentation transcript:

Similar presentations

About project

Feedback