Presentation is loading. Please wait.

Presentation is loading. Please wait.

From Adaptive to Self-Tuning Systems Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos School of Electrical and Computer Engineering.

Similar presentations


Presentation on theme: "From Adaptive to Self-Tuning Systems Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos School of Electrical and Computer Engineering."— Presentation transcript:

1 From Adaptive to Self-Tuning Systems Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos School of Electrical and Computer Engineering

2 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2 Architectural Challenges Not much headroom left in the stage to stage times (currently 8-12 FO4 delays) [4] ILP Pipeline in-order OOO aggressive OOO 1.P. Ranganathan, S. Adve, N. Jouppi. Reconfigurable Caches and their Application to Media Processing. ISCA Michael Zhang, Krste Asanovic Fine-Grain CAM-Tag Cache Resizing Using Miss Tags ISLPED 02 3.S. Borkar Design Challenges of Technology Scaling Micro Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000 Cache Area Cache Area 80% of transistor budget 50% of total area [1] 80% of transistor budget 50% of total area [1] Defects in cache affect processor yield Defects in cache affect processor yield Significant power consumers (e.g. > 40% of total power in Strong ARM) [2] Significant power consumers (e.g. > 40% of total power in Strong ARM) [2] On-chip-DRAM gap continues to grow Power Wall Frequency Wall Single Thread Performance Memory Wall Economic Wall Costs of developing next generation processors Costs of developing next generation processors Design & Manufacturing costs Extreme Device Variability Negative returns with power Increasing inefficiencies due to speculation control flow Source:http://techreport.com/reviews/2005q2/opteron-x75/dualcore-chip.jpg Power Leakage current increases 7.5X with each generation [3]

3 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 3 System View Large scale P P P P P P P P P P P P M M M M M M M M M M M M M M M M M M M M M M M M P P P P P P P P P P P P M M M M M M M M M M M M M M M M M M M M M M M M P P P P P P P P P P P P M M M M M M M M M M M M M M M M M M M M M M M M P P P P P P P P P P P P M M M M M M M M M M M M M M M M M M M M M M M M 1. Capture and adapt to intrinsic application behavior Many-core, Heterogeneous System Static, off-line characterizations Dynamic, on-line, evolutionary behaviors Solution: Systems are self-tuning 2. Device-Level Variations reduce architecture yield

4 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 4 State of the Practice The Space of Solutions Structured Workloads Ill- Structured Workloads Rigid, HW/SW Boundaries Evolutionary or Self-Tuning Systems M P M P Traditional Architectures (Fixed) P M Architectures Change At SW- determined Points of Execution P M Architectures continuously autonomously evolve and adapt Ability to Customize Architectures Before Application Deployment P M

5 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 5 From Adaptive to Self Tuning Where do we make future investments in transistors and software? Where do we make future investments in transistors and software? Hardware software co-design for continuous monitoring and/or tuning Hardware software co-design for continuous monitoring and/or tuning Expose and (dynamically) eliminate design redundancies Expose and (dynamically) eliminate design redundancies Two Examples Two Examples Cache memory hierarchy Cache memory hierarchy On-Chip Networks On-Chip Networks

6 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 6 Generational Behavior of Caches new generation Time Idle interval miss hit Memory Lines 2. Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle IATAC: a smart predictor to turn-off L2 cache lines. TACO Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle IATAC: a smart predictor to turn-off L2 cache lines. TACO Kaxiras, S., Hu, Z. and Martonosi, M., "Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power ISCA 2001

7 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 7 Cache Tuning: Conceptual Model Remap memory into the cache shape the cache Remap memory into the cache shape the cache Match the program footprint resize the cache Match the program footprint resize the cache

8 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 8 Cache Tuning: System Model & Opportunities statement end loop loop Region A remapping directive Placement( B[][], param ) Static analysis or programmer supplied Profile based insertion L1 L2 M AT P Thread 1 Thread 2 LUT logic Alternative implementations Run-time tuning x y z Structured accesses

9 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 9 Static Tuning: Scientific Applications Targeted to programs with predictable access patterns Targeted to programs with predictable access patterns Compiler can both resize and remap Compiler can both resize and remap Advanced compiler optimizations made possible Advanced compiler optimizations made possible

10 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 10 Dynamic Tuning: Folding Heuristics Find and utilize redundancies in the design Find and utilize redundancies in the design Miss folding fold misses via re-mapping memory lines into the same cache set Miss folding fold misses via re-mapping memory lines into the same cache set S. Ramaswamy, S. Yalamanchili. Improving Cache Efficiency via Resizing + Remapping. ICCD 2007 Comparisons shown for a 256KB L2 cache

11 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11 Tuning for Yield: Decreasing Defect Sensitivity* Performance Yield yield at a given performance (e.g. AMAT) for 1000 units Performance Yield yield at a given performance (e.g. AMAT) for 1000 units Up to four times greater than modulo placement Up to four times greater than modulo placement Exploiting redundancies application to power management Exploiting redundancies application to power management Recovering Design Inefficiencies S. Ramaswamy, S. Yalamanchili, Customizable Fault Tolerant Caches for Embedded Processors, ICCD 2006

12 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 12 Opportunities Voltage scaling Voltage scaling Combine voltage scaling and remapping for program phase dependent power management Combine voltage scaling and remapping for program phase dependent power management Compiler-directed hardware optimizations Compiler-directed hardware optimizations For example concurrent data layout + cache placement For example concurrent data layout + cache placement Application to multi-threaded and multi-core domains Application to multi-threaded and multi-core domains Cache sharing across threads Cache sharing across threads Challenge: coherency traffic Challenge: coherency traffic

13 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 13 The On-Chip Network The network is in the critical path (performance) The network is in the critical path (performance) Operand networks Operand networks Cache hierarchy Cache hierarchy System on Chip System on Chip Increasing impact of wire (channel) delays Increasing impact of wire (channel) delays Wire delays must be actively managed Wire delays must be actively managed On-demand resource management On-demand resource management Initial studies: link tuning Initial studies: link tuning Reference: Research at EPFL & Stanford on robust link design Reference: Research at EPFL & Stanford on robust link design

14 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 14 A System for Tuning and Actively Reconfiguring SoC Links (STARS) Variable delays and and cascaded registers measure link delay Variable delays and and cascaded registers measure link delay Digital PLL tunes the clock to match the link delay Digital PLL tunes the clock to match the link delay Value 1Value 2 Value 1Value 2 Value 1Value 2 Well TunedToo Slow Latch 1 Latch 2 Latch 3 Too Fast Time

15 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15 FPGA Tests Monitoring Find End of Link Transition Tuning Find Start of Link Transition Determine Slack In the Link Adjust Clock Frequency Low speed tests to validate the control strategy Low speed tests to validate the control strategy

16 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 16 Variable Delay Elements (VDE) Variable Delay Elements (VDE) Variable delay from 118ps to 1.47ns Variable delay from 118ps to 1.47ns 10 bits of resolution 10 bits of resolution 502 transistors 502 transistors Digitally Controlled Oscillator (DCO) Digitally Controlled Oscillator (DCO) Clock period from 240ps to 2.97ns Clock period from 240ps to 2.97ns 10 bits of resolution 10 bits of resolution 528 transistors 528 transistors Digital Clock Divider (DCD) Digital Clock Divider (DCD) Min input clock period 480ps Min input clock period 480ps 8 bits of resolution 8 bits of resolution 1127 transistors 1127 transistors Allows tuning links up to GHz Allows tuning links up to GHz From reference clock of 8.13MHz From reference clock of 8.13MHz Prototyping: 180nm

17 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 17 Extensions Modulate link widths Modulate link widths Modulate buffer organizations Modulate buffer organizations Channels/depth Channels/depth Feedback between local congestion detection and link and buffer resources Feedback between local congestion detection and link and buffer resources

18 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 18 Summary Application demands will be time varying Application demands will be time varying Technology will introduce time-varying hardware characteristics Technology will introduce time-varying hardware characteristics Continuous cooperative HW/SW tuning provides a methodology for addressing these concerns Continuous cooperative HW/SW tuning provides a methodology for addressing these concerns Need the support of abstractions for tuning Need the support of abstractions for tuning Influence of prior applications to datapaths (Razor-UMich), communication systems (Vizor-GT), and reliable links (Stanford/EPFL) Influence of prior applications to datapaths (Razor-UMich), communication systems (Vizor-GT), and reliable links (Stanford/EPFL) Build on existing research in cache performance & power management Build on existing research in cache performance & power management


Download ppt "From Adaptive to Self-Tuning Systems Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos School of Electrical and Computer Engineering."

Similar presentations


Ads by Google