Download presentation
Presentation is loading. Please wait.
Published byCandace Gibson Modified over 9 years ago
1
Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg
2
Program 2 Program 1 Heterogeneity Processor A Single-Core:
3
Program 2 Processor Program 1 Heterogeneity Processor Multiple Cores:
4
Program 2 Processor Program 1 Heterogeneity Multiple Cores: Processor
5
Program 1 Program 2 Heterogeneity Processor Heterogeneous Cores:
6
Heterogeneous CMP Design Must determine: 1) Best processor configuration for a group of workloads. 2) Best way to group workloads together.
7
The Challenge: A B C D Core 1 Core 2 Workload SpaceBest core configurations Core 1 Core 2 Communal Customization E F G H I J K L M N
8
Existing Approaches Regression models: Enable speedy exploration. Subsetting: Reduce workloads to a representative subset based on characteristics.
9
The Argument Subsetting isn’t a valid substitute or facilitator for communal customization. Reason: complex interdependencies between different architectural units.
10
Ties that bind 1)The global clock intertwines the sizing of different architectural units. 2) The burden of compromise in one unit can be passed on to another.
11
Example: The Global Clock solid line: delay of the issue queue, dashed line: access delay of the cache 1ns Cache Issue Queue 0.66ns Cache Issue Queue 0.66ns Cache Issue Queue 1ns Cache Issue Queue Pipeline: Less slackSlack Pipeline too deep Small Issue-queue Needlessly large cache
12
Example: The Global Clock The clock period, issue-queue size and cache size can not be optimized independent of each other. 1ns Cache Issue Queue 0.66ns Cache Issue Queue 0.66ns Cache Issue Q 1ns Cache Issue Queue
13
Ties that bind 1) The global clock intertwines the sizing of different architectural units. 2) The burden of compromise in one unit can be passed on to another.
14
Example: Passing on the Burden A) Working-set size, B) Branch predictability C) Density of dependence chains D) Frequency of loads E) Frequency of conditional branches * All normalized to a scale of 0~10 βα γ
15
Example: Passing on the Burden A) Working-set size B) Branch predictability C) Density of dependence chains D) Frequency of loads E) Frequency of conditional branches * all normalized to a scale of 0~10 β α γ LH Speed: Core Cache Core Cache LH LH LH LH Customized Architectures:
16
Example: Passing on the Burden A) Working-set size, B) Branch predictability C) Density of dependence chains D) Frequency of loads E) Frequency of conditional branches * all normalized to a scale of 0~10 β α γ Speed: Core Cache Core LH LHLH Customized Architectures:
17
A More Accurate Solution Represent workloads by their customized architectural configurations. Allows for direct and accurate evaluation how well different workloads do on customized configurations. We call this Configurational Workload Characterization
18
Design Process Overview Important workloads Rep. workloads Optimal core combination Select representative workloads based on workload behavior Search for opt. core combination Important workloads Customized architectures Optimal core combination Customize a core for each workload (configurational characterization) Search for opt. core combination How not to do it How to do it
19
Pros & Cons -more costly to determine + provides a more optimal design solution + provides a systematic approach + can be performed prior to the design phase that is critical for time-to-market
20
XP-SCALAR A superscalar design-space exploration frame work www4.ncsu.edu/~hhashem/xpscalar.htm Uses Simplescalar to perform cycle- accurate simulations Uses CACTI model to approximate the access latency of the different units
21
XP-SCALAR What parameters are varied: Clock period, Processor width, Size of the issue queue, Size of the register-file, Size of the load-store queue, Size of the L1 and L2 caches
22
XP-SCALAR How they are varied: a) Clock period is varied, and architecture parameters are adjusted to make latencies fit within pipeline stages. b) Number of pipeline stages of a unit is varied and its configuration appropriately adjusted.
23
Determining the Best cores Execute all benchmarks on each-other’s customized configurations. From that, determine best grouping through a complete search.
24
Best Core Results customized core(s)avg. IPThar. IPT best config for avg. & har. IPTgcc2.061.57 2 best configs for avg. IPTparser, twolf2.271.76 2 best configs for har. IPTgcc, mcf2.121.88 3 best configs for avg. IPTcrafty, parser, twolf2.351.82 3 best configs for har. IPTcrafty, mcf, twolf2.272.05 4 best configs for avg. & har. IPTcrafty, mcf, parser, twolf2.322.08 each benchmark on its own customized architecture -2.382.12
25
The effect of subsetting Subsetting of a single pair of benchmarks results in the extraction of a totally different set of best cores.
26
Representation Dendogram are
27
Conclusions There are interdependencies between architectural units in how they are customized. In the design of a heterogeneous CMP subsetting can lead to performance degradation.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.