Presentation is loading. Please wait.

Presentation is loading. Please wait.

Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture.

Similar presentations


Presentation on theme: "Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture."— Presentation transcript:

1 Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

2 Overview Motivation – Explore architectural issues as the computing moves toward the could 1.Impact of sharing memory subsystem resources (LLC, memory bandwidth..) 2.Maximize resource utilization by co-locating applications without hurting QoS 3.Inefficiencies on traditional processors for running scale- out workloads

3 Overview PaperProblemApproach The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Sharing in Memory subsystemSoftware Bubble-UpResource UtilizationSoftware Clearing the cloudInefficiencies for scale-out workload Software Scale-out processorsImprove scale-out workload performance Hardware

4 Impact of memory subsystem sharing

5 Motivation & Problem definition – Machines have multi-core, multi-socket – For better utilization, applications should share Last Level Cache(LLC) / Front Side Bus (FSB)  It is important to understand the memory sharing interaction between (datacenter) applications

6 Impact of thread-to-core mapping Sharing Cache Separate FSBs (XX..XX..) Sharing Cache Sharing FSBs (XXXX….) Separate Cache Separate FSBs (X.X.X.X.)

7 Impact of thread-to-core mapping - Performance varies up to 20% - Each Application has different trend. - TTC behavior changes depending on co-located application.

8 Observation 1.Performance can significantly swing simply based on how application threads are mapped to cores. 2.Best TTC mapping changes depends on co-located program. 3.Application characteristics that impact performance – Memory bus usage, Cache line sharing, Cache footprint – Ex) CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint  Works better if it doesn’t share LLC and FSB  STITCH use more Bus bandwidth, so co-located CONTENT ANALYZER will have contention on FSB

9 Increasing Utilization in Warehouse scale Computers via Co-location

10 Increasing Utilization via Co-location Motivation – Cloud computing wants to get higher resource utilization. – However, overprovisioning is used to ensure the performance isolation for latency-sensitive task, which lowers the utilization.  Need precise prediction in shared resource for better utilization without violating QoS.

11 Bubble-up Methodology 1.QoS sensitivity curve (  ) – Get the sensitivity of the application by iteratively increasing the amount of pressure to memory subsystem 2.Bubble score (  ) – Get amount of pressure that the application causes on a reporter

12 Better Utilization Now we know 1)how QoS changes depending on bubble size (QoS curve) 2)how the application can affect to others (bubble number)  Can co-locate applications estimatiing changes on QoS

13 Scale-out workload

14 Examples: – Data Severing – Mapreduce – Media Streaming – SAT Solver – Web Frontend – Web Search

15 Execution-time breakdown A major part of time is waiting for caches misses  A clear micro-architectural mismatch

16 Frontend ineffficiencies Cores idle due to high instruction-cache miss rates L2 caches increase average I-fetch latency Excessive LLC capacity leads to long I-fetch latency How to improve? – Bring instructions closer to the cores

17 Core inefficiencies Low instruction level parallelism precludes effectively using the full core width Low memory level parallelism underutilizes reorder buffers and load-store queues. How to improve? – Run many things together: multi-threaded multi- core architecture

18 Data-access inefficiencies Large LLC consumes area, but does not improve performance Simple data prefetchers are ineffective How to improve? – Reduce LLC, leave place for processers

19 Bandwidth inefficiencies Lack of data sharing deprecates coherence and connectivity Off-chip bandwidth exceeds needs by an order of magnitude How to improve? – Scale back on-chip interconnect and off-chip memory bus to give place for processors

20 Scale-out processors So, too large LLC, interconnect, memory bus, but not enough processors Here comes a better one: Improve throughput by 5x-6.5x!

21 Q&A or Discussion

22 Supplement slides

23 Datacenter Applications Applicatio n DescriptionMetricType content analyzer Throughputlatency- sensitive bigtableaverage latencylatency- sensitive websearchqueries per second latency- sensitive stitcherBatch protobufBatch - Google’s production application

24 Key takeaways TTC behavior is mostly determined by – Memory bus usage (for FSB sharing) – Data sharing: Cache line sharing – Cache footprint: Use last level cache miss to estimate the foot print size Example – CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint  Works better if it does not share LLC and FSB – Stich actually uses more Bus bandwidth, so it’s better for CONTENT ANALYZER not to share FSB with stitch

25 1% prediction error on average Prediction accuracy for pairwise co-locations of Google applications


Download ppt "Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture."

Similar presentations


Ads by Google