Towards a Smart Workload Generator on RAMP Archana Ganapathi, David Patterson, Anthony Joseph {archanag, pattrsn, cs.berkeley.edu.

Towards a Smart Workload Generator on RAMP Archana Ganapathi, David Patterson, Anthony Joseph {archanag, pattrsn, adj} @ cs.berkeley.edu

RADLAB goals Help Single Operator/ Developer sites become Google-scale Eliminate SW/HW obstacles for scaling Tools to identify/fix problems Use RAMP to move websites from right to left Source: Washington Post 3/31/2006 Top 50 Web Domains

RADLAB Goals (2) YouTube.com 2006 Daily Traffic Ranking Challenges: Scalability Configurability Single person operation Cost-effectiveness Reproducibility and Observability Source: Alexa.com

RAMP for Time Travel UCB goal for RAMP: Data center in a box Google/Amazon.com O(10,000) processors in data center Anticipate load of 3-6 months in future for fast moving company Smart Workload Generation: “smart gateware” design informed by SML analysis of workload Time Dilation on Emulated Machines: Try software/ config changes and observe behavior prior to deployment Targeted Component-specific Load Generation: Stress-test components in the critical path to determine performance limitations RAMP as emulation environment + workload generator

Building Blocks Workload generation engine Parser to extract data from server response Workload description language to specify primitives to compile onto an FPGA Machine Learning techniques to discover web interactions ML to determine interactions User scripts Real System Workload Generator

Naïve Workload Generator (CS252 class project by Lorenzo Orecchia and Madhur Tulsiani) Generate the data-set using analytical models server file size distribution request size distribution relative file popularity Derive URL connectivity graph, load in memory Circuit logic to perform random walk on graph. We achieve our goal of scalability: Graph Size: 1048576 Memory Usage: 21 MB Total data size: 21083 MB

Scalability: RAMP DRAM limits Given: 2GB DRAMs, 4 DRAM banks per FPGA, 100 MHz clock cycle ~ 10 ns. Per cycle = 21 bits per walk (21Mbits for 1M walks) Assume 10 clock cycles per access => 100 ns 10 million accesses per second per bank per walk Given four DRAM banks = 40M accesses per sec Compare: Google receives 2000 requests per second

Scalability: RAMP Ethernet limits Given: 20 10Gbit/sec Ethernet ports per board Assume can generate 100 million accesses/sec Naïve Response-ignorant workload generation: 4 bytes for URL + check sum + header… ~ 32 bytes => 3 GB for 100 million accesses (per second) Smart Response-driven workload generation: Google: 23KB Flickr: 45KB CNN: 100KB Assume up to 200KB response can receive 50K responses per second per port = 1M responses handled by 20 10G ports.

Ethernet limits cont. About 1000X Google average with Smart (response-driven) generation Mixed RAMP emulation/workload generation even higher BW inside box Have plenty of headroom to tradeoff speed for greater accuracy of workload

Some Open Questions Limits on types of workloads? Workload trace sources? Web services, existing traffic generators, …? Role of Response-Ignorant trace generation? UDP, error/congestion-free TCP, …? Required level of fidelity for Response- Driven trace generation? How much of TCP FSM to model?

Question/Feedback?

Backup Slides

State of the art Hardware vs. software based Hammer, Optixia vs SURGE, SLAMd Tunability vs Automation SPECweb, TPC-W, Harpoon vs Optixia, SLAMd Realistic vs Synthetic SURGE, SLAMd, Harpoon vs TPC-W Generic vs App-Specific SLAMd, Harpoon vs TPC-W, Hammer Open-loop vs Closed-loop Partly-open loop is most realistic for web services

Workload Generator Next Steps Handle server responses Include “server response” states in logic Parse server response to identify current state Include think-time distribution User think-time + server response time What happens when things go wrong? Improve temporal/spatial locality Prefetch other URLs a page is linked to Take advantage of Zipfian popularity distribution

Sketch of Random Walk Module MEMORYMEMORY

Data-set parameters Graph Size Memory Usage Total data size Average file size 409650 KB2138 MB521 KB 655361 MB3121 MB47.6 KB 104857621 MB21083 MB20.11 KB

Circuit properties Device : Virtex-E Maximum delay Walk Module: 3.99 ns Memory Module: 5.82 ns Estimated frequency = (1000/9.81) = 101.93 MHz Number of LUTs per walk: 593 Number of slices per walk: 307

Towards a Smart Workload Generator on RAMP Archana Ganapathi, David Patterson, Anthony Joseph {archanag, pattrsn, cs.berkeley.edu.

Similar presentations

Presentation on theme: "Towards a Smart Workload Generator on RAMP Archana Ganapathi, David Patterson, Anthony Joseph {archanag, pattrsn, cs.berkeley.edu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards a Smart Workload Generator on RAMP Archana Ganapathi, David Patterson, Anthony Joseph {archanag, pattrsn, cs.berkeley.edu.

Similar presentations

Presentation on theme: "Towards a Smart Workload Generator on RAMP Archana Ganapathi, David Patterson, Anthony Joseph {archanag, pattrsn, cs.berkeley.edu."— Presentation transcript:

Similar presentations

About project

Feedback