Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Performance Stream Processing for Mobile Sensing Applications

Similar presentations


Presentation on theme: "High Performance Stream Processing for Mobile Sensing Applications"— Presentation transcript:

1 High Performance Stream Processing for Mobile Sensing Applications
Graduate Research Symposium 2015 Farley Lai Advisor: Dr. Octav Chipara

2 Mobile Sensing Applications (MSAs)
Introduction Mobile Sensing Applications (MSAs) Speaker Identification HTTP Upload Speech Recording Voice Detection Feature Extraction Speaker Models Sensing Stream Processing MSAs are an emerging class of … Here is an example application the Speaker Identification Sensing and streaming processing phases The sensing phayse is to record speech from mic The stream processing involves voice detection and feature extraction, features may be uploaded … While sensing can be strightforward, stream processing can arbitrarily complex and compute intensive Moreover, this kind of apps are supposed to run in the background for a very long time and deliver continuous results Therefore, it is essential to achieve high performance real-time processing and efficient resource management High performance real-time processing Efficient resource management

3 A Model for Stream Applications
StreamIt – a simple stream language from MIT Static schedule pass-by-value semantics across FIFO channels pipeline Source Duplicate LPF1 Subtract Sink Round-Robin LPF2 split-join Let’s take a closer look to the application model. We use the StreamIt language from MIT to facilitate our optimization. In this model, the stream processing is represented as a pipeline. Here is the band pass filter program that allows only data samples in a particular frequency range to pass. At a high level, the program is a pipeline with filters connected through FIFO channels. A filter is basically a function and has at most one input channel and output channels. The only way to access it input channel is to use the peek() and pop() operations. The only way to access it out channel is to use the push() operation. The pipeline may have the split-join construct that allow the data stream to branch. Here, the splitter duplicates its input and the joiner combines its input in a round-robin fashion. The program execution follows a static schedule. The schedule for this program has an init phase that executes once and a steady that may repeat forever. Each phase specifies the order and times of the filter invocations. Though this model is simple, the pass-by-value semantics across FIFO channels may be inefficient. A filter may not reuse the input memory for its output. Source,3 DUP, 3 LPF1,1 LPF2,1 Source,1 DUP, 1 RR,1 Sub,1 Sink INIT PHASE: STEADY PHASE:

4 The Memory Management Challenge
Workload: memory intensive operations on data streams e.g., windowing, splitting, or appending Goal: implement stream operations efficiently reduce memory footprint reduce number of memory accesses Challenges captures component memory behaviors avoids unnecessary memory copies exploits data sharing between components ESMS: Efficient Static Memory management for Streaming reduces data memory usage by up to 96% up to 8.7X speedup That is why we want to take the memory management challenge to stream processing. Because we observed many stream operations such as windowing, splitting and appending are memory intensive Therefore, the goal is to implement stream operations efficiently. We would like to reduce the memory footprint and the number of memory accesses. To achieve this, we need to address the following challenges First, we need to capture component memory behaviros We also need to avoid unnecessary memory copies and exploit data sharing. Hence, we propose the ESMS…. that reduces the data memory usage up to 96% And improves the performance by 8.7X Our optimization includes the component analysis for each filter, the whole program analysis across filter and layout generation.

5 Component Analysis of LFP (1)
Low Pass Filter (LFP) CFG: Entry sum = 0 work pop 1 push 1 peek 3 { float sum = 0; sum += peek(0) * coeff[0]; sum += peek(1) * coeff[1]; sum += peek(2) * coeff[2]; pop(); push(sum); } sum += peek(0) * coeff[0] sum += peek(1) * coeff[1] sum += peek(2) * coeff[2] Goal, live range The goal of the component analysis to capture the live ranges for each I/O element in one filter invocation The live range tells us when the element is produced for use and when it is no longer used. Let’s go through low pass filter example in band pass filter program. Here is the filter’s work function that basically computes the linear combination of its input elements with some pre-computed coefficients. To facilitate the analysis, we usually convert the function to a CFG so that we can traversal the graph to process the statements pop() push() Exit

6 Component Analysis of LFP (2)
MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0] Component Analysis of LFP (2) STATE: CFG: Entry LIN [0,0] [0,0] [0,0] LFP LOUT sum = 0 2 1 sum += peek(0) * coeff[0] MC: 0 sum += peek(1) * coeff[1] MC: 1 sum += peek(2) * coeff[2] MC: 2 MC, input element live ranges, output element live range In the beginning, each stream operations are labeled with a MC to differentiate their order The input element live ranges are initialized to the interval between 0 and 0 The output element live ranges are initialized to be empty Since we only concerns stream operations, the push() peek() and pop(), we can directly get to the first peek(0) statement. pop() MC: 3 push() MC: 4 Exit

7 Component Analysis of LFP (3)
MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0] Component Analysis of LFP (3) STATE: CFG: Entry LIN [0,0] [0,1] [0,0] LFP LOUT sum = 0 2 1 sum += peek(0) * coeff[0] MC: 0 sum += peek(1) * coeff[1] MC: 1 sum += peek(2) * coeff[2] MC: 2 pop() MC: 3 push() MC: 4 Exit

8 Component Analysis of LFP (4)
MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0] Component Analysis of LFP (4) STATE: CFG: Entry LIN [0,2] [0,1] [0,0] LFP LOUT sum = 0 2 1 sum += peek(0) * coeff[0] MC: 0 sum += peek(1) * coeff[1] MC: 1 sum += peek(2) * coeff[2] MC: 2 pop() MC: 3 push() MC: 4 Exit

9 Component Analysis of LFP (4)
MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0] Component Analysis of LFP (4) STATE: CFG: Entry LIN [0,2] [0,1] [0,3] LFP LOUT sum = 0 2 1 sum += peek(0) * coeff[0] MC: 0 sum += peek(1) * coeff[1] MC: 1 sum += peek(2) * coeff[2] MC: 2 pop() MC: 3 push() MC: 4 Exit

10 Component Analysis of LFP (5)
MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0] Component Analysis of LFP (5) STATE: CFG: Entry LIN [0,2] [0,1] [0,3] LFP [4,4] LOUT sum = 0 2 1 sum += peek(0) * coeff[0] MC: 0 sum += peek(1) * coeff[1] MC: 1 sum += peek(2) * coeff[2] MC: 2 pop() MC: 3 push() MC: 4 Exit

11 Whole Program Analysis
Extends the live ranges to cover all the I/O elements Elements Start End (LPF1, O0) [4, 4] [0, 3] (Subtract, I0) (LPF2, O0) [4, 4] [0, 4] (Subtract, I1) (phase, invocation, MC) With component analysis, the live ranges for one filter invocation are captured. The next step is to extend the live ranges to cover all the I/O elements through the whole program analysis. Usually, one data element live range starts as some filter’s output And ends as another filter’s input. So it has a start live range and an end live range. The whole program analysis relates them and extends the live ranges by including the schedule phase number and filter invocation index with the original live range MC In this way, it should be straightforward to check if two live ranges overlap (LPF1, O0) (0, 6, 4) (0, 9, 3) (LPF2, O0) (0, 7, 4) (0, 9, 4) Overlap: 6 < 7 < 9

12 Band Pass Filter Layout (1)
Memory Layout Live Ranges Source LPF1 Round Robin Subtract Sink LPF2 Duplicate MEM Initialization Source: O0 Source: O1 Source: O2 Elements Start End (Source: O0) (Source: O1) (Source: O2) (LPF1, O0) (0, 6, 4) (0, 9, 3) (LPF2, O0) (0, 7, 4) (0, 9, 4) (Subtract, O0) (0, 9, 2) (0, 10, 0) Next, with the live range information of the entire program, it is straightforward to generate the memory layout. The memory layout is initialized to be empty. We begin with initialization phase and simulate the schedule once. So first, the source filter executes three time and produces three output elements. Since their live ranges overlap, they need to occupy three memory locations 0, 1, and 2. (0, 0, 0) (0, 1, 0) (0, 2, 0) (0, 7, 3) (1, 3, 3) (1, 6, 1)

13 Band Pass Filter Layout (2)
Memory Layout Live Ranges Source LPF1 Duplicate LPF2 Subtract Sink Round Robin MEM Initialization Source: O0 Source: O1 Source: O2 LPF1: O0 Elements Start End (Source: O0) (0, 0, 0) (0, 7, 3) (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (LPF1, O0) (0, 6, 4) (0, 9, 3) (LPF2, O0) (0, 7, 4) (0, 9, 4) (Subtract, O0) (0, 9, 2) (0, 10, 0) Next, we can skip splitters and joiner because they don’t produce new elements. So we get to the LPF1. (0, 0, 0) (0, 7, 3) (0, 6, 4) (0, 9, 3)

14 Band Pass Filter Layout (3)
Memory Layout Live Ranges Source LPF2 LPF1 Duplicate Subtract Sink Round Robin MEM Initialization Source: O0 LPF2: O0 Source: O1 Source: O2 LPF1: O0 Elements Start End (Source: O0) (0, 0, 0) (0, 7, 3) (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (LPF1, O0) (0, 6, 4) (0, 9, 3) (LPF2, O0) (0, 7, 4) (0, 9, 4) (Subtract, O0) (0, 9, 2) (0, 10, 0) (0, 0, 0) (0, 7, 3) (0, 7, 4) (0, 9, 4)

15 Band Pass Filter Layout (4)
Memory Layout Live Ranges Source LPF2 Subtract Sink LPF1 Duplicate Round Robin MEM Initialization Source: O0 LPF2: O0 Source: O1 Source: O2 LPF1: O0 Subtract: O0 Elements Start End (Source: O0) (0, 0, 0) (0, 7, 3) (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (LPF1, O0) (0, 6, 4) (0, 9, 3) (LPF2, O0) (0, 7, 4) (0, 9, 4) (Subtract, O0) (0, 9, 2) (0, 10, 0) (0, 9, 2) (0, 10, 0)

16 Band Pass Filter Layout (5)
Memory Layout Live Ranges Source LPF2 LPF1 Duplicate Subtract Sink Round Robin MEM Initialization Steady Source: O0 Source: O1 LPF2: O0 Source: O1 Source: O2 Source: O2 Source: O3 LPF1: O0 Subtract: O0 Elements Start End (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (Source: O3) (1, 0, 0) (1, 6, 2) (LPF1, O1) (1, 2, 4) (1, 5, 3) (LPF2, O1) (1, 3, 4) (1, 5, 4) (Subtract, O1) (1, 5, 2) (1, 6, 0) Next is the steady phase. Since source output elements 1 and 2 are still live for use, we need to copy and shift to the beginning of the layout to ensure the same memory access. The following procedure is the same, so we simply skip through the slides. (1, 0, 0) (1, 6, 2)

17 Band Pass Filter Layout (6)
Memory Layout Live Ranges Source LPF2 LPF1 Duplicate Subtract Sink Round Robin MEM Initialization Steady 0 Source: O0 Source: O1 LPF2: O0 1 Source: O1 Source: O2 2 Source: O2 Source: O3 3 LPF1: O0 LPF1, O1 4 Subtract: O0 Elements Start End (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (Source: O3) (1, 0, 0) (1, 6, 2) (LPF1, O1) (1, 2, 4) (1, 5, 3) (LPF2, O1) (1, 3, 4) (1, 5, 4) (Subtract, O1) (1, 5, 2) (1, 6, 0) (0, 1, 0) (1, 3, 3) (1, 2, 4) (1, 5, 3)

18 Band Pass Filter Layout (7)
Memory Layout Live Ranges Source LPF2 LPF1 Duplicate Subtract Sink Round Robin MEM Initialization Steady 0 Source: O0 Source: O1 LPF2: O0 LPF2, O1 1 Source: O1 Source: O2 2 Source: O2 Source: O3 3 LPF1: O0 LPF1, O1 4 Subtract: O0 Elements Start End (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (Source: O3) (1, 0, 0) (1, 6, 2) (LPF1, O1) (1, 2, 4) (1, 5, 3) (LPF2, O1) (1, 3, 4) (1, 5, 4) (Subtract, O1) (1, 5, 2) (1, 6, 0) (0, 1, 0) (1, 3, 3) (1, 3, 4) (1, 5, 4)

19 Band Pass Filter Layout (8)
Memory Layout Live Ranges Source LPF2 LPF1 Duplicate Subtract Sink Round Robin MEM Initialization Steady 0 Source: O0 Source: O1 LPF2: O0 LPF2, O1 1 Source: O1 Source: O2 2 Source: O2 Source: O3 3 LPF1: O0 LPF1, O1 4 Subtract: O0 Subtract: O1 Elements Start End (Source: O1) (0, 1, 0) (1, 3, 3) (Source: O2) (0, 2, 0) (1, 6, 1) (Source: O3) (1, 0, 0) (1, 6, 2) (LPF1, O1) (1, 2, 4) (1, 5, 3) (LPF2, O1) (1, 3, 4) (1, 5, 4) (Subtract, O1) (1, 5, 2) (1, 6, 0) The resulting memory layout size is decreased from 14 to 7 compared with the original StreamIt model. Memory layout size is decreased from 14 to 7 (1, 5, 2) (1, 6, 0)

20 Memory Usage on Intel x86_64
Evaluation Memory Usage on Intel x86_64 removes the channel buffer allocations for splitters and joiners more data reuses ESMS Now, you should be clear how ESMS works. But you might curious about how useful it is. I would like to show the data memory usage on the Intel platform Our EMSM may use different strategies to handle the cases when the output element cannot reuse the input memory That doesn’t matter As you can see in the figure the data size for each benchmarks, we achieve 45 to 96% reductions on the data size because ESMS saves the channel buffer allocations for splitters and joiners and have more data reuses The memory saving can be use to buffer more sensor data. Here are the memory usage reductions on the Intel platform. The right figure shows the data size reduction from 45% to 96% compared with the cache optimization. The left figure shows the code size reduction is 73% on average This is because location sharing prevents unnecessary memory copies of shared elements. Therefore, ESMS reduces both channel buffer sizes and the number memory operations from splitters, joiners and reordering filters. 45% to 96% reductions on data size

21 Evaluation Speedup on Intel x86_64 avg. speedup for AA, AoC, IP and CacheOpt are 3, 3.1, 3 and 1.07 performance improved by eliminating unnecessary memory operations and reducing cache/memory references ESMS baseline StreamIt Next, we show the performance speedup compared with the baseline StreamIt Overall, we got avg. speedup of 3 while the StreamIt CacheOpt only achieves 1.07. due to eliminating unnecessary memory operations and fewer memory/cache references MSAs are supposed to run all the time for continuous sensing. With significant speedup, it is possible lower CPU utilization may save the battery life. The system can be more responsive too. Finally, we evaluate the performance speedup against the baseline StreamIt. Overall, the average speedup of ESMS is about 3 While the average speedup of the StreamIt cache optimization is merely 1.07. The StreamIt cache optimization is not applicable to our macro benchmarks because it runs out of memory due to large fine-grained FFT settings. To sum up, ESMS improves the performance by removing unnecessary memory operations and reducing the number of cache/memory references with a smaller working set.

22 Conclusions and Future Work
ESMS is effective for stream languages Predictable performance/energy model for stream processing Effective captures the whole program memory behavior exploits the reuse opportunities achieves significant performance improvement Predictable many opt configurations for single core, dual, quad allow for programmers to specify real-time constraints like the latency compiler search for the configuration with the least power consumption believe useful for long-term use of MSAs

23 Mobile Sensing Laboratory
Dr. Chipara Dr. Leo Dr. Marjan Dr. Behnam Finally, here are our mobile sensing lab members This lab is led by Dr. Chipara We have Dr. Leo, Dr. Marjan and Dr. Behnam We have Ph.D. students Shabih, Ryan and me Please feel free to talk to them in the symposium Now, I think I think it’s time to take your questions Farley Shabih Ryan

24 Thank You Questions?


Download ppt "High Performance Stream Processing for Mobile Sensing Applications"

Similar presentations


Ads by Google