Static Memory Management for Efficient Mobile Sensing Applications

Static Memory Management for Efficient Mobile Sensing Applications
Farley Lai, Daniel Schmidt, Octav Chipara Department of Computer Science My name is Farley from the University of Iowa in the US. Daniel just graduated and he wrote some of the benchmarks. Prof. Chipara contributed significant ideas to this work and supervised the research. I am going to present our static memory management for mobile sensing applications. EMSOFT 2015

Emerging Mobile Sensing Applications
Introduction Emerging Mobile Sensing Applications A class of applications that process continuous input data streams and may produce continuous output streams real-time processing efficient resource management Speaker Identifier HTTP Upload Speech Recording VAD Feature Extraction A mobile sensing application needs to process continuous input and output data streams. High performance and efficient resource management are both important. Such applications usually consist of the sensing and the stream processing phases. Here is the example of the speaker identifier application, The sensing is the speech recording, while the stream processing involves voice activity detection and feature extraction The features are uploaded to a remote server for identification Unlike the sensing, the stream processing can be arbitrarily complex and compute intensive Therefore, it’s essential to develop compiler optimizations for streaming Speaker Models Sensing Stream Processing

The Memory Management Challenge
Introduction The Memory Management Challenge Workload: stream operations on frames of samples e.g., windowing, splitting, or appending stream operation tend to be memory intensive Goal: implement stream operations efficiently reduce memory footprint reduce number of memory accesses Challenges: handle complex interaction between components avoid unnecessary memory copies enable data sharing between components In term of the performance bottleneck of stream processing, stream operations such as windowing, splitting, and appending tend to be memory intensive Therefore, the goal is to implement stream operations efficiently. This involves reducing the memory footprint and the number of memory accesses. To achieve this, We have to handle complex component interactions Avoid unnecessary memory copies And enable data sharing between components

Approaches to Memory Management
Introduction Approaches to Memory Management Dynamic memory management specialized data structures to implement memory management e.g., SigSeg [Girod, et al. 2008] – linked list of buffered samples a level of indirection in accessing streaming data Static memory management no runtime overhead requires precise knowledge of the variable live ranges difficult to achieve in complex applications must be time-efficient to be included in compilers Traditionally, the memory management can be dynamic or static. Dynamic memory management can simply rely on the garbage collection Or some runtime analysis and data structures For example, the previous work SigSeg uses a linked list like structue to manage buffered samples However, runtime management overhead due to a level for indirection is inevitable On the other hand, static memory management does not suffer runtime overhead. But it requires precise knowledge of the variable live ranges. This is difficult in complex applications Not to mention this analysis must be efficient for practical use [Girod2008] L. Girod, Y. Mei, R. Newton, S. Rost, A. Thiagarajan, H. Balakrishnan, and S. Madden, “XStream: a Signal-Oriented Data Stream Management System,” in ICDE, 2008.

Outline Application model Static analysis Memory layout Evaluation
Conclusions For the remainder of the talk, I will first give an overview of the application model that our optimization is applied to Next, I will go through the static analysis and the memory layout Followed by the evaluation and conclusions

A Model for Stream Applications
StreamIt – synchronous data flow (SDF) language application = graph of filters connected with FIFO channels limited memory operations: pop(), peek(), and push() known consumption and production rates pop peek push Filter::work() INPUT: OUTPUT: We adopt a well-defined synchronous data-flow language called StreamIt from MIT to facilitate our optimization. In StreamIt, a program is represented as a graph of filters connected with FIFO channel A filter has work() function and serves as the basic processing unit. A filter must use peek() and pop() to access its input channel, and push() to access its output channel. Besides, the data consumption and production rates in one filter invocation are known and fixed at compile time.

A Model for Stream Applications
StreamIt – synchronous data flow language applications are constructed hierarchically pipeline of streams split and joins (splitter and joiner) pass-by-value semantics naïve implementation would incur significant number of copies LPF2 Source Duplicate LPF1 Subtract Sink Round-Robin To compose a a complex stream program, StreamIt provides hierarchical stream constructs Including pipelines and splitjoins Here a stream a placeholder, which can be a filter, or another pipeline or splitjoin. A pipeline compose a sequence of stream. A splitjoin allows for parallel data flow branches. The data exchange follows the pass-by-value semantics. A naiive implementation would incur significant memory copy overhead. Here is a band pass filter example. The top level stream is always a pipeline. There is a splitjoin that splits the source input for the downstream low pass filters to process The results are joined in a round-robin fashion and subtracted to produce the final output

Insight SDFs may be executed in a cyclo-static schedule
the complete memory behavior of the program may be observed within one execution of the schedule Our solution: static analysis + memory layout INIT PHASE: Source,3 DUP, 3 LPF1,1 LPF2,1 RR,1 Sub,1 Sink STEADY PHASE: Source,1 DUP, 1 LPF1,1 LPF2,1 RR,1 Sub,1 Sink The insight of using StreamIt is the model of computation follows a static schedule that describes the filter invocation order and times. A schedule may have an optional initialization phase that executes only once. And a steady state phase that can repeats forever It is possible to capture the complete memory behavior in one schedule iteration. Our solution is to first apply the static analysis to the schedule and then generate an efficient memory layout. Here is a schedule example of the band pass filter. In the init phase, only the source and the dup splitter exectues 3 times. The other filter executes once. In the steady phase, all the filters execute once. LPF2 Source Duplicate LPF1 Subtract Sink RoundRobin

Component Analysis Location Sharing Temporal Sharing
an output element is pushed from an unmodified input element each I/O element is associated with a pop/push index Temporal Sharing an output element reuses the input element storage each I/O element is associated with a live range [i, j] Builds on abstract interpretation build a Control-Flow Graph (CFG) for each filter abstract interpretation of memory operations Our static analysis consists of the component analysis and the whole program analysis. The component analysis analyzes the work() function per filter. The goal is to capture location and temporal sharing opportunities. The location sharing associates the unmodified input elements with the corresponding output elements. The input and output elements are identified by their respective pop and push indices. The temporal sharing allows an output element to reuse an input element storage. The input and output element live ranges must be captured for safe reuse. This framework builds on abstract interpretation and data-flow analysis. Each filter work() function is represented as a CFG for analysis. Then we describe each memory operation by abstract interpretation.

Component Analysis Abstract interpretation of memory operations
memory counter (MC) – relative order of operation indexes of current push (out) and pop (in) live range for each input (LIN) and output (LOUT) element Indexes and live ranges represented as intervals Subset of rules for determining live ranges: MC, out, LOUT LOUT [out] ⊔ MC, out++, MC++ push MC, in, LIN LIN[in] ⊔ MC, in++, MC++ pop The abstract interpretation includes the memory counter, element push, pop indices, and live ranges. The memory counter is the relative order of operation. We use out and in to denote the push and pop indices. We use L to denote the element live range. Indices and live ranges are viewed as intervals for set operators. (MC1, in1, out1) (MC2, in2, out2) (MC=max(MC1,MC2), in= in1 ⊔ in2, out=out1 ⊔ out2) join

Example of Component Analysis
RULE: CFG: MC, LIN, in LIN[in] ⊔ MC, in++, MC++ pop STATE: LIN [0,0] Example ∅ ∅ LOUT 1 Let’s go through an example CFG. The MC is initialized to zero. The input element live range is initialized to the interval [0, 0]. The output element live range is initialized to be empty. Then, a pop() is applied first. The input element live range union with MC=0 is still [0,0]. After that, the MC and the input index are incremented. MC in out MC 1 in out LIN[0] =LIN[0]⊔[0,0]

RULE: CFG: MC, LOUT, out LOUT [out] ⊔ MC, out++, MC++ push STATE: LIN [0,0] Example [1,1] ∅ LOUT 1 Next, we get to the push(0) in the right branch. The output element live range is evaluated to be [1,1] because MC=1 Then, the MC and the push index are incremented. MC 1 in out MC 2 in 1 out LOUT[0] =LOUT[0]⊔[1,1]

RULE: CFG: (MC1, in1, out1) (MC2, in2, out2) (MC=max(MC1,MC2), in= in1 ⊔ in2, out=out1 ⊔ out2) join STATE: LIN [0,0] Example [1,1] ∅ LOUT 1 Next, we get to the join block and merge the information from both branches. We take the maximum MC. The union of the pop indices is the same. The union of the push indices becomes [0,1]. MC 1 in out MC 2 in 1 out MC 2 in 1 out

RULE: CFG: MC, LOUT, out LOUT [out] ⊔ MC, out++, MC++ push STATE: LIN [0,0] Example [1,1] [2,2] LOUT [0,1] There is one last push(x). Again, its live range is evaluated to be [2,2] because the current MC is 2. But its push index is between 0 and 1. This implies the memory behavior is input dependent and non-deterministic. The compiler needs to take a conservative estimation. MC 2 in 1 out MC 3 in 1 out 2 LOUT[0,1] =LOUT[0,1]⊔[2,2]

Whole Program Analysis
Component analysis constructs a memory fragment captures live ranges for temporal reuse captures location sharing edges Whole program analysis constructs a memory graph stitches together memory fragments simulates the schedule to connect location sharing edges into paths and extend live ranges with the phase number and invocation index Our approach: analysis is precise when there is no input dependency otherwise, it is a sound approximation After the component analysis is done, the live ranges and location sharing edges between input and output elements are saved in a memory fragment. The whole program analysis then constructs a memory graph by stitching the fragments in one schedule iteration. The location sharing edges between filters are connected into paths. The live range of the location shared element takes the union of the live ranges along the path. Now, we scale the live ranges to include all the input and output elements in different schedule phases and invocation indices. The intuition is combine the phase number and the invocation index with the MC in the live ranges. As a result, given no input dependency, our analysis is able to characterize the complete memory behavior of the entire program because there is no pointer aliasing in StreamIt and it is guaranteed to terminate in one schedule iteration. If there is input dependency, our compiler simply enforces FIFO access and enlarge the memory layout for safety.

Memory Layout A B Empirical insights
split-joins can be eliminated for manipulating location shared elements a filter usually can reuse its input memory Heuristic approaches to resolving temporal reuse conflicts A B other comps A memory B memory Based on the static analysis, it is straightforward to generate the efficient memory layout following the data sharing insights. First , location sharing avoids memory copies due to splitjoins and reordering filters Second, temporal sharing allows to reuse the the input memory However, we still need to resolve temporal reuse conflicts due to non-empty live range intersections. Here is the simulation for filter B to decide its output memory layout. If all the input elements are temporally reusable, output B simply reuses its input from A as shown in the left figure. Otherwise, we need to resolve the live range conflicts. Currently we offer three strategies. The Always Append strategy appends to enlarge the layout regardless of temporal reuses as shown in the central figure. The Append on Conflict strategy appends to enlarge the layout as long as there is any live range conflict. Therefore, it acts either as the left figure or the central figure. The Insert in Place tries to reuse as much as possible and insert the output by shifting the conflicted region to enlarge the layout. In the right figure, output B reuses until the conflicted region and insert by shifting the region to enlarge the layout. A A A B A B B No conflict Append on Conflict (AoC) Insert-in-Place (IP)

Experimental Setup Intel x86_64 on Mac OS X 10.10.3 StreamIt Compiler
Evaluation Experimental Setup Intel x86_64 on Mac OS X 3GHz Intel Xeon CPU E v2. 32KB L1 instruction + 32KB L1 data caches 256KB L2 + 25MB L3 caches StreamIt Compiler baseline default settings without optimizations enabled cache optimizations with –cacheopt gcc –O3 to compile generated C/C++ code 11 micro benchmarks from StreamIt 3 macro benchmarks from real MSAs BeepBeep [Peng, C., et al. 2007], MFCC and Crowd [Xu, C., et al. 2013] Next, I will present the experimental results on the Intel platform. The results on the ARM Android platform are available in the paper. Our baseline is the default StreamIt compiler without optimizations. We also compare with the StreamIt cache optimization, which increases the number of filter invocations to trade space for cache locality. To make a fair comparison, 11 of the benchmarks are from the StreamIt package. We also implemented 3 macro benchmarks extracted from real mobile sensing applications. The BeepBeep is for audio localization The MFCC is the feature extraction of the speaker identifier The Crowd is for co-located speaker counting

Memory Usage on Intel x86_64
Evaluation Memory Usage on Intel x86_64 ESMS reduces both channel buffer sizes and the number memory operations from splitters, joiners and reordering filters Here are the memory usage reductions on the Intel platform. The right figure shows the data size reduction from 45% to 96% compared with the cache optimization. The left figure shows the code size reduction is 73% on average This is because location sharing prevents unnecessary memory copies of shared elements. Therefore, ESMS reduces both channel buffer sizes and the number memory operations from splitters, joiners and reordering filters. 73% reductions on average 45% to 96% reductions

Speedup on Intel x86_64 Compared with baseline StreamIt
Evaluation Speedup on Intel x86_64 Compared with baseline StreamIt The average speedup of AA, AoC, and IP are 3, 3.1, and 3 while the average speedup of CacheOpt is merely 1.07. ESMS improves the performance by eliminating unnecessary memory operations and reducing cache/memory references. Finally, we evaluate the performance speedup against the baseline StreamIt. Overall, the average speedup of ESMS is about 3 While the average speedup of the StreamIt cache optimization is merely 1.07. The StreamIt cache optimization is not applicable to our macro benchmarks because it runs out of memory due to large fine-grained FFT settings. To sum up, ESMS improves the performance by removing unnecessary memory operations and reducing the number of cache/memory references with a smaller working set.

Conclusions Static memory management is effective for stream languages
whole program memory behaviors may be characterized both location and temporal sharing opportunities are exploited performance improvement due to fewer memory operations and references ESMS provides significant performance improvements 45% to 96% data size reduction 73% code size reduction 3X speedup The component fragment information is reusable and can be exposed without the source code.

Acknowledgements National Science Foundation (NeTs grant #1144664 )
CSense Toolkit Acknowledgements National Science Foundation (NeTs grant # ) Carver Foundation (grant # ) We especially thank and acknowledge our funding sources. Now, I think it’s time to take your questions.

Thank You Questions?

Static Memory Management for Efficient Mobile Sensing Applications

Similar presentations

Presentation on theme: "Static Memory Management for Efficient Mobile Sensing Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Static Memory Management for Efficient Mobile Sensing Applications

Similar presentations

Presentation on theme: "Static Memory Management for Efficient Mobile Sensing Applications"— Presentation transcript:

Similar presentations

About project

Feedback