Presentation on theme: "Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures Amir Hormati1, Yoonseo Choi1, Manjunath Kudlur3, Rodric Rabbah2,"— Presentation transcript:
1Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures Amir Hormati1, Yoonseo Choi1, Manjunath Kudlur3, Rodric Rabbah2, Trevor Mudge1, and Scott Mahlke11University of Michigan2IBM T.J. Watson Research Lab.3NVIDIA Corp.
2C/C++/Java Cores are the New Gates Stream Programming (Shekhar Borkar, Intel)512Stream ProgrammingUnicorePicoChip256AMBRICHomogeneous MulticoreCISCO CSR1128Heterogeneous MulticoreCUDANVIDIA G80Courtesy: Gordon’06Courtesy: Gordon’06Larrabee64X10Peakstream32C/C++/JavaRAZAXLRFortressCavium# cores/chip16RAWAcceleratorCell8NiagaraCtUnicore processors worked well for 30 years, primarily because of shrinking transistors and increased operating frequency.Around 2000, the industry hit what can be collectively called the complexity wall (including power, memory, ILP). This is mainly because of large monolithic structures.In the past few years, processor designers have turned to multicore designs to continue performance scalability.We are seeing many multicore designs ranging from 2 to 256 cores on a chip.Some of them are homogeneous. Others are heterogeneous, containing data engines and special accelerators along with GPPs.Let us see how these processors were programmed.Uniprocessors were programmed using sequential von Neumann languages. I am just showing C/C++ just because they were the most popular. Language research focused on bringing more productivity to the programmers. Parallel libraries and languages targeted network of workstations and were the mainly used by scientists working on supercomputers.Now that multicores have become commodity, every company out there is scrambling to figure out how to program multicores. Here is a sampling of what the companies, both small and big are trying to do. All these projects started only in the past 5 years. Some of them are new languages, some of them are extensions, and others use the library approach. They vary widely in their underlying programming model.One notable programming model is streaming, and the following approaches provide support for streaming in some form.Opteron 4PAMD Fusion4BCM 1480C T MCore2QuadXbox 360Xeon400480088080400480088080Power4PA8800PA8800Power6Rstream2OpteronCoreDuo8086286386486PentiumP2P3P4Core2DuoRapidmind1CoreAthlonItaniumItanium219751980198519901995200020052010
3Streaming Computing Is Everywhere! Prevalent computing domain with applications in embedded systems, desktops and high-end serversEmphasize embedded systems (Cell phones, handhelds, DSPs)But also extends to desktop and HPSPerformance and power is a major concern in embedded systemsImportant to make execution as efficient as possible: optimizations are keyIn streaming codes:Locality: computation and data localityEasy to extract and exploit both parallelism and localityCan have a major impact on memory system
4StreamIt Exposes different types of parallelism Main Constructs: Filter: Encapsulate computation.StatefulStatelessPipeline Expressing pipeline parallelism.Splitjoin Expressing task/data-level parallelism.Exposes different types of parallelismfilterpipelinesplitjoin
5StreamIt Graph Tuning Parallelism can be tuned in streaming programs Horizontal ReplicationHorizontal FusionVertical Fusion
6StreamIt Example A A A B1 B2 B1 B2 B3 B4 B C C CDE D D E E F F F 10A10A10A6Splitter6Splitter43B143B221.5B121.5B221.5B321.5B486B6Joiner6Joiner246C246C1138CDE326D326D566E566E10F10F10F
7What Are We Solving?AB1B3FECSplitterJoinerB2B4D1D2MemoryCoreCoreMemoryMemoryCoreCoreMemory?Performing graph modulo scheduling on a stream graph statically.What happens in case of dynamic resource changes?
8Target Architecture Master processor acts as a controller. Each slave processor has its own local store and DMA engine.An interconnect network connects all the components together.SlaveLocal StoreDMA. . .SlaveLocal StoreDMAMemoryInterconnectMaster ProcessorDMADMA. . .Local StoreLocal StoreSlaveSlave
9Streaming Application Overview of FlextreamStreaming ApplicationPrepass ReplicationAdjust the amount of parallelism for the target system by replicating actors.StaticFind an optimal schedule for a virtualized member of a family of processors.Work PartitioningFind optimal modulo schedule for a virtualized member of a family of processors.Goal: To perform Adaptive Stream Graph Modulo Scheduling.Partition RefinementTunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)DynamicStage AssignmentPerforms light-weight adaptation of the schedule for the current configuration of the target hardware.Specifies how actors execute in time in the new actor-processor mapping.Buffer AllocationTries to efficiently allocate the storage requirements of the new schedule into available memory units.MSL Commands
10MSL : Multi-Core Streaming Layer Instruction set for heterogeneous multi-core systemsA set of high-level commands for :Actor Commands(Loading/Unloading)Buffer Commands(Allocating local/global buffers)Data Transfer Commands(Managing DMAs)Flextream’s online layer uses these commands to adapt the static schedule
11Overall Execution Flow For every application may see multiple iterations of:Resource change RequestResource change Granted
13Work Partitioning [static 2] Finds optimal actor to processor mapping considering:Actors’ work estimatesCommunication costDMA costMemory requirementsAt the end, each actor is assigned to exactly one processor.
14Partition Refinement [dynamic 1] Available resources at runtime can be more limited than resources in static target architecture.Partition refinement tunes actor to processor mapping for the active configuration.A greedy iterative algorithm is used to achieve this goal.
15Partition Refinement Example Pick processors with most number of actors.Sort the actorsFind processor with max workAssign min actors until thresholdE2C0C1E3S2S1E0BC2E2S0J0C2P0 : 184.5P1 : 283P1 : 141.5P2 : 171.5P3 : 289P3 : 141.5J2J1AE1D1FC3BC3D0C0C1S2J0S0J1J2S1P4 : 151.5P4 : 274.5P5 : 270.5P5 : 193P5 : 183P5 : 173P6 : 140P7 : 159.5E2BC0C1C2C3S2S1J1J2S0J0
16Stage Assignment [dynamic 2] Processor assignment only specifies how actors are overlapped across processors.Stage assignment finds how actors are overlapped in time.Relative start time of the actors is based on stage numbers.DMA operations will have a separate stage.
17Stage Assignment Example B2E3E0BS04E2C26C0C1C2C3J2J1J0AE1D1FC3D0S18C0C1S2J0S0S110D0D1J112S2E0E1E2E31416J2F18
18Buffer Allocation [dynamic 3] Slave processors have limited local store.Local store is faster than main memory.Utilize local stores first and then spill to main memoryIn case of spilling, DMAs have to be adjusted
19Methodology StreamIt Compiler Metis for graph partitioning 32 core heterogeneous distributed memory multi-core systemEach slave core has a DMA engine and 128K local storeSystem simulator to simulate the interconnect traffic.
29Introduction Single core performance stopped to scale. Multi-core and Many-core systems are every where.These systems have different configurations.Resource management is a challenging problem.Cell ProcessorIntel Larrabee