Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Nios Multi Processor Ethernet Embedded Platform Final Presentation
Categories of I/O Devices
Wait-Free Queues with Multiple Enqueuers and Dequeuers
Operating Systems Part III: Process Management (Process Synchronization)
On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.
Håkan Sundell, Chalmers University of Technology 1 Evaluating the performance of wait-free snapshots in real-time systems Björn Allvin.
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Global Environment Model. MUTUAL EXCLUSION PROBLEM The operations used by processes to access to common resources (critical sections) must be mutually.
SE-292 High Performance Computing
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.
Chapter 6 Computer Architecture
Performance and power consumption evaluation of concurrent queue implementations 1 Performance and power consumption evaluation of concurrent queue implementations.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
12/13/99 Page 1 IRAM Network Interface Ioannis Mavroidis IRAM retreat January 12-14, 2000.
“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 6.
A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch CSCE
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
DISTRIBUTED COMPUTING
Playstation2 Architecture Architecture Hardware Design.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Parallel Computing Presented by Justin Reschke
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Channels. Models for Communications Synchronous communications – E.g. Telephone call Asynchronous communications – E.g. .
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Multiprocessors – Locks
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
CS203 – Advanced Computer Architecture
Concurrent Data Structures for Near-Memory Computing
Scalable Processor Design
Challenges in Concurrent Computing
A Lock-Free Algorithm for Concurrent Bags
Anders Gidenstam Håkan Sundell Philippas Tsigas
Yiannis Nikolakopoulos
Channels.
Multithreaded Programming
Channels.
Channels.
The University of Adelaide, School of Computer Science
6- General Purpose GPU Programming
Presentation transcript:

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems Lazaros Papadopoulos, Ivan Walulya, Philippas Tsigas, Dimitrios Soudris and Brendan Barry Lazaros Papadopoulos, Ivan Walulya, Philippas Tsigas, Dimitrios Soudris and Brendan Barry National Technical University of Athens School of Electrical and Computer Engineering Division of Computer Science

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 2 “Watt’s Next?” Power consumption –Design decisions –Performance/watt metric Improvements in compute performance -More power budget -Cooling problems

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 3 GPU FLOPS/W Trend

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 4 GPU FLOPS/W Trend Myriad nm GPU rate of increase Myriad x per Year 65nm Years to hit 50GFLOPS/W! Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 4 Emerging Embedded Systems Trend

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 5 But how? Old ApproachNew Approach

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 6 Now that I’ve got an Ultra low power Compute Platform What can I do with it? Potential of such low power processors for use in high end computations. Can they offer a solution to power problems Can high-performance computing techniques be deployed on these processors?

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 7 Introduction –Synchronization on multi-core platforms –Movidius SoC Algorithmic Designs Experimental results Conclusions Outline

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 8 Synchronization Hardware support Mutexes –Scalability –Busy Waiting Lock alternatives or lock-free designs?? Message-passing techniques from HPC domain

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 9 Myriad architecture Processors: –32-bit general purpose RISC SPARC processor (LEON). –8 SHAVE (Streaming Hybrid Architecture Vector Engine) processors for computational processing. Memory: –CMX (Connection Matrix): 1 MB on-chip RAM (with 128KB per SH AVE core) –SDRAM: 64MB. Synchronization support on Myriad1: Mutexes, FIFO registers

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 10 Single Lock Double Lock Client-Server Remote Core Locking - RCL Algorithmic Designs

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 11 No concurrency Busy waiting No Scalability Single Lock Done yet? Done yet? Done yet? Done yet?

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 12 Better concurrency Improved scalability Busy waiting Multiple Locks

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 13 Request for access Spin on local variable Access granted by server Limited Concurrency Client-Server arbitration (C-S) Thread Server Pend Post Queue

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 14 Migrate Critical Section No shared data transfers Reduced Bus traffic Remote Core Locking (RCL) Queue Thread Server Post

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 15 Remote Core Locking - RCL Th-1 Th-2 Server head tail Memory headtail Th-1 Th-2 e1e5 e0 e4 tail head enq() &e6 e1 e5 deq() &e1 deq(&e1) e4 tail e6 head

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 16 Clients server communication costs Serialization of a concurrent data structure Losing one core RCL - Drawbacks

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 17 Experimental evaluation FIFO Queues Cores execute Enqueue and Dequeue operations o High contention Test Configurations 1.Random, initial size 0 2.N/2 Producers / N/2 Consumers, initial size ,000 Operations Measured execution time in cycles

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 18 Experimental Results

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 19 Experimental Results

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 20 Complex data structures can be deployed on ultra low power processors With relatively low absolute performance can they be viable for high-end computing With 3D stacking it may become possible to stack many processors for very fast and energy-efficient communication Conclusions

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 21 Questions?