Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yiannis Nikolakopoulos

Similar presentations


Presentation on theme: "Yiannis Nikolakopoulos"— Presentation transcript:

1 Yiannis Nikolakopoulos
Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas

2 Concurrent Data Structures
Parallel/Concurrent programming: Share data among threads/processes, sharing a uniform address space (shared memory) Inter-process/thread communication and synchronization Both a tool and a goal Yiannis Nikolakopoulos

3 Concurrent Data Structures: Implementations
Coarse grained locking Easy but slow... Fine grained locking Fast/scalable but: error-prone, deadlocks Non-blocking Atomic hardware primitives (e.g. TAS, CAS) Good progress guarantees (lock/wait-freedom) Scalable Yiannis Nikolakopoulos

4 What’s happening in hardware?
Multi-cores  many-cores “Cache coherency wall” [Kumar et al 2011] Shared address space will not scale Universal atomic primitives (CAS, LL/SC) harder to implement Shared memory  message passing Shared Local Cache Cache IA Core Yiannis Nikolakopoulos

5 Can we have Data Structures: Fast Scalable Good progress guarantees
Cache IA Core Shared Local Networks on chip (NoC) Short distance between cores Message passing model support Shared memory support Eliminated cache coherency Limited support for synchronization primitives Can we have Data Structures: Fast Scalable Good progress guarantees Yiannis Nikolakopoulos

6 Yiannis Nikolakopoulos ioaniko@chalmers.se
Outline Concurrent Data Structures Many-core architectures Intel’s SCC Concurrent FIFO Queues Evaluation Conclusion Not in the beginning Yiannis Nikolakopoulos

7 Single-chip Cloud Computer (SCC)
Experimental processor by Intel 48 independent x86 cores arranged on 24 tiles NoC connects all tiles TestAndSet register per core Mention that is not available but is relevant because similar architectures appear Yiannis Nikolakopoulos

8 SCC: Architecture Overview
Stay longer Message Passing Buffer (MPB) 16Kb Memory Controllers: to private & shared main memory Yiannis Nikolakopoulos

9 Programming Challenges in SCC
Message Passing but… MPB small for large data transfers Data Replication is difficult No universal atomic primitives (CAS); no wait-free implementations [Herlihy91] Say that I repeat the challenges for the specific architectures Yiannis Nikolakopoulos

10 Yiannis Nikolakopoulos ioaniko@chalmers.se
Outline Concurrent Data Structures Many-core architectures Intel’s SCC Concurrent FIFO Queues Evaluation Conclusion Not in the beginning Yiannis Nikolakopoulos

11 Concurrent FIFO Queues
Main idea: Data are stored in shared off-chip memory Message passing for communication/coordination 2 design methodologies: Lock-based synchronization (2-lock Queue) Message passing-based synchronization (MP-Queue, MP-Acks) Need a goal after this Do not need the “case study” Yiannis Nikolakopoulos

12 Yiannis Nikolakopoulos ioaniko@chalmers.se
2-lock Queue Array based, in shared off-chip memory (SHM) Head/Tail pointers in MPBs 1 lock for each pointer [Michael&Scott96] TAS based locks on 2 cores Separate algorithmic contribution (flag bit) and implementation (lock-placement) We can use the chip overview here 2-lock Standard 2-lock Yiannis Nikolakopoulos

13 2-lock Queue: “Traditional” Enqueue Algorithm
Acquire lock Read & Update Tail pointer (MPB) Add data (SHM) Release lock Show the traditional approach, the optimization and why Yiannis Nikolakopoulos

14 2-lock Queue: Optimized Enqueue Algorithm
Acquire lock Read & Update Tail pointer (MPB) Release lock Add data to node SHM Set memory flag to dirty Show the traditional approach, the optimization and why Why? No Cache Coherency! Yiannis Nikolakopoulos

15 2-lock Queue: Dequeue Algorithm
Acquire lock Read & Update Head pointer Release lock Check flag Read node data What about progress? Yiannis Nikolakopoulos

16 2-lock Queue: Implementation
Locks? On which tile(s)? Head/Tail Pointers (MPB) Data nodes Yiannis Nikolakopoulos

17 Message Passing-based Queue
Data nodes in SHM Access coordinated by a Server node who keeps Head/Tail pointers Enqueuers/Dequeuers request access through dedicated slots in MPB Successfully enqueued data are flagged with dirty bit Yiannis Nikolakopoulos

18 MP-Queue What if this fails and is never flagged?
DEQ ENQ TAIL HEAD ADD DATA SPIN What if this fails and is never flagged? “Pairwise blocking” only 1 dequeue blocks Yiannis Nikolakopoulos

19 Adding Acknowledgements
No more flags! Enqueue sends ACK when done Server maintains in SHM a private queue of pointers On ACK: Server adds data location to its private queue On Dequeue: Server returns only ACKed locations Yiannis Nikolakopoulos

20 MP-Acks No blocking between enqueues/dequeues
TAIL HEAD ACK No blocking between enqueues/dequeues Yiannis Nikolakopoulos

21 Yiannis Nikolakopoulos ioaniko@chalmers.se
Outline Concurrent Data Structures Many-core architectures Intel’s SCC Concurrent FIFO Queues Evaluation Conclusion Not in the beginning Yiannis Nikolakopoulos

22 Yiannis Nikolakopoulos ioaniko@chalmers.se
Evaluation Perfomance? Scalability? Is it the same for all cores? Benchmark: Each core performs Enq/Deq at random High/Low contention Yiannis Nikolakopoulos

23 Measures Throughput: Data structure operations completed per time unit. 𝑓𝑎𝑖𝑟𝑛𝑒𝑠𝑠 Δ𝑡 =𝑚𝑖𝑛 min⁡( 𝑛 𝑖 ) 𝑖 𝑛 𝑖 𝑁 , 𝑖 𝑛 𝑖 𝑁 𝑚𝑎𝑥 ( 𝑛 𝑖 ) [Cederman et al 2013] Average operations per core Operations by core i Yiannis Nikolakopoulos

24 Throughput – High Contention
Yiannis Nikolakopoulos

25 Fairness – High Contention
Yiannis Nikolakopoulos

26 Throughput VS Lock Location
Yiannis Nikolakopoulos

27 Throughput VS Lock Location
Yiannis Nikolakopoulos

28 Yiannis Nikolakopoulos ioaniko@chalmers.se
Conclusion Lock based queue High throughput Less fair Sensitive to lock locations, NoC performance MP based queues Lower throughput Fairer Better liveness properties Promising scalability Conclusions as a title Yiannis Nikolakopoulos

29 Thank you! ivanw@chalmers.se ioaniko@chalmers.se
Yiannis Nikolakopoulos

30 Yiannis Nikolakopoulos ioaniko@chalmers.se
Backup slides Yiannis Nikolakopoulos

31 Yiannis Nikolakopoulos ioaniko@chalmers.se
Experimental Setup 533MHz cores, 800MHz mesh, 800MHz DDR3 Randomized Enq/Deq operations High/Low contention One thread per core 600ms per execution Averaged over 12 runs Yiannis Nikolakopoulos

32 Concurrent FIFO Queues
Typical 2-lock queue [Michael&Scott96] Yiannis Nikolakopoulos


Download ppt "Yiannis Nikolakopoulos"

Similar presentations


Ads by Google