Presentation is loading. Please wait.

Presentation is loading. Please wait.

Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto.

Similar presentations


Presentation on theme: "Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto."— Presentation transcript:

1 Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

2 2 FPGA Increasingly large Systems-on-Chip Many CPUs, accelerators, IP blocks Processors are easier to program than hardware FPGAs & multicores: similar parallel programming challenge Soft Processor FPGAs for Systems-on-Chip DDR controller Ethernet MAC controllers Why are parallel programs challenging?

3 3 Packet Processing Example packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; SINGLE-THREADEDMULTI-THREADED 1- Must correctly delimit atomic operations 2- Improve performance by finer-grain locking Challenges: Atomic packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++;

4 4 Packet Processing Example Atomic packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; No Parallelism Optimisic Parallelism across Connections Opportunity for Parallelism MULTI-THREADED

5 5 Exploit Opportunity for Parallelism Allow more than 1 thread in a critical section Will succeed if threads access different data Transactional Memory –the new hot topic for multiprocessor computers –how to map TM to FPGAs?

6 6 Our Transactional Approach Modify main memory directly: reduce copies, faster commit Data Cache Data processor1 Off-chip DDR processor2 x x Detect conflicts prior to corrupting main memory Undo changes on transaction abort How to efficiently detect conflicts?

7 7 Conflict Detection Must detect all conflicts for correctness Reporting false conflicts is acceptable Transaction1 Transaction2 Read A OK Read BWrite BCONFLICT Write CRead CCONFLICT Compare accesses across transactions: Write D CONFLICT Tracking speculative reads and writes

8 8 Related Work on Conflict Detection FPGAs: test speculative bits in the cache –Complex to evict cache lines –Lots of additional state –Too restrictive in terms of storage capacity Signatures well suited to FPGA bitwise operations How can signatures be efficiently implemented? ASIC: compare signatures –Signature: bit vector recording TM memory accesses –No previous signature FPGA implementation

9 9 Conflict Detection with Signatures Hash of an address indexes into a bit vector - More bits per signature  more resolution - FPGA timing and area limit the number of bits - Hash functions have varying complexity/accuracy processor1 load Hash Function WriteRead Signatures processor2 store AND

10 10 Goals of this Work Implement efficient signatures for TM on FPGAs FPGA reconfigurability  better/more-efficient TM Evaluate with real system

11 11 Existing Hash Functions 1.Bit Selection Address bits 0110... Hash = 0110 4 bits hash index into 16 signature bits

12 12 Existing Hash Functions (continued) We use 4 hash functions to improve performance/length 2. H3: XOR random address bits Address bits 100111... Multiple hash functions index different parts of the signature Address bits 001101... Hash_2 = 10 Hash_1 = 11

13 13 Existing Hash Functions (continued) 3.PBX: XOR high-order bits with low-order ones Address bits 1101... Hash_2 = 01 Address bits 1101... Hash_1 = 01 Address bits 0010... Hash_2 = 10 4.LE-PBX: XOR high-order bits with low-order ones, progressively omit low-order bits in hash functions

14 14 Signatures: an Opportunity for FPGAs Application-specific signatures! ASIC hash functions on FPGA: very area consuming Due to locality:  applications access certain memory locations more frequently  certain locations will have more conflicts than others Via app-specific signatures:  increase tracking resolution of conflicting memory locations  decrease tracking resolution of others FPGAs allow customized hash function for each application

15 15 Trie-based Hashing for Signatures 000011100101110111000011100101110111 Binary Addresses (profiling) 1xx root 11x 111110101100011000 10x 0xx 01x00x Trie gives control on the resolution for different memory regions Complete trie of all TM accesses is HUGE Which leaves in the trie can/cannot be merged? Leaves are distinct addresses  signature bits

16 16 Load/Store A2A1A0 Trie-Based Conflict Detection 1xx xxx 11x 111110101100011000 10x 0xx 01x00x Simulation feedback: 3 leaves in trie  3 signature bits encompass all accesses Compact trie by only evaluating nodes with remaining branching Representation is very efficient! A2 & A0 A2 & !A0 !A2 A2,A1,A0

17 17 Trie-based Hash function Evaluation Training packet trace is different from test packet trace

18 18 Multiprocessor System –NetFPGA: Virtex II Pro 50, 4 GigE + 1 PCI interfaces –2 processors @ 125 MHz (limited by FPGA) –64 MB DDR2 SDRAM @ 200 MHz Real system executing real applications Instr. Data Input mem. Output mem. I$ processor1 1-thread I$ processor2 1-thread Input Buffer Shared Data Cache Output Buffer packet input packet output Off-chip DDR Synch. Unit

19 19 Simulated Ratio of False Conflicts versus Number of Signature Bits - Trie-based hashing function requires much fewer signature bits NAT, percent false conflicts

20 20 Simulated Ratio of False Conflicts versus Number of Signature Bits Classifier UDHCP - Trie-based hashing function requires much fewer signature bits NAT Intruder

21 21 Simulated Packet Rate Normalized to Ideal Conflict Detection vs Trie-Based Signature Length Signatures are Critical to Performance Ideal

22 22 2 Best Implementation Options Block RAM 2048 signature bits per thread Signatures Bit-Select hash function Registers ~100 signature bits per thread Arbitrary hash function We use trie-based signatures: They perform best at that size Let’s Compare! Maximum Design @ 125MHz

23 23 Trie-based Hashing Normalized to BitSelection - Significantly fewer rollbacks  packet rate increase Throughput Area +12% +58% +9% +71% - At most 5% area overhead

24 24 Conclusions Conflict detection significantly impacts performance Trie-based hashing reduces required signature bits Trie-based hashing can be implemented in LUTs  Preserve frequency, 5% area overhead Retiming is required to implement in RAMs Increased performance (up to 71%) versus other best implementation (RAM-based bit-select) - Application-specific signatures enable first fully integrated TM processor for FPGA - We now have an extended version working with 8 threads

25 25 Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto martinL/markJ@eecg.utoronto.ca Thank you!

26 26

27 27 Transactional Memory Parallel Programming Made Easy Reduce conservative synchronization overhead Lock(); if (shared_1) array [ i ] = 0; Unlock(); Only serialized when truly necessary Bool val = f(shared_1); if(val) { Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); } Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); BEFORE AFTER Alleviate need for fine grained-synchronization

28 28 Our Transactional Approach No program change required Modify directly main memory Data Cache Data processor Off-chip DDR processor x x x Detect conflicts prior to corrupting main memory Undo changes on transaction abort

29 29 sigsvn_udhcp/statsout fp rates sigsvn_other/mat other stats

30 30 Transactional Memory Parallel Programming Made Easy Reduce conservative synchronization overhead Lock(); if (shared_1) array [ i ] = 0; Unlock(); Only serialized when truly necessary Bool val = f(shared_1); if(val) { Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); } Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); BEFORE AFTER Alleviate need for fine grained-synchronization

31 31 Transactional Single-Threaded Processor (simplified) Instr. Cache PCPC +4 Reg. Array ALU Data Cache Hazard Detection Logic Hazard detection is too slow: use static hazard detection

32 32 Transactional Single-Threaded Processor (simplified) Instr. Cache +4 ALU Data Cache Conflict Detection Undo Log Reg. Array Reg. Array PCPC PCPC

33 33 Transactional Packet Processing Hardware support to revert speculative changes to: –Register file –Program counter –Data memory To detect failed speculation: –Record read and write sets of speculative threads –Compare sets across threads When does the set comparison take place?

34 34 Conflict Detection with Signatures Suited for FPGA bitwise operations –Hash of an address sets bits in a bit vector -Requires many bits per thread -Timing constraints allow read and write set tracking for 2 threads -Made a single-threaded 2-processor implementation W 00000000 R 00000000 Signature Thread 0 processor x W 01000000 R 00000000 W 00000000 R 00000000 Signature Thread 1 processor x W 01000000 R 00000000 – Set comparison is an AND operation – Clearing sets is done in 1 cycle

35 35 1xx root 11x 111110000 0xx 00x

36 36

37 37 A New Meaning for Locks Optimistically consider locks No program change required Lock(); if ( f( ) ) shared_1 = a(); else shared_2 = b(); Unlock(); Thread1 Thread2 Thread3 Thread4 LOCKS Thread1 Thread2 Thread3 Thread4 TRANSACTIOAL x Reduce conservative synchronization overhead Reduce challenge of fine grained-synchronization

38 38

39 39 * can you list the apps? emphasize that train != test in methodology page


Download ppt "Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto."

Similar presentations


Ads by Google