Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer.

Similar presentations


Presentation on theme: "A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer."— Presentation transcript:

1 A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

2 ASC Processor Group2 Outline of Talk SIMD Associative Computing  Associative Search & Associative Processing  PE Interconnection Network  Multiple Instruction Streams ASC Processor (Work Mostly Complete)  Pipelined Architecture  Reconfigurable PE Interconnection Network  Processor and Network Performance MASC Architecture (Work in Progress)  Implementation of Task Manager and Instruction Stream  Sample Code  Architecture and Sample execution Sample Application  String Matching Conclusion

3 ASC Processor Group3 Associative SIMD Array MemoryAPE MemoryAPE MemoryAPE MemoryAPE MemoryAPE MemoryAPE Assoc. Control Unit (CU) Associative Computing Tabular data, cells referenced by content = Associative Search On successful search is, cell is flagged + Associative Processing Flagged cells are processed further SIMD Associative Computing Memory cell uses associative processing element (APE) to search concurrently

4 ASC Processor Group4 Assoc. Control Unit (CU) FordTaurus$22,000Kent ChevroletMalibu$20,000Akron FordTaurus$18,000Akron FordFocus$14,000Kent JeepWrangler$25,000Akron Associative PEs (APEs) search for a key, and those that find it are flagged as responders Find all “Ford” cars for sale APE R R R R R Associative Search

5 ASC Processor Group5 Assoc. Control Unit (CU) FordTaurus$22,000Kent ChevroletMalibu$20,000Akron FordTaurus$18,000Akron FordFocus$14,000Kent JeepWrangler$25,000Akron The responders can be processed further, either one, all sequentially, or all in parallel, as needed APE R R R Associative Processing

6 ASC Processor Group6 ASC: Associative SIMD Array w/ PE Network Memory PE Interconnection Network Assoc. Control Unit (CU) APE

7 ASC Processor Group7 MASC: ASC + Multiple Control Units / Instruction Streams Memory PE Interconnection Network CU Interconnection Network APE Assoc. Control Unit (CU) Assoc. Control Unit (CU) Assoc. Control Unit (CU)

8 ASC Processor Group8 Outline of Talk SIMD Associative Computing  Associative Search & Associative Processing  PE Interconnection Network  Multiple Instruction Streams ASC Processor (Work Mostly Complete)  Pipelined Architecture  Reconfigurable PE Interconnection Network  Processor and Network Performance MASC Architecture (Work in Progress)  Implementation of Task Manager and Instruction Stream  Sample Code  Architecture and Sample Execution Sample Application  String Matching Conclusion

9 ASC Processor Group9 ASC Processor’s Pipelined Architecture We have implemented a pipelined SIMD Associative (ASC) Processor using Altera FPGAs Five single-clock-cycle pipeline stages are split between the SIMD Control Unit (CU) and the PEs  In the Control Unit Instruction Fetch (IF) Part of Instruction Decode (ID)  In the Scalar PE (SPE), in each Parallel PE (PPE) Rest of Instruction Decode (ID) Execute (EX) Memory Access (MEM) Data Write Back (WB)

10 ASC Processor Group10 ID/EX Latch EX/MEM Latch MEM/WB Latch Data Memory Register File IF/ID Latch Instruction Memory Decoder Control Unit (CU) Sequential PE (SPE) Parallel PE (PPE) Array Immediate Data Broadcast Register Data Pipelined ASC Processor with Reconfigurable Interconnection Network

11 ASC Processor Group11 Register File Data Switch Comparator ID/EX Latch Mask EX/MEM LatchMEM/WB Latch Data Memory MUX Processing Element (PE) Comparator implements associative search, pushes ‘1’ onto top of stack for responders, ‘0’ otherwise Top of mask of ‘0’ disables ID/EX Latch

12 ASC Processor Group12 Pipelined ASC Processor’s Performance Our pipelined ASC Processor has been implemented an Altera APEX20KC1000 FPGA with 70 8-bit PEs  Other 8-bit processor cores implemented on this FPGA / speed grade have clock speeds ranging from 30 to 106 MHz, typically 60-68 MHz Our pipelined ASC Processor has a clock speed of 56.4 MHz, comparable with these other processors  With the 5-stage pipeline, our ASC Processor can approach a peak performance of 300 MHz

13 ASC Processor Group13 Reconfigurable PE Interconnection Network Our pipelined ASC Processor also has a reconfigurable PE interconnection network Reconfigurable PE network allows arbitrary PEs in the PE Array to be connected via  Linear array (currently implemented), or  2D mesh (to be implemented soon) without the restriction of physical adjacency Each PE in the PE Array can  Choose to stay in the PE interconnection network, or  Choose to stay out of the PE interconnection network, so that it is bypassed by any inter-PE communication

14 ASC Processor Group14 ID/EX Latch EX/MEM Latch MEM/WB Latch Data Memory Register File IF/ID Latch Instruction Memory Decoder Control Unit (CU) Sequential PE (SPE) Parallel PE (PPE) Array Immediate Data Broadcast Register Data Pipelined ASC Processor with Reconfigurable Interconnection Network

15 ASC Processor Group15 Data Switch Register File Register Data (from SPE) Immediate Data (from CU) Left Neighbor Right Neighbor Top of Mask Stack Comparator & ID/EX Latch Reconfigurable Network Implementation Data switch  Passes register, broadcast, and immediate data to the PE and to its two neighbors  Routes data from the PE’s neighbors to its EX stage Reconfigurable network — supports Bypass Mode to remove the PE non-responders from the network  Will be needed by MASC Processor

16 ASC Processor Group16 ASC Processor’s Network Performance Performance of ASC Processor degrades as number of PEs is increased with Bypass Mode present  Due to the long path from the first PE to the last PE in the PE array 4-PE ASC Processor requires 2152 LEs and runs at 56.4 MHz with Bypass Mode present  When the number of PEs is increased to 50, the clock frequency drops to 22 MHz In the future we hope to reduce this delay using a pipelined or other multi-hop architecture

17 ASC Processor Group17 Outline of Talk SIMD Associative Computing  Associative Search & Associative Processing  PE Interconnection Network  Multiple Instruction Streams ASC Processor (Work Mostly Complete)  Pipelined Architecture  Reconfigurable PE Interconnection Network  Processor and Network Performance MASC Architecture (Work in Progress)  Implementation of Task Manager and Instruction Stream  Sample Code  Architecture and Sample Execution Sample Application  String Matching Conclusion

18 ASC Processor Group18 IDLE Task Manager Task_Allocation Wait_For_IS Join Call_TM Task_Execution IDLE Instruction Stream

19 ASC Processor Group19 MASC PE Structure PE IS_TM_Chooser IS1IS2TM1TM2 ID Register

20 ASC Processor Group20 IDLE Task Manager Task_Allocation Wait_For_IS Join Call_TM Task_Execution IDLE Instruction Stream TM ID IS ID

21 ASC Processor Group21 Assembly Code Example. 101Parallel_Select_StartMem(110) 102 Pcase Condition1 Mem(104) 103 Pcase Condition2 Mem(107) 104 Case1 105 … 106 Parallel_Case_End 107 Case 2 108 … 109 Parallel_Case_End 110 Parallel_Select_End (note: This does not trigger JOIN, lack of tasks do).

22 ASC Processor Group22 TM0 TM1 TM2 IS0IS1IS2 Task ManagersInstruction Streams PE0PE1PE2PE3PE4PE5

23 ASC Processor Group23 TM0 TM1 TM2 Task Managers IS0 IS1IS2 Instruction Streams PE0PE1PE2PE3PE4PE5 Originally All PEs listen to IS0

24 ASC Processor Group24 TM0 TM1 TM2 Task Managers IS0 IS1IS2 Instruction Streams PE0PE1PE2PE3PE4PE5 When Parallel Select is met, Task Manager takes over PEs 101Parallel_Select_StartMem(110)

25 ASC Processor Group25 TM0 TM1 TM2 Task Managers IS0 IS1IS2 Instruction Streams PE0PE1PE2PE3PE4PE5 TM then calls IS0 to perform 1 st task 102 Pcase Condition1 Mem(104) 104 Case1 105 …

26 ASC Processor Group26 TM0 TM1 TM2 Task Managers IS0IS1 IS2 Instruction Streams PE0PE1PE2PE3PE4PE5 TM then calls IS1 to perform 2 nd task 102 Pcase Condition2 Mem(107) 107 Case 2 108 … 102 Pcase Condition1 Mem(104) 104 Case1 105 …

27 ASC Processor Group27 TM0 TM1 TM2 Task Managers IS0 IS1 IS2 Instruction Streams PE0PE1PE2PE3PE4PE5 2 nd task finishes and gives control back to TM 107 Case 2 108 … 109 Parallel_Case_End 102 Pcase Condition1 Mem(104) 104 Case1 105 …

28 ASC Processor Group28 TM0 TM1 TM2 Task Managers IS1 IS2 Instruction Streams PE0PE1PE2PE3PE4PE5 1 st task finishes and gives control back to TM 104 Case1 105 … 106 Parallel_Case_End

29 ASC Processor Group29 TM0 TM1 TM2 Task Managers IS0 IS2 Instruction Streams PE0PE1PE2PE3PE4PE5 Control is back to the last finished IS which is IS0 110 Parallel_Select_End. IS1

30 ASC Processor Group30 TM0 TM1 TM2 Task Managers IS0 IS1 IS2 Instruction Streams PE0PE1PE2PE3PE4PE5 IS1 meets a nested parallel select code

31 ASC Processor Group31 TM0 TM1 TM2 Task Managers IS0 IS1IS2 Instruction Streams PE0PE1PE2PE3PE4PE5 TM1 allocates the two tasks to IS1 and IS2 A = 2 C = AB = A Common Register

32 ASC Processor Group32 Outline of Talk SIMD Associative Computing  Associative Search & Associative Processing  PE Interconnection Network  Multiple Instruction Streams ASC Processor (Work Mostly Complete)  Pipelined Architecture  Reconfigurable PE Interconnection Network  Processor and Network Performance MASC Architecture (Work in Progress)  Implementation of Task Manager and Instruction Stream  Sample Code  Architecture and Sample Execution Sample Application  String Matching Conclusion

33 ASC Processor Group33 Sample Application — String Matching One of the most fundamental computing operations Variable Length Don’t Care (VLDC) SIMD associative algorithm (Mary Esenwein 1997 and Ping Xu 2005) can find all instances of a pattern string within a larger text string  Exact-match version shown here  Extensions support single-character and variable- length “don’t cares” Demonstrates associative search, associative computing (responder processing), and the linear PE interconnection network

34 ASC Processor Group34 String Match by Associative Computing Assoc. Control Unit (CU) APE R R R R R 1 2 3 4 5 Look for match of pattern string AB in text string ABAA Initialize variables as shown above Note that “$” indicates a parallel variable text$ 0 counter$ 0 match$ 00 00 00 @ A B A A00 0 patt_counter 2 patt_length AB patt_string

35 ASC Processor Group35 String Match by Associative Computing Assoc. Control Unit (CU) APE R R R R R 1 2 3 4 5 Responders are text$ == patt_string[j] and counter$ == patt_counter; text$ 0 counter$ 0 match$ 00 00 00 @ A B A A00 0 patt_counter 2 patt_length ABAB patt_string j

36 ASC Processor Group36 String Match by Associative Computing Assoc. Control Unit (CU) APE R 1 2 3 4 5 Responders are text$ == patt_string[j] and counter$ == patt_counter; text$ 0 counter$ 0 match$ 00 00 00 @ A B A A00 0 patt_counter 2 patt_length ABAB patt_string j

37 ASC Processor Group37 String Match by Associative Computing Assoc. Control Unit (CU) APE R 1 2 3 4 5 Responders add 1 to counter$ and send result to counter$ of preceding cell via network; patt_counter++; text$ 0 counter$ 0 match$ 0 00 00 @ A B A A00 0  1 patt_counter 2 patt_length ABAB patt_string j 0  1

38 ASC Processor Group38 String Match by Associative Computing Assoc. Control Unit (CU) APE R R R R R 1 2 3 4 5 Responders are text$ == patt_string[j] and counter$ == patt_counter; text$ 0 counter$ 0 match$ 10 00 00 @ A B A A00 1 patt_counter 2 patt_length ABAB patt_string j

39 ASC Processor Group39 String Match by Associative Computing Assoc. Control Unit (CU) APE R 1 2 3 4 5 Responders are text$ == patt_string[j] and counter$ == patt_counter; text$ 0 counter$ 0 match$ 10 00 00 @ A B A A00 1 patt_counter 2 patt_length ABAB patt_string j

40 ASC Processor Group40 String Match by Associative Computing Assoc. Control Unit (CU) APE R 1 2 3 4 5 Responders add 1 to counter$ and send result to counter$ of preceding cell via network; patt_counter++; text$counter$ 0 match$ 10 00 00 @ A B A A00 1  2 patt_counter 2 patt_length ABAB patt_string j 0  2

41 ASC Processor Group41 String Match by Associative Computing Assoc. Control Unit (CU) APE R R R R R 1 2 3 4 5 Responders are counter$ == patt_length; text$counter$ 0 match$ 10 00 00 @ A B A A00 2 patt_counter 2 patt_length AB patt_string 2

42 ASC Processor Group42 String Match by Associative Computing Assoc. Control Unit (CU) APE R 1 2 3 4 5 Responders are counter$ == patt_length; text$counter$ 0 match$ 10 00 00 @ A B A A00 2 patt_counter 2 patt_length patt_string 2 AB

43 ASC Processor Group43 String Match by Associative Computing Assoc. Control Unit (CU) APE R 1 2 3 4 5 Responders send 1 to match$ of next cell via network; text$counter$ 0 match$ 1 00 00 @ A B A A00 2 patt_counter 2 patt_length patt_string 2 0  1 AB

44 ASC Processor Group44 String Match by Associative Computing Assoc. Control Unit (CU) APE R 1 2 3 4 5 Responders are match$ == 1; Indicates cell(s) where match of pattern string AB in text string ABAA begins text$counter$ 0 match$ 1 00 00 @ A B A A00 2 patt_counter 2 patt_length patt_string 2 1 AB

45 ASC Processor Group45 Conclusion We have implemented a SIMD associative ASC Processor (on an FPGA) that combines the parallelism of SIMD architectures with the search capabilities of associative computing  Performance is improved by adding a 5-stage pipeline, split between the Control Unit and the PEs  Additional functionality is provided by a reconfigurable PE interconnection network Future work will include  Support for multiple Control Units (in progress)  Performance improvement to support more efficient broadcast to a large number of PEs


Download ppt "A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer."

Similar presentations


Ads by Google