Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

Similar presentations


Presentation on theme: "University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,"— Presentation transcript:

1 University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003

2 University of Michigan Electrical Engineering and Computer Science 2 Motivation Cell phones, PDAs, digital cameras, etc. are everywhere –High performance yet low power design point General core + ASIC solution –Limited post-programmability General core + application specific instructions (CFUs) CPU ASIC CPU CFU

3 University of Michigan Electrical Engineering and Computer Science 3 What is a CFU? Combine multiple primitive operations –Smaller code size, fewer RF reads –Increases performance & | << ^ & * + ^ + + ^ + ^ | CFU 1 + ^ CFU 2 &<< | 2 ^ 2 * 1 + 1 1

4 University of Michigan Electrical Engineering and Computer Science 4 Automation is Key This is ¼ of the DFG for a single basic block of blowfish 159 XOR 164 SHR173 AND

5 University of Michigan Electrical Engineering and Computer Science 5 Related Work Tensilica Xtensa –Commercial example –MIPS core + manually constructed CFU Automatic instruction set synthesis is mature field –See paper for comparison of techniques Our contributions –Novel technique for automatic CFU creation –System to utilize CFUs in multiple applications –Analysis of how effectively CFUs for one application apply to other applications in the same domain

6 University of Michigan Electrical Engineering and Computer Science 6 System Overview Synthesis –Subgraph identification Discover candidates for CFUs Weed out what shouldn’t be picked –Selection Determine which candidates to use as CFUs Compilation –Subgraph replacement Make use of the CFUs in a range of applications

7 University of Michigan Electrical Engineering and Computer Science 7 Subgraph Identification Grow subgraphs from seed nodes –All nodes are seeds –Most directions don’t make sense How to decide where to grow? –Making decisions using factors similar to an architect –Take 4 factors into consideration Criticality, Latency, Area, Input/Output % ^ << +* & |

8 University of Michigan Electrical Engineering and Computer Science 8 Subgraph Identification Grow subgraphs from seed nodes –All nodes are seeds –Most directions don’t make sense How to decide where to grow? –Making decisions using factors similar to an architect –Take 4 factors into consideration Criticality, Latency, Area, Input/Output % ^ << +* & | CFU Candidates & <<

9 University of Michigan Electrical Engineering and Computer Science 9 Subgraph Identification Grow subgraphs from seed nodes –All nodes are seeds –Most directions don’t make sense How to decide where to grow? –Making decisions using factors similar to an architect –Take 4 factors into consideration Criticality, Latency, Area, Input/Output Sum of these factors determines value of each direction –NOT picking CFUs % ^ << +* & | CFU Candidates & <<& +

10 University of Michigan Electrical Engineering and Computer Science 10 Critical Path Combining operations on the critical path will shrink the longer dependence chains –Maximize potential performance gain Wt = –Slack is # cycles off longest dependence path ^& ^ >> &&& ++ << ++ + + + 10/(0+1) = 1010/(2+1) = 3.33

11 University of Michigan Electrical Engineering and Computer Science 11 Latency Growing toward low latency operations allows combination of more nodes in a cycle –Maximize DFG compression Wt = ^& ^ >> &&& ++ << ++ + + + 10*0.3 / 0.6 = 5 10*0.3 / 0.36 = 8.33 OpcodeAreaCycles + 1.000.30 & 0.120.06 > 0.01~0.00 ^ 0.160.09

12 University of Michigan Electrical Engineering and Computer Science 12 Area Want the most benefit for the least area Wt = Area is the sum of macrocell areas ^& ^ >> &&& ++ << ++ + + + 10*0.5/0.5 = 10 10*0.5/1.5 = 3.33 OpcodeAreaCycles + 1.000.30 & 0.120.06 > 0.01~0.00 ^ 0.160.09

13 University of Michigan Electrical Engineering and Computer Science 13 Input/Output Want CFUs to use as few RF ports as possible –Smaller encoding –Allow growth of larger candidates Wt = ^& ^ >> &&& ++ << ++ + + + 10*2/(2+1)= 6.67 10*2/(4+1)= 4

14 University of Michigan Electrical Engineering and Computer Science 14 Example ^& ^ >> &&& ++ << ++ + + + 35 28.5 37.5 30.8 28.537.5

15 University of Michigan Electrical Engineering and Computer Science 15 Example ^& ^ >> &&& ++ << ++ + + + 35 28.5 33.5 30.8 28.540

16 University of Michigan Electrical Engineering and Computer Science 16 Example ^& ^ >> &&& ++ << ++ + + + 35 28.5 36 30.8 28.5 36

17 University of Michigan Electrical Engineering and Computer Science 17 Example ^& ^ >> &&& ++ << ++ + + +

18 University of Michigan Electrical Engineering and Computer Science 18 Example ^& ^ >> && ++ << ++ + + + &

19 University of Michigan Electrical Engineering and Computer Science 19 Example & ^ >> && ++ << ++ + + + & ^

20 University of Michigan Electrical Engineering and Computer Science 20 Example & ^ >> && ++ << ++ + + + & ^

21 University of Michigan Electrical Engineering and Computer Science 21 Example & ^ >> && + << ++ + + + & ^ +

22 University of Michigan Electrical Engineering and Computer Science 22 Example & ^ >> && + << ++ + + + & ^ +

23 University of Michigan Electrical Engineering and Computer Science 23 Example & ^ >> && + ++ << + + + & ^ +

24 University of Michigan Electrical Engineering and Computer Science 24 & ^ >> && + ++ << + + + & ^ + Finished – Met External Constraints

25 University of Michigan Electrical Engineering and Computer Science 25 Set of Candidates ^ << ^ ^ & ^ && ^ && ^ ^ && + ^ ^ && + + ^ ^ && + + ^ ^ && + + ^ & ^ && + + ^

26 University of Michigan Electrical Engineering and Computer Science 26 Avoids Exponential Explosion 1.00 1.25 1.50 1.38 1.13 Speedup

27 University of Michigan Electrical Engineering and Computer Science 27 Greedy Selection Heuristic Subgraph Number ValueCostOps 1204(3,4),(6,8) 261(1,3,7) ………… N95(1,7) Subgraph Number ValueCostOps 1104(6,8) 261(1,3,7) ………… N05 Use estimates of performance improvement / cost

28 University of Michigan Electrical Engineering and Computer Science 28 Multiple applications can utilize CFUs Vflib pattern matcher [Cor ’99] 3 5 6 14 2 Compiler Replacement Instruction Synthesis CFU Description Compiler 3 5 CFU 4 2 1

29 University of Michigan Electrical Engineering and Computer Science 29 Experimental Setup Implemented in the Trimaran toolset Baseline machine: 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle –CFUs use Int issue slot CFU latency/area generated as sum of each individual macrocell –Pipeline latches were added if CFU latency >1 clock cycle –300 MHz clock assumed –No branch or memory instructions in CFUs Four application domains tested –Audio, Encryption, Image, Network

30 University of Michigan Electrical Engineering and Computer Science 30 Native Encryption Results

31 University of Michigan Electrical Engineering and Computer Science 31 Encryption Cross Compile

32 University of Michigan Electrical Engineering and Computer Science 32 Generalizing CFUs Subsumed (Multiple Paths) Wildcards (Multiple Nodes) >> | + IN_10x8 0xF IN_2 >> | + IN_1 0x0 0x8, 0x0 0x0 0xF, 0x0 IN_2 >> & |,& - +,- IN_10x8 0xF IN_2

33 University of Michigan Electrical Engineering and Computer Science 33 Effects of Generalization blowfish bfish-rijn bfish-sha rijndael rijn-bfish rijn-sha sha sha-bfish sha-rijn 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 CFUsSubsumed Subgraphs Speedup

34 University of Michigan Electrical Engineering and Computer Science 34 Conclusions Developed two phase instruction set synthesis system –Guide function removes bad candidates –Greedy selection heuristic Substantial speedups can be attained with very little die impact Subsumed subgraphs and wildcarding increase cross- application effectiveness DomainEncryptionNetworkImageAudio Ave. Speedup1.611.381.161.66

35 University of Michigan Electrical Engineering and Computer Science 35 Questions? http://cccp.eecs.umich.edu

36 University of Michigan Electrical Engineering and Computer Science 36 Backup slides

37 University of Michigan Electrical Engineering and Computer Science 37 Individual Factors - Blowfish

38 University of Michigan Electrical Engineering and Computer Science 38 Individual Factors - Djpeg

39 University of Michigan Electrical Engineering and Computer Science 39 Selection Uses estimates of performance improvement Greedy Heuristic used ^& ^ >> &&& ++ << ++ + + +


Download ppt "University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,"

Similar presentations


Ads by Google