Presentation is loading. Please wait.

Presentation is loading. Please wait.

MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne.

Similar presentations


Presentation on theme: "MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne."— Presentation transcript:

1 MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne

2 2 Motivation Streaming Applications How to automatically customize embedded Multi Processor on Chip to support efficient execution of complex algorithms? W.J. Dally, et al. 2003

3 3 Motivation © Tensilica 2007

4 4 Motivation Automatic Parallelizatio n

5 5 Motivation Automatic Parallelizatio n Automatic Customizatio n

6 6 Motivation (Parallelization) Streaming Applications Load balancing 1

7 7 Motivation (Parallelization) Streaming Applications Load balancing Avoiding intra-processor communication

8 8 Motivation (Parallelization) Streaming Applications Load balancing Avoiding intra-processor communication

9 9 Motivation (Parallelization) Streaming Applications Load balancing Avoiding intra-processor communication

10 10 Motivation (Parallelization) Streaming Applications Load balancing Avoiding intra-processor communication Synchronization Hardware Barrier (pipelined parallelization)

11 11 Motivation Automatic Parallelizatio n Automatic Customizatio n

12 12 Motivation (Customization) Streaming Applications Instruction Set Extensions L. Pozzi, et al. 2006Tensilica, ARC, NIOS

13 13 Motivation (Customization) Streaming Applications Instruction Set Extensions Architecturally Visible Storage L. Pozzi, et al. 2006Tensilica, ARC, NIOS P. Biswas, et al. 2007 T. Kluter, et al. 2008

14 14 Motivation Automatic Parallelizatio n Automatic Customizatio n ?

15 15 Motivation Automatic Parallelizatio n Automatic Customizatio n ? Only Load and Store instructions allowed in the I nstruction S et E xtension identification

16 16 Motivation Automatic Parallelizatio n Automatic Customizatio n ? Only Load and Store instructions allowed in the I nstruction S et E xtension identification A rchitecturally V isible S torage memory placed between processors to form A rchitecturally V isible C ommunication buffers

17 17 Contents Motivation Parallelization Communication Automation

18 18 Parallelization (reference) Streaming Applications

19 19 Parallelization (reference) Streaming Applications T.R. Halfhill 2000

20 20 Parallelization (reference) Streaming Applications T.R. Halfhill 2000

21 21 Parallelization (reference)

22 22 Parallelization (reference) Reduced energy consumption

23 23 Parallelization (reference) Reduced energy consumption Increased performance

24 24 Parallelization (reference) Reduced energy consumption Increased performance Energy of the memory subsystem only! D. Tarjan, et al. 2006

25 25 Parallelization (reference) Reduced energy consumption Increased performance Energy of the memory subsystem only! D. Tarjan, et al. 2006

26 26 Parallelization (homogeneous) Macro block data-parallel computation due to algorithmic data dependence Theoretical speed up of 5x

27 27 Parallelization (homogeneous) time data dependence

28 28 Parallelization (homogeneous)

29 29 Parallelization (homogeneous) Higher instruction cache pressure due to five distributed copies of the complete algorithm: The system prefers a four way set associative cache over a direct mapped one

30 30 Parallelization (heterogeneous)

31 31 Parallelization (heterogeneous) Quantization is the critical execution path, however it contains easy to detect data parallelism M.I. Gordon, et al. 2006

32 32 Parallelization (heterogeneous) Entropy Encoding is the next critical execution path limiting the speed up to a factor of 4x (according to the execution on a single processor and linear speed up assumptions)

33 33 Parallelization (heterogeneous) time data dependence

34 34 Parallelization (heterogeneous)

35 35 Parallelization (heterogeneous) Reduced instruction cache pressure due to the distribution of the complete algorithm over five caches: The system prefers a 2k byte cache over a 4k byte one

36 36 Parallelization (comparison)

37 37 Contents Motivation Parallelization Communication Automation

38 38 Communication Homogeneous parallelization: Heterogeneous parallelization: Intra processor communication (10 bytes) Intra processor communication (3 x 128 bytes)

39 39 Communication (homogeneous)

40 40 Communication (homogeneous)

41 41 Communication (homogeneous)

42 42 Communication (homogeneous)

43 43 Communication (homogeneous)

44 44 Communication (homogeneous)

45 45 Communication (homogeneous)

46 46 Communication (homogeneous)

47 47 Communication (homogeneous)

48 48 Communication (homogeneous) The communication has as expected little influence on performance, and moving it to AVC buffers reduces energy consumption

49 49 Communication (heterogeneous)

50 50 Communication (heterogeneous)

51 51 Communication (heterogeneous)

52 52 Communication (heterogeneous)

53 53 Communication (heterogeneous) The communication has as expected high influence on performance, and moving it to AVC buffers reduces significantly energy consumption

54 54 Communication (summary)

55 55 Communication (summary)

56 56 Contents Motivation Parallelization Communication Automation

57 57 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } Is this pointer a data structure that can be moved to and AVC buffer?

58 58 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } A designer can disambiguate all data structures (time consuming) Tensilica 2007

59 59 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } A designer can disambiguate all data structures (time consuming) A compiler might not be able to disambiguate all data structures (fast, but incomplete) D.M. Gallagher 1995 Tensilica 2007

60 60 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } A designer can disambiguate all data structures (time consuming) A compiler might not be able to disambiguate all data structures (fast, but incomplete) Profiling can disambiguate all data structures it sees (fast, “complete”, but not guaranteed) D.M. Gallagher 1995 S. Rul, et al. 2008 W. Thies, et al. 2007 Tensilica 2007

61 61 Automation (“safe” data structures)

62 62 Automation (“unsafe” data structures) [1] T. Kluter et al. 2008

63 63 Automation (flow) 1) Disambiguate all data structures D.M. Gallagher 1995

64 64 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000

65 65 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures 3) Annotate zero communication cost D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000

66 66 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures 3) Annotate zero communication cost 4) Perform “standard” parallelization algorithm(s) D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000 S. Rul, et al. 2008 W. Thies, et al. 2007

67 67 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures 3) Annotate zero communication cost 4) Perform “standard” parallelization algorithm(s) 5) Insert AVC buffers where required D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000 S. Rul, et al. 2008 W. Thies, et al. 2007 T.Kluter, et al. 2008 Biswas, et al. 2007

68 68 Conclusion ● Our results confirmed previous finding in automated parallelization ● Application-specific communication buffers do improve both performance and energy reduction ● Application-specific communication buffers find new automated parallelization solutions ● Application-specific communication can be used in presence of “unsafe” analysis methods


Download ppt "MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne."

Similar presentations


Ads by Google