Presentation is loading. Please wait.

Presentation is loading. Please wait.

IXP Lab 2012: Part 3 Programming Tips. Outline Memory Independent Techniques – Instruction Selection – Task Partition Memory Dependent Techniques – Reducing.

Similar presentations


Presentation on theme: "IXP Lab 2012: Part 3 Programming Tips. Outline Memory Independent Techniques – Instruction Selection – Task Partition Memory Dependent Techniques – Reducing."— Presentation transcript:

1 IXP Lab 2012: Part 3 Programming Tips

2 Outline Memory Independent Techniques – Instruction Selection – Task Partition Memory Dependent Techniques – Reducing Overhead Reduce the number of memory accesses Reduce average access latency – Hiding Overhead NCKU CSIE CIAL Lab 2

3 Memory Independent Techniques Instruction Selection – General Coding Skill – Use Hardware Instruction Task Partition – Multi-Processing – Context-Pipelining NCKU CSIE CIAL Lab 3

4 General Coding Skill Remove loop Shift Operation – Avoid using multiply and divide Inline Function – __inline & __forceinline Branch Prediction – Branch Prediction Penalty NCKU CSIE CIAL Lab 4

5 Hardware Instruction POP_COUNT FFS Multiply CRC Hashing CAM NCKU CSIE CIAL Lab 5

6 POP_COUNT --Brief Population Count Report number of bit set in a 32-bit register 3 cycles latency Example: – pop_count( 0x3121 ) = ? – 0011 0001 0010 0001 – Result = 5 NCKU CSIE CIAL Lab 6

7 POP_COUNT --Na ï ve Implementation unsigned int pop_count_for (unsigned int x) { unsigned int y=0; unsigned int i; for(i=0; i<32; i++) { if( (x&1)==1 ) y++; x=x>>1; } return y; } NCKU CSIE CIAL Lab 7

8 POP_COUNT --Faster Implementation unsigned int pop_count_agg(unsigned int x) { x -= ((x >> 1) & 0x55555555); x = (((x >> 2) & 0x33333333) + (x & 0x33333333)); x = (((x >> 4) + x) & 0x0f0f0f0f); x += (x >> 8); x += (x >> 16); return(x & 0x0000003f);} } Reference http://aggregate.org/MAGIC/http://aggregate.org/MAGIC/ NCKU CSIE CIAL Lab 8

9 POP_COUNT --Hardware Instruction unsigned int pop_count_hardware(unsigned int x) { return pop_count (x); } NCKU CSIE CIAL Lab 9

10 POP_COUNT --Additional Information Bitmap-RFC (Liu, TECS 2008) NCKU CSIE CIAL Lab 10

11 FFS Find the first bit set in data and return its position Example: – ffs ( 0x3121 ) = 0 0011 0001 0010 0001 – ffs ( 0x3120 ) = 5 0011 0001 0010 0000 – ffs ( 0x3100 ) = 8 0011 0001 0000 0000 NCKU CSIE CIAL Lab 11

12 Multiply Specific Multiply Instruction – Multiply_24x8() – Multiply_16x16() – Multiply_32x32_hi() – Multiply_32x32_lo() NCKU CSIE CIAL Lab 12

13 CRC 14 cycles latency Example of CRC operation crc_write( 0x42424242); crc_32_be( source_address, bytes_0_3 ); crc_32_be( dest_address, bytes_0_3 ); … Cache_index = crc_read(); NCKU CSIE CIAL Lab 13

14 Hash hash_48() hash_64() hash_128() Example: SIGNAL sig_hash; hash48(data_out, data_in, count, sig_done, &sig_hash); __wait_for_all(&sig_hash); NCKU CSIE CIAL Lab 14

15 CAM --Brief Content Addressable Memory Each ME has 16 32-bit CAM entries The CAM is private to other MEs With lookup operation, each entries is searching in parallel With a success lookup, the index of matched entries will be returned Else, the index of entries to be replaced will be returned NCKU CSIE CIAL Lab 15

16 CAM --Structure cam_lookup_t NCKU CSIE CIAL Lab 16

17 CAM --Usage cam_lookup_t cam_result; cam_result = cam_lookup( data ); if( cam_result.hit == 1 ) { Access Entry cam_result.entry_num; … } else { …… cam_write( cam_result.entry_num, data, 15 ); } NCKU CSIE CIAL Lab 17

18 Task Partition Multi-Processing – More Computing Power – Easy to implement Context-Pipelining – More Useable Resource – Hard to balance NCKU CSIE CIAL Lab 18

19 Memory Relative Techniques --Reducing Overhead Reduce the number of memory accesses – Wide-word Accesses – Result Caches Reduce average access latency – Multi-level Memory Hierarchy – Data Cache NCKU CSIE CIAL Lab 19

20 Wide-Word Accesses --Brief Batch Access the needed data Reduce the necessary accesses Useful when the data stored contiguously NCKU CSIE CIAL Lab 20 MEM_ADDR+0…… +4…… +8…… +12…… +16…… +20…… +24…… +28……

21 Wide-Word Accesses --Usage (One Node per Access) __declspec(sram_read_reg) UINT32 A; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*4), 1, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A...... ---------------------------------------------- Result: 8 Accesses are needed NCKU CSIE CIAL Lab 21

22 Wide-Word Accesses --Usage (Two Node per Access) __declspec(sram_read_reg) UINT32 A[2]; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*8), 2, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A...... ---------------------------------------------- Result: 4 Accesses are needed NCKU CSIE CIAL Lab 22

23 Wide-Word Accesses --Usage (Four Node per Access) __declspec(sram_read_reg) UINT32 A[4]; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*16), 4, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A...... ---------------------------------------------- Result: 2 Accesses are needed NCKU CSIE CIAL Lab 23

24 Wide-Word Accesses --Experiment Platform: IXP2800 Total Accesses: 8 LW (8*4 Byte) CaseTotal CycleAverage Cycles/ LW 1LW * 8 Time1211151.38 2LW * 4 Time72590.63 4LW * 2 Time46057.50 8LW * 1 Time38748.38 NCKU CSIE CIAL Lab 24

25 Wide-Word Accesses --Limitation Data must be contiguous – Suitable for linear search – Not support random accesses Number of Transfer Registers are fixed – Each thread has 16 read / write registers – The Tx-Regs may be reserved by others NCKU CSIE CIAL Lab 25

26 Resulting Cache --Brief Caching the result of application If same fields appear again, the cached result is returned Memory accesses are reduced when cache hit. Depends on temporal locality of the traffic NCKU CSIE CIAL Lab 26

27 Result Cache --IXP2400 No hardware cache is supported in IXP2400 ME Not easy to implement set-associative cache Replacement policy will also be an overhead NCKU CSIE CIAL Lab 27

28 Result Cache --Design Consideration Shared or Private Cache ? Size of Cache ? Works with specific Hardware ? Miss penalty handling ? NCKU CSIE CIAL Lab 28

29 Result Cache --Example NCKU CSIE CIAL Lab 29

30 Multi-Level Memory Hierarchy --Brief Reduce the average access latency Number of accesses remained unchanged If data can fit in faster memory, then do it NCKU CSIE CIAL Lab 30

31 Multi-Level Memory Hierarchy --Data Placement Size smaller while read-only – Hard Code Size smaller while need updating – Local Memory Size larger – Scratchpad Size largest – SRAM NCKU CSIE CIAL Lab 31

32 Multi-Level Memory Hierarchy --Packet Data Type Packet related data – Temporary Data – Valid with specific packet – Local Memory Flow related data – Related to specific flow – Spatial Locality – Wide-Word Access Application related data – Valid with specific application – Temporal Locality – Result Cache NCKU CSIE CIAL Lab 32

33 Split-Cache (Z. Liu, IET-COM 2007) Two separate hardware for application data and flow data NCKU CSIE CIAL Lab 33

34 Data Cache --Brief Hardware Cache Mechanism that cached the data for packet processing – App-Cache – Flow-Cache However, not supported by IXP2400 (Need additional hardware) NCKU CSIE CIAL Lab 34

35 Data Cache --CAM + Local Memory CAM works with Local Memory acts like hardware cache However, number of CAM entries is limited Each CAM entry may co-worked with several Local Memory Cache entry NCKU CSIE CIAL Lab 35

36 Memory Relative Techniques --Hiding Overhead Not really reduce the overhead, but overlapped it – Hardware Multi-Threading – Asynchronous Memory NCKU CSIE CIAL Lab 36

37 Hardware Multi-Threading Swap out itself and let another thread to execute while access memory Each thread kept its own set of registers, thus no stack are needed for thread swapping Round Robin Scheduling No thread preemptive NCKU CSIE CIAL Lab 37

38 Asynchronous Memory --Brief Thread will not be blocked when issue a memory request Thus, thread can issues multiple memory requests at a time NCKU CSIE CIAL Lab 38

39 Asynchronous Memory --Example (1 Issue) Read X __wait_for_all ( &sig_x ) Read Y __wait_for_all ( &sig_y ) // Use X and Y … NCKU CSIE CIAL Lab 39

40 Asynchronous Memory --Example (2 Issues) Read X Read Y __wait_for_all ( &sig_x, &sig_y ) // Use X and Y … NCKU CSIE CIAL Lab 40

41 Wide-Word Access + Multiple Issues MEM_ADDR+0 …… +4 …… +8 …… +12 …… +16 …… +20 …… +24 …… +28 …… NCKU CSIE CIAL Lab 41

42 Wide-Word Access + Multiple Issues (1LW, 2 Issue) MEM_ADDR+0 …… +4 …… +8 …… +12 …… +16 …… +20 …… +24 …… +28 …… NCKU CSIE CIAL Lab 42

43 Wide-Word Access + Multiple Issues (2LW, 2 Issue) MEM_ADDR+0 …… +4 …… +8 …… +12 …… +16 …… +20 …… +24 …… +28 …… NCKU CSIE CIAL Lab 43

44 Wide-Word Access + Multiple Issues (4LW, 2 Issue) MEM_ADDR+0 …… +4 …… +8 …… +12 …… +16 …… +20 …… +24 …… +28 …… NCKU CSIE CIAL Lab 44

45 Wide-Word Access + Multiple Issues (Experiment) SchemeTotal CyclesAverage Cycles / LW 1 LW * 1 Issue 1211 151.38 2 LW * 1 Issue 725 90.63 4 LW * 1 Issue 460 57.50 8 LW * 1 Issue 387 48.38 1 LW * 2 Issue71689.50 2 LW * 2 Issue44555.63 4 LW * 2 Issue36445.50 1 LW * 4 Issue39649.50 2 LW * 4 Issue32040.00 1 LW * 8 Issue31839.75 NCKU CSIE CIAL Lab 45

46 Reference (1) Jayaram Mudigonda, Harrick M. Vin, Raj Yavatkar, “ Overcoming the memory wall in packet processing: hammers or ladders? ”, Proc. ANCS 2005. Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang, “ High-Performance Packet Classification Algorithm for Multireaded IXP Network Processor ”, ACM TECS 2008. NCKU CSIE CIAL Lab 46

47 Reference (2) Z. Liu, K. Zheng, B. Liu, “ Hybrid cache architecture for high-speed packet processing ”, IET-COM 2007. NCKU CSIE CIAL Lab 47


Download ppt "IXP Lab 2012: Part 3 Programming Tips. Outline Memory Independent Techniques – Instruction Selection – Task Partition Memory Dependent Techniques – Reducing."

Similar presentations


Ads by Google