Download presentation
Presentation is loading. Please wait.
Published byLiana Hardja Modified over 6 years ago
1
Flow Stats Module James Moscola September 6, 2007
2
SPP V1 LC Egress with 1x10Gb/s Tx
XScale NAT Miss Scratch Ring SCR S W I T C H R B U F M S F Rx1 Rx2 Key Extract Lookup Hdr Format NN NN NN NN TCAM NN T B U F QM1 R T M M S F 1x10G Tx2 1x10G Tx1 Flow Stats1 SCR SCR Port Splitter NN NN SCR QM2 SCR NAT Pkt return SCR XScale Stats (1 ME) SRAM1 SCR SRAM3 Flow Stats2 XScale SCR SRAM2
3
Overview of Flow Stats Main functions Secondary functions
Uniquely identify flows based on 5-tuple Hash header values to get an index into a table of records Maintain packet and byte counts for each flow Compare packet header with header values in record, and increment if same Otherwise, follow hash chain until correct record is found Send flow information to XScale for archiving Secondary functions Maintain hash table Identify and remove flows that are no longer active Invalid flows are removed so memory can be resused
4
Design Considerations
Efficiently maintaining a hash table with chained collisions Efficiently inserting and deleting records Efficiently reading hash table records Synchronization issues Multiple threads modifying hash table and chains
5
Flow Record Total Record Size = 11 32-bit words = Member of 6-tuple
V is valid bit Only needed at head of chain ‘1’ for valid record ‘0’ for invalid record Start timestamp (64-bits) is set when record starts counting flow Reset to zero when record is archived End timestamp (64-bits) is set each time a packet is seen for the given flow Packet and Byte counters are incremented for each packet on the given flow Next Record Number is next record in hash chain 0x1FFFF if record is tail Address = (next_record_num * record_size) + base_addr LW0 Source Address (32b) LW1 Destination Address (32b) LW2 SrcPort (16b) DestPort (16b) LW3 Reserved (12b) Slice ID (12b) Protocol (8b) LW4 V (1b) Reserved (14b) Next Record Number (17b) LW5 Packet Counter (32b) LW6 Byte Counter (32b) Can END timestamp be eliminated? Use archive time as the end timestamp. Actual end of flow will have occurred within 5 minutes of archival time if records are archived every 5 minutes. LW7 Start Timestamp_high (32b) LW8 Start Timestamp_low (32b) LW9 End Timestamp_high (32b) LW10 End Timestamp_low (32b) = Member of 6-tuple
6
Hash Table Memory Allocating 4 MBytes in SRAM Channel 3 for hash table
Supports ~95K records Divided memory 75% for the main table and 25% for the collision table Memory required = Main_table_size + Collision_table_size + Bit_vector_size .75*(#records * #bytes/record) + .25*(#records * #bytes/record) + .75*(#records/8) ~71K records + ~24K records ~3Mbytes + ~1Mbytes + ~8Kbytes Space for main table and collision table can be adjusted to tune performance Larger main table means fewer collisions, but still need adequate space for collision table Main Table 75% Collision Table 25% Bit Vector
7
Inserting Into Hash Table
IXP has 3 different hash functions (48-bit, 64-bit, 128-bit) which take 7, 8, and 16 clock cycles Using 64-bit hash function provides required functionality and takes half the time of the 128-bit hash Not including Source Addr or Protocol into address HASH(D.Addr, S.Port, D.Port); Result of hash is used to address the main hash table Records in the main table represent the head of a chain If slot at head of chain is empty (valid_bit=0), store record there If slot at head of chain is occupied, compare 6-tuple If 6-tuple matches If packet_count == 0 then (existing flows will have 0 packet_counts when previous packets on flow have just been archived) Increment packet_counter for record Add size of current packet to byte_counter Set start and end time stamps If packet_count > 0 then Set end time stamp If 6-tuples doesn’t match then a collision has occurred and the record needs to be stored in collision table Main Table Collision Table Bit Vector
8
Hash Collisions Hash collisions are chained in linked list
Head of list is in the main table Remainder of list is in collision table SRAM ring maintains list of free slots in collision table When a collision occurs, a pointer to an open slot in the collision table can be retrieved from the SRAM ring When a record is removed from the collision table, a pointer is returned to the SRAM ring for the invalidated slot Main Table Collision Table Bit Vector SRAM Ring Free list
9
Hash Table Bit Vector Hash table can be sparse data structure
Reading entire table looking for valid records would be very time consuming Append bit vector to hash table Each bit represents the head of a chain in the main table Set bits indicate a valid chain starts at that location One bit for each entry in main hash table When archiving records, read bit vector, compute the start addresses of valid chains, and read only valid records from hash table 31 24 16 8 offset = 0 offset = 1 Bit Vector Example shows chains at 0, 7, 30, 32 are valid
10
Archiving Hash Table Records
Send all valid records in hash table to XScale for archiving every 5 minutes For each record in the chain … If packet count > 0 then Record is valid Send record to XScale via Scratch ring (maybe change to SRAM ring?) Set packet count to 0 Set byte count to 0 Leave record in table If packet count == 0 then Flow has already been archived No packet has arrived on flow in 5 minutes Record is no longer valid Delete record from hash table to free memory Info Sent to XScale for each flow every 5 minutes LW0 Source Address (32b) LW1 Destination Address (32b) LW2 SrcPort (16b) DestPort (16b) LW3 Reserved (12b) Slice ID (12b) Protocol (8b) LW4 Packet Counter (32b) LW5 Byte Counter (32b) LW6 Start Timestamp_high (32b) LW7 Start Timestamp_low (32b) LW8 End Timestamp_high (32b) LW9 End Timestamp_low (32b)
11
Deleting Records from Hash Table
While archiving records If packet count is zero then remove record from hash table Record has already been archived, and no packets have arrived in the last five minutes To remove a record If record == head If record != tail Replace record with record.next Free slot for the moved record Else if record == tail Valid_bit = 0 Else if record != head Set previous records next pointer to record.next Free slot for the deleted record Main Table Collision Table Bit Vector SRAM Ring Free list
12
Memory Synchronization Issues
Multiple threads reading/writing same block of memory Only allow 1 ME to modify structure of hash table Inserting and deleting nodes Use global registers to indicate that the structure of the hash table is being modified One global bitmask register to indicate which threads are modifying the structure Eight global lock registers (1 per thread) to indicate what chain in the hash table is being modified When a thread wants to insert or delete a record from the hash table Store pointer to head of chain in a global lock register Set bitmask to signal other threads to check the global lock registers If another thread is processing a packet that hashed to the same hash chain, wait for lock to clear and restart processing packet Packet can also just be dropped, if counts don’t need to be perfect However, dropping packets would likely be necessary when bursts are occurring on a single flow ... exactly the behavior that flow stats should count and log Otherwise, the thread can continue processing the packet normally Clear pointer from shared register when done with insert/delete Use atomic increments where possible
13
Flow Stats Execution ME 1 ME2 (thread numbers may need adjusting)
Init - Configure hash function 8 threads Read packet header Hash packet header Send header and hash result to ME2 for processing ME2 (thread numbers may need adjusting) Init - Load SRAM ring with addresses for each slot in the collision table Init - Set TIMESTAMP to 0 7 threads Insert records into hash table Increment counter for records 1 thread Archive and delete hash table records
14
Diagram of Flow Stats Execution (ME1)
get buffer handle from QM 60 cycles read buffer descriptor (SRAM) 150 cycles read packet header (DRAM) 300 cycles build hash key ~5 cycles compute hash ~8 cycles send packet info to ME2 60 cycles send buffer handle to TX 60 cycles ~643 cycles
15
Diagram of Flow Stats Execution (ME2)
Incrementing Counters Adds records to hash chain, but doesn’t remove them Iterating through hash chain Locking head of chain Best: ~360 cycles Worst: ~ x 60 cycles get packet info from ME1 150 cycles read hash table record (SRAM) x valid? Yes match? No tail? No compare record to header read next record in chain ~10 cycles Yes Yes 150 cycles Yes count==0? set register to lock chain No No 150 cycles set register to lock chain Now locking all insert/increment operations. Since 2 counters + timestamps need to be written, it makes more sense to write them in a single burst write as opposed to 1 atomic increment (packet count), 1 atomic add (byte count) and 1 standard write to update the timestamps set register to lock chain set register to lock chain get record slot from freelist 150 cycles 150 cycles 150 cycles 150 cycles insert new record Write START/END time & new counts Write END time & new counts insert new record clear lock register clear lock register clear lock register clear lock register
16
Diagram of Flow Stats Execution (ME2)
Archiving Records Removes records from hash chain, but doesn’t add them Processing of archiving records occurs every five minutes Waiting to archive Locking head of chain count == 0? Yes head of list? No read current time No Yes No 5 minutes? send record to XScale tail of list? No set register to lock chain set register to lock chain Yes Yes 150 cycles 150 cycles set register to lock chain read 8 LWs of bit_vector read record.next write next_ptr to previous list item set register to lock chain 150 cycles 150 cycles reset paket and byte counters 150 cycles compute address of next record replace record with record.next clear lock register set valid bit to zero 150 cycles clear lock register read record from SRAM clear lock register return record slot to freelist clear lock register done with 8LWs? return record.next slot to freelist No Yes done with all records?
17
Return from Swap When returning from each CTX switch, always check global lock registers If any locks are set, determine if another thread is modifying the same hash chain that the current packet has hashes to If locks are on different chains, then just continue processing packet If any locks are on the same chain, then restart processing current packet check global lock registers set? Yes compare lock vals to current chain No equal? Yes No continue processing packet restart procssing packet
18
Flow Stats Interfaces QM1 1x10G Tx1 Flow Stats1 QM2 SRAM3 Flow Stats2
V: Valid Bit V 1 Rsv (3b) Port (4b) Buffer Handle(24b) QM1 1x10G Tx1 Flow Stats1 SCR NN SCR QM2 Source Address (32b) SCR Destination Address (32b) XScale SrcPort (16b) DestPort (16b) SRAM3 Flow Stats2 SCR Reserved (12b) Slice ID (12b) Protocol (8b) Packet Counter (32b) Source Address (32b) Byte Counter (32b) Destination Address (32b) Start Timestamp_high (32b) SrcPort (16b) DestPort (16b) Start Timestamp_low (32b) Reserved (12b) Slice ID (12b) Protocol (8b) End Timestamp_high (32b) Reserved (9b) Hash Result (17b) End Timestamp_low (32b)
19
Flow Statistics Module
Scratch rings QM_TO_FS_RING_1: 0x3800 – 0x3BFF // for receiving from QM QM_TO_FS_RING_2: 0x3C00 – 0x3FFF // for receiving from QM FS_TO_TX_RING_1: 0x x43FF // for sending to TX FS_TO_TX_RING_1: 0x x47FF // for sending to TX FS_TO_XSCALE_RING: 0x???? – 0x???? // for receiving invalidate info from XScale SRAM ring FS_FREELIST_RING: 0x???? - 0x???? // stores list of open slots in collision table LC Egress SRAM Channel 3 info for Flow Stats SM_RECORD_SIZE = 11 * 4 // bit words/record * 4 bytes/word TOTAL_NUM_RECORDS = // MAX with 4 MB table is ~95K records (should be divisible by 32) SM_NUM_HASH_TABLE_RECORDS = // <= TOTAL_NUM_RECORDS SM_NUM_CHAINED_RECORDS = TOTAL_NUM_RECORDS - SM_NUM_HASH_TABLE_RECORDS BIT_VECTOR_SIZE_IN_BYTES = TOTAL_NUM_RECORDS / 8; BIT_VECTOR_SIZE_IN_WORDS = TOTAL_NUM_RECORDS / 32; SM_HASH_TABLE_BASE = 0x100000 SM_HASH_COLLISION_TABLE_BASE = (SM_HASH_TABLE_BASE + (SM_RECORD_SIZE * SM_NUM_HASH_TABLE_ENTRIES)) SM_VALID_RECORD_BIT_VECTOR_BASE = (SM_HASH_COLLISION_TABLE_BASE + (SM_RECORD_SIZE * SM_NUM_CHAINED_RECORDS))
20
End
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.