Presentation is loading. Please wait.

Presentation is loading. Please wait.

® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.

Similar presentations


Presentation on theme: "® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation."— Presentation transcript:

1 ® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation

2 ® Lihu Rappoport 2 The Frontend Frontend goal: supply instructions to execution – Predict which instructions to fetch – Fetch the instructions from cache / memory – Decode the instructions – Deliver the decoded instructions to execution Frontend MemoryExecution Instructions Data The processor:

3 ® Lihu Rappoport 3 Requirements from the Frontend High bandwidth Low latency

4 ® Lihu Rappoport 4 The Traditional Solution: Instruction Cache Basic unit: cache line –A sequence of consecutive instructions in memory Deficiencies: –Low Bandwidth Jump into the line Jump out of the linejmp –High Latency –Instructions need decoding

5 ® Lihu Rappoport 5 TC Goals: high bandwidth & low latency Basic unit: trace – A sequence of dynamically executed instructions Trace Cache Instructions are decoded into uops – Fixed length, RISC like instructions Traces have a single entry, and multiple exits Trace end condition jmp jmpjmpjmp –Trace tag/index is derived from starting IP

6 ® Lihu Rappoport 6 Redundancy in the TC Code If (cond) A B Possible Traces (i) AB (ii) B B A Space inefficiency  low hit rate

7 ® Lihu Rappoport 7 XBC Goals High bandwidth Low latency High hit rate

8 ® Lihu Rappoport 8 XBC - eXtended Block Cache Basic unit: XB - eXtended Block jcc jmp XB features – Multiple entry, single exit – Tag / index derived from ending instruction IP – Instructions are decoded XB end conditions – Conditional or indirect branches – Call/Return – Quota (16 uops)

9 ® Lihu Rappoport 9 XBC Fetch Bandwidth Fetch multiple XBs per cycle –A conditional branch ends a XB –Need to predict only 1 branch/ XB –Predicting 2 branch/cyc  fetch 2 XB/cyc Promote  99% biased conditional branches*  Build longer XBs  Maximize XBC bandwidth for a given #pred/cyc  99% biased jcc jcc jcc jmp *[Patel 98]

10 ® Lihu Rappoport 10 XB Length Block typesAverage Length BB basic block 7.7 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 12345678910 111213141516 BB XB XBp DBL XB don’t break on uncond8.0 XBp XB + promotion 10.0 DBL group 2 XBp 12.7

11 ® Lihu Rappoport 11 XBC Structure A banked structure which supports Variable length XBs (minimize fragmentation) Fetching multiple XBs/cycle Reorder & Align Bank 0 Bank 1 Bank 2 Bank 3 4 uop

12 ® Lihu Rappoport 12 Support Variable Length XBs An XB may spread over several Banks on the same set Reorder & Align bank 0 bank 1 bank 2 bank 30 1

13 ® Lihu Rappoport 13 Support Fetching 2 XBs/cycle Data may be received from all Banks in the same cycle Reorder & Align bank 0 bank 1 bank 2 bank 301 1 0 0101

14 ® Lihu Rappoport 14 Support Fetching 2 XBs/cycle Actual bandwidth may be sometimes less than 4 banks per cycle Reorder & Align bank 0 bank 1 bank 2 bank 30 1 1 0 010

15 ® Lihu Rappoport 15 Reordering and Aligning Uops bank 0 bank 1 bank 2 bank 3 bank i2 bank i3 Reorder Banks Mux 1 bank i0 bank i1 Align Uops Mux 2 bnk i0 bnk i1 bank i2 Empty uops

16 ® Lihu Rappoport 16 XBC Structure The average XB length is >8 uops  16 uop/line is < 2-XB set associative Reorder & Align bank 0 bank 1 bank 2 bank 3 01 0 21 16 uop

17 ® Lihu Rappoport 17 XBC Structure The average XB length is >8 uops  make each bank set-associative Reorder & Align bank 0 bank 1 bank 2 bank 3 1 01 0 2

18 ® Lihu Rappoport 18 The XBTB The XBTB provides the next XB for each XB – XBs are indexed according to ending IP  Cannot directly lookup next IP in the XBC  XBC can only be accessed using the XBTB XBTB provides info needed to access next XB –The IP of the next XB –Defines the set in which the XB resides –A masking vector, indicating the banks in which the XB resides –The #uops counted backward from the end of XB –Defines where to enter the XB XBTB provides next 2 XBs

19 ® Lihu Rappoport 19 XBTB XBC Decoder Memory / Cache BTB Delivery mode Priority Encode XBQ XBC Structure: the whole picture Build mode Fill Unit

20 ® Lihu Rappoport 20 XB Build Algorithm XBTB lookup fails  build a new XB into the fill buffer End-of-XB condition reached  lookup XBC for the new XB –No match  store new XB in the XBC, and update XBTB –Match  there are three cases: XB new  XB exist Update XBTB IP 1 XB exist IP 1 XB new Extend XB exist Update XBTB XB new  XB exis t XB exist IP 1 XB new Complex XB, Update XBTB XB new  XB exist   IP 1 XB exist XB new The XBC has NO Redundancy

21 ® Lihu Rappoport 21 XB new and XB exist have same suffix but different prefix: –Possible solution, complying to no-redundancy: Complex XBs –Drawback: we get 2 short XBs instead of a single long XB Wrong Way IP 1 XB exist XB new IP 1 XB exist Prefix new

22 ® Lihu Rappoport 22 XB new and XB exist have same suffix but different prefix: –Second solution: a single “complex XB” Complex XBs Complex XBs: no redundancy, but still high bandwidth Right Way Prefix cur IP 1 XB exist XB new Prefix new Suffix

23 ® Lihu Rappoport 23 bank 0 bank 1 bank 2 bank 3 Extending an Existing XB An XB can only be extended at its beginning9 1 2 3 4 5 6 7 8 8 9 1 2 3 4 5 6 7 0 Since the existing uops move, the pointers in the XBTB become stale If we store XB in the usual way, when an XB is extended, we need to move all its uops

24 ® Lihu Rappoport 24 Storing Uops in Reverse Order The solution is to store the uops of an XB in a reversed order 1 2 3 4 5 6 7 8 9 bank 0 bank 1 bank 2 bank 30 XB IP is the IP of the ending instruction  extending the XB does not change the XB IP  when an XB is extended, no need to move uops

25 ® Lihu Rappoport 25 Set Search XB is replaced and then placed again –Not on same set  different XB –Same set, same banks  no problem –Same set but not on the same banks  XBTB entries which point to the old location of the XB are erroneous Solution - Set Search –On an XBTB hit & XBC miss, try to locate the XB in other banks in the same set –Calculate new mask according to offset –Only a small penalty: cycle loss, but no switch to build

26 ® Lihu Rappoport 26 XB Replacement Use a LRU among all the lines in a given set LRU also makes sure that we do not evict a line other than the first line of a XB (a head line) –There is no point in retaining the head line while evicting another line –if we enter the XB in the head line, we will get a miss when we reach the evicted line –if a head line is evicted, but we enter the XB in its middle, we may still avoid a miss A non-head line is always accessed after a head line is accessed  its LRU will be higher  it will not be evicted before the head line

27 ® Lihu Rappoport 27 XB Placement Build-mode placement algorithm –New XB is placed in banks such that it does not have bank conflict with the previous XB (if possible) –LRU ordering is maintained by switching the LRU line with the non-conflicting line before the new XB is placed –Set-search repairs the XBTB Delivery mode placement algorithm –repeating bandwidth losses due to bank conflicts found  conflicting lines are moved to non-conflicting banks –Each XB is augmented with a counter –incremented when XB has a bank conflict –when counter reaches threshold, the conflicting lines are switched with other lines in non-conflicting banks –A line can be switched with another line, only if its LRU is higher, or if both gain from the switch

28 ® Lihu Rappoport 28 0 1 2 3 4 5 6 7 Games SpecINTSysmarkNT Average Uop per Cycle XBC vs. TC Delivery Bandwidth TCXBC

29 ® Lihu Rappoport 29 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 16K32K64K Size - KUops Uop Miss Rate Miss Rate as a Function of Size TCXBC 29% >50%

30 ® Lihu Rappoport 30 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 124 Associativity Uop Miss Rate Miss Rate as a Function of Size XBC TC

31 ® Lihu Rappoport 31 XBC Features Summary Basic unit - XB –Ends with a conditional branch –Multiple entries, single exit –Indexed according to ending IP –Branch promotion  longer XBs XBC uses a banked structure –Supports fetching multiple XBs/cycle –Supports variable length XBs –Uops within XBs are stored in reverse order

32 ® Lihu Rappoport 32 Conclusions Instruction Cache has high hit rate, but … –Low bandwidth, high latency TC has high bandwidth, low latency, but … –Low hit rate XBC combines the best of both worlds –High bandwidth, low latency and high hit rate

33 ® Lihu Rappoport 33


Download ppt "® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation."

Similar presentations


Ads by Google