Presentation is loading. Please wait.

Presentation is loading. Please wait.

AM chip schedule Alberto. Design activities (17/11/2010) Adapt JTAG and bounday scan to MPW chip Design new CAM cells, Buffer logic New logic majority.

Similar presentations


Presentation on theme: "AM chip schedule Alberto. Design activities (17/11/2010) Adapt JTAG and bounday scan to MPW chip Design new CAM cells, Buffer logic New logic majority."— Presentation transcript:

1 AM chip schedule Alberto

2 Design activities (17/11/2010) Adapt JTAG and bounday scan to MPW chip Design new CAM cells, Buffer logic New logic majority Opcode interface to new majority New logic for kill? PADs layout

3 Schedule/milestones end of March: –a first version of the boundary scan is available –preliminary layout of the majority logic cell available –first version of the CAM-block available starting beginning of April work on a first integration of the chip end of April: after simulation and debugging… –a final version of the boundary scan is available –a complete layout of the majority logic cell –final design for the CAM-block available starting beginning of May work on the final integration of the chip end of May: successful integration of all pieces –during May test and simulation of the integrated design. during June based on simulation decide what adjustments are needed end of June: completed all simulations of the chip – the final and integrated chip should be completely simulated and approved. early July submission

4 Majority interface component majority port ( CLK : in std_logic; INIT : in std_logic; REQ_LAY0 : in std_logic; FORCE_READ : in std_logic; -- force read regardless of n-miss and layer0 DISABLE_READ : in std_logic; MISS0 : in std_logic; MISS1 : in std_logic; MISS2 : in std_logic; -- signals to write the DISABLE_THIS SRAM cell WL : in std_logic; DISABLE_THIS_SET : in std_logic; DISABLE_THIS_RESET : in std_logic; LAYER_MATCH : in std_logic_vector (7 downto 0); READ_FLAG : in std_logic; PATTERN_MATCH : out std_logic; ); end component;

5 Majority interface majority CLK INIT layer_match MISS0 MISS1 MISS2 Require_layer0 Force_read Disable_read Global cfg signals, change just after clk rising edge Pattern_match Read_flag Disable_this_RESET Disable_this_SET WL (write line) Fisher tree layer_match Signals to write disable_this pattern Layer_match is needed to output the map of fired layers.

6 Majority interface majority CLK INIT layer_match MISS0 MISS1 MISS2 Require_layer0 Force_read Disable_read Global cfg signals, change just after clk rising edge Pattern_match Read_flag Disable_this_RESET Disable_this_SET WL (write line) Fisher tree layer_match Signals to write disable_this pattern Layer_match is needed to output the map of fired layers. WL SET RESET

7 Where to put the Flip Flops? bitline clk_N CAM cells 1 layer CAM cells 1 layer Match line Sense Ampl. Layer-match SR latch Majority logic & flag logic Majority logic & flag logic clk Pattern match clk bitline Match enable ML reset ML MLSA Patt. match Bit line Propag. ML Match? Majority Propag.

8 Where to put the Flip Flops? bitline clk_N CAM cells 1 layer CAM cells 1 layer Match line Layer-match SR latch bitline Match enable reset ML Latch Bit line Propag. ML Match? Current source Match enable R S reset

9 Use 16 blocks of 512 patters ? 3.75mm 3.2mm If die is 3.2x3.75mm Usable area excluding PADs 3040um x 3590um Can fit 3 blocks in height 3*979um = 2937um Can almost fit 6 blocks in width 6*600um = 3600um Depends on block size! Can fit 3*5 blocks. Can fit 3*6 if block smaller than 600 um

10 Use 16 blocks of 512 patters ? 3.75mm 3.2mm If die is 3.2x3.75mm Usable area excluding PADs 3040um x 3590um Can fit 3 blocks in height 3*979um = 2937um Can almost fit 6 blocks in width 6*600um = 3600um Depends on block size! Can fit 3*5 blocks. Can fit 3*6 if block smaller than 600 um

11 Block size April 7th, 201 Full custom block 64x4 (half of 64 patterns) –Width 225.4um * Height 122.4um (68 rows) Pattern block of 64 patterns + buffer –Buffer 1 row ? –8 layers width 450.8um no majority –Height with buffer 124.2um Height of 512 patterns = 993.6um Height of 3x512 patterns = 2980.8um 3.2mm – 160um PADS – 2980.8um = 59.2um or 32 rows Max Pattern block width –(Chip 3.75mm – 160um PADS)/6 = 3590um / 6 = 598um

12 AMchip03 document http://agenda.infn.it/materialDisplay.py?con tribId=0&materialId=0&confId=3021

13

14

15 What we have now: Standard Cell 180 nm 5000 pattern/chip for 6-layer patterns, 2500 pattern/chip for 12-layer patterns “A VLSI Processor for Fast Track Finding Based on Content Addressable Memories”, IEEE Transactions on Nuclear Science, Volume 53, Issue 4, Part 2, Aug. 2006 Page(s):2428 - 2433 NEXT: NEW VERSION For both L1 & L2 65 nm technology provides a factor 8 → 20000 patterns/chip Full custom cell provides at least a factor 2 → 40000 patterns/chip 8 layers instead of 12 provides a factor 1,5 → 60000 patterns/chip 1,2 x 1,2 cm^2 2D chip → 80000 patterns/chip With a 2 D chip we gain a factor 30! 1 AMboard: 128 chips → ~10 Mpatterns per board 1 Crate: 16 AMboard → ~160 Mpatterns per crate Current prototype under design: 65nm TSMC, 12mm^2 MPW run, 100 MHz running clock 8000 patterns/chip 8 layers each Layer words of 12 bits + 3 ternary bits  variable resolution patterns A. Annovi - ACES 2011 @ CERN 15

16 Pattern efficiency 90% # of patterns in Amchips (barrel only, 45  degress) 65M500M Pattern size r-  : 24 pixels, 20 SCT strips z: 36 pixels Pattern size (half size) r-  : 12 pixels, 10 SCT strips z: 36 pixels = 342k = 40k Want this A. Annovi - ACES 2011 @ CERN 16

17 Variable resolution AM A. Annovi - ACES 2011 @ CERN finer patterns coarser pattern We can use don’t care on the least significant bit when we want to match the pattern layer @ carser resolution or use all the bits to match it @ finer resolution Patterns with 1 kid are stored at finer precision Layers without “don’t care (DC)” can ignore the hits in the “wrong” side of the layer DCDC coarser pattern 17 With 2 “don’t care” bits per layer gain an effective factor of 5 in patterns

18 A. Annovi - ACES 2011 @ CERN 18 Goal: x30 pattern density but lower power consumption 32 patterns of 8 layers ~ 60  m x 500  m ~ 1 or 2 pixels

19 Tasks in Italy New PADs layout (Stabile, Milano) New CAM cells (Matteo, Frascati) –Complete NAND cell –Evaluate advantages of SRAM transistors –Other cells Clean up project scripts (Francesco, Pisa) Transition to 65nm and new tools Place and route (Francesco, Stabile)

20 JTAG work (Germany) JTAG logic to be looked at after Laura left New features: –Add pins all MPW to boundary scan –Extend JPATT_DATA register to include as bit[0] a disable_pattern bit –Extend register for new busses (8 instead of 6)

21 Majority logic (Fermilab) Current draft from Jim is good and almost complete Features to be added: –Account for individual pattern disable –4 thresholds: disable, 0miss, 1miss, 2miss –Include option require_layer0 Layer0 should be the closest to final AND Design the basic majority logic cell

22 OPCODE work (Germany) Transalate OPCODE output for new majority logic. 5 lines out: –Disable_match, 0miss, 1miss, 2miss –require_layer0 Optional if time allows (coding and testing) –change input OPCODE protocol to single word

23 Kill tree (Fermilab) Optional if time allows (coding and testing) –Try a new scheme for kill tree Short description: –Current scheme encodes the highest priority pattern (encoder) –Then decodes it to set one kill FF –1024 (or N patt) kill lines are then distributed to the patterns Alternative: calculate kill along with priority encoder in a tree like fashion

24 Draft schedule Aim for March submission (tight) –All dates below are my guess for discussion today NAND cell November NOR cell end of December NOR don’t care cell mid January (?) Match line amplifier beginning of February –This item is late in the schedule New PADs layout end of November –Important to check that all pads fit in a single MPW block Project clean up and first place and route end of December Preliminary version of majority mid December –Final end of January New JTAG logic end of December: needed for 1st place and route New OPCODE and kill TREE end of January (mid Feb at latest)

25 Draft schedule during February Put everything together Final place and route Prepare a detailed model of each CAM cell for mixed simulation of a few patterns. –Doing this in February is late, we should start it earlier, but currently uncovered From now till the end: comparison of the implemented model with the C++ model for debugging. –Ilaria Sacco (Pisa) –This item is under staffed, we will need help here Are we missing any important item? First goal get Jim and Hans up and running –Please ask for needed information to start up

26 Outline  The Pattern Matching and the Associative Memory (AM)  Why more dense AM we get better it is  Associative memory architecture  How chips are put together: Lamb → AMboard → crate  The Tree Search Processor & its location

27 The Event... The Pattern Bank TRACKING WITH PATTERN MATCHING

28 Bingo scorecard Dedicated device - maximum parallelism: Each pattern with private comparator Track search during detector readout The Associative Memory – AM = Bingo Full custom 700 nm: 0,1286Lkpat/chip FPGA 350 nm: 0,1286Lkpat/chip standard cell 180 nm:5,0 6Lkpat/chip new for FTK 90 nm: ~60 8Lkpat/chip new for FTK 65 nm:~1208Lkpat/chip 2 Tiers 65 nm 2,5 D : 240 8Lkpat/chip

29 FF word Layer 1Layer 2 Layer 3Layer 4 HIT Cell 0 Cell 1 Cell 2 Cell 3 Output Bus ONE PATTERN HIT

30 Track fitting using full resolution of the detector Data Organizer (DO) Hits Tracks parameters (d, p T, ,  z) Roads Associative Memory (AM) Hits Roads + hits Track Fitter (TF) Super Strip (SS) Tracking in 2 steps : find Roads first (Pattern Matching with Associative Memory, AM) then find Tracks inside Road (Fit by TF) Full Resolution Hits Large SS: a lot of fakes + combinatorics inside roads Road Hot point @high occupancy

31 What we have now: Standard Cell 180  m 5000 pattern/chip for 6-layer patterns, 2500 pattern/chip for 12-layer patterns “A VLSI Processor for Fast Track Finding Based on Content Addressable Memories”, IEEE Transactions on Nuclear Science, Volume 53, Issue 4, Part 2, Aug. 2006 Page(s):2428 - 2433 NEXT: NEW VERSION For both L1 & L2 90 nm technology provides a factor 4 → 10000 patterns/chip Full custom cell provides at least a factor 2 → 20000 patterns/chip 8 layers instead of 12 provides a factor 1,5 → 30000 patterns/chip 1,5 x 1,5 cm**2 2D chip → 60000 patterns/chip Going to 65 nm → 120000 patterns/chip With a 2 D chip we gain a factor 50! 1 AMboard: 128 chips → ~15 Mpatterns per board 1 Crate: 16 AMboard → ~245 Mpatterns per crate 100 MHz running clock

32 Pattern bank Add encoder kill Bus0[17:0] Bus1[17:0] Bus2[17:0] Bus3[17:0] Bus4[17:0] Bus5[17:0]

33 Power consumption Old Chip: corr. Factor1,8 Watt 180 nm 1,8 V Core New chip 90 nm 1 V Core1/(1,8*1,8)0,56 Watt Frequency 40 MHz New chip 100 MHz100/401,39 Watt Area 1x1 cm**2 New chip 4 cm**24/15,56 Watt New: Pre-match feature1/3 (1/2)1,85 (2,78) Watt Per crate 16 x 128 = 2048 chips3,8 (5,7) kW IF the pre-match feature save at least 1/3, new 2D chip (1,85 W) ~ old chip (1,8 W) ANY OTHER IDEA TO GAIN IN POWER INCREASES THE POTENTIALITY TO GROW IN THE THIRD DIRECTION we would like to be 4 funding agencies involved:

34 Annovi, 27-09-2010 34 Concentrate now on 2013-2015 (17-19 pile-up events) Consider evolution up to 2019 (41,5 pile-up events << simulated 75 ev) → Intermediate chip! 2020 comes much later and will profit of a very advanced technology……. Sim with 75 pile-up events after 2020! 17,6 pile-up ev. @2.6 10 33 19,0 pile-up ev. @ 10 34 LHC Schedule

35 Our Schedule 1.TSMC 65 nm, low power, available as mini@sic (Vcc_core=1,2 V). 2.65 nm mini@sic 22,5 k€/block; 90 nm mini@sic 18,6 k€/block. 3."variable resolution" gives good results → early production of AM04 4.we missed the 90nm 2010 September run 5.We propose to move directly to a 65 nm prototype. 6.This is a preliminary schedule to produce new LAMBs for 2013: (1) Mini@sic submission: spring or october 2011. (2) delivery:~february 2012 (3) tested ~June 2012 (4) MPW submission:from June 2012 (5) Delivery:from November 2012 (6) Tested:from February 2013 (7) MPW Production from February 2013 (8) Delivery from July 2013 (9) mounted on new Lambsfrom autumn 2013

36 Costs 2 blocks Mini@sic: payed by Italy MPW run: TSMC 2010: 12 mm^2 80 kUSD → 6,7 kUSD/mm^2 UMC 2010: 4 mm x 4 mm70 k€ → 4,37 k € /mm^2 12 mm^2 ~ 1/8 AMchip03 area in CDF → 7500 patterns/chip → 960 kpatterns/AMBoard With 2 blocks 160 kUSD→ ~2 Mpatterns/AMBoard In 2012 could cost less – Academia Sinica can help on prize. Italy – Germany – USA – Academia Sinica (reduction). For 2013: small production = 8+2 AMBoards = 1280 chips. How many wafers? How much for a wafer? we would like to be 4 funding agencies, especially for final step: Whole wafer Mask @time when a large area chip is needed: UMC 2010 90 nm:555 kUSD TSMC 2010 65 nm: 1300-900 kUSD TSMC 2010 65 nm MLM 650-950 kUSD

37 add_in add_out Pipelines of AM chips AMchip Control = GLUE

38 AM INDI AMTOP Bus0 Bus1 Bus3 Bus2 AMBOTTOM Bus0 Bus5 Bus1Bus3 Bus2 Bus4 Bus5 PAT_ADD_IN [17:0] PAT_ADD_OUT [17:0] REV_EN add_in add_out LAMB

39 AM GLUE FIFOS RECEIVERs & DRIVERs (ROAD bus + 6 HIT buses) LAMB CONNECTORs VME INTERFACE ROADCONNECTOR HITCONNECTOR FPGA I/O control PIPELINE REGISTERs INDI HIT [17:0] ADD OUT [30:0] TRACKs 6 bus (108 bits!) Four 8- chips (top- bottom) pipeline

40 LAMB Standard cell chip 40 MHz clock FPGA for Roads FTK AMBoard P3 serial LVDS Control FPGA for SS Input CDF AMBoard with 4 LAMBs Complementary Functions in the AUX board 16 AMBoards per “core” crate → 8 core crates in the system

41 AM0+TSP+DO+TF+HW CPU vme AM1+TSP+DO+TF +HW AM2+TSP+DO+TF +HW AM3+TSP+DO+TF +HW AM4+TSP+DO+TF +HW AM5+TSP+DO+TF +HW AM6+TSP+DO+TF +HW AM7+TSP+DO+TF +HW 11LayFit+HW AM10+….. AM11+….. AM12+….. AM13+…… AM14+….. AM8+….. AM9+…... 11LayFit+ HW final AM15+….. 11LayFit+ HW final 11LayFit+HW LAMB Standard cell chip 40 MHz clock FPGA for Roads AMBoard P3 serial LVDS Control FPGA for SS Input AUX card Connectors for Hits LVDS Cables DO+TF+HW HWTF DO INPUT FIFOs HWTF DO HWTF DO HWTF DO Connectors for tracks output Interface SSMAP Processing Unit

42 The whole system: Data Formatter + 8 core crates

43 6 18-bit buses, hit rate: 40MHz/bus input bandwidth of 4 Gbit/s 1/2  AM Divide into  sectors with overlaps Pixel barrelSCT barrel Pixel disks 6-12 Logical Layers: full  coverage IEEE Trans. Nucl. Sci. 51, 391 (2004) Overlaps require hits in a small region to be sent to two neighboring AMs Goal: High Lum 8  sectors 8 9U VME crates for the FTK core 1/2  AM

44 Whatever is the power of the AM we can build, we can do better with the TSP

45 Algorithm: NIM A287 (1990) 436-438 http://www.pi.infn.it/~paola/Tree_search_algorithm.pdf Tree Search Processor: NIM A 287, 431 (1990), http://www.pi.infn.it/~orso/ftk/NIMA287_431.pdf IEEE Toronto, Canada, November 8-14 1998 http://www.pi.infn.it/~paola/TSP_v14.pdf 1 2 3 4 THIN ROAD FAT ROAD Found by AM (default SS for example) 1234 5678 Depth 0 Depth 1 Depth 2 PATTER N BLOCK PARENT PATTERN

46 The AM chip for each found road could provide: 1)The Road IDentifier (address) 2)The Bitmap : one bit per layer, saying which SSs are empty & which are full (11 bits: 11101111111 eg.) 3)4 more bits for each layer, Sub-SS, saying which of the 4 SS subdivisions are empty and which are full (4 bits  8 Layers). Higher resolution SS (sub-ss) to be stored in AM or into a Mini-DO & LSB bits should be provided to TSP Example: 2-Level TSP → divide by 4 each SS

47 Conclusions  The application at future Instantaneus Luminosities will require AM extremely performing  Even if extremely performing, the AM work could be refined by the TSP that could fit in the same package with the AM chip in a 2.5 D technology. This actually is NOT true any more, probably, before 2020  The AM could be used for both L1 and L2 applications  Any AM pattern capacity increase would be an important advantage for both L1 and L2 tracking systems

48 BACKUP

49 New AMchip features Alberto Annovi INFN Frascati

50 Outline Use of patterns Variable size patterns New input busses Disabling patterns –Increase effective production yield Annovi, 27-09-201050

51 Annovi, 27-09-201051 The Event... The Pattern Bank Pattern matching

52 Annovi, 27-09-201052 1.Find low resolution track candidates called “roads”. Solve most of the pattern recognition 2.Then fit tracks inside roads. Thanks to 1 st step it is much easier Tracking with ~offline quality Super Bin (SB) Tracking in 2 steps Critical parameter: SS size Affects: - Number of patterns for given efficiency: cost - Number of found roads: workload for next step Critical parameter: SS size Affects: - Number of patterns for given efficiency: cost - Number of found roads: workload for next step

53 Pattern efficiency Annovi, 27-09-201053 90% # of patterns in Amchips (barrel only, 45  degress) 65M500M Pattern size r-  : 24 pixel, 20 SCT 36 pix z Pattern size r-  : 12 pixel, 10 SCT 36 pix z = 342k = 40k Want this

54 Efficiency curve Annovi, 27-09-201054 # of pattern in Amchips (barrel only, 45  degress) Need many patterns for little efficiency ?? Super Bins are discrete Edge effects give lots of patterns with little coverage

55 Annovi, 27-09-201055 TSP simulation & varying-resolution pattern banks Guido Volpi & Roberto Vitillo - Pisa Depth 0 Depth 1 Depth 2 PARENT PATTERN FAT ROAD Thin ROAD AM resolution TSP resolution We do have now a structured “pattern bank”, where each thin road is connected to its parent pattern in FTKsim. Ongoing tests for TSP algo after the RoadFinder (AMsim) in FTKsim; we have studied the bank composition and AM FAKE roads. AM Fake road is a AM matched pattern whose kids do not match the event Low probability to fire AM patterns: few kids (1 or 2): big advantage to match it at TSP resolution! All blank Half-SS can fire @ AM level as fakes while @ TSP level the fake has good probability to be deleted LOW coverage patterns High probability to fire AM patterns (symmetric): many kids (up to 20 or more): no advantage to match it at TSP resolution! More than one kid can fire @ TSP level. Low probability to be a fake AM road HIGH coverage patterns KID PATTERN @Depth 0 PARENT @Depth 1

56 Annovi, 27-09-201056 We can use don’t care on the least significant bit when we want to match the pattern layer @ AM resolution or use all the bits to match it @ TSP resolution Test of AM patterns: 1.all single kid patterns @ TSP resolution 2.For all few kid patterns use don’t care only for layers where both Half-SS are used by kids AM resolution (don’t care ) TSP resolution (care) to exclude the right half in these layers Guido Volpi & Roberto Vitillo - Pisa All AM roads AM roads with at least 1 matched kid Fake AM roads # of kids WH @10 34 How to implement “variable resolution” in the AMchip AM pattern distribution vs Number of kids Majority of patterns with a single Kid AM & TSP Pattern Bank for 23 ev. pileup # of kids

57 AM with care/don’t care Annovi, 27-09-201057 TSP38000 AM@TSP28000 AM@DC44000 AM342000 Care/don’t care very effective to reduce the number of roads. Area cost on the chip approx. 1 extra cell for each DC bit. Now 15 cells/layers. With 1 DC bit area increases by 1/15 ~ 7%. For comparison going to TSP resolution would require 3x patterns. # of kids

58 Number of busses Currently we have 6 input busses New AMchip should handle 8 layers IBL will require 2 busses for higher b/w External SCT layers needs half b/w Current package constraint max 7 input busses 3 options: implement 2 of them to be selected online Annovi, 27-09-201058

59 8 Layers vs 7 buses (option 1) Annovi, 27-09-201059 Pattern bank with 8 matching layers 8 internal buses Internal register that feeds 8 busses Input register for 7 busses Demultiplex based on MSB Ex tr a PixPix PixPix PiXPiX SCTSCT SCTSCT SCT 2 & 3

60 IBL: 7 Layers vs 7 buses Annovi, 27-09-201060 Internal register that feeds 8 busses IBLIBL IBLIBL PixPix PiXPiX SCTSCT SCTSCT SCT 2 & 3 Input register for 7 busses Demultiplex based on MSB IBL @ double bandwidth. Either double internal clock, or special logic. Take the logical OR of 2 layers. Both layers store the IBL super bin. Distribute 50% data to each layer. Layer matches if any of 2 IBL layers match Special IBL layer: OR of 2 layers

61 IBL: 8 Layers vs 7 buses Annovi, 27-09-201061 Internal register that feeds 8 busses IBLIBL IBLIBL PixPix PiXPiX ?????? SCTSCT SCT 2 & 3 Input register for 7 busses Demultiplex based on MSB IBL @ double bandwidth. Either double internal clock, or special logic. Take the logical OR of 2 layers. Both layers store the IBL super bin. Distribute 50% data to each layer. Layer matches if any of 2 IBL layers match IBL with double clock

62 Amchip 03 yields AMchip03 prototype 2004 –1cm^2 MPW yield 35% AMchip03 production 2005 –1cm^2 pilot run yield 70% Large fraction of failures due to single pattern defect. Add one register to disable bad patters –Will allow to use all chips with a single (or few) pattern defects. Area cost small :1 flip-flop/pattern (not /layer) Annovi, 27-09-201062

63 Changes to AMChip specifications Amchip 03 specs: –http://www- cdf.fnal.gov/publications/cdf7339_amchip03_s pecs.pshttp://www- cdf.fnal.gov/publications/cdf7339_amchip03_s pecs.ps New features –Add 1 or 2 don’t care bits/layer –Increase input busses to 7 with multiplexing & special handling of IBL –Add disable FF for each pattern Annovi, 27-09-201063

64 BACKUP Annovi, 27-09-201064

65 Two possible Approaches to expand into the third direction VIPRAM - Vertically Integrated Pattern Recognition Associative Memory Ted/Jim/Aida/Ray/Gregory/Simon/Silvia/Marcel/Gary/Mel/Bob… FNAL/ANL/UC/Tezzaron/… 1. “Identical Tier” 3D architecture (actually 2.5 D?) 2.“True 3D” Implementation

66 Trying to define a collaboration Italy-USA for DOE application to Generic R&D funds (ATLAS FTK - Fermilab CMS, both interested)

67 All equal tiers: put them in pipeline as done on the board

68 The 3D IO Wrapper must be designed and fabricated around the 2D AMchip to ensure that all tiers act as a single chip as shown in Figure 5. Even for prototyping purpose, it is not possible to simply take an existing, fabricated AMchip and place it inside a rectangular doughnut- shaped 3D IO Wrapper. There are several ways to address this. First, the 2D AMchip could be redesigned in a 3D process like Tezzaron/Chartered, and then the 3D IO Wrapper could be designed around it. This method has no obstacles to its 3D fabrication. However, it does require the redesign of the AMchip. Second, the CMOS UMC process could be used for 3D development even though UMC does not have a 3D process. This method requires no redesign of the AMchip, but it does require UMC to be willing to participate in a “Via Middle” process in which after a certain number of fabrication steps, the wafers are shipped to a “Via Middle company” (e.g. Tezzaron) where the first steps of the Through Silicon Via process are started. Then the wafers are shipped back to UMC where the 2D processing is completed. Finally, UMC ships the completed wafers to the Via Middle Company where 3D processing is completed. Not all companies are willing to participate in a Via Middle process.

69 The True 3D: 1 tier/ Layer + 1 control tier Control Tier Tier 4 Tier 3 Tier 2 Tier 1

70 CAM in 2D

71 Very high density of patterns

72 Advantages 2D chip: ready soon with ~best technology (65 nm today, 40 or better in 2020), 1 single mask, probably enough for LVL2, could allow 2,5D True 3D: less consuming Tiers, much larger banks useful for LVL1? Less latency compared to pipelined Tiers. True 3D: Important if we need much larger banks than provided by 2D. COSTS? Fermilab proposes “True 3D” as a phase I R&D

73 EVEN MORE – Phase II Adding more planes? Could we include DO – TF and HW? All planes that fit well in a 2,5 D scheme All of them well known and testable on FPGA before! AMchip Flexible TSP Logic-FPGA like ? Memories for TSP MINIDO? DO + TF + HW ? Integration of VLSI chips with FPGA and RAMs

74 Conclusions They present 2 phases: “true 3D” first, Integration with FPGA and memories second. We think that in a short time scale it is important to understand the power of 2D design: density of patterns available/needed. For LVL2 seems ok 2D pushed at best technology. Consumption We could try the 2D chip to be used as 2.5 D as Phase I On a longer time scale, try the “True 3D” as Phase II

75 Amchip04 with umc90 std cells UMC90 FSD0A_A standard cells library Our custom standard cells: single_layer search_line Tools used Synopsis DC D-2010.03-SP1-1 (synthesis) Cadence SoC Encounter v07.10-s219_1 (placement, routing) Synopsis PT D-2010.03-SP1-1 (timing analysis) Custom scripts (manual place)

76 Basic bank structure......... 8x Input Bus Matched patterns Buffer 32x Patterns (row) 32x Majority - Manual placement - Majority row has it's own clock tree

77 Basic bank structure......... 8x Input Bus Matched patterns Buffer 32x Patterns (row) 32x Majority - Manual placement - Majority row has it's own clock tree A pattern is a row: 8x single_layer cells Each cell match a 15bit bus

78 Basic bank structure......... 8x Input Bus Matched patterns Buffer 32x Patterns (row) 32x Majority - Manual placement - Majority row has it's own clock tree Majority logic: If X out of 8 bus match the pattern is matched. X is programmable via JTAG

79 CLK Match Line Match_reg BL [15:4] BL_N [15:4] XXXXXX DATA XXXXX XXX ZZZZZZ Mlpre_n slpre slpre_t BL [3:0] BL_N [3:0] XXXXXX DATA XXXXX XXX MLSA_res SEN

80 CLK Match Line Match_reg BL [15:4] BL_N [15:4] XXXXXX DATA XXXXX XXX ZZZZZZ Mlpre_n slpre slpre_t BL [3:0] BL_N [3:0] XXXXXX DATA XXXXX XXX MLSA_res SEN All this signals are inputs to the single_layer pattern cell for activate the match. Relative timing is critical! Generated in each Buff module By global “read” signals

81 512 patterns bank 16 x 32 pattern blocks are manually placed to build a 512 patterns bank. Horizontal and vertical gaps are left for power grid.

82 All logic placed The pattern bank occupies most of the area. All the other control logic scale very weakly with the number of patterns. We could try to fill the chip with a bigger column of patterns (~800), but is not critical for this mini@sic prototype to have a bigger bank.mini@sic

83 Logic scheme

84 Power grid Power distribution is done by two big horizontal stripes and two thinner vertical stripes. We are waiting a feedback from IMEC about this power grid design.

85 512 patt AMCHIP04 routed First results of routing (wroute, clock tree routed first, no post- routing optimization) are reasonable: - routing is simple and consistent with our plans in the bank area (vertical buses, horizontal output) - no critical congestions in other areas

86 Timing Analysis We have working skeleton scripts for static timing analysis A first look at the timing with PrimeTime showed some various setup and hold violations No post-route optimization was done, buffer optimization in this step might remove most of the violations Global signals running through all the patterns coloumn have setup violation  Force a better routing of the column area  Manually optimize buffer usage  Split the column in two shorted columns Some optimization and re-routing is needed, but no critical flaws are detected

87 Full Custom Associative Memory Core With respect to standard cell design of the memory chip we want to: Increase memory density Reduce power consumption

88 CAM model Simple schematic of a CAM with 4 words having 3 bits each. The schematic shows individual core cells, differential searchlines, and matchline sense amplifiers (MLSAs) CAM core cells for (a) 10-T NOR-type CAM and (b) 9-T NAND-type CAM. The cells are shown using SRAM-based data-storage cells. For simplicity, the figure omits the usual SRAM access transistors and associated bitlines. `

89 NAND Type SRAM Cell

90 NAND Type SRAM Cell Layout NAND Cell dimensions: 2.8 micron height 3.8 micron width

91 NOR Type SRAM Cell

92 NOR Type SRAM Cell Layout NOR Cell dimensions: 2.8 micron height 3.62 micron width

93 MatchLine Sense Amplifier (MLSA) Positive feedback differential sense amplifier Matchline discharge transistor Output inverter Amplifier resetting transistors Amplifier resetting transistor

94 MatchLine Sense Amplifier Layout MLSA dimensions: 2.8 micron height 7.3 micron width

95 NOR Type Matchline Model The main feature of the NOR matchline is its high speed of operation. In the slowest case of a one-bit miss in a word, the critical evaluation path is through the two series transistors in the cell that form the pulldown path.

96 NAND Type Matchline Models A feature of the NAND matchline is that a miss stops signal propagation such that there is no consumption of power past the final matching transistor in the serial nMOS chain Two drawbacks of the NAND matchline are: a quadratic delay dependence on the number of cells a low noise margin.

97 Selective Precharge Model

98 Selective Precharge

99 Estimated Power Consumption The Associative Memory core estimated power consumption (at 100MHz clock frequency) with NOR cell match line scheme is about 3 A. The core power supply is 1V. Associative memory core (60000 pattern) running at 100MHz clock frequency with Selective Precharge matchline scheme We have obtained an 80% reduction in power consumption

100 Selective Precharge Timing (all bits match) Matchline precharge MLSA output Search line (Bit line) Precharge MLSA enable Matchline discharge Matchline NOR cell Searchline and Matchline Precharge phase Matchline Evaluation phase Matchline Discharge phase

101 Selective Precharge Timing (NOR bit mismatch) Matchline precharge MLSA output Search line (Bit line) Precharge MLSA enable Matchline discharge Matchline NOR cell Searchline and Matchline Precharge phase Matchline Evaluation phase Matchline Discharge phase

102 Selective Precharge Timing (NAND bit mismatch) Matchline precharge MLSA output Search line (Bit line) Precharge MLSA enable Matchline discharge Matchline NOR cell Searchline and Matchline Precharge phase Matchline Evaluation phase Matchline Discharge phase

103 Layer Layout Width: 67.2 micron Height: 2.8 micron Matchline precharge Transistor NAND cells NOR cells MLSA and Matchline discharge transistor

104 Timing

105 Conclusions I have completed the layout of the full layer The obtained layout is quite compact The estimated memory core power consumption is reduced about 80% with respect to a NOR type matchline model To do: Complete the remaining full custom part (Search line precharge of the NOR cell and the MLSA Vref) Complete the layer simulation with Montecarlo analysis Simulation of the full associative memory chip

106 Annovi, 27-09-2010106 Milestone #9: Specify system size..1×10 34 and 3×10 33 Concentrate now on 2013-2015 (17-19 pile-up events) 2020 comes much later and will profit of a very advanced technology……. Sim with 75 pile-up events after 2020! 17,6 pile-up ev. @2.6 10 33 19,0 pile-up ev. @ 10 34

107 Annovi, 27-09-2010107 Using the variable resolution in a new AM chip for 10 34 WH events @10**34 (# of pile-up events = 23) Banks coverage ~ 95% 8.0 MPat @TSP → 2,80 MPat @ AM level (35%) per region (barrel only) 20 MPat @ TSP → 7 MPat @ AM level (35%) per region (all detector) Using TSP resolution in the AM bank for AM patterns with 1,2,3 kids: 3600 goes down to 1325 roads/AMboard → gaining a factor ~ 3! For a full detector FTK: less than 4000 roads/AMboard @AM out with a limit of 8000. less than 2000 roads/AMboard @TSP out with a limit of 4000. Guido Volpi & Roberto Vitillo - Pisa FTK Demonstrator with old chip, barrel only: running now on 17,6 pile-up events to understand DATA FLOW → however we consider it a test, It is not necessary to have large margins for 2013. Even a small AMchip (12 mm 2 ) @ 65 nm (MPW 80 k€) with variable resolution implemented, could do it, even without the TSP. Very low consumption DATA FLOW (Option A) assuming 16 AMboards in a core crate (numbers are for barrel only – a factor ~2,5 has to be applied for “all detector”): 3600 roads/AMboard of which 733 have a kid match at TSP level → 80% fakes

108 Annovi, 27-09-2010108 180 nm 90 nm NEXT YEAR – MAY BE MARCH Mini-asic COULD be 90 or 65 nm THE AMCHIP04 PROTOTYPE Design: L.Sartori (Ferrara) M.Beretta (LNF) E. Bossini, F. Crescioli, I.Sacco (Pisa) Test: A.Lanza (Pavia) 90 nm miniasic

109 The FTK CHALLENGING PART: the NEW AMCHIP & the TSP Where we can stack the TSP? In the AUX board just after the AMBoard? In the AMBoard itself? In the Lamb to reduce early the # of roads? Even better in the AMchip 2.5 D! LAMB Standard cell chip 40 MHz clock FPGA +TSP?


Download ppt "AM chip schedule Alberto. Design activities (17/11/2010) Adapt JTAG and bounday scan to MPW chip Design new CAM cells, Buffer logic New logic majority."

Similar presentations


Ads by Google