AM chip schedule Alberto. Design activities (17/11/2010) Adapt JTAG and bounday scan to MPW chip Design new CAM cells, Buffer logic New logic majority.

Slides:



Advertisements
Similar presentations
1 Lecture 20 Sequential Circuits: Latches. 2 Overview °Circuits require memory to store intermediate data °Sequential circuits use a periodic signal to.
Advertisements

The 8085 Microprocessor Architecture
Microprocessor and Microcontroller
Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.
Clock Design Adopted from David Harris of Harvey Mudd College.
Introduction to CMOS VLSI Design Lecture 13: SRAM
Track quality - impact on hardware of different strategies Paola FTK meeting Performances on WH and Bs   2.Now we use all the layers.
6 June 2002UK/HCAL common issues1 Paul Dauncey Imperial College Outline: UK commitments Trigger issues DAQ issues Readout electronics issues Many more.
Low-Power CMOS SRAM By: Tony Lugo Nhan Tran Adviser: Dr. David Parent.
FTK poster F. Crescioli Alberto Annovi
Global Timing Constraints FPGA Design Workshop. Objectives  Apply timing constraints to a simple synchronous design  Specify global timing constraints.
SVT workshop October 27, 1998 XTF HB AM Stefano Belforte - INFN Pisa1 COMMON RULES ON OPERATION MODES RUN MODE: the board does what is needed to make SVT.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.
McKenneman, Inc. SRAM Proposal Design Team: Jay Hoffman Tory Kennedy Sholanda McCullough.
Status of Global Trigger Global Muon Trigger Sept 2001 Vienna CMS-group presented by A.Taurok.
CPT Week, April 2001Darin Acosta1 Status of the Next Generation CSC Track-Finder D.Acosta University of Florida.
AMB HW LOW LEVEL SIMULATION VS HW OUTPUT G. Volpi, INFN Pisa.
CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.
G. Volpi - INFN Frascati ANIMMA Search for rare SM or predicted BSM processes push the colliders intensity to new frontiers Rare processes are overwhelmed.
CERN, 18 december 2003Coincidence Matrix ASIC PRR Coincidence ASIC modifications E.Petrolo, R.Vari, S.Veneziano INFN-Rome.
Introduction to Microprocessors - chapter3 1 Chapter 3 The 8085 Microprocessor Architecture.
ATLAS Trigger Development
TIMELINE FOR PRODUCTION 2  Need to be ready for production end next year  => submission of final mask set ~September 2015  Would like one more iteration.
1 FTK AUX Design Review Functionality & Specifications M. Shochet November 11, 2014AUX design review.
A Fast Hardware Tracker for the ATLAS Trigger System A Fast Hardware Tracker for the ATLAS Trigger System Mark Neubauer 1, Laura Sartori 2 1 University.
Software for tests: AMB and LAMB configuration - Available tools FTK Workshop – Pisa 13/03/2013 Daniel Magalotti University of Modena and Reggio Emilia.
System Demonstrator: status & planning The system demonstrator starts as “vertical slice”: The vertical slice will grow to include all FTK functions, but.
G.F. Tassielli - SuperB Workshop XI LNF1/11 02/12/2009 Status report on CLUster COUnting activities G. F. Tassielli on behalf of CLUCOU group SuperB Workshop.
FTK high level simulation & the physics case The FTK simulation problem G. Volpi Laboratori Nazionali Frascati, CERN Associate FP07 MC Fellow.
H. Krüger, , DEPFET Workshop, Heidelberg1 System and DHP Development Module overview Data rates DHP function blocks Module layout Ideas & open questions.
Associative Memory design for the Fast Track processor (FTK) at Atlas I.Sacco (Scuola Superiore Sant’Anna) On behalf Amchip04 project (A. Annovi, M. Beretta,
Status of FTK & requests 2013 Paola Giannetti, INFN Pisa, for the FTK Group ATLAS Italia, Sep 5, 2012 Status of FTK work IMOU NEWS & Future steps TDR with.
Future evolution of the Fast TracKer (FTK) processing unit C. Gentsos, Aristotle University of Thessaloniki FTK FP7-PEOPLE-2012-IAPP FTK executive.
Status of FTK Paola Giannetti, INFN Pisa, for the FTK Group ATLAS Italia, Fabruary 2, 2010 Status & Evolution of FTK (impact on Italian groups) Schedule.
Paola TDAQ FTK STATUS (valid for both Option A & B) Paola Giannetti for the FTK collaboration  Work done for each milestone since the TDAQ.
Off-Detector Processing for Phase II Track Trigger Ulrich Heintz (Brown University) for U.H., M. Narain (Brown U) M. Johnson, R. Lipton (Fermilab) E. Hazen,
Status of FTK Paola Giannetti INFN Pisa for the FTK Group ATLAS Italia November 17, 2009.
FTK crates, power supplies and cooling issues 13/03/20131FTK-IAPP workshop - A. Lanza  Racks, crates and PS: requirements  Wiener crates  Rittal crates.
AMBFTK Report AMBFTK: problems to solve Power distribution: Crates – compatibility with CDF crates? Thermal dissipation: Cooling Signals I/O:
Alberto Stabile 1. Overview This presentation describes status of the research and development of main boards for the FTK project. We are working for.
New AMchip features Alberto Annovi INFN Frascati.
Outline The Pattern Matching and the Associative Memory (AM)
Firmware development for the AM Board
The Associative Memory Chip
IAPP - FTK workshop – Pisa march, 2013
FTK: update on progress, problems, need
D. Breton, S. Simion February 2012
The 8085 Microprocessor Architecture
The Associative Memory – AM = Bingo
FTK Update Approved by TDAQ in april
LHC1 & COOP September 1995 Report
LAMB: Hardware & Firmware
Project definition and organization milestones & work-plan
AM system Status & Racks/crates issues
2018/6/15 The Fast Tracker Real Time Processor and Its Impact on the Muon Isolation, Tau & b-Jet Online Selections at ATLAS Francesco Crescioli1 1University.
Pending technical issues and plans to address and solve
Scheme for the large full custom cell
Full Custom Associative Memory Core
SLP1 design Christos Gentsos 9/4/2014.
Meeting at CERN March 2011.
Overview of the ATLAS Fast Tracker (FTK) (daughter of the very successful CDF SVT) July 24, 2008 M. Shochet.
The 8085 Microprocessor Architecture
FTK variable resolution pattern banks
Amchip04 with umc90 std cells
An Introduction to Microprocessor Architecture using intel 8085 as a classic processor
Timing Analysis 11/21/2018.
The 8085 Microprocessor Architecture
PID meeting Mechanical implementation Electronics architecture
ECE 352 Digital System Fundamentals
Preliminary design of the behavior level model of the chip
Presentation transcript:

AM chip schedule Alberto

Design activities (17/11/2010) Adapt JTAG and bounday scan to MPW chip Design new CAM cells, Buffer logic New logic majority Opcode interface to new majority New logic for kill? PADs layout

Schedule/milestones end of March: –a first version of the boundary scan is available –preliminary layout of the majority logic cell available –first version of the CAM-block available starting beginning of April work on a first integration of the chip end of April: after simulation and debugging… –a final version of the boundary scan is available –a complete layout of the majority logic cell –final design for the CAM-block available starting beginning of May work on the final integration of the chip end of May: successful integration of all pieces –during May test and simulation of the integrated design. during June based on simulation decide what adjustments are needed end of June: completed all simulations of the chip – the final and integrated chip should be completely simulated and approved. early July submission

Majority interface component majority port ( CLK : in std_logic; INIT : in std_logic; REQ_LAY0 : in std_logic; FORCE_READ : in std_logic; -- force read regardless of n-miss and layer0 DISABLE_READ : in std_logic; MISS0 : in std_logic; MISS1 : in std_logic; MISS2 : in std_logic; -- signals to write the DISABLE_THIS SRAM cell WL : in std_logic; DISABLE_THIS_SET : in std_logic; DISABLE_THIS_RESET : in std_logic; LAYER_MATCH : in std_logic_vector (7 downto 0); READ_FLAG : in std_logic; PATTERN_MATCH : out std_logic; ); end component;

Majority interface majority CLK INIT layer_match MISS0 MISS1 MISS2 Require_layer0 Force_read Disable_read Global cfg signals, change just after clk rising edge Pattern_match Read_flag Disable_this_RESET Disable_this_SET WL (write line) Fisher tree layer_match Signals to write disable_this pattern Layer_match is needed to output the map of fired layers.

Majority interface majority CLK INIT layer_match MISS0 MISS1 MISS2 Require_layer0 Force_read Disable_read Global cfg signals, change just after clk rising edge Pattern_match Read_flag Disable_this_RESET Disable_this_SET WL (write line) Fisher tree layer_match Signals to write disable_this pattern Layer_match is needed to output the map of fired layers. WL SET RESET

Where to put the Flip Flops? bitline clk_N CAM cells 1 layer CAM cells 1 layer Match line Sense Ampl. Layer-match SR latch Majority logic & flag logic Majority logic & flag logic clk Pattern match clk bitline Match enable ML reset ML MLSA Patt. match Bit line Propag. ML Match? Majority Propag.

Where to put the Flip Flops? bitline clk_N CAM cells 1 layer CAM cells 1 layer Match line Layer-match SR latch bitline Match enable reset ML Latch Bit line Propag. ML Match? Current source Match enable R S reset

Use 16 blocks of 512 patters ? 3.75mm 3.2mm If die is 3.2x3.75mm Usable area excluding PADs 3040um x 3590um Can fit 3 blocks in height 3*979um = 2937um Can almost fit 6 blocks in width 6*600um = 3600um Depends on block size! Can fit 3*5 blocks. Can fit 3*6 if block smaller than 600 um

Use 16 blocks of 512 patters ? 3.75mm 3.2mm If die is 3.2x3.75mm Usable area excluding PADs 3040um x 3590um Can fit 3 blocks in height 3*979um = 2937um Can almost fit 6 blocks in width 6*600um = 3600um Depends on block size! Can fit 3*5 blocks. Can fit 3*6 if block smaller than 600 um

Block size April 7th, 201 Full custom block 64x4 (half of 64 patterns) –Width 225.4um * Height 122.4um (68 rows) Pattern block of 64 patterns + buffer –Buffer 1 row ? –8 layers width 450.8um no majority –Height with buffer 124.2um Height of 512 patterns = 993.6um Height of 3x512 patterns = um 3.2mm – 160um PADS – um = 59.2um or 32 rows Max Pattern block width –(Chip 3.75mm – 160um PADS)/6 = 3590um / 6 = 598um

AMchip03 document tribId=0&materialId=0&confId=3021

What we have now: Standard Cell 180 nm 5000 pattern/chip for 6-layer patterns, 2500 pattern/chip for 12-layer patterns “A VLSI Processor for Fast Track Finding Based on Content Addressable Memories”, IEEE Transactions on Nuclear Science, Volume 53, Issue 4, Part 2, Aug Page(s): NEXT: NEW VERSION For both L1 & L2 65 nm technology provides a factor 8 → patterns/chip Full custom cell provides at least a factor 2 → patterns/chip 8 layers instead of 12 provides a factor 1,5 → patterns/chip 1,2 x 1,2 cm^2 2D chip → patterns/chip With a 2 D chip we gain a factor 30! 1 AMboard: 128 chips → ~10 Mpatterns per board 1 Crate: 16 AMboard → ~160 Mpatterns per crate Current prototype under design: 65nm TSMC, 12mm^2 MPW run, 100 MHz running clock 8000 patterns/chip 8 layers each Layer words of 12 bits + 3 ternary bits  variable resolution patterns A. Annovi - ACES CERN 15

Pattern efficiency 90% # of patterns in Amchips (barrel only, 45  degress) 65M500M Pattern size r-  : 24 pixels, 20 SCT strips z: 36 pixels Pattern size (half size) r-  : 12 pixels, 10 SCT strips z: 36 pixels = 342k = 40k Want this A. Annovi - ACES CERN 16

Variable resolution AM A. Annovi - ACES CERN finer patterns coarser pattern We can use don’t care on the least significant bit when we want to match the pattern carser resolution or use all the bits to match finer resolution Patterns with 1 kid are stored at finer precision Layers without “don’t care (DC)” can ignore the hits in the “wrong” side of the layer DCDC coarser pattern 17 With 2 “don’t care” bits per layer gain an effective factor of 5 in patterns

A. Annovi - ACES CERN 18 Goal: x30 pattern density but lower power consumption 32 patterns of 8 layers ~ 60  m x 500  m ~ 1 or 2 pixels

Tasks in Italy New PADs layout (Stabile, Milano) New CAM cells (Matteo, Frascati) –Complete NAND cell –Evaluate advantages of SRAM transistors –Other cells Clean up project scripts (Francesco, Pisa) Transition to 65nm and new tools Place and route (Francesco, Stabile)

JTAG work (Germany) JTAG logic to be looked at after Laura left New features: –Add pins all MPW to boundary scan –Extend JPATT_DATA register to include as bit[0] a disable_pattern bit –Extend register for new busses (8 instead of 6)

Majority logic (Fermilab) Current draft from Jim is good and almost complete Features to be added: –Account for individual pattern disable –4 thresholds: disable, 0miss, 1miss, 2miss –Include option require_layer0 Layer0 should be the closest to final AND Design the basic majority logic cell

OPCODE work (Germany) Transalate OPCODE output for new majority logic. 5 lines out: –Disable_match, 0miss, 1miss, 2miss –require_layer0 Optional if time allows (coding and testing) –change input OPCODE protocol to single word

Kill tree (Fermilab) Optional if time allows (coding and testing) –Try a new scheme for kill tree Short description: –Current scheme encodes the highest priority pattern (encoder) –Then decodes it to set one kill FF –1024 (or N patt) kill lines are then distributed to the patterns Alternative: calculate kill along with priority encoder in a tree like fashion

Draft schedule Aim for March submission (tight) –All dates below are my guess for discussion today NAND cell November NOR cell end of December NOR don’t care cell mid January (?) Match line amplifier beginning of February –This item is late in the schedule New PADs layout end of November –Important to check that all pads fit in a single MPW block Project clean up and first place and route end of December Preliminary version of majority mid December –Final end of January New JTAG logic end of December: needed for 1st place and route New OPCODE and kill TREE end of January (mid Feb at latest)

Draft schedule during February Put everything together Final place and route Prepare a detailed model of each CAM cell for mixed simulation of a few patterns. –Doing this in February is late, we should start it earlier, but currently uncovered From now till the end: comparison of the implemented model with the C++ model for debugging. –Ilaria Sacco (Pisa) –This item is under staffed, we will need help here Are we missing any important item? First goal get Jim and Hans up and running –Please ask for needed information to start up

Outline  The Pattern Matching and the Associative Memory (AM)  Why more dense AM we get better it is  Associative memory architecture  How chips are put together: Lamb → AMboard → crate  The Tree Search Processor & its location

The Event... The Pattern Bank TRACKING WITH PATTERN MATCHING

Bingo scorecard Dedicated device - maximum parallelism: Each pattern with private comparator Track search during detector readout The Associative Memory – AM = Bingo Full custom 700 nm: 0,1286Lkpat/chip FPGA 350 nm: 0,1286Lkpat/chip standard cell 180 nm:5,0 6Lkpat/chip new for FTK 90 nm: ~60 8Lkpat/chip new for FTK 65 nm:~1208Lkpat/chip 2 Tiers 65 nm 2,5 D : 240 8Lkpat/chip

FF word Layer 1Layer 2 Layer 3Layer 4 HIT Cell 0 Cell 1 Cell 2 Cell 3 Output Bus ONE PATTERN HIT

Track fitting using full resolution of the detector Data Organizer (DO) Hits Tracks parameters (d, p T, ,  z) Roads Associative Memory (AM) Hits Roads + hits Track Fitter (TF) Super Strip (SS) Tracking in 2 steps : find Roads first (Pattern Matching with Associative Memory, AM) then find Tracks inside Road (Fit by TF) Full Resolution Hits Large SS: a lot of fakes + combinatorics inside roads Road Hot occupancy

What we have now: Standard Cell 180  m 5000 pattern/chip for 6-layer patterns, 2500 pattern/chip for 12-layer patterns “A VLSI Processor for Fast Track Finding Based on Content Addressable Memories”, IEEE Transactions on Nuclear Science, Volume 53, Issue 4, Part 2, Aug Page(s): NEXT: NEW VERSION For both L1 & L2 90 nm technology provides a factor 4 → patterns/chip Full custom cell provides at least a factor 2 → patterns/chip 8 layers instead of 12 provides a factor 1,5 → patterns/chip 1,5 x 1,5 cm**2 2D chip → patterns/chip Going to 65 nm → patterns/chip With a 2 D chip we gain a factor 50! 1 AMboard: 128 chips → ~15 Mpatterns per board 1 Crate: 16 AMboard → ~245 Mpatterns per crate 100 MHz running clock

Pattern bank Add encoder kill Bus0[17:0] Bus1[17:0] Bus2[17:0] Bus3[17:0] Bus4[17:0] Bus5[17:0]

Power consumption Old Chip: corr. Factor1,8 Watt 180 nm 1,8 V Core New chip 90 nm 1 V Core1/(1,8*1,8)0,56 Watt Frequency 40 MHz New chip 100 MHz100/401,39 Watt Area 1x1 cm**2 New chip 4 cm**24/15,56 Watt New: Pre-match feature1/3 (1/2)1,85 (2,78) Watt Per crate 16 x 128 = 2048 chips3,8 (5,7) kW IF the pre-match feature save at least 1/3, new 2D chip (1,85 W) ~ old chip (1,8 W) ANY OTHER IDEA TO GAIN IN POWER INCREASES THE POTENTIALITY TO GROW IN THE THIRD DIRECTION we would like to be 4 funding agencies involved:

Annovi, Concentrate now on (17-19 pile-up events) Consider evolution up to 2019 (41,5 pile-up events << simulated 75 ev) → Intermediate chip! 2020 comes much later and will profit of a very advanced technology……. Sim with 75 pile-up events after 2020! 17,6 pile-up ,0 pile-up LHC Schedule

Our Schedule 1.TSMC 65 nm, low power, available as (Vcc_core=1,2 V) nm 22,5 k€/block; 90 nm 18,6 k€/block. 3."variable resolution" gives good results → early production of AM04 4.we missed the 90nm 2010 September run 5.We propose to move directly to a 65 nm prototype. 6.This is a preliminary schedule to produce new LAMBs for 2013: (1) submission: spring or october (2) delivery:~february 2012 (3) tested ~June 2012 (4) MPW submission:from June 2012 (5) Delivery:from November 2012 (6) Tested:from February 2013 (7) MPW Production from February 2013 (8) Delivery from July 2013 (9) mounted on new Lambsfrom autumn 2013

Costs 2 blocks payed by Italy MPW run: TSMC 2010: 12 mm^2 80 kUSD → 6,7 kUSD/mm^2 UMC 2010: 4 mm x 4 mm70 k€ → 4,37 k € /mm^2 12 mm^2 ~ 1/8 AMchip03 area in CDF → 7500 patterns/chip → 960 kpatterns/AMBoard With 2 blocks 160 kUSD→ ~2 Mpatterns/AMBoard In 2012 could cost less – Academia Sinica can help on prize. Italy – Germany – USA – Academia Sinica (reduction). For 2013: small production = 8+2 AMBoards = 1280 chips. How many wafers? How much for a wafer? we would like to be 4 funding agencies, especially for final step: Whole wafer when a large area chip is needed: UMC nm:555 kUSD TSMC nm: kUSD TSMC nm MLM kUSD

add_in add_out Pipelines of AM chips AMchip Control = GLUE

AM INDI AMTOP Bus0 Bus1 Bus3 Bus2 AMBOTTOM Bus0 Bus5 Bus1Bus3 Bus2 Bus4 Bus5 PAT_ADD_IN [17:0] PAT_ADD_OUT [17:0] REV_EN add_in add_out LAMB

AM GLUE FIFOS RECEIVERs & DRIVERs (ROAD bus + 6 HIT buses) LAMB CONNECTORs VME INTERFACE ROADCONNECTOR HITCONNECTOR FPGA I/O control PIPELINE REGISTERs INDI HIT [17:0] ADD OUT [30:0] TRACKs 6 bus (108 bits!) Four 8- chips (top- bottom) pipeline

LAMB Standard cell chip 40 MHz clock FPGA for Roads FTK AMBoard P3 serial LVDS Control FPGA for SS Input CDF AMBoard with 4 LAMBs Complementary Functions in the AUX board 16 AMBoards per “core” crate → 8 core crates in the system

AM0+TSP+DO+TF+HW CPU vme AM1+TSP+DO+TF +HW AM2+TSP+DO+TF +HW AM3+TSP+DO+TF +HW AM4+TSP+DO+TF +HW AM5+TSP+DO+TF +HW AM6+TSP+DO+TF +HW AM7+TSP+DO+TF +HW 11LayFit+HW AM10+….. AM11+….. AM12+….. AM13+…… AM14+….. AM8+….. AM9+…... 11LayFit+ HW final AM15+….. 11LayFit+ HW final 11LayFit+HW LAMB Standard cell chip 40 MHz clock FPGA for Roads AMBoard P3 serial LVDS Control FPGA for SS Input AUX card Connectors for Hits LVDS Cables DO+TF+HW HWTF DO INPUT FIFOs HWTF DO HWTF DO HWTF DO Connectors for tracks output Interface SSMAP Processing Unit

The whole system: Data Formatter + 8 core crates

6 18-bit buses, hit rate: 40MHz/bus input bandwidth of 4 Gbit/s 1/2  AM Divide into  sectors with overlaps Pixel barrelSCT barrel Pixel disks 6-12 Logical Layers: full  coverage IEEE Trans. Nucl. Sci. 51, 391 (2004) Overlaps require hits in a small region to be sent to two neighboring AMs Goal: High Lum 8  sectors 8 9U VME crates for the FTK core 1/2  AM

Whatever is the power of the AM we can build, we can do better with the TSP

Algorithm: NIM A287 (1990) Tree Search Processor: NIM A 287, 431 (1990), IEEE Toronto, Canada, November THIN ROAD FAT ROAD Found by AM (default SS for example) Depth 0 Depth 1 Depth 2 PATTER N BLOCK PARENT PATTERN

The AM chip for each found road could provide: 1)The Road IDentifier (address) 2)The Bitmap : one bit per layer, saying which SSs are empty & which are full (11 bits: eg.) 3)4 more bits for each layer, Sub-SS, saying which of the 4 SS subdivisions are empty and which are full (4 bits  8 Layers). Higher resolution SS (sub-ss) to be stored in AM or into a Mini-DO & LSB bits should be provided to TSP Example: 2-Level TSP → divide by 4 each SS

Conclusions  The application at future Instantaneus Luminosities will require AM extremely performing  Even if extremely performing, the AM work could be refined by the TSP that could fit in the same package with the AM chip in a 2.5 D technology. This actually is NOT true any more, probably, before 2020  The AM could be used for both L1 and L2 applications  Any AM pattern capacity increase would be an important advantage for both L1 and L2 tracking systems

BACKUP

New AMchip features Alberto Annovi INFN Frascati

Outline Use of patterns Variable size patterns New input busses Disabling patterns –Increase effective production yield Annovi,

Annovi, The Event... The Pattern Bank Pattern matching

Annovi, Find low resolution track candidates called “roads”. Solve most of the pattern recognition 2.Then fit tracks inside roads. Thanks to 1 st step it is much easier Tracking with ~offline quality Super Bin (SB) Tracking in 2 steps Critical parameter: SS size Affects: - Number of patterns for given efficiency: cost - Number of found roads: workload for next step Critical parameter: SS size Affects: - Number of patterns for given efficiency: cost - Number of found roads: workload for next step

Pattern efficiency Annovi, % # of patterns in Amchips (barrel only, 45  degress) 65M500M Pattern size r-  : 24 pixel, 20 SCT 36 pix z Pattern size r-  : 12 pixel, 10 SCT 36 pix z = 342k = 40k Want this

Efficiency curve Annovi, # of pattern in Amchips (barrel only, 45  degress) Need many patterns for little efficiency ?? Super Bins are discrete Edge effects give lots of patterns with little coverage

Annovi, TSP simulation & varying-resolution pattern banks Guido Volpi & Roberto Vitillo - Pisa Depth 0 Depth 1 Depth 2 PARENT PATTERN FAT ROAD Thin ROAD AM resolution TSP resolution We do have now a structured “pattern bank”, where each thin road is connected to its parent pattern in FTKsim. Ongoing tests for TSP algo after the RoadFinder (AMsim) in FTKsim; we have studied the bank composition and AM FAKE roads. AM Fake road is a AM matched pattern whose kids do not match the event Low probability to fire AM patterns: few kids (1 or 2): big advantage to match it at TSP resolution! All blank Half-SS can AM level as fakes TSP level the fake has good probability to be deleted LOW coverage patterns High probability to fire AM patterns (symmetric): many kids (up to 20 or more): no advantage to match it at TSP resolution! More than one kid can TSP level. Low probability to be a fake AM road HIGH coverage patterns KID 0 1

Annovi, We can use don’t care on the least significant bit when we want to match the pattern AM resolution or use all the bits to match TSP resolution Test of AM patterns: 1.all single kid TSP resolution 2.For all few kid patterns use don’t care only for layers where both Half-SS are used by kids AM resolution (don’t care ) TSP resolution (care) to exclude the right half in these layers Guido Volpi & Roberto Vitillo - Pisa All AM roads AM roads with at least 1 matched kid Fake AM roads # of kids 34 How to implement “variable resolution” in the AMchip AM pattern distribution vs Number of kids Majority of patterns with a single Kid AM & TSP Pattern Bank for 23 ev. pileup # of kids

AM with care/don’t care Annovi, TSP AM Care/don’t care very effective to reduce the number of roads. Area cost on the chip approx. 1 extra cell for each DC bit. Now 15 cells/layers. With 1 DC bit area increases by 1/15 ~ 7%. For comparison going to TSP resolution would require 3x patterns. # of kids

Number of busses Currently we have 6 input busses New AMchip should handle 8 layers IBL will require 2 busses for higher b/w External SCT layers needs half b/w Current package constraint max 7 input busses 3 options: implement 2 of them to be selected online Annovi,

8 Layers vs 7 buses (option 1) Annovi, Pattern bank with 8 matching layers 8 internal buses Internal register that feeds 8 busses Input register for 7 busses Demultiplex based on MSB Ex tr a PixPix PixPix PiXPiX SCTSCT SCTSCT SCT 2 & 3

IBL: 7 Layers vs 7 buses Annovi, Internal register that feeds 8 busses IBLIBL IBLIBL PixPix PiXPiX SCTSCT SCTSCT SCT 2 & 3 Input register for 7 busses Demultiplex based on MSB double bandwidth. Either double internal clock, or special logic. Take the logical OR of 2 layers. Both layers store the IBL super bin. Distribute 50% data to each layer. Layer matches if any of 2 IBL layers match Special IBL layer: OR of 2 layers

IBL: 8 Layers vs 7 buses Annovi, Internal register that feeds 8 busses IBLIBL IBLIBL PixPix PiXPiX ?????? SCTSCT SCT 2 & 3 Input register for 7 busses Demultiplex based on MSB double bandwidth. Either double internal clock, or special logic. Take the logical OR of 2 layers. Both layers store the IBL super bin. Distribute 50% data to each layer. Layer matches if any of 2 IBL layers match IBL with double clock

Amchip 03 yields AMchip03 prototype 2004 –1cm^2 MPW yield 35% AMchip03 production 2005 –1cm^2 pilot run yield 70% Large fraction of failures due to single pattern defect. Add one register to disable bad patters –Will allow to use all chips with a single (or few) pattern defects. Area cost small :1 flip-flop/pattern (not /layer) Annovi,

Changes to AMChip specifications Amchip 03 specs: – cdf.fnal.gov/publications/cdf7339_amchip03_s pecs.pshttp://www- cdf.fnal.gov/publications/cdf7339_amchip03_s pecs.ps New features –Add 1 or 2 don’t care bits/layer –Increase input busses to 7 with multiplexing & special handling of IBL –Add disable FF for each pattern Annovi,

BACKUP Annovi,

Two possible Approaches to expand into the third direction VIPRAM - Vertically Integrated Pattern Recognition Associative Memory Ted/Jim/Aida/Ray/Gregory/Simon/Silvia/Marcel/Gary/Mel/Bob… FNAL/ANL/UC/Tezzaron/… 1. “Identical Tier” 3D architecture (actually 2.5 D?) 2.“True 3D” Implementation

Trying to define a collaboration Italy-USA for DOE application to Generic R&D funds (ATLAS FTK - Fermilab CMS, both interested)

All equal tiers: put them in pipeline as done on the board

The 3D IO Wrapper must be designed and fabricated around the 2D AMchip to ensure that all tiers act as a single chip as shown in Figure 5. Even for prototyping purpose, it is not possible to simply take an existing, fabricated AMchip and place it inside a rectangular doughnut- shaped 3D IO Wrapper. There are several ways to address this. First, the 2D AMchip could be redesigned in a 3D process like Tezzaron/Chartered, and then the 3D IO Wrapper could be designed around it. This method has no obstacles to its 3D fabrication. However, it does require the redesign of the AMchip. Second, the CMOS UMC process could be used for 3D development even though UMC does not have a 3D process. This method requires no redesign of the AMchip, but it does require UMC to be willing to participate in a “Via Middle” process in which after a certain number of fabrication steps, the wafers are shipped to a “Via Middle company” (e.g. Tezzaron) where the first steps of the Through Silicon Via process are started. Then the wafers are shipped back to UMC where the 2D processing is completed. Finally, UMC ships the completed wafers to the Via Middle Company where 3D processing is completed. Not all companies are willing to participate in a Via Middle process.

The True 3D: 1 tier/ Layer + 1 control tier Control Tier Tier 4 Tier 3 Tier 2 Tier 1

CAM in 2D

Very high density of patterns

Advantages 2D chip: ready soon with ~best technology (65 nm today, 40 or better in 2020), 1 single mask, probably enough for LVL2, could allow 2,5D True 3D: less consuming Tiers, much larger banks useful for LVL1? Less latency compared to pipelined Tiers. True 3D: Important if we need much larger banks than provided by 2D. COSTS? Fermilab proposes “True 3D” as a phase I R&D

EVEN MORE – Phase II Adding more planes? Could we include DO – TF and HW? All planes that fit well in a 2,5 D scheme All of them well known and testable on FPGA before! AMchip Flexible TSP Logic-FPGA like ? Memories for TSP MINIDO? DO + TF + HW ? Integration of VLSI chips with FPGA and RAMs

Conclusions They present 2 phases: “true 3D” first, Integration with FPGA and memories second. We think that in a short time scale it is important to understand the power of 2D design: density of patterns available/needed. For LVL2 seems ok 2D pushed at best technology. Consumption We could try the 2D chip to be used as 2.5 D as Phase I On a longer time scale, try the “True 3D” as Phase II

Amchip04 with umc90 std cells UMC90 FSD0A_A standard cells library Our custom standard cells: single_layer search_line Tools used Synopsis DC D SP1-1 (synthesis) Cadence SoC Encounter v07.10-s219_1 (placement, routing) Synopsis PT D SP1-1 (timing analysis) Custom scripts (manual place)

Basic bank structure x Input Bus Matched patterns Buffer 32x Patterns (row) 32x Majority - Manual placement - Majority row has it's own clock tree

Basic bank structure x Input Bus Matched patterns Buffer 32x Patterns (row) 32x Majority - Manual placement - Majority row has it's own clock tree A pattern is a row: 8x single_layer cells Each cell match a 15bit bus

Basic bank structure x Input Bus Matched patterns Buffer 32x Patterns (row) 32x Majority - Manual placement - Majority row has it's own clock tree Majority logic: If X out of 8 bus match the pattern is matched. X is programmable via JTAG

CLK Match Line Match_reg BL [15:4] BL_N [15:4] XXXXXX DATA XXXXX XXX ZZZZZZ Mlpre_n slpre slpre_t BL [3:0] BL_N [3:0] XXXXXX DATA XXXXX XXX MLSA_res SEN

CLK Match Line Match_reg BL [15:4] BL_N [15:4] XXXXXX DATA XXXXX XXX ZZZZZZ Mlpre_n slpre slpre_t BL [3:0] BL_N [3:0] XXXXXX DATA XXXXX XXX MLSA_res SEN All this signals are inputs to the single_layer pattern cell for activate the match. Relative timing is critical! Generated in each Buff module By global “read” signals

512 patterns bank 16 x 32 pattern blocks are manually placed to build a 512 patterns bank. Horizontal and vertical gaps are left for power grid.

All logic placed The pattern bank occupies most of the area. All the other control logic scale very weakly with the number of patterns. We could try to fill the chip with a bigger column of patterns (~800), but is not critical for this prototype to have a bigger

Logic scheme

Power grid Power distribution is done by two big horizontal stripes and two thinner vertical stripes. We are waiting a feedback from IMEC about this power grid design.

512 patt AMCHIP04 routed First results of routing (wroute, clock tree routed first, no post- routing optimization) are reasonable: - routing is simple and consistent with our plans in the bank area (vertical buses, horizontal output) - no critical congestions in other areas

Timing Analysis We have working skeleton scripts for static timing analysis A first look at the timing with PrimeTime showed some various setup and hold violations No post-route optimization was done, buffer optimization in this step might remove most of the violations Global signals running through all the patterns coloumn have setup violation  Force a better routing of the column area  Manually optimize buffer usage  Split the column in two shorted columns Some optimization and re-routing is needed, but no critical flaws are detected

Full Custom Associative Memory Core With respect to standard cell design of the memory chip we want to: Increase memory density Reduce power consumption

CAM model Simple schematic of a CAM with 4 words having 3 bits each. The schematic shows individual core cells, differential searchlines, and matchline sense amplifiers (MLSAs) CAM core cells for (a) 10-T NOR-type CAM and (b) 9-T NAND-type CAM. The cells are shown using SRAM-based data-storage cells. For simplicity, the figure omits the usual SRAM access transistors and associated bitlines. `

NAND Type SRAM Cell

NAND Type SRAM Cell Layout NAND Cell dimensions: 2.8 micron height 3.8 micron width

NOR Type SRAM Cell

NOR Type SRAM Cell Layout NOR Cell dimensions: 2.8 micron height 3.62 micron width

MatchLine Sense Amplifier (MLSA) Positive feedback differential sense amplifier Matchline discharge transistor Output inverter Amplifier resetting transistors Amplifier resetting transistor

MatchLine Sense Amplifier Layout MLSA dimensions: 2.8 micron height 7.3 micron width

NOR Type Matchline Model The main feature of the NOR matchline is its high speed of operation. In the slowest case of a one-bit miss in a word, the critical evaluation path is through the two series transistors in the cell that form the pulldown path.

NAND Type Matchline Models A feature of the NAND matchline is that a miss stops signal propagation such that there is no consumption of power past the final matching transistor in the serial nMOS chain Two drawbacks of the NAND matchline are: a quadratic delay dependence on the number of cells a low noise margin.

Selective Precharge Model

Selective Precharge

Estimated Power Consumption The Associative Memory core estimated power consumption (at 100MHz clock frequency) with NOR cell match line scheme is about 3 A. The core power supply is 1V. Associative memory core (60000 pattern) running at 100MHz clock frequency with Selective Precharge matchline scheme We have obtained an 80% reduction in power consumption

Selective Precharge Timing (all bits match) Matchline precharge MLSA output Search line (Bit line) Precharge MLSA enable Matchline discharge Matchline NOR cell Searchline and Matchline Precharge phase Matchline Evaluation phase Matchline Discharge phase

Selective Precharge Timing (NOR bit mismatch) Matchline precharge MLSA output Search line (Bit line) Precharge MLSA enable Matchline discharge Matchline NOR cell Searchline and Matchline Precharge phase Matchline Evaluation phase Matchline Discharge phase

Selective Precharge Timing (NAND bit mismatch) Matchline precharge MLSA output Search line (Bit line) Precharge MLSA enable Matchline discharge Matchline NOR cell Searchline and Matchline Precharge phase Matchline Evaluation phase Matchline Discharge phase

Layer Layout Width: 67.2 micron Height: 2.8 micron Matchline precharge Transistor NAND cells NOR cells MLSA and Matchline discharge transistor

Timing

Conclusions I have completed the layout of the full layer The obtained layout is quite compact The estimated memory core power consumption is reduced about 80% with respect to a NOR type matchline model To do: Complete the remaining full custom part (Search line precharge of the NOR cell and the MLSA Vref) Complete the layer simulation with Montecarlo analysis Simulation of the full associative memory chip

Annovi, Milestone #9: Specify system size..1×10 34 and 3×10 33 Concentrate now on (17-19 pile-up events) 2020 comes much later and will profit of a very advanced technology……. Sim with 75 pile-up events after 2020! 17,6 pile-up ,0 pile-up 10 34

Annovi, Using the variable resolution in a new AM chip for WH (# of pile-up events = 23) Banks coverage ~ 95% 8.0 → 2,80 AM level (35%) per region (barrel only) 20 TSP → 7 AM level (35%) per region (all detector) Using TSP resolution in the AM bank for AM patterns with 1,2,3 kids: 3600 goes down to 1325 roads/AMboard → gaining a factor ~ 3! For a full detector FTK: less than 4000 out with a limit of less than 2000 out with a limit of Guido Volpi & Roberto Vitillo - Pisa FTK Demonstrator with old chip, barrel only: running now on 17,6 pile-up events to understand DATA FLOW → however we consider it a test, It is not necessary to have large margins for Even a small AMchip (12 mm 2 65 nm (MPW 80 k€) with variable resolution implemented, could do it, even without the TSP. Very low consumption DATA FLOW (Option A) assuming 16 AMboards in a core crate (numbers are for barrel only – a factor ~2,5 has to be applied for “all detector”): 3600 roads/AMboard of which 733 have a kid match at TSP level → 80% fakes

Annovi, nm 90 nm NEXT YEAR – MAY BE MARCH Mini-asic COULD be 90 or 65 nm THE AMCHIP04 PROTOTYPE Design: L.Sartori (Ferrara) M.Beretta (LNF) E. Bossini, F. Crescioli, I.Sacco (Pisa) Test: A.Lanza (Pavia) 90 nm miniasic

The FTK CHALLENGING PART: the NEW AMCHIP & the TSP Where we can stack the TSP? In the AUX board just after the AMBoard? In the AMBoard itself? In the Lamb to reduce early the # of roads? Even better in the AMchip 2.5 D! LAMB Standard cell chip 40 MHz clock FPGA +TSP?