Presentation on theme: "Specialized Supercomputers Piero Vicini INFN Istituto Nazionale di Fisica Nucleare Italian National Institute for Nuclear Physics."— Presentation transcript:
Specialized Supercomputers Piero Vicini INFN Istituto Nazionale di Fisica Nucleare Italian National Institute for Nuclear Physics
Dedicated SuperComputing WHY –The Scientific Case –Custom vs Commodity –Italian Experience APE project HOW TO –The international scenario –Petaflops machine Some ideas TOOLS –EU funding –National funding
SuperComputing: the Scientific Case Large Scale numerical applications –Astrophysics and Plasma Physics Today: 70-100 TF/s, 2009: >500 TFs/s Dedicated architecture: Grape (Japan/Europe) –High-Energy Physics (LQCD) Today: 10-50 TF/s, several projects 2009: 500-1000 TFs/s Dedicated architecture: APE (Europe), QCDOC(USA/UK) –Weather, Climatology, Earth sciences Today: 10-30 TF/s, 2009: several projects per 200-300 TF/s aggregated power Dedicated architecture: Earth Simulator (Japan) –Life Sciences (molecular dynamics, protein folding, in silico drug design,…) Today:…., 2009-2010: > N*Petaflops Dedicated architecture: IBM Blue/Gene (USA) –.........
Dedicated vs General Purpose Parallel Machine Processor level -> very well balanced architecture –Computing unit designed to be very efficent on kernel of (several) classes of applications –Integration of unusual memory interfaces based on large register File, huge multiport,…. –Integration of optimized interconnection network (low latency, high bandwidth) * Communication overhead not included Eff. (H ) 0.560.530.27*0.110.420.05 QCD benchmarks
Dedicated vs General Purpose Parallel Machine(2) System level: Dense, safe and cheap systems –Very high ratio of Flops/Watt –Very high ratio of Flops/Volume –Cost effective systems 0.5 /Mflops Very low cost maintenance 3670 80 46 apeNEXT 3670 72 50972 apeNEXT
The Italian experience: ape project Our line of Home Made Computers … APE (1988) APE100 (1993) APEmille (1999) apeNEXT (2004) Italian research team Italian research team European research team + Industry(QSW, Eurotech) European research team + Industry(Eurotech) ArchitectureSIMD SIMD++ comp. nodes162048 4096 Interc. Topologyflexible 1Drigid 3Dflexible 3D Memory size256 MB8 GB64 GB1 TB registers(w.size)64 (x32)128 (x32)512 (x32)512 (x64) Clock speed8 MHz25 MHz66 MHz200 MHz Peak power1 GFlops100 GFlops1 TFlops7 TFlops
apeNEXT architecture 3D mesh of computing nodes Custom VLSI processor - 200 MHz (J&T) 1.6 GFlops per node (complex normal) 256 MB (1 GB) memory per node First neighbor communication network loosely synchronous YZ internal, X on cables r = 8/16 => 200 MB/s per channel Scalable 25 GFlops -> 6 Tflops Processing Board4 x 2 x 2 ~ 26 GF Crate (16 PB) 4 x 8 x 8 ~ 0.5 TF Rack (32 PB)8 x 8 x 8 ~ 1 TF Large systems(8*n) x 8 x 8 Linux PCs as Host system Z+(bp) Y+(bp) X+(cables) 02 46 810 1214 13 57 911 1315 J&T DDR-MEM X + … Z -
Evaluating the success of APE(1) Apemille (2000): Italy 1365 GF Germany 650 GF UK 65 GF France 16 GF Total 2 TF apeNEXT (2005): Development costs = 2000 kuro 1100 kuro VLSI NRE 250 kuro non-VLSI NRE 650 kuro prototype procurement Manpower = 20 man/year Mass production cost ~ 0.5 uro/Mflops Installations: Italy 10.6 TF Germany 8.0 TF France 1.6 TF Total 20.2 TF
Evaluating the success of APE(2) Scientific, technological and social impacts: –APE is standard de facto in European LQCD computing area –Huge number of scientific and technological (HW, SW, Architecture) papers –Establishment of an international computing facility fully dedicated to scientific numerical computing Laboratorio di Calcolo apeNEXT: 12 TFs installed, opening on February, 8th –Strategic opportunities to increase national(European) industry capability Eurotech –INFN collaboration -> HPC division, market expansion, international visibility Finmeccanica/QSW –Training, dissemination and establishment of spin-off company Atmel/Ipitec Nergal Digital Video Venere
Whats next after apeNEXT?: scenario In the future (2010) the required computing platform for numerical large-scale applications will be of the order of PetaFlops The International scenario –Today (www.top500.org): IBM Blue/Gene: dedicated architecture (very similar to APE….), N*100TFlops Earth Simulator: N*10TFlops PC Clusters approach: N*10TFlops –Future (2010 and beyond): USA: IBM, Blue/Gene evolution, N*Petaflops Japan: NEC/Hitachi/University, 3 Petaflops per biotech and nanotech, custom silicon, custom interconnect Japan: Fujitsu, 3 Petaflops, cluster approach with optical interconnection Europe?
Brainstorming Silicon shrink –apeNEXT: 0.18 um –today: 0.13um –Next years: 0.90 – 0.65 um Die area per FP Node Worst case: 6 computing Nodes per chip (Tiled architecture)
Brainstorming(2) Performance scaling –Clock frequency scales with silicon process –Power consumption decrease with silicon process (est. 0.3 W/Gflops) –Architecture: Multi-Tiles versus Single-Tile
Brainstorming(3) Smart memory architecture and new 3D Engineering –On chip large and hierarchical memory buffers -> reduction of components per board –Processing board sandwich (stacked) -> surface distributed network connectors –512 FP Nodes per board Worst case: Factor 100 in 5 years… apeNEXT rack
PetaFlops class computer proposal Leverage on European leadership in embedded processor technology European collaboration (research + industry) to design a new computing architecture for scientific and engineering numerical applications Parameters: –(Less) dedicated architecture suitable for future great challenging applications –0.5/1 PetaFlops system (factor 50 better than apeNEXT) –300W/TeraFlops –10KEuro/Teraflops (factor 50 better than apeNEXT) –Programming environment to produce parallel code with very high efficiency
Tools(1) EU Level –FP6 and beyond SHAPES (Scalable sw/Hw Architecture Platform for Embedded Systems) –FP6-2004-IST-4 2.3.4(viii) Advanced Computing Architectures –Partners: INFN-Roma, ATMEL-ROMA, ST, TIMA(FR), TARGET COMPILER(BE)…. –Target: technology R&D to study feasibility of 2TFs board in 4 years (Tiled architecture, NoC, Off-chip network and 3D Engineering multi board system) HPC Europe Initiative –Joint action at EU level (France, Germany, UK, Spain + NederLand, Finland,Italy) to consolidate European role in supercomputing applications and to ensuring the availability of the most advanced supercomputer systems in the EU –Main target: In 2010, 4 computing centre in Europe equipped with general purpose (->not-european) supercomputers –800 MEuro (!) partially funded by EU and national governements
Tools(2) National Level –PNR High Performance Computing for scientific and engineering applications: architecture, hardware, development software and selected applications –Partners: INFN, EUROTECH, CNR(MI), CILEA, SISSA, UNI MI BICOCCA,UNI PADOVA –Target: Petaflops supercomputer suitable for engineering and scientific applications –200 W/Teraflops, 10 KEuro/Petaflops –Development cost (+ prototype procurement): 20 Meuro + 40 man/years –Project duration: 4 years
apeNET: status ed attivita future Testato con successo un cluster di 16 PCs interconnessi via apeNET –Performance misurate >800 MB/s (send- receive) per direzione Ottimizzazioni SW/HW/FW –RDMA, Network driver, LAM/MPI INFN Roma2 ha finanziato un cluster da 128 nodi (dual Xeon + apeNET) –Fornitura prevista per Settembre 2005 Attivita future: –PCI-X -> PCI Express –Integrazione di uP core su FPGA –Sviluppo applicazioni QCD e dinamica molecolare
apeNEXT: lultima generazione Network di comunicazione a primo vicino debolmente sincrona Sistema scalabile da 25 GF a 6 TF –16 processori per scheda (PB) –Sistemi 8x8x16=1024 nodi o 8x8x64=4096 nodi Host system realizzato con PC (Linux) Z+(bp) Y+(bp) X+(cables) 02 46 810 1214 13 57 911 1315 J&T DDR-MEM X + … Z - Reticolo 3D di 4096 nodi di calcolo (6.5 TF) – Processore custom VLSI- 200 MHz (J&T) – 1.6 GFlops per nodo (a*b+c su dati complessi) 4Q03: !! apeNEXT running !!
APENET Network dinterconnessione per PC cluster con topologia 3D toroidale per cluster di PC –apeLINK: PCI-X (133MHz) board 6 link LVDS, bidirezionali e full-duplex 700 MB/s per link per direzione (-> 8.4GByte/s) Link basati su National Instr. SERDES –Capacita di routing e switching integrata –Alta banda passante e bassa latenza grazie alladozione di un protocollo leggero