Download presentation
Presentation is loading. Please wait.
1
The Cell Broadband Engine Processor
Aerospace & Electronic Systems Society The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business Manager, Performance Computing Group © Mercury Computer Systems, Inc.
2
Cell Chip Lives in Two Worlds
Game console chip market Driven by “game physics” requirements, not just graphics Compute intensive, vector processing, floating and fixed point New consoles introduced every 5+ years, last about 10 years PS3 unveiled May 2005, will launch November 2006, about 6 years after PS2. New chip architectures linked to console designs Chip architecture unchanged during lifetime Process shrinks targeted at lower cost and lower power High performance processor market Evolving architecture with backwards compatibility Piggy-back off largest volume processor platform that is leading in performance With affordable architecture increments to address high performance needs Previously desktop PC, now game console Cell roadmap addresses both game console and high performance markets © 2005 Mercury Computer Systems, Inc.
3
Mercury’s Relationship with IBM
In June 2005, Mercury announced a strategic alliance agreement with IBM offering Mercury special access to IBM expertise including the broadly publicized Cell technology. Multicomputer-on-a-chip © 2005 Mercury Computer Systems, Inc.
4
Cell BE Processor Block Diagram
Cell BE processor boasts nine processors on a single die 1 Power® processor 8 vector processors Computational Performance GHz GHZ A high-speed data ring connects everything 205 GB/s maximum sustained bandwidth High performance chip interfaces 25.6 GB/s XDR main memory bandwidth The brown boxes are the single PowerPC core and its L2 cache. The green boxes are the eight SPU cores. The purple boxes represent the Rambus I/O technologies that move data on and off the die. © 2005 Mercury Computer Systems, Inc.
5
Synergistic Processing Element
Standalone vector processor 128 bit SIMD model 128 registers each 128 bits wide AltiVec/VMX has only 32 registers, SSE3 only eight 256KB local store Load/store instructions can access only local store Memory flow controller DMA engine built into each SPE SPE includes DMA instructions for explicitly moving data between local store and main memory Performance Dual issue Two- to sixteen-way SIMD 25.6 GFLOPS (single precision), 51 GOPS (8 bit) © 2005 Mercury Computer Systems, Inc.
6
SPE 128 Bit SIMD Engine Operates on 128 bit vector registers
2 x 64 bits (DP float) 4 x 32 bits (SP float or integer) 8 x 16 bits (integer) 16 x 8 bits (integer) Example: Floating point multiply add 4 x 32 bit fma instruction can complete eight floating point operations (FLOPS) every cycle 128 bits fma vr, v1, v2, v3 v1 v2 X X X X v3 + + + + vr © 2005 Mercury Computer Systems, Inc.
7
Power® Processing Element
64-bit Power® core with complete AltiVec™/VMX High frequency Low power consumption Hardware multi-threading L2 is 512 KB Can use any SPE’s DMA engine Altivec is a registered trademark of Freescale Semiconductor Corp. © 2005 Mercury Computer Systems, Inc.
8
The SPE is a very fast, very lean core
Why is Cell So Fast? The SPE is a very fast, very lean core SPE (3.2 GHz) is up to 3 times faster than the fastest Pentium core (3.6 GHz) when computing FFTs That’s 24X better performance chip to chip Huge internal chip bandwidth 205 GB/s sustained ring bandwidth 25.6 GB/s main memory bandwidth High performance DMA DMA can be fully overlapped with SPE computation Software controlled DMAs can bring exactly the right data into local store at the right time © 2005 Mercury Computer Systems, Inc.
9
© 2005 Mercury Computer Systems, Inc.
10
© 2005 Mercury Computer Systems, Inc.
11
© 2005 Mercury Computer Systems, Inc.
12
Mercury Cell Hardware Products
© Mercury Computer Systems, Inc.
13
Mercury Cell Related Roadmap
3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q Blades Dual Cell Based Blade 2 Single slot, 2 BE, 2 Comp. Chips, 4GB XDR+DDR2 Dual Cell Based Blade 3 Single slot, 2 BE, 2 Comp. Chips, up to 32GB DDR2 Dual Cell Based Blade 2 BE, 2 SouthBridges, 1GB XDR 1U Servers Dual Cell Based Server 2 BE 2 Southbridges, 1GB XDR Dual Cell Based Server 2 2 BE, 2 Comp. Chips 4GB XDR+DDR2 Dual Cell Based Server 3 2 BE, 2 Comp. Chips, up to 32GB DDR2 Embedded CAB PCIe Add-In Card 1 BE, 1 Companion Chip, 4 GB DDR2, 1GB XDR ATCA Blade Concept 1 BE, 1 Companion Chip, 4 GB DDR2 1GB XDR Turismo Chassis Concept Rugged PowerBlock™200 ½ ATR Concept 1 BE, 1 Companion Chip, 4 GB DDR2, 1GB XDR VITA 46 / 48 Concept PowerStreamTM Concept © 2005 Mercury Computer Systems, Inc.
14
Dual Cell Based Blade Flexible blade solution based on the Cell BE processor Outstanding performance for HPC applications Designed for distributed processing Cell-optimized software available About 11 TFLOPS in 5 feet of rack height Dual-width BladeCenterTM blade Two PCI Express x4 expansion slots Initially supports only Infiniband cards Evaluation units available since December 2005 Production October 2006 © 2005 Mercury Computer Systems, Inc.
15
Dual Cell Based Blade Block Diagram
Infiniband Daughtercard PCI Express x4 512 MB XDR DRAM 3.2 GHz Cell Processor 2.5 GB/s each way South- bridge 25.6 GB/s GbE Power 20 GB/s each way BladeCenter Midplane Connector 512 MB XDR DRAM 3.2 GHz Cell Processor 2.5 GB/s each way South- bridge 25.6 GB/s GbE Infiniband Daughtercard Power PCI Express x4 Serial Port © 2005 Mercury Computer Systems, Inc.
16
Cell Blade Systems Complete 19” rack-based systems
25U (42.75” high) Up to 14 blades, 5.7 TFLOPS 42U (73.5”) chassis Up to 28 blades, 11.5 TFLOPS Multi-rack systems scalable using Infiniband and GbE Cell Technology Evaluation System Complete turn-key Cell HW & SW system 25U rack One Dual Cell-Based Blade All components included to support expansion to 7 blade system MultiCore Plus SDK One year subscription to production SW Monitor and keyboard Serial line concentrator Xeon based Linux server External GbE switch BladeCenter chassis Power distribution 25U 14-Blade System front rear © 2005 Mercury Computer Systems, Inc.
17
1U Dual-Cell Based Server
Hardware Dual Cell processors at 3.2 GHz 1 GB of XDR DRAM Integrated dual Gigabit Ethernet Serial port Dual full size PCI Express x4 slots Initially supports only Infiniband cards Software Toolchain Native (PPE hosted) Cross (x86 hosted) GUI via X-Windows over GbE No direct keyboard / video / mouse support Production Q1 2007 © 2005 Mercury Computer Systems, Inc.
18
Cell Companion Chip Under design by IBM since May 2005
With significant design input from Mercury First parts began preliminary testing June 2006 Second spin for production in December 2006 Low latency, high capacity mailbox Cell BE Interface 5 GB/s each way Extends Cell global address space to PCIe, DDR2 etc. Non-coherent (non-cached) Cell BE Interface 5 GB/s GbE PCIe 16x Mailbox GbE DMA PCIe 16x interfaces Each configurable: 8x, 4x, 2x and 1x Endpoint or root complex Multichannel, striding DMA engine PCI-X PCIe 16x 405 PPC UART GPIO DDR2 controllers 5 GB/s each Up to 4 GB each DDR2 667 MHz DDR2 667 MHz © 2005 Mercury Computer Systems, Inc.
19
Dual Cell Based Blade 2 Single slot blade Uses new companion chip
One-Slot Processor Blade 2-8 GB DDR2 Single slot blade Up to twice the density Uses new companion chip Up to 10x I/O bandwidth DDR2 I/O buffer memory Production available Q3 2007 IB 4x 1 GB XDR DRAM 3.2 GHz Cell Processor BladeCenter H High Speed Daughtercard 5 GB/s each way Companion Chip 25.6 GB/s 2 PCIe 8x GbE Power 20 GB/s each way 1 GB XDR DRAM 3.2 GHz Cell Processor 5 GB/s each way Companion Chip IB 4x PCIe 16x 25.6 GB/s PCIe 16x GbE 2-8 GB DDR2 Power PCIe x16 / PCI-X Daughtercard PCIe x16 / PCI-X Daughtercard One-Slot I/O Expansion Blade © 2005 Mercury Computer Systems, Inc.
20
Dual Cell Based Blade 3 Concept
One-Slot Processor Blade 1-2 GB DDR2 Improved SPE double precision performance Expanded memory DDR2 replaces XDR Production available Q1 2008 2 IB 4x 8-16 GB DDR2 3.2 GHz Cell Processor BladeCenter H High Speed Daughtercard 5 GB/s each way Companion Chip 25.6 GB/s 2 PCIe 8x GbE Power 20 GB/s each way 8-16 GB DDR2 3.2 GHz Cell Processor 5 GB/s each way Companion Chip 2 IB 4x PCIe 16x 25.6 GB/s PCIe 16x GbE 1-2 GB DDR2 Power PCIe / PCI-X x16 Daughtercard PCIe / PCI-X x16 Daughtercard One-Slot I/O Expansion Blade © 2005 Mercury Computer Systems, Inc.
21
1U Dual-Cell Based Server 2
1U solution using based on companion chip Dual 3.2 GHz Cell processors Memory 2 GB of XDR 4-16 GB of DDR2 I/O Daughtercard site options under consideration PCI-E and PCI-X customer options Dual GigE Dual IB 4x Production available Q3 2007 © 2005 Mercury Computer Systems, Inc.
22
1U Dual-Cell Based Server 3 Concept
1U solution with enhanced memory capacity Dual 3.2 GHz Cell processors Memory 16-32 GB of DDR2 Main memory is now DDR2 DIMMs 1-2 GB of DDR2 per companion chip for IO buffering I/O PCIe / PCI-X daughtercards Dual GigE Dual IB 4x Production available Q1 2008 © 2005 Mercury Computer Systems, Inc.
23
Cell Accelerator Board
PCI Express™ accelerator card compatible with high-end workstations More than 180 GFLOPS on a desktop 1 GB of XDR and 4GB of DDR2 Gigabit Ethernet on end bracket Internal prototype boards with FPGA bridge received July 2006 Boards with the prototype bridge silicon received September 2006 Volume production of boards Q1 2007 © 2005 Mercury Computer Systems, Inc.
24
Cell Accelerator Board Block Diagram
22 GB/s 2.8 GHz Cell Processor 1 GB XDR DRAM 4 GB DDR2 Companion Chip 8 GB/s © 2005 Mercury Computer Systems, Inc.
25
Software is the Key to Harnessing Cell Performance!
Mercury’s MultiCore Plus SDK © Mercury Computer Systems, Inc.
26
Cell BE Processor Architecture
Resembles distributed memory multiprocessor with explicit DMA over a fabric © 2005 Mercury Computer Systems, Inc.
27
Mercury Multi-DSP Board (1996)
© 2005 Mercury Computer Systems, Inc.
28
Programming Cell: What’s Good and What’s Hard
No second guessing about cache replacement algorithm Very deterministic pipeline 128 registers mask pipeline latency very well DMA has negligible impact on SPE local store bandwidth Generous ring bandwidth means topology is seldom an issue Standard Power® core SPE Burden on software to get code and data into local store Local store is small compared to ring latency Branch prediction is manual and very restricted 128 byte alignment necessary for best performance XDR bandwidth is a bottleneck Cell chips linked in coherent mode increases latency Performance is modest Ring and XDR PPE © 2005 Mercury Computer Systems, Inc.
29
Single precision complex FFTs Symmetric image filters
How Much Faster Is Cell? Relative performance of Cell and leading general purpose processors Performance relative to 1GHz Freescale 744x (i.e. Freescale = 1) In all cases, we are comparing Mercury optimized Cell algorithm implementations with the best available (Mercury or 3rd party) implementations on other processors Did not compare with dual core x86 processors Single precision complex FFTs Symmetric image filters © 2005 Mercury Computer Systems, Inc.
30
Goals for Programming Cell
Achieve high performance: The only reason for choosing Cell Ease of programming: An important aspect of this is programmer portability Code Portability Important for large legacy code bases written in C/C++, Fortran And new code developed for Cell should be portable to current and anticipated multiprocessor architectures © 2005 Mercury Computer Systems, Inc.
31
Linux OS Linux on Cell patches released by IBM Linux Technology Center
Kernel Version libspe version 1.1 Built and tested with Fedora Core 5 distribution IBM LTC releases packages through Barcelona Supercomputing Center to official kernel website Mercury works closely with IBM Linux team on performance optimization Linux now able to acheive maximum hardware performance possible on Dual Cell-Based Blade NUMA support, PPE affinity, SPE affinity, 64KB and 16MB page support Mercury uses Terra Soft Solutions Y-HPC Distribution Mercury contracted TSS to port to Y-HPC to the Dual Cell Based Blade Distributions are tested and supported on Mercury hardware Mercury assists TSS with driver development GbE, uDAPL, Infiniband © 2005 Mercury Computer Systems, Inc.
32
The MultiCore Plus SDK MultiCore Framework (MCF)
Scientific Algorithm Library (SAL) MultiCore Plus IDE TATL SPEAK © Mercury Computer Systems, Inc.
33
Mercury Approach to Programming Cell
Very pragmatic Can’t wait for tools to mature Develop our own tools when it makes sense Emphasis on explicitly programming the architecture rather than trying to hide it When the tools are immature, this allows us to get maximum performance Achieve ease-of-use and portability through function offload model Run legacy code on PPE Offload compute intensive workload to SPEs © 2005 Mercury Computer Systems, Inc.
34
First implementation is for the Cell BE processor
MultiCore Framework An API for programming heterogeneous multicores that contain explicit non-cached memory hierarchies Provides an abstract view of the hardware oriented toward computation of multidimensional data sets First implementation is for the Cell BE processor © 2005 Mercury Computer Systems, Inc.
35
MCF Abstractions Function offload model Data movement Miscellaneous
Worker Teams: Allocate tasks to SPEs Plug-ins: Dynamically load and unload functions from within worker programs Data movement Distribution Objects: Defining how n-dimensional data is organized in memory Tile Channels: Move data between SPEs and main memory Re-org Channels: Move data among SPEs Multibuffering: Overlap data movement and computation Miscellaneous Barrier and semaphore synchronization DMA-friendly memory allocator DMA convenience functions Performance profiling © 2005 Mercury Computer Systems, Inc.
36
MCF Abstractions Function offload model Data movement Miscellaneous
Worker Teams: Allocate tasks to SPEs Plug-ins: Dynamically load and unload functions from within worker programs Data movement Distribution Objects: Defining how n-dimensional data is organized in memory Tile Channels: Move data between SPE and main memory Re-org Channels: Move data among SPEs Multibuffering: Overlap data movement and computation Miscellaneous Barrier and semaphore synchronization DMA-friendly memory allocator DMA convenience functions Performance profiling © 2005 Mercury Computer Systems, Inc.
37
MCF Distribution Objects
Frame One complete data set in main memory Distribution Object parameters: Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e.g. split / interleaved) Partitioning policy across workers, including partition overlap © 2005 Mercury Computer Systems, Inc.
38
MCF Distribution Objects
Frame Tile Unit of work for an SPE One complete data set in main memory Distribution Object parameters: Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e.g. split / interleaved) Partitioning policy across workers, including partition overlap © 2005 Mercury Computer Systems, Inc.
39
MCF Partition Assignment
Partitions SPE 0 SPE 1 SPE 2 Distribution Object parameters: Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e.g. split / interleaved) Partitioning policy across workers, including partition overlap © 2005 Mercury Computer Systems, Inc.
40
MCF Tile Channels Partitions Tile Channel
SPE 0 SPE 1 SPE 2 Distribution Object parameters: Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e.g. split / interleaved) Partitioning policy across workers, including partition overlap © 2005 Mercury Computer Systems, Inc.
41
MCF Tile Channels input tile channel output tile channel worker 1
manager (PPE) generates data set and injects it into input tile channel input tile channel subdivides data set into tiles each worker (SPE) extract tiles out of input tile channel ... worker 2 ... computes on input tiles to produce output tiles... manager output tile channel ...and inserts them into output tile channel worker 3 when output data set is complete, manager is notified and extracts data set output tile channel automatically puts tiles into correct location in output data set © 2005 Mercury Computer Systems, Inc.
42
MCF Manager Program Add worker tasks Specify data organization
main(int argc, char **argv) { mcf_m_net_create(); mcf_m_net_initialize(); mcf_m_net_add_task(); mcf_m_team_run_task(); mcf_m_tile_distribution_create_3d(“in”); mcf_m_tile_distribution_set_partition_overlap(“in”); mcf_m_tile_distribution_create_3d(“out”); mcf_m_tile_channel_create(“in”); mcf_m_tile_channel_create(“out”); mcf_m_tile_channel_connect(“in”); mcf_m_tile_channel_connect(“out”); mcf_m_tile_channel_get_buffer(“in”); // fill input data here mcf_m_tile_channel_put_buffer(“in”); mcf_m_tile_channel_get_buffer(“out”); // process output data here } Add worker tasks Specify data organization Create and connect to tile channels Get empty source buffer Fill it with data Send it to workers Wait for results from workers © 2005 Mercury Computer Systems, Inc.
43
MCF Worker Program Create and connect to tile channels Get full source
mcf_w_main (int n_bytes, void * p_arg_ls) { mcf_w_tile_channel_create(“in”); mcf_w_tile_channel_create(“out”); mcf_w_tile_channel_connect(“in”); mcf_w_tile_channel_connect(“out”); while (! mcf_w_tile_channel_is_end_of_channel(“in”) { mcf_w_tile_channel_get_buffer(“in”); mcf_w_tile_channel_get_buffer(“out”); // Do math here mcf_w_tile_channel_put_buffer(“in”); mcf_w_tile_channel_put_buffer(“out”); } Create and connect to tile channels Get full source buffer Get empty destination buffer Do math and fill destination buffer Put back empty source buffer Put back full destination buffer © 2005 Mercury Computer Systems, Inc.
44
MCF Implementation Consists of Utilizes Cell Linux “libspe” support
PPE library SPE library and tiny executive (12 KB) Utilizes Cell Linux “libspe” support But amortizes expensive system calls Reduces overhead from milliseconds to microseconds Provides faster and smaller footprint memory allocation library Based on Data Reorg standard Derived from existing Mercury technologies Other Mercury RDMA-based middleware DSP product experience with small footprint, non-cached architectures © 2005 Mercury Computer Systems, Inc.
45
SAL Primary Markets Medical Imaging Sonar Radar Semiconductor
Inspection The SAL or Scientific Algorithm Library finds utility in these markets. Signals Intelligence Defense Imaging © 2005 Mercury Computer Systems, Inc.
46
Scientific Algorithm Library
SAL is a collection of optimized functions Baseline Arithmetic, data type conversions, data moves DSP FFTs, convolutions, correlation, filters, etc. Linear Algebra Linear systems, matrix decomposition, etc. Parallel Algorithms (future) High level algorithms on multiple cores Invoked from application running on PPE Automatically use one or more SPEs Initial work done for 1D and 2D FFTs and fast convolutions PIXL – Image Processing Library Edge detection, fixed point operations and analysis, filtering, manipulation, erosion, dilation, histogram, lookup tables, etc. Work in this area depend on customer demand. PPE SAL based on Altivec optimizations for G4 and G4A2 SAL C source code version also available SPE SAL is new implementation optimized for SPE architecture Backwards compatibility with existing SAL API except in very rare cases Some new APIs needed in order to extract best performance from SPE Static and plug-in component versions for each function © 2005 Mercury Computer Systems, Inc.
47
Eclipse Framework Provides an open platform for creating an Integrated Development Environment (IDE) Eclipse Consortium manages continuous development of the tool Eclipse plug-ins extend the functionality of the framework Written in Java Compilers, debuggers, TATL, helpfiles, etc. are all be Eclipse plug-ins. © 2005 Mercury Computer Systems, Inc.
48
Mercury MultiCore Plus IDE
PPE and SPE cross build support for Gcc/gcc++ XLC/C++ Eclipse CDT (C/C++ Development Toolkit) Syntax highlighting Code completion Content assistance Makefile generation Remote debugging of PPE and SPE applications TATL plug-in © 2005 Mercury Computer Systems, Inc.
49
TATL™ Trace Analysis Tool
Log events from PPE & SPE threads across multiple Cell chips Synchronized global timestamps Minimally intrusive in space and time Timeline trace and histogram viewers Structured log file for use in other tools © 2005 Mercury Computer Systems, Inc.
50
SPE Assembly Development Kit (SPE-ADK)
The SPE architecture encourages “bare metal programmers” Very deterministic architecture Performance benefits from hand tuning the pipelines SPE-ADK dramatically improves bare metal productivity SPE-ADK consists of Assembler preprocessor, optimizer and macro library Using SPE-ADK is similar to programming with SPE C extensions But with more deterministic control of instruction scheduling and hardware resources SPE-ADK is a productized version of the internal development tool used by all Mercury SAL developers © 2005 Mercury Computer Systems, Inc.
51
SPE-ADK Features Alignment of instructions for the even and odd pipelines of the SPU Automatic insertion of nop's and lnop's or instruction swapping to maintain dual dispatch Alignment of loops to minimize instruction fetching overhead Register assignment. It automatically: Finds symbolic register operands, Assigns registers to symbols to minimize register usage, Eliminates bugs from inconsistent register assignment. Mapping of register usage, both active line number extents per symbol, and active hardware registers per line Analysis of stall cycles due to register dependencies Optional C emulation for assembly development allows C-like debugging facilities Hardware independence for assembly code, Setting breakpoints at source line numbers, Displaying source code rather than disassembling the object code, Displaying register contents by symbol. Detection of errors to preclude bugs: Inconsistent manual register assignment, Write-only variables, Uninitialized variables, Updated but unused variables. © 2005 Mercury Computer Systems, Inc.
52
Software Summary The Cell BE processor can achieve one to two orders of magnitude performance improvement over current general purpose processors Lean SPE core saves space and power And makes it easier for software to approach peak performance Cell is a distributed memory multiprocessor on a chip Prior experience on these architectures translates easily to Cell But for most programmers, Cell is a new architecture Successful adoption by programmers is Cell’s biggest challenge And the history of other new processor architectures is not encouraging We need a range of tools that span the continuum from ease-of-use to high performance © 2005 Mercury Computer Systems, Inc.
53
Markets for Cell Aerospace and Defense Semiconductor Medical Imaging
Oil and Gas Visualization © Mercury Computer Systems, Inc.
54
Sales & Marketing Progress for Cell
Very Active Semiconductor inspection – active sales engagements; prototypes sold Medical imaging – active sales engagements; prototypes sold Semiconductor lithography – active sales engagements; prototypes sold. Defense signal & image processing – active sales engagements; prototypes sold Oil & Gas exploration – active sales engagements; prototypes sold Video transcoding – active sales engagements Less Active for Mercury Financial modeling (IBM) Gaming Animation & rendering Defense simulation for training (specialized gaming) © 2005 Mercury Computer Systems, Inc.
55
Summary Mercury has been developing computing solutions for applications well suited for Cell technology for many years. Cell technology represents a significant performance breakthrough similar to historical programming models. Customers can leverage Cell technology through Mercury to achieve: Unbiased assessment of risks and applicability of deploying Cell-based solutions. Significant improvements in performance and bandwidth for certain applications compared to conventional processors © 2005 Mercury Computer Systems, Inc.
56
(866) 627-6951 (US) (978) 967-1401 (International)
For More Information (866) (US) (978) (International) Web: © 2005 Mercury Computer Systems, Inc.
57
Backup Slides © Mercury Computer Systems, Inc.
58
Semiconductor DFM Requirements
© Mercury Computer Systems, Inc.
59
Moore’s Law Irrelevant
Processing requirements of semiconductor industry are increasing at an even faster rate Driven by: Increased feature density Increased complexity of processing due to sub-wavelength physics Tool specific features Year 1 Year 2 Year 3 Year 4 Moore’s Law Processing Requirements 4X 12X Processing needs outpace mainstream computing as data rates and algorithm complexity increase © 2005 Mercury Computer Systems, Inc.
60
OPC/RET/DFM – The need for speed
CHALLENGES Reduce OPC cycle times from days/weeks to hours Simulation models that ensure a mask will work when printed Computing goes up by an order of magnitude at every design node (e.g. 65nm to 45nm) Resolution Enhancement Technologies (RET) Optical Proximity Correction (OPC) Phase Shift Masks (PSM) Off-axis Illumination (OAI) Quotes from top chip designers: “It takes 8 days with 500 nodes to do OPC on a single chip layer … and we need it to be 10 to 100 times faster” “We have 10,000 blades to do RET” Design for Manufacturing (DFM) WYSIWYG no more Source: AMD © 2005 Mercury Computer Systems, Inc.
61
Dense racks of servers are expensive to maintain
Cost of Ownership System sizes to do RET and Lithography simulation are expanding to the 1000s of 1U servers Dense racks of servers are expensive to maintain Cost of electricity to power computers Cost of capital infrastructure for electricity delivery Cost of electricity to power HVAC systems Cost of capital infrastructure for HVAC Challenge of managing air flow © 2005 Mercury Computer Systems, Inc.
62
Cost of Ownership A rack of 84 such servers
A single dual processor server Consumes Watts Costs $ /year just to power (at $.05/kWh) Comparable amounts for HVAC and capital costs A rack of 84 such servers Costs $10K+ per year to power Comparable amounts for HVAC and capital costs Operators of data centers now see power and cooling costs as more significant than cost of computing hardware © 2005 Mercury Computer Systems, Inc.
63
Processing Efficiency
The metric of performance per dollar must be expanded to include not just the cost of the hardware but also the lifetime cost of operating the computer system Performance/Watt, which used to just be a metric for the embedded and defense industry, is now important for commercial customers as well © 2005 Mercury Computer Systems, Inc.
64
Cell processor technology provides:
Summary Cell processor technology provides: Order-of-magnitude improvement in computing performance per processor for OPC/RET applications Significant improvement in performance per Watt Significant performance breakthrough for other critical computationally intensive applications The right software infrastructure is critical for: Taking full advantage of specialized processing units Partitioning application among heterogeneous group or processing cores Parallelizing application among multiple processing nodes Cell can significantly improve OPC/RET turnaround time © 2005 Mercury Computer Systems, Inc.
65
Mercury Computer Systems Visualization and Sciences Group
Ray Tracing Mercury Computer Systems Visualization and Sciences Group © Mercury Computer Systems, Inc.
66
What is Ray Tracing? Computer Graphics Rendering Technique which mathematically simulates rays of light Capable of producing photo-realistic images Used in a variety of markets Automotive, aerospace and marine virtual prototyping Architecture Industrial Design Digital Content Creation in film and video © 2005 Mercury Computer Systems, Inc.
67
Basic Technique For each pixel in the screen, send out a ray of light from the viewpoint. Check every object in the scene and check for intersection. If the ray does not intersect an object, set pixel to background color If the ray does intersect an object, set the pixel color to the first object it intersects © 2005 Mercury Computer Systems, Inc.
68
More Advanced Technique
© 2005 Mercury Computer Systems, Inc.
69
Characteristics of Ray Tracing
Simulating the Physics of Light Simulates light transport by following “photons” Fully parallel: just as nature Demand-driven: start from the camera Correctly orders rendering effects (per pixel !!) Can account for all global effects All effects are orthogonal to each other Makes content design easy and fast Requires very large amount of CPU in order to be interactive Driven by intersection calculations Every ray checked against all objects Each secondary ray becomes a primary ray in a recursive algorithm 800 x 600 screen, 3 light sources, 50 opaque objects requires 600 billion intersection tests! © 2005 Mercury Computer Systems, Inc.
70
Challenges Implementing on Cell
In-order instruction access and SIMD Must carefully optimize instructions to avoid stalls Must parallelize code to take advantage of SIMD instructions Memory Access DMA engines must move data into LS from XDR Hiding latency requires overlapped I/O and processing (DMA read latency is a few hundered clock cycles) Even more challenging for irregular data access Mapping to 8 SPEs Mapping algorithm very important with Cell architecture © 2005 Mercury Computer Systems, Inc.
71
Linear Speed-up Across SPEs
© 2005 Mercury Computer Systems, Inc.
72
Results Frames per Second (Normalized to 2.4 GHz Opteron) 2.4 GHz x86
7.2 3.0 2.5 2.4 GHz SPE 7.4 (+3%) 2.6 (-13%) 1.9 (-24%) 2.4 GHz Cell 58.1 (8x) 20 (6.6x) 16.2 (6.4x) 2.4 GHz Dual Cell 110.9 (15.4x) 37.3 (12.4x) 30.6 (12.2x) 3.2 GHz Cell 67.8 (9.4x) 23.2 (7.7x) 18.9 (7.5x) © 2005 Mercury Computer Systems, Inc.
73
What is OpenRTRT from Mercury?
Highly optimized ray tracing rendering engine Enabling high-quality rendering at interactive frame rate Supports large model visualisation Complements GPU OpenGL-based rendering Realism and rendering effects Quality and accuracy Capacity for large models Performance scalability with multiple CPUs and clusters © 2005 Mercury Computer Systems, Inc.
74
OpenRTRT: Real-Time Ray Tracing
Recognized as outstanding, breakthrough technology Cutting edge research and dramatic optimizations achieved by U. Saarland and inTrace: cache & data layout optimization, parallelization - SIMD/SSE, multi-threading, distribution… Interactive even on a PC, enough for preparation work for instance Scalable performances with multiple CPUs Allow fully interactive visualization Performance depends linearly on the number of pixels, rays and processors Logarithmic in scene size (20Mio triangles guaranteed) Available for Linux on x86, x86-64, and IA64 and Windows 32 © 2005 Mercury Computer Systems, Inc.
75
Background 2000 Start of research at the University of Saarland
Presentation of the first scientific results Initial projects with the Automotive industry Simulation of Ray Tracing hardware Foundation of inTrace GmbH Volkswagen AG as first customer (VR – Lab) New project visualization center at Wolfsburg based on Ray Tracing. First Ray tracing hardware prototype 2005 Projects with basically all German car manufacturers: VW, Audi, BMW, DaimlerChrysler + Airbus, Boeing, … First design of fully programmable chip for Ray Tracing 2005 Exclusive agreement for worldwide distribution with Mercury Computer Systems © 2005 Mercury Computer Systems, Inc.
76
Final Thoughts: Criteria for Evaluating Cell Software Tools
Performance Ability to approach maximum performance even if substantial effort is required Productivity to performance How quickly can a programmer approach maximum performance Productivity to first run How quickly can a programmer get correct results, ignoring performance Acceptance Perceived ease of use This can include familiarity, perceived learning curve Legacy portability Ease of porting existing programs from prior architectures Future portability Ease of porting and optimizing to new architectures Compatibility and co-existence Ability to use multiple tools in the same application Particularly without performance conflicts © 2005 Mercury Computer Systems, Inc.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.