SDP Kernels Workshop – The Role of Kernels

SDP Kernels Workshop – The Role of Kernels
Cambridge, 29th November 2016 Bojan Nikolic Principal Research Associate Cavendish Laboratory University of Cambridge Project Engineer & Architecture Lead SKA Science Data Processor Consortium “Big Data” undersells itself – more like big information. SKA is really is just big data, not so big information!

SDP Top-level Components & Key Performance Requirements -- SKA Phase 1
Telescope Manager Science Data Processor SDP Local Monitor & Control Regional Centres & Astronomers CSP Data Processor Data Preservation Delivery System High Performance ~100 PetaFLOPS Data Intensive ~100 PetaBytes/observation (job) Partially real-time ~10s response time Partially iterative ~10 iterations/job (~3 hour) Data Distribution ~100 PetaByte/year from Cape Town & Perth to rest of World Data Discovery Visualisation of 100k by 100k by 100k voxel cubes High Volume & High Growth Rate ~100 PetaByte/year Infrequent Access ~few times/year max 1 Tera Byte/s

SDP Top-level Components & Key Performance Requirements -- SKA Phase 1
Telescope Manager Science Data Processor SDP Local Monitor & Control Regional Centres & Astronomers CSP Data Processor Data Preservation Delivery System High Performance ~100 PetaFLOPS Data Intensive ~100 PetaBytes/observation (job) Partially real-time ~10s response time Partially iterative ~10 iterations/job (~3 hour) Data Distribution ~100 PetaByte/year from Cape Town & Perth to rest of World Data Discovery Visualisation of 100k by 100k by 100k voxel cubes High Volume & High Growth Rate ~100 PetaByte/year Infrequent Access ~few times/year max Goal is to extract information from data and then discard the data 1 Tera Byte/s

Role of Kernels

“Big-Data” : little focus on Kernels
PageRank, L. Page, 1999 MapReduce, Dean & Ghemawat, 2004 Spark, Zaharia et al

LINPACK: Almost all the focus on kernels
The TOP500 ranking Sunway TaihuLight Titan Tianhe-2 (MilkyWay-2) Kernels == BLAS procedures

SDP: Some key ratios FLOP/IO
R FLOP R IO ≈1000 FLOP Byte R FLOP : The achieved average FLOPS rate R IO : The achieved average read rate from the storage system (“buffer”) SDP is roughly balanced in this respect e.g.: today 1 disk 100 MB/s . If sustained continuously need achieved 100GFLOP/s -> do-able (especially with accellerators) but requires carefully tuning of a lot of parts!

Where Kernels fit into the SDP
SDP Design Kernels Memory Management Buffer I/O Network Reductions Network bulk data move Load Balancing Scheduling Data Partitioning

How do we use outputs of Kernel Studies?
Extrapolation for sizing the SDP: how much science within capital & power budgets Architectural impacts: System h/w architecture: networking & storage topologies Software driving quality attributes selection, e.g., how important is memory management? Re-partitioning of the data ? Programming models Design impacts Compatibility with specific technology selections Source code for use in in final system (least significant) In rough order

Kernel Architecture

Current Kernel Architecture
“C”-callable (launching accelerator kernels/or natively) Inputs and outputs in pre-allocated memory regions No I/O, networking access from within kernels For non-uniform memories (e.g., accelerators): inputs & outputs are on the accelerator memory Data structures being investigated – baseline selection made Kernels allocate own working memory Do we need to reconsider this? Key metrics: time-to-completion, energy-to-completion, memory usage Alternative: e.g.: multi-node kernels, not C callable (why?), letting kernel gets bits of input and output at a time? Feedback very welcome!

Example of system sizing & architecture implications
All Subject to your feedback!

Memory Bandwidth Limit
Finding: SDP Kernels are mostly limited by memory bandwidth Finding: Operational intensity is ~ 0.5FLOP/byte Assume SDP needs to perform 100 PetaFLOPS (cf parametric model) => Need 200 PetaByte/s BW Assume 6pJ/bit of memory access => 10MW to drive memory Assume maximum 1/3 of power can go into driving => 30MW total power requirement

Implication: HBM By C. Spille/pcgameshardware.de - CC BY-SA 4.0,

SDP: Some key ratios FLOP/Memory
16 GB minimum working set size (with faceting) Very roughly need to achieve around 1 TeraFLOP on it M FAST R FLOP ≈20 Byte×ms FLOP M FAST : High throughput/ energy efficient memory (e.g.,HBM) size => 1000-way shared memory parallelism => 1ms average between access to each bit of fast memory (but some parts much more frequently, others not at all!) Opportunity to optimise given 1ms time and big dispersion

Data Partitioning & Sharing
What is are good structures for the data passed into kernels? Do any expensive intermediate results need to be shared between kernels (e.g. “W” or “A” convolution functions) ? Current thinking probably no, recompute “on-the-fly” What are the practical overheads of using packed data structures?

SDP Kernels Workshop – The Role of Kernels

Similar presentations

Presentation on theme: "SDP Kernels Workshop – The Role of Kernels"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SDP Kernels Workshop – The Role of Kernels

Similar presentations

Presentation on theme: "SDP Kernels Workshop – The Role of Kernels"— Presentation transcript:

Similar presentations

About project

Feedback