OCR on Knights Landing (Xeon-Phi)

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

Computer Science, University of Oklahoma Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Today’s topics Single processors and the Memory Hierarchy
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
OpenFOAM on a GPU-based Heterogeneous Cluster
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Performance benchmark of LHCb code on state-of-the-art x86 architectures Daniel Hugo Campora Perez, Niko Neufled, Rainer Schwemmer CHEP Okinawa.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Working Group on Methodology for Optimizing Multilevel Parallelism Fialho, Gimenez, Tallent, Welton, Morris, Malony, Montoya and Browne.
GPU Computing with CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
QCD Project Overview Ying Zhang September 26, 2005.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
U.S. Department of Energy’s Office of Science High Performance Computing Challenges and Opportunities Dr. Daniel Hitchcock
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Trace-Based Optimization for Precomputation and Prefetching Madhusudan Raman Supervisor: Prof. Michael Voss.
Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.
A summary by Nick Rayner for PSU CS533, Spring 2006
Parallel Event Processing for Content-Based Publish/Subscribe Systems Amer Farroukh Department of Electrical and Computer Engineering University of Toronto.
A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Ian Gable HEPiX Spring 2009, Umeå 1 VM CPU Benchmarking the HEPiX Way Manfred Alef, Ian Gable FZK Karlsruhe University of Victoria May 28, 2009.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
NUMA Optimization of Java VM
A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
Manycore processors Sima Dezső October Version 6.2.
Intel Many Integrated Cores Architecture
Deep Learning with Intel DAAL on Knights Landing Processor
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Modern supercomputers, Georgian supercomputer project and usage areas
Early Results of Deep Learning on the Stampede2 Supercomputer
Chandra S. Martha Min Lee 02/10/2016
OCR GCSE Computer Science Teaching and Learning Resources
Intel MIC Architecture Internals and Optimizations
Scott Michael Indiana University July 6, 2017
Geant4 MT Performance Soon Yung Jun (Fermilab)
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
Kilohertz Decision Making on Petabytes
Unconventional applications of Intel® Xeon Phi™ Processor (KNL)
Structural Simulation Toolkit / Gem5 Integration
Challenges CPU performance Variable density Multi-thread computing
Computer Architecture 2
Carlos Rosales, John Cazes, Kent Milfeld
IXPUG Abstract Submission Instructions
Directory-based Protocol
Mattan Erez The University of Texas at Austin
Early Results of Deep Learning on the Stampede2 Supercomputer
Template for IXPUG EMEA Ostrava, 2016
Interconnect with Cache Coherency Manager
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
High Performance Computing
Many-Core Graph Workload Analysis
Accelerating Quantum Chemistry with Batched and Vectorized Integrals
Introduction, background, jargon
Department of Computer Science, University of Tennessee, Knoxville
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Maximizing Speedup through Self-Tuning of Processor Allocation
Presentation transcript:

OCR on Knights Landing (Xeon-Phi) 31st Mar 2016 Acknowledgment: This material is based upon work supported by the Department of Energy Office of Science under cooperative agreement DE-SC0008717 and DE-SC0014355, and Lawrence Livermore National Labs subcontract B608115.

Knights Landing Overview Three modes Self-boot processor Self-boot w/ integrated fabric Co-processor (PCIe addon card) MCDRAM: three memory modes Flat – entirely addressable Cache – on DDR, direct-mapped Hybrid – part cache, part memory Cluster modes (cc mesh interconnect) All-to-all: address uniformly hashed Quadrant: software-transparent, address hashed to dir same quadrant as memory Sub-NUMA: exposed as 4 NUMA nodes KNL presentation at Hotchips ‘15

OCR on KNL 1 policy domain with up to 288 workers MCDRAM in flat mode, with two allocators $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 255 node 0 size: 98200 MB node 0 free: 90312 MB node 1 cpus: node 1 size: 16384 MB node 1 free: 15519 MB node distances: node 0 1 0: 10 31 1: 31 10 Memory hints to choose allocator on MCDRAM (OCR_HINT_DB_HIGHBW)

Results – Stencil 2D weak scaling Xeon KNL Preliminary results! Software under optimization

Results – MCDRAM vs DDR Stencil 2D with 256 threads Preliminary results! Software under optimization Stencil 2D with 256 threads

Results – Stream Runtime bottlenecks? Profiling underway Limited vectorization opportunities? Preliminary results! Software under optimization

Next Steps Rootcause & fix MCDRAM performance Study all-to-all vs. sub-NUMA modes Single vs multiple policy domains Performance counters & introspection