Database for Data-Analysis Developer: Ying Chen (JLab) Developer: Ying Chen (JLab) Computing 3(or N)-pt functions Computing 3(or N)-pt functions Many correlation.

Slides:



Advertisements
Similar presentations
OpenMP.
Advertisements

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
SE-292 High Performance Computing
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
6-April 06 by Nathan Chien. PCI System Block Diagram.
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut.
MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Topics Left Superscalar machines IA64 / EPIC architecture
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Altix ccNUMA Architecture Distributed Memory - Shared address space.
3-Software Design Basics in Embedded Systems
Chapter 3 General-Purpose Processors: Software
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Processing with OpenMP
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
COREY: AN OPERATING SYSTEM FOR MANY CORES
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Introductions to Parallel Programming Using OpenMP
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Parallel/Concurrent Programming on the SGI Altix Conley Read January 25, 2007 UC Riverside, Department of Computer Science.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Comparison of Distributed Operating Systems. Systems Discussed ◦Plan 9 ◦AgentOS ◦Clouds ◦E1 ◦MOSIX.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Lecture on Central Process Unit (CPU)
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
Chapter 4: Threads 羅習五. Chapter 4: Threads Motivation and Overview Multithreading Models Threading Issues Examples – Pthreads – Windows XP Threads – Linux.
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Chapter 3 Getting Started. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. Objectives To give an overview of the structure of a contemporary.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
NFV Compute Acceleration APIs and Evaluation
Chapter 1: A Tour of Computer Systems
Chapter 4: Threads 羅習五.
Vector Processing => Multimedia
Operation System Program 4
What is Parallel and Distributed computing?
Chapter 4: Threads.
Comparison of Two Processors
Hybrid Programming with OpenMP and MPI
Outline Operating System Organization Operating System Examples
Presentation transcript:

Database for Data-Analysis Developer: Ying Chen (JLab) Developer: Ying Chen (JLab) Computing 3(or N)-pt functions Computing 3(or N)-pt functions Many correlation functions (quantum numbers), at many momenta for a fixed configuration Many correlation functions (quantum numbers), at many momenta for a fixed configuration Data analysis requires a single quantum number over many configurations (called an Ensemble quantity) Data analysis requires a single quantum number over many configurations (called an Ensemble quantity) Can be 10K to over 100K quantum numbers Can be 10K to over 100K quantum numbers Inversion problem: Inversion problem: Time to retrieve 1 quantum number can be long Time to retrieve 1 quantum number can be long Analysis jobs can take hours (or days) to run. Once cached, time can be considerably reduced Analysis jobs can take hours (or days) to run. Once cached, time can be considerably reduced Development: Development: Require better storage technique and better analysis code drivers Require better storage technique and better analysis code drivers

Database for Data-Analysis Developer: Ying Chen (JLab) Developer: Ying Chen (JLab) Computing 3(or N)-pt functions Computing 3(or N)-pt functions Many correlation functions (quantum numbers), at many momenta for a fixed configuration Many correlation functions (quantum numbers), at many momenta for a fixed configuration Data analysis requires a single quantum number over many configurations (called an Ensemble quantity) Data analysis requires a single quantum number over many configurations (called an Ensemble quantity) Can be 10K to over 100K quantum numbers Can be 10K to over 100K quantum numbers Inversion problem: Inversion problem: Time to retrieve 1 quantum number can be long Time to retrieve 1 quantum number can be long Analysis jobs can take hours (or days) to run. Once cached, time can be considerably reduced Analysis jobs can take hours (or days) to run. Once cached, time can be considerably reduced Development: Development: Require better storage technique and better analysis code drivers Require better storage technique and better analysis code drivers

Database Requirements: Requirements: For each config worth of data, will pay a one-time insertion cost For each config worth of data, will pay a one-time insertion cost Config data may insert out of order Config data may insert out of order Need to insert or delete Need to insert or delete Solution: Solution: Requirements basically imply a balanced tree Requirements basically imply a balanced tree Try DB using Berkeley Sleepy Cat: Try DB using Berkeley Sleepy Cat: Preliminary Tests: Preliminary Tests: 300 directories of binary files holding correlators (~7K files each dir.) 300 directories of binary files holding correlators (~7K files each dir.) A single “key” of quantum number + config number hashed to a string A single “key” of quantum number + config number hashed to a string About 9GB DB, retrieval on local disk about 1 sec, over NFS about 4 sec. About 9GB DB, retrieval on local disk about 1 sec, over NFS about 4 sec.

Database and Interface Database “key”: Database “key”: String = source_sink_pfx_pfy_pfz_qx_qy_qz_Gamma_linkpath String = source_sink_pfx_pfy_pfz_qx_qy_qz_Gamma_linkpath Not intending (at the moment) any relational capabilities among sub-keys Not intending (at the moment) any relational capabilities among sub-keys Interface function Interface function Array > read_correlator(const string& key); Array > read_correlator(const string& key); Analysis code interface (wrapper): Analysis code interface (wrapper): struct Arg {Array p_i; Array p_f; int gamma;}; struct Arg {Array p_i; Array p_f; int gamma;}; Getter: Ensemble > operator[](const Arg&); or Getter: Ensemble > operator[](const Arg&); or Array > operator[](const Arg&); Array > operator[](const Arg&); Here, “ensemble” objects have jackknife support, namely Here, “ensemble” objects have jackknife support, namely operator*(Ensemble, Ensemble ); operator*(Ensemble, Ensemble ); CVS package adat CVS package adat

(Clover) Temporal Preconditioning Consider Dirac op det(D) = det(D t + D s /  Consider Dirac op det(D) = det(D t + D s /  Temporal precondition: det(D)=det(D t )det(1+ D t -1 D s /  ) Temporal precondition: det(D)=det(D t )det(1+ D t -1 D s /  ) Strategy: Strategy: Temporal preconditiong Temporal preconditiong 3D even-odd preconditioning 3D even-odd preconditioning Expectations Expectations Improvement can increase with increasing  Improvement can increase with increasing  According to Mike Peardon, typically factors of 3 improvement in CG iterations According to Mike Peardon, typically factors of 3 improvement in CG iterations Improving condition number lowers fermionic force Improving condition number lowers fermionic force

Multi-Threading on Multi-Core Processors Jie Chen, Ying Chen, Balint Joo and Chip Watson Scientific Computing Group IT Division Jefferson Lab

Motivation Next LQCD Cluster Next LQCD Cluster What type of machines is going to used for the cluster? What type of machines is going to used for the cluster? Intel Dual Core or AMD Dual Core? Intel Dual Core or AMD Dual Core? Software Performance Improvement Software Performance Improvement Multi-threading Multi-threading

Test Environment Two Dual Core Intel 5150 Xeons (Woodcrest) Two Dual Core Intel 5150 Xeons (Woodcrest) 2.66 GHz 2.66 GHz 4 GB memory (FB-DDR2 667 MHz) 4 GB memory (FB-DDR2 667 MHz) Two Dual Core AMD Opteron 2220 SE (Socket F) Two Dual Core AMD Opteron 2220 SE (Socket F) 2.8 GHz 2.8 GHz 4 GB Memory (DDR2 667 MHz) 4 GB Memory (DDR2 667 MHz) smp kernel (Fedora Core 5) smp kernel (Fedora Core 5) i386 i386 x86_64 x86_64 Intel c/c++ compiler (9.1), gcc 4.1 Intel c/c++ compiler (9.1), gcc 4.1

Multi-Core Architecture Core 1Core 2 Memory Controller ESB2 I/O PCI Express FB DDR2 Core 1Core 2 PCI-E Bridge PCI-E Expansion HUB PCI-X Bridge DDR2 Intel Woodcrest Intel Xeon 5100 AMD Opterons Socket F

Multi-Core Architecture L1 Cache L1 Cache 32 KB Data, 32 KB Instruction 32 KB Data, 32 KB Instruction 8-Way associativity 8-Way associativity L2 Cache L2 Cache 4MB Shared among 2 cores 4MB Shared among 2 cores 16-way associativity 16-way associativity 256 bit width 256 bit width 10.6 GB/s bandwidth to cores 10.6 GB/s bandwidth to cores FB-DDR2 FB-DDR2 Increased Latency Increased Latency memory disambiguation allows load ahead store instructions memory disambiguation allows load ahead store instructions Executions Executions Pipeline length 14; 24 bytes Fetch width; 96 reorder buffers Pipeline length 14; 24 bytes Fetch width; 96 reorder buffers Max decoding rate 4 + 1; Max 4 FP/cycle Max decoding rate 4 + 1; Max 4 FP/cycle bit SSE Units; One SSE instruction/cycle bit SSE Units; One SSE instruction/cycle L1 Cache 64 KB Data, 64 KB Instruction 2-Way associativity L2 Cache 1 MB dedicated 16-way associativity 128 bit width 6.4 GB/s bandwidth to cores NUMA (DDR2) Increased latency to access the other memory Memory affinity is important Executions Pipeline length 12; 16 bytes Fetch width; 72 reorder buffers Max decoding rate 3; Max 3 FP/cycle bit SSE Units; One SSE instruction = two 64-bit instructions.

Multi-Core Architecture L1 Cache L1 Cache 32 KB Data, 32 KB Instruction 32 KB Data, 32 KB Instruction L2 Cache L2 Cache 4MB Shared among 2 cores 4MB Shared among 2 cores 256 bit width 256 bit width 10.6 GB/s bandwidth to cores 10.6 GB/s bandwidth to cores FB-DDR2 FB-DDR2 Increased Latency Increased Latency memory disambiguation allows load ahead store instructions memory disambiguation allows load ahead store instructions Executions Executions Pipeline length 14; 24 bytes Fetch width; 96 reorder buffers Pipeline length 14; 24 bytes Fetch width; 96 reorder buffers bit SSE Units; One SSE instruction/cycle bit SSE Units; One SSE instruction/cycle L1 Cache 64 KB Data, 64 KB Instruction L2 Cache 1 MB dedicated 128 bit width 6.4 GB/s bandwidth to cores NUMA (DDR2) Increased latency to access the other memory Memory affinity is important Executions Pipeline length 12; 16 bytes Fetch width; 72 reorder buffers bit SSE Units; One SSE instruction = two 64-bit instructions. Intel Woodcrest Xeon AMD Opteron

Memory System Performance

L1L2Mem Rand Mem Intel AMD Memory Access Latency in nanoseconds

Performance of Applications NPB-3.2 (gcc-4.1 x86-64)

LQCD Application (DWF) Performance

Parallel Programming Messages Machine 1 Machine 2 OpenMP/Pthread  Performance Improvement on Multi-Core/SMP machines  All threads share address space  Efficient inter-thread communication (no memory copies)

Multi-Threads Provide Higher Memory Bandwidth to a Process

Different Machines Provide Different Scalability for Threaded Applications

OpenMP Portable, Shared Memory Multi-Processing API Portable, Shared Memory Multi-Processing API Compiler Directives and Runtime Library Compiler Directives and Runtime Library C/C++, Fortran 77/90 C/C++, Fortran 77/90 Unix/Linux, Windows Unix/Linux, Windows Intel c/c++, gcc-4.x Intel c/c++, gcc-4.x Implementation on top of native threads Implementation on top of native threads Fork-join Parallel Programming Model Fork-join Parallel Programming Model Master ForkJoin Time

OpenMP Compiler Directives (C/C++) Compiler Directives (C/C++) #pragma omp parallel { thread_exec (); /* all threads execute the code */ } /* all threads join master thread */ #pragma omp critical #pragma omp section #pragma omp barrier #pragma omp parallel reduction(+:result) Run time library Run time library omp_set_num_threads, omp_get_thread_num omp_set_num_threads, omp_get_thread_num

Posix Thread IEEE POSIX c standard (1995) IEEE POSIX c standard (1995) NPTL (Native Posix Thread Library) Available on Linux since kernel 2.6.x. NPTL (Native Posix Thread Library) Available on Linux since kernel 2.6.x. Fine grain parallel algorithms Fine grain parallel algorithms Barrier, Pipeline, Master-slave, Reduction Barrier, Pipeline, Master-slave, Reduction Complex Complex Not for general public Not for general public

QCD Multi-Threading (QMT) Provides Simple APIs for Fork-Join Parallel paradigm Provides Simple APIs for Fork-Join Parallel paradigm typedef void (*qmt_user_func_t)(void * arg); qmt_pexec (qmt_userfunc_t func, void* arg); The user “func” will be executed on multiple threads. The user “func” will be executed on multiple threads. Offers efficient mutex lock, barrier and reduction Offers efficient mutex lock, barrier and reduction qmt_sync (int tid); qmt_spin_lock(&lock); Performs better than OpenMP generated code? Performs better than OpenMP generated code?

OpenMP Performance from Different Compilers (i386)

Synchronization Overhead for OMP and QMT on Intel Platform (i386)

Synchronization Overhead for OMP and QMT on AMD Platform (i386)

QMT Performance on Intel and AMD (x86_64 and gcc 4.1)

Conclusions Intel woodcrest beats AMD Opterons at this stage of game. Intel woodcrest beats AMD Opterons at this stage of game. Intel has better dual-core micro-architecture Intel has better dual-core micro-architecture AMD has better system architecture AMD has better system architecture Hand written QMT library can beat OMP compiler generated code. Hand written QMT library can beat OMP compiler generated code.