Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 4 MapReduce Software Frameworks and CUDA GPU Architectures

Similar presentations


Presentation on theme: "Lecture 4 MapReduce Software Frameworks and CUDA GPU Architectures"— Presentation transcript:

1 Lecture 4 MapReduce Software Frameworks and CUDA GPU Architectures

2 From MapReduce to Hadoop and Spark
MapReduce is a software framework Designed for bipartite graph computing Built with a master-worker model Supports parallel and distributed computing on large data sets Abstracts the data flow of running a parallel program on a distributed computing system By providing users with two interfaces in the form of two functions, i.e., Map and Reduce Users can override these two functions to interact with and manipulate the data flow of running the programs 1.1 CLOUD COMPUTING IN A NUTSHELL

3 From MapReduce to Hadoop and Spark (cont.)
MapReduce applies dynamic execution, fault tolerance, and easy-to-use APIs Performs Map and Reduce functions in a pipelined fashion MapReduce software framework was first proposed and implemented by Google Google MapReduce paradigm is written in C Evolved from use in a search engine to Google App Engine cloud Initially, Google’s MapReduce was applied only in fast search engines Then MapReduce enabled cloud computing 1.1 CLOUD COMPUTING IN A NUTSHELL

4 From MapReduce to Hadoop and Spark (cont.)
Apache Hadoop has made MapReduce Possible for big data processing on large server clusters or clouds Apache Spark frees up many constraints by MapReduce and Hadoop programming In general-purpose batch or streaming applications The MapReduce framework is only for batch processing of large data sets Deal with a static data set which will not change during execution Streaming data or real-time data cannot be handled well in batch mode 1.1 CLOUD COMPUTING IN A NUTSHELL

5 The MapReduce Compute Engine
The MapReduce framework provides an abstraction layer for data and control flow The logical data flow from the Map to the Reduce The control flow is hidden from users 1.1 CLOUD COMPUTING IN A NUTSHELL

6 The MapReduce Compute Engine (cont.)
The MapReduce library is essentially the controller of the MapReduce pipeline Coordinates the dataflow from the input end to the output end in a synchronous manner The API tools are used to provide an abstraction to hide the MapReduce software framework from intervention by users, randomly The data flow in a MapReduce framework is predefined Data partitioning, mapping and scheduling, synchronization, communication, and output of results Partitioning is controlled in user programs By specifying the partitioning block size and data fetch patterns 1.1 CLOUD COMPUTING IN A NUTSHELL

7 The MapReduce Compute Engine (cont.)
The abstraction layer provides two well-defined interfaces in two functions: Map and Reduce These mapper and reducer functions can be defined by the user to achieve specific objectives The user overrides the Map and Reduce functions Map and Reduce functions take a specification object, called Spec First initialized inside the user’s program The user writes code to fill it with the names of input and output files as well as other tuning parameters Also filled with the names of the Map and the Reduce functions Invokes the provided MapReduce(Spec, &Results) function from the library to start the flow of data 1.1 CLOUD COMPUTING IN A NUTSHELL

8 The MapReduce Compute Engine (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

9 The MapReduce Compute Engine (cont.)
The overall structure of a user’s MapReduce program 1.1 CLOUD COMPUTING IN A NUTSHELL

10 Logical Dataflow The input data to both the Map and the Reduce function have a particular structure The same argument goes for the output data too The input data to the Map function is arranged in the form of a (key, value) pair The value is the actual data The key part is only used to control the data flow e.g., The key is the line offset within the input file and the value is the content of the line The output data from the Map function is structured as (key, value) pairs Called intermediate (key, value) pairs 1.1 CLOUD COMPUTING IN A NUTSHELL

11 Logical Dataflow (cont.)
The Map function processes each input (key, value) pair To produce s few intermediate (key, value) pairs The aim is to process all input (key, value) pairs to the Map function in parallel e,g, The map function emits each word w plus an associated count of occurrences Just a 1 is recorded in this pseudo-code The Reduce function receives the intermediate (key, value) pairs 1.1 CLOUD COMPUTING IN A NUTSHELL

12 Logical Dataflow (cont.)
In the form of a group of intermediate values (key, [set of values]) associated with one intermediate key MapReduce framework forms these groups by first sorting the intermediate (key, value) pairs Then grouping values with the same key Sorting the data is done to simplify the grouping process The Reduce function processes each (key, [set of values]) group Produces a set of (key, value) pairs as output e.g., The reduce function merges the word counts by different map workers Into a total count as output 1.1 CLOUD COMPUTING IN A NUTSHELL

13 Logical Dataflow (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

14 Logical Dataflow (cont.)
Word Count Using MapReduce over Partitioned Data Set One of the well-known MapReduce problems The word count problem for a simple input file containing only two lines most people ignore most poetry most poetry ignores most people The Map function simultaneously produces a number of intermediate (key, value) pairs for each line content Each word is the intermediate key with 1 as its intermediate value, e.g., (ignore, 1) The MapReduce library collects all the generated intermediate (key, value) pairs 1.1 CLOUD COMPUTING IN A NUTSHELL

15 Logical Dataflow (cont.)
Sorts them to group the 1s for identical words, e.g., (people, [1,1]) Groups are then sent to the Reduce function in parallel It can sum up the 1 values for each word Generate the actual number of occurrences for each word in the file, e.g., (people, 2) 1.1 CLOUD COMPUTING IN A NUTSHELL

16 Logical Dataflow (cont.)
Hadoop Implementation of a MapReduce WebVisCounter Program WebVisCounter counts the number of times that users connect to or visit a given website using a particular operating system The input data is a typical web server log file 1.1 CLOUD COMPUTING IN A NUTSHELL

17 Logical Dataflow (cont.)
Data flow in WebVisCounter program execution The Map function parses each line to extract the type of the used OS as a key and assigns a value 1 to it The Reduce function in turn sums up the number of 1s for each unique key 1.1 CLOUD COMPUTING IN A NUTSHELL

18 Logical Dataflow (cont.)
Each Map server applies the map function to each input data split Many mapper functions run concurrently on hundreds or thousands of machine instances Many intermediate key-value pairs are generated Stored in local disks for subsequent use The original MapReduce is slow on large clusters Due to disk-based handling of intermediate results The Reduce server collates the values using the reduce function The reducer function can be max., min., average, dot product of two vectors, etc 1.1 CLOUD COMPUTING IN A NUTSHELL

19 Formal MapReduce Model
The Map function is applied in parallel to every input (key, value) pair Produces a new set of intermediate (key, value) pairs MapReduce library collects all the produced intermediate pairs from all input pairs Sorts them based on the key part Groups the values of all occurrences of the same key The Reduce function is applied in parallel to each group To produce the collection of values as output 1.1 CLOUD COMPUTING IN A NUTSHELL

20 Formal MapReduce Model (cont.)
After grouping all the intermediate data The values of all occurrences of the same key are sorted and grouped together Each key becomes unique in all intermediate data Finding unique keys is the starting point to solving a typical MapReduce problem The intermediate (key, value) pairs as the output of map function will be automatically produced Examples of how to define keys and values Count the number of occurrences of each word in a collection of documents in the above example 1.1 CLOUD COMPUTING IN A NUTSHELL

21 Formal MapReduce Model (cont.)
Count the number of occurrences of anagrams in a collection of documents Anagrams are words that are formed by rearranging the letters of another word e.g., listen can be reworked into the word silent The unique keys are an alphabetically sorted sequence of letters for each word, e.g., eilnst The intermediate value is the number of occurrences The main responsibility of the MapReduce framework To efficiently run a user’s program on a distributed computing system Carefully handles all partitioning, mapping, synchronization, communication, and scheduling details of such data flows 1.1 CLOUD COMPUTING IN A NUTSHELL

22 Formal MapReduce Model (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

23 Formal MapReduce Model (cont.)
Intermediate (key, value) pairs produced are partitioned into R regions R is equal to number of reduce tasks Guarantees that (key, value) pairs with identical keys are stored in the same region Reduce workers may face network congestion Caused by reduction or merging operation performed 1.1 CLOUD COMPUTING IN A NUTSHELL

24 Compute-Data Locality
The MapReduce implementation takes advantage of Google File System (GFS) as the underlying layer MapReduce can perfectly adapt itself to GFS GFS is a distributed file system Files are divided into fixed-size blocks (chunks) Blocks are distributed and stored on cluster nodes MapReduce library splits the input data (files) into fixed-size blocks Ideally performs the Map function in parallel on each block 1.1 CLOUD COMPUTING IN A NUTSHELL

25 Compute-Data Locality (cont.)
GFS has already stored files as a set of blocks MapReduce just needs to send a copy of the user’s program containing the Map function to the nodes already stored as data blocks The notion of sending computation toward data rather than sending data toward computation 1.1 CLOUD COMPUTING IN A NUTSHELL

26 MapReduce for Parallel Matrix Multiplication
In multiplying two n×n matrices A = (aij) and B = (bij) Need to perform n2 dot product operations to produce an output matrix C = (cij) Each dot product produces an output element cij = ai1 × b1j + ai2 × b2j + ∙ ∙ ∙ + ain × bnj Corresponding to the i-th row vector in matrix A multiplied by the j-th column vector in matrix B Mathematically, each dot product takes n multiply-and-add time units to complete The total matrix multiply complexity equals n×n2 since there are n2 output elements In theory, the n2 dot products are totally independent of each other 1.1 CLOUD COMPUTING IN A NUTSHELL

27 MapReduce for Parallel Matrix Multiplication (cont.)
Can be done on n2 servers in n time units When n is very large, say millions or higher Too expensive to build a cluster with n2 servers In practice, only the use of N << n2 servers The ideal speedup is expected to be N MapReduce Multiplication of Two Matrices Apply the MapReduce method to multiply two 2×2 matrices: A = (aij) and B = (bij) With two mappers and one reducer 1.1 CLOUD COMPUTING IN A NUTSHELL

28 MapReduce for Parallel Matrix Multiplication (cont.)
Map the first and second rows row of matrix A and entire matrix B to the first and second Map servers, respectively Four keys are used to identify four blocks of data processed K11, K12, K21, and K22 Simply denoted by the matrix element indices Partition matrix A and matrix BT by rows into two blocks, horizontally BT is the transposed matrix of B Data blocks are read into the two mappers All intermediate computing results are identified by their <key, value> pairs 1.1 CLOUD COMPUTING IN A NUTSHELL

29 MapReduce for Parallel Matrix Multiplication (cont.)
The generation, sorting, and grouping of four <key, value> pairs by each mapper in two stages Each short pair <key, value> holds a single partial-product value identified by its key The long pair holds two partial products identified by each block key The Reducer is used to sum up the output matrix elements using four long <key, value(s)> pairs Consider six mappers and two reducers Each mapper handles n/6 adjacent rows of the input matrix Each reducer generates n/2 of the output matrix C 1.1 CLOUD COMPUTING IN A NUTSHELL

30 MapReduce for Parallel Matrix Multiplication (cont.)
When the matrix order becomes very large The time to multiply very large matrices becomes cost prohibitive A dataflow graph for the above example 1.1 CLOUD COMPUTING IN A NUTSHELL

31 GPU Computing to Exascale and Beyond
Multicore CPUs may increase from the tens of cores to hundreds or more in the future CPU has reached its limit in terms of exploiting massive parallelism due to liming memory speed Triggered the development of many-core GPUs with hundreds or more thin cores x-86 processors have been extended to serve HPC systems in some high-end server processors Many RISC processors have been replaced with multicore x-86 processors and many-core GPUs This trend indicates that x-86 upgrades will dominate in data centers and supercomputers The GPU also has been applied in large clusters to build supercomputers in massively parallel processors (MPPs) 1.1 CLOUD COMPUTING IN A NUTSHELL

32 GPU Computing to Exascale and Beyond (cont.)
In the future, the processor industry will develop asymmetric/heterogeneous chip multiprocessors With both fat CPU cores and thin GPU cores on chip Internal to each node of the cloud Multithreading is practiced with a large number of cores in many-core GPU clusters Four challenges for exascale computing Energy and power & Memory and storage Needs to optimize the storage hierarchy and tailor the memory to the applications Concurrency and locality & System resiliency Needs to promote self-aware OS and runtime support and build locality-aware compilers and auto-tuners Self-aware OS/RT systems have the ability to adapt to the current situation and react to runtime events 1.1 CLOUD COMPUTING IN A NUTSHELL

33 GPU Computing to Exascale and Beyond (cont.)
A graphics processing unit (GPU) is a graphics coprocessor or accelerator Mounted on a computer’s graphics/video card Offloads the CPU from tedious graphics tasks in video editing applications The world’s first GPU, the GeForce 256, was marketed by NVIDIA in 1999 Modern GPU chips can process a minimum of 10 million polygons per second Used in nearly every computer on the market today Some features were also integrated into certain CPUs Traditional CPUs are structured with only a few cores 1.1 CLOUD COMPUTING IN A NUTSHELL

34 GPU Computing to Exascale and Beyond (cont.)
e.g., The Xeon X5670 CPU has six cores Modern GPU chips can be built with hundreds of cores GPUs have a throughput architecture that exploits massive parallelism by executing many concurrent threads slowly Instead of executing a single long thread in a conventional microprocessor very quickly Parallel GPUs or GPU clusters have been adopted Against the use of CPUs with limited parallelism General-purpose computing on GPUs have appeared in the HPC field, known as GPGPUs NVIDIA’s CUDA (Compute Unified Device Architecture) model is for HPC using GPGPUs 1.1 CLOUD COMPUTING IN A NUTSHELL

35 How GPUs Work Early GPUs functioned as coprocessors attached to the CPU Today, the NVIDIA GPU has been upgraded to 128 cores on a single chip Each core can handle eight threads of instructions Translates to having up to 1,024 threads executed concurrently on a single GPU True massive parallelism, compared to only a few threads that can be handled by a conventional CPU Achieves exascale-scale computing, Eflops or 1018 flops Optimized to deliver much higher throughput with explicit management of on-chip memory The CPU is optimized for latency caches 1.1 CLOUD COMPUTING IN A NUTSHELL

36 How GPUs Work (cont.) Modern GPUs are not restricted to accelerated graphics or video coding Also used in HPC systems To power supercomputers with massive parallelism at multicore and multithreading levels Designed to handle large numbers of floating-point operations in parallel In a way, the GPU offloads the CPU from all data-intensive calculations Not just those related to video processing widely used in mobile phones, game consoles, PCs, servers, etc e.g., The NVIDIA CUDA Tesla or Fermi is used in GPU clusters or in HPC systems for parallel processing of massive floating-pointing data 1.1 CLOUD COMPUTING IN A NUTSHELL

37 How GPUs Work (cont.) The interaction between a CPU and GPU
In performing parallel execution of floating-point operations concurrently The CPU is the conventional multicore processor with limited parallelism to exploit The GPU has a many-core architecture Hundreds of simple processing cores organized as multiprocessors Each core can have one or more threads The CPU’s floating-point kernel computation role is largely offloaded to the many-core GPU The CPU instructs the GPU to perform massive data processing 1.1 CLOUD COMPUTING IN A NUTSHELL

38 How GPUs Work (cont.) The NVIDIA Fermi GPU Chip with 512 CUDA Cores
The bandwidth must be matched between the on-board main memory and the on-chip GPU memory This process is carried out in NVIDIA’s CUDA programming using the GeForce 8800 or Tesla and Fermi GPUs The NVIDIA Fermi GPU Chip with 512 CUDA Cores 1.1 CLOUD COMPUTING IN A NUTSHELL

39 How GPUs Work (cont.) In 2010, three of the five fastest supercomputers in the world used large numbers of GPU chips to accelerate floating-point computations i.e., The Tianhe-1a, Nebulae, and Tsubame The architecture of the Fermi GPU A next-generation GPU from NVIDIA A streaming multiprocessor (SM) module Multiple SMs can be built on a single GPU chip The Fermi chip has 16 SMs implemented with 3 billion transistors Each SM comprises up to 512 streaming processors (SPs), known as CUDA cores The Tesla GPUs used in the Tianhe-1a have a similar architecture, with 448 CUDA cores 1.1 CLOUD COMPUTING IN A NUTSHELL

40 How GPUs Work (cont.) The Fermi GPU is a newer generation of GPU
Can be used in desktop workstations to accelerate floating-point calculations Or for building large-scale data centers There are 32 CUDA cores per SM Each CUDA core has a simple pipelined integer ALU and an FPU that can be used in parallel Each SM has 16 load/store units Allowing source and destination addresses to be calculated for 16 threads per clock There are four special function units (SFUs) for executing transcendental instructions These instructions perform trigonometric and logarithmic operations on floating-point operands 1.1 CLOUD COMPUTING IN A NUTSHELL

41 How GPUs Work (cont.) 1.1 CLOUD COMPUTING IN A NUTSHELL

42 How GPUs Work (cont.) All functional units and CUDA cores are interconnected by an NoC (network on chip) to a large number of SRAM banks (L2 caches) Each SM has a 64 KB L1 cache The 768 KB unified L2 cache is shared by all SMs to serve all load and store operations Memory controllers are used to connect to 6 GB of off-chip DRAMs The SM schedules threads in groups of 32 parallel threads called warps In total, 256/512 FMA (fused multiply and add) operations can be done in parallel to produce 32/64-bit floating-point results The 512 CUDA cores in an SM can work in parallel to deliver up to 515 Gflops of double-precision results 1.1 CLOUD COMPUTING IN A NUTSHELL

43 How GPUs Work (cont.) With 16 SMs, a single GPU has a peak speed of 82.4 Tflops Only 12 Fermi GPUs have the potential to reach the Pflops performance In the future, thousand-core GPUs may appear in Exascale systems Reflects a trend toward building future MPPs with hybrid architectures of both types of chips The progress of GPUs along with CPU advances in power efficiency, performance, and programmability All systems using the hybrid CPU/GPU architecture consume much less power 1.1 CLOUD COMPUTING IN A NUTSHELL

44 GPU Clusters for Massive Parallelism
Commodity GPUs have become high-performance accelerators for data-parallel computing Modern GPU chips contain hundreds of processor cores per chip Each GPU chip is capable of achieving up to 1 Tflops for single-precision (SP) arithmetic And more than 80 Gflops for double-precision (DP) calculations Recent HPC-optimized GPUs contain up to 4 GB of on-board memory Capable of sustaining memory bandwidths exceeding 100 GB/second 1.1 CLOUD COMPUTING IN A NUTSHELL

45 GPU Clusters for Massive Parallelism (cont.)
GPU clusters are built with a large number of GPU chips With the capability to achieve Pflops performance in some of the Top 500 systems Most GPU clusters are structured with homogeneous GPUs of the same hardware class, make, and model The software used in a GPU cluster includes the OS, GPU drivers, and clustering API such as an MPI The high performance of a GPU cluster is attributed mainly to The massively parallel multicore architecture, and high throughput in multithreaded floating-point arithmetic Significantly reduced time in massive data movement using large on-chip cache memory 1.1 CLOUD COMPUTING IN A NUTSHELL

46 GPU Clusters for Massive Parallelism (cont.)
GPU clusters already are more cost-effective than traditional CPU clusters Result in a quantum jump in speed performance Highly reduced space, power, and cooling demands Can operate with a reduced number of OS images, compared with CPU-based clusters These reductions in power, environment, and management complexity Make GPU clusters very attractive for use in future HPC applications A GPU cluster is often built as a heterogeneous system Consisting of three major components 1.1 CLOUD COMPUTING IN A NUTSHELL

47 GPU Clusters for Massive Parallelism (cont.)
The CPU host nodes, the GPU nodes and the cluster interconnect between them The GPU nodes are formed with general-purpose GPUs (GPGPUs) to carry out numerical calculations The host node controls program execution The cluster interconnect handles inter-node communications To guarantee the performance Multiple GPUs must be fully supplied with data streams over high-bandwidth network and memory Host memory should be optimized to match with the on-chip cache bandwidths on the GPUs 1.1 CLOUD COMPUTING IN A NUTSHELL

48 Echelon GPU Cluster Architecture
The architecture of a GPU accelerator chip suggested for Exascale computing For use in building a NVIDIA GPU cluster An Echelon GPU chip incorporates 1024 stream cores and 8 latency-optimized CPU-like cores Eight stream cores form a stream multiprocessor (SM) There are 128 SMs in the Echelon GPU chip Each SM is designed with 8 processor cores to yield a 160 Gflops peak speed Totally the chip has a peak speed of Tflops These nodes are interconnected by a NoC to 1,024 SRAM banks (L2 caches) Each cache bank has a 256 KB capacity 1.1 CLOUD COMPUTING IN A NUTSHELL

49 Echelon GPU Cluster Architecture (cont.)
The MCs (memory controllers) are used to connect to off-chip DRAMs The NI (network interface) is to scale the size of the GPU cluster hierarchically The architecture of NVIDIA Echelon GPU system 1.1 CLOUD COMPUTING IN A NUTSHELL

50 Echelon GPU Cluster Architecture (cont.)
The entire system is built with N cabinets Labeled C0, C1, ... , CN Each cabinet is built with 16 compute module Labeled as M0, M1, ... , M15 Each compute module is built with 8 GPU nodes Labeled as N0, N1, ... , N7 Each GPU node is the innermost block labeled as PC A single cabinet can house 128 GPU nodes Each compute module features a performance of 160 Tflops and 12.8 TB/s over 2 TB of memory Each cabinet has the potential to deliver 2.6 Pflops over 32 TB memory and 205 TB/s bandwidth The N cabinets are interconnected by a Dragonfly network with optical fiber 1.1 CLOUD COMPUTING IN A NUTSHELL

51 Echelon GPU Cluster Architecture (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

52 Echelon GPU Cluster Architecture (cont.)
To achieve Eflops performance, need to use at least N = 400 cabinets In total, an Exascale system needs 327,680 processor cores in 400 cabinets The Echelon system is supported by a self-aware OS and runtime system Also designed to preserve locality with the support of compiler and autotuner NVIDIA Fermi (GF110) chip has 512 stream processors The Echelon design is about 25 times faster Future applications of CUDA GPUs and Echelon Building the single-chip cloud computer (SCC) through virtualization in many-core architecture 1.1 CLOUD COMPUTING IN A NUTSHELL

53 CUDA Parallel Programming
CUDA offers a parallel computing architecture developed by NVIDIA The computing engine in NVIDIA GPUs This software is accessible to developers through standard programming languages Programmers use CUDA C with NVIDIA extensions and certain restrictions Compiled through a PathScale Open64 C compiler for parallel execution on a large number of GPU cores Parallel SAXPY Execution Using CUDA C Code on GPUs SAXPY is a kernel operation frequently performed in matrix multiplication 1.1 CLOUD COMPUTING IN A NUTSHELL

54 CUDA Parallel Programming (cont.)
Essentially performs repeated multiply/add operations to generate the dot product of two long vectors The saxpy_serial routine in standard C code Only suitable for sequential execution on a single processor core The saxpy_parallel routine in CUDA C code For parallel execution by 256 threads/block on many processing cores on the GPU chip 1.1 CLOUD COMPUTING IN A NUTSHELL

55 CUDA Parallel Programming (cont.)
N blocks are handled by N processing cores N can be on the order of hundreds of blocks Using CUDA C to exploit massive parallelism on a cluster of multicore and multithreaded processors using the NVIDIA GPUs as building blocks CUDA architecture shares a range of computational interfaces Khronos Group’s Open Computing Language Microsoft’s DirectCompute Provides both a low-level API and a higher-level API Third-party wrappers are also available For using Python, Perl, FORTRAN, Java, Ruby, Lua, MATLAB, and IDL 1.1 CLOUD COMPUTING IN A NUTSHELL

56 Trends in CUDA Usage CUDA has been used to accelerate nongraphical applications in computational biology, cryptograph, etc By an order of magnitude or more Works with all NVIDIA GPUs from the G8X series Including the GeForce, Quadro, and Tesla lines Programs developed for the GeForce 8 series will also work without modification on all future NVIDIA video cards due to binary compatibility Tesla and Fermi are two generations of CUDA architecture released by NVIDIA The Fermi has eight times the peak double-precision floating-point performance of the Tesla GPU 1.1 CLOUD COMPUTING IN A NUTSHELL

57 Trends in CUDA Usage (cont.)
The CUDA version 3.2 is used for using a single GPU module in 2010 A newer version 4.0 will allow multiple GPUs to use an unified virtual address space of shared memory The NVIDIA GPUs will be designed to support C++ CUDA leverages the high performance of GPUs to run compute-intensive applications on host OSes In 2016, AWS announced a new cloud service based on GPU architecture Designed to support solving complex AI problems and applications that demand low-precision arithmetic on massive resources for parallellism This GPU cloud differs from EC2 1.1 CLOUD COMPUTING IN A NUTSHELL

58 The vCUDA for Virtualization of General-Purpose GPUs
Difficult to run CUDA applications on hardware-level VMs directly vCUDA virtualizes the CUDA library Can be installed on guest OSs When CUDA applications run on a guest OS and issue a call to the CUDA API vCUDA intercepts the call and redirects it to the CUDA API running on the host OS The basic concept of vCUDA architecture Employs a client-server model to implement CUDA virtualization Consists of three user space components The vCUDA library 1.1 CLOUD COMPUTING IN A NUTSHELL

59 The vCUDA for Virtualization of General-Purpose GPUs (cont.)
A virtual GPU in the guest OS to act as a client The vCUDA stub in the host OS to act as a server The vCUDA library resides in the guest OS as a substitute for the standard CUDA library Responsible for intercepting and redirecting API calls from the client to the stub vCUDA also creates and manages vGPUs to abstract the GPU structure Gives applications a uniform view of hardware When a CUDA application in the guest OS allocates a device’s memory The vGPU can return a local virtual address to the application and notify the remote stub to allocate real device memory 1.1 CLOUD COMPUTING IN A NUTSHELL

60 The vCUDA for Virtualization of General-Purpose GPUs (cont.)
Responsible for storing the CUDA API flow The vCUDA stub receives and interprets remote requests and creates a corresponding execution context for the API calls from the guest OS Returns the results to the guest OS The stub also manages actual physical resource allocation 1.1 CLOUD COMPUTING IN A NUTSHELL

61 TensorFlow Toolkit TensorFlow is a computational framework for building machine learning models The hierarchy of TensorFlow toolkits 1.1 CLOUD COMPUTING IN A NUTSHELL

62 TensorFlow Toolkit (cont.)
TensorFlow provides a variety of different toolkits Allow to construct models at the preferred level of abstraction Can use lower-level APIs to build models by defining a series of mathematical operations Or can use higher-level APIs like tf.estimator to specify predefined architectures such as linear regressors or neural networks TensorFlow can run the ML program on multiple hardware platforms Including CPU, GPU, and TPU Just as the Python interpreter is implemented on multiple hardware platforms to run Python code 1.1 CLOUD COMPUTING IN A NUTSHELL

63 TensorFlow Toolkit (cont.)
One should use the highest level of abstraction to solve the problem The higher levels of abstraction are easier to use But are also less flexible Recommend to start with the highest-level API first and get everything working If need additional flexibility for some special modeling concerns, move one level lower Each level is built using the APIs in lower levels Dropping down the hierarchy should be reasonably straightforward tf.estimator API Using tf.estimator can dramatically lowers the number of lines of code 1.1 CLOUD COMPUTING IN A NUTSHELL

64 TensorFlow Toolkit (cont.)
Compatible with the scikit-learn API Scikit-learn is an extremely popular open-source ML library in Python The pseudocode for a linear classification program implemented in tf.estimator 1.1 CLOUD COMPUTING IN A NUTSHELL


Download ppt "Lecture 4 MapReduce Software Frameworks and CUDA GPU Architectures"

Similar presentations


Ads by Google