Presentation is loading. Please wait.

Presentation is loading. Please wait.

Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda §, Andrés Márquez §, Jarek Nieplocha §, Kristyn Maschhoff † and.

Similar presentations


Presentation on theme: "Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda §, Andrés Márquez §, Jarek Nieplocha §, Kristyn Maschhoff † and."— Presentation transcript:

1 Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda §, Andrés Márquez §, Jarek Nieplocha §, Kristyn Maschhoff † and Chad Scherrer § § Pacific Northwest National Laboratory (PNNL) † Cray, Inc.

2 2 IntroductionIntroduction Increasing gap between memory and processor speed Causing many applications to become memory-bound Mainstream processors utilize cache hierarchy Caches not effective for highly irregular, data-intensive applications Multithreaded architectures provide an alternative Switch computation context to hide memory latency Cray MTA-2 processors and newer ThreadStorm processors on the Cray XMT utilize this strategy

3 3 Cray XMT 3 rd generation multithreaded system from Cray Infrastructure is based on XT3/4, scalable up to 8192 processors SeaStar network, torus topology, service and I/O nodes Compute nodes contain 4 ThreadStorm multithreaded processors instead of 4 AMD Opteron processors Hybrid execution capabilities: code can run on ThreadStorm processors in collaboration with code running on Opteron processors

4 4 Cray XMT (cont.) ThreadStorm processors run at 500 MHz 128 hardware thread contexts, each with its own set of 32 registers No data cache 128KB, 4-way associative data buffer on the memory side Extra bits in each 64-bit memory word: full/empty for synchronization Hashed memory at a 64-byte level, i.e. contiguous logical addresses at a 64-byte boundary might be mapped to uncontiguous physical locations Global shared memory

5 5 Cray XMT (cont.) Lightweight User Communication library (LUC) to coordinate data transfers and hybrid execution between ThreadStorm and Opteron processors Portals-based on Opterons Fast I/O API-based on ThreadStorms RPC-style semantics Service and I/O (SIO) nodes provide Lustre, a high- performance parallel file system ThreadStorm processors cannot directly access Lustre LUC-based execution and transfers combined with Lustre access on the SIO nodes Attractive and high-performance alternative for processing very large datasets on the XMT system

6 6 OutlineOutline Introduction Cray XMT  PDTree Multithreaded implementation Static & dynamic versions Experimental setup and Results Conclusions

7 PDTree (or Anomaly Detection for Categorical Data) 7 Originates from cyber security analysis Detect anomalies in packet headers Locate and characterize network attacks Analysis method is more widely applicable Uses ideas from conditional probability Multivariate categorical data analysis For a combination of variables and instances of values for these variables, find out how many times the pattern has occurred Resulting count table or contingency table specifies a joint distribution Efficient implementation of algorithms using such tables are very important in statistical analysis ADTree data structure (Moore & Lee 1998), can be used to store data counts Stores all combinations of values for variables

8 PDTree (cont.) 8 We use an enhancement to the ADTree data structure called a PDTree where we don’t need to store all possible combinations of values Only store a priori specified combinations

9 Multithreaded Implementation 9 PDTree implemented using a multiple type, recursive tree structure Root node is an array of ValueNodes (counts for different value instances of the root variables) Interior and leaf nodes are linked lists of ValueNodes Inserting a record at the top level involves just incrementing the counter of the corresponding ValueNode XMT’s int_fetch_add() atomic operation is used to increment counters Inserting a record at other levels requires the traversal of a linked list to find the right ValueNode If the ValueNode does not exist, it must be appended to the end of the list Inserting at other levels when the node does not exist is tricky To ensure safety the end pointer of the list must be locked Use readfe() and writeef() MTA operations to create critical sections Take advantage of full/empty bits on each memory word As data analysis progresses the probability of conflicts between threads is lower

10 Multithreaded Implementation (cont.) 10 v i = j (count) v i = k (count) T 2 trying to grab the end pointer T 1 trying to grab the end pointer v i = j (count) v i = k (count) T 2 now has a lock to a non-end pointer T 1 succeeded and inserted a new node v i = m (count)

11 Static and Dynamic Versions 11 column = a numCols = 3 values = RootNode count = 5 columns = column = b values = column = c values =... value = 10 count = 1 numCols = 3 columns =... nextVN = value = 19 count = 4 numCols = 3 columns =... nextVN =... count = 3 columns = column = b values = column = c values =... Linked list of ValueNodes Hash table of ValueNodes Array of ColumnNodes Array of RootNodes

12 12 OutlineOutline Introduction Cray XMT PDTree Multithreaded Implementation Static & dynamic versions  Experimental setup and Results Conclusions

13 Experimental setup and Results Large dataset to be analyzed by PDTree 4 GB resident on disk (64M records, 9 column guide tree) Options: Direct file I/O from ThreadStorm procesors via NFS Not very efficient Indirect I/O via LUC server running on Opteron processors on the SIO nodes Large input file can reside on high-performance Lustre file system Simulates the use of PDTree for online network traffic analysis Need to use dynamic PDTree 128K element hash table 13

14 Experimental setup and Results (cont.) 14 Threadstorm CPU Threadstorm CPU Threadstorm CPU Threadstorm CPU DRAM Opteron CPU DRAM Service/login node Compute node SeaStar Interconnect Lustre filesystem Direct Access Indirect Access LUC RPC Note: results obtained on a preproduction XMT with only half of the DIMM slots populated

15 Experimental setup and Results (cont.) 15 # of procs. XMT Insertion XMT Speedup MTA Insertion MTA Speedup 1239.261.00200.171.00 2116.362.0698.252.04 456.484.2448.074.16 827.538.6923.298.59 1613.9717.1311.6117.24 327.1333.565.8134.45 643.6865.02N/A 962.6092.02N/A In-core, 1M record execution, static PDTree version

16 Experimental setup and Results (cont.) 16

17 Experimental setup and Results (cont.) 17

18 18 Conclusions Results indicate the value of the XMT hybrid architecture and its improved I/O capabilities Indirect access to Lustre through LUC interface Need to improve I/O operation implementation to take full advantage of Lustre Multiple LUC transfers in parallel should improve performance Scalability of the system is very good for complex, data-dependent irregular accesses in the PDTree application Future work includes comparisons against parallel cache-based systems


Download ppt "Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda §, Andrés Márquez §, Jarek Nieplocha §, Kristyn Maschhoff † and."

Similar presentations


Ads by Google