Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories.

Similar presentations


Presentation on theme: "Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories."— Presentation transcript:

1 Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

2 The Question What is Data-Intensive Computing?

3 My Answer What is Data-Intensive Computing? Parallel computing where you design your algorithms and your software around efficient access and traversal of a data set; where hardware requirements are dictated by data size as much as by desired run times Usually distilling compact results from massive data

4

5 Outline What is Data-Intensive Computing? Data-Intensive Computing at Sandia –Physics –Informatics –Architectures Into the Future

6 Spaghetti Plot (2)

7 Traditional Visualization Workflow Solver Disk Storage Disk Storage Visualization Full Mesh

8 Traditional In-Situ Visualization Solver Disk Storage Disk Storage Visualization Images Solver Disk Storage Disk Storage Visualization Full Mesh

9 Coprocessing Solver Disk Storage Disk Storage Visualization Images Solver Disk Storage Disk Storage Visualization Full Mesh Solver Disk Storage Disk Storage Features & Statistics Features & Statistics Salient Data Visualization

10 Collision Movie

11 Outline What is Data-Intensive Computing? Data-Intensive Computing at Sandia –Physics –Informatics –Architectures Into the Future

12 Slide 3/20 Community Detection in Networks Find many small groups of vertices and/or edges –O(n) communities –overlaps may be allowed Hundreds of papers in physics and computer science Lancichinetti, Fortunato, Radicchi 2008

13 Slide 2/20 Analysis of Massive Graphs Finding communities: a kernel of social network analysis “Dunber’s number” from sociology: there is a size limit (~150) on stable social group size (from neolithic farming village to academic sub-discipline) Twitter social network (|V|≈200M) [Akshay Java, 2007]

14 Slide 19/20 Collapsed Dendrograms and Statistical Confidence: wCNM The wCNM partitioning is much deeper, resolving smaller communities The statistically significant variation is visually close, but does not reproduce ground truth as well Image credit: Titan The (much better) wCNM solution also has a statistically significant variation.

15 LSA and LDA from 5 miles up Slide 15 of 18 Image credit: Dave Robinson (LDA)

16 LSA/LDA: Increasing Data Size, Single Processor Straight Line = Linear Scaling, Lower = Faster Slide 16 of XX Slide 16 of 18

17 LSA/LDA: Weak Scaling (Bigger Problem, Same Time) Flat Lines = Perfect Scaling Slide 17 of XX Slide 17 of 18

18 Outline What is Data-Intensive Computing? Data-Intensive Computing at Sandia –Physics –Informatics –Architectures Into the Future

19 NGC System Diagram ArchitecturesAlgorithmsWeb ServicesApplications (Clients) Titan, browser Trilinos Algebraic Methods Clustering, Ranking, High Dimensional Mapping MTGL Graph Methods Subgraph searches, Connection sg’s, Shortest Path, etc. Specialized Distributed Data Operations Titan Analysis Pipelines, Capability Integration, Data Access, Lightweight analysis Titan Analysis Pipelines, Capability Integration, Data Access, Lightweight analysis “This project seeks to bring these two strengths – a solid reputation for excellence in computing, and our niche expertise in specific classes of intelligence analysis – to bear on a thorny problem: developing advanced informatics capabilities that are both usable and useful to analysts who are drowning in data.” NGC project proposal Highly optimizedIterative, flexible Data

20 SQL Service Enables Remote Access to Data Warehouse Appliances (DWA) SQL Service* –Provides “bridge” between parallel apps and external DWA –Runs on Red Storm network nodes –Titan applications communicate with service through Portals –External resources (Netezza) communicate through standard interfaces (e.g. ODBC over TCP/IP) The SQL service enables an HPC application to access a remote DWA Service Nodes (GUI and Database Services) Service Nodes (GUI and Database Services) High-Speed Network (Portals) High-Speed Network (Portals) Compute Nodes (Titan Analysis Code) Tech Area 1AnywhereCSRI Netezza LexisNexis Other ODBC DWA Other ODBC DWA AnalystHPC System (Red Storm)DWA TCP/IP SQL * Results of SQL access from parallel statistics code presented at CUG’2009. Additional Modifications for Multilingual –Tokenization support on Netezza (goal is to count unique words) –Developed a custom UTF-8 words splitter for SPU (snippet processing unit) –Allows parallel tokenization and counting at storage device Slide 20 of 14

21 Outline What is Data-Intensive Computing? Data-Intensive Computing at Sandia –Physics –Informatics –Architectures Into the Future

22 I don’t care about flops anymore. I care about mops. I want to send more complex requests to the storage system. There is no one perfect architecture.


Download ppt "Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories."

Similar presentations


Ads by Google