D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under.

D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.

D4M-2 Nicholas Arcolano Michelle Beard Bob Bond Josh Haines Matthew Schmidt Ben Miller Benjamin O’Gwynn Tamara Yu Bill Arcand Bill Bergeron Acknowledgements David Bestor Chansup Byun Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee Dylan Hutchinson

D4M-3 Introduction Theory Results Summary Outline

D4M-4 Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets Example Applications of Graph Analytics Cyber Graphs represent communication patterns of computers on a network 1,000,000s – 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Graphs represent communication patterns of computers on a network 1,000,000s – 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Social Graphs represent relationships between individuals or documents 10,000s – 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent relationships between individuals or documents 10,000s – 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent entities and relationships detected through multi-INT sources 1,000s – 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life Graphs represent entities and relationships detected through multi-INT sources 1,000s – 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life ISR

D4M-5 Four Ecosystems Dominate Cloud Computing Enterprise Big DataDBMS Big Compute Each ecosystem is at the center of a multi-$B market Pros/cons of each are numerous; diverging hardware/software Some missions can exist wholly in one ecosystem; some can’t Each ecosystem is at the center of a multi-$B market Pros/cons of each are numerous; diverging hardware/software Some missions can exist wholly in one ecosystem; some can’t - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing - Java - Map/Reduce - Easy admin - Indexing - Search - Security

D4M-6 LLGrid MapReduce provides map/reduce interface in a big compute environment D4M provides an interactive parallel scientific computing environment to databases LLGrid MapReduce provides map/reduce interface in a big compute environment D4M provides an interactive parallel scientific computing environment to databases LLGridEnterprise Big DataDBMS - Interactive - On-demand - Elastic - High performance - Parallel Languages - Scientific computing - Java - Map/Reduce - Easy admin - Indexing - Search - Security Big Compute MapReduce Four Ecosystems Dominate Cloud Computing

D4M-7 Big Data + Big Compute Challenge Database Worldview “It’s the data!” Delivering data is the end Supercomputing Worldview “It’s the computer!” Delivering data is the start Shared Compute Separate DataSeparate Compute Shared Data Database and supercomputing views are fundamentally different Have never coexisted; do not know how to coexist Big Data “Analytics” are forcing them together Current standard practice duplicates hardware and data Database and supercomputing views are fundamentally different Have never coexisted; do not know how to coexist Big Data “Analytics” are forcing them together Current standard practice duplicates hardware and data

D4M-8 Big Data + Big Compute Stack High Level Composable API: D4M (“Databases for Matlab”) Weak Signatures, Noisy Data, Dynamics Novel Analytics for: Text, Cyber, Bio Interactive Super- computing High Performance Computing: LLGrid + Hadoop Distributed Database/ Distributed File System Distributed Database: Accumulo (triple store) A C E B Array Algebra Combining Big Compute and Big Data enables entirely new domains

D4M-9 High Level Language: D4M http://www.mit.edu/~kepner/D4M Distributed Database Query: Alice Bob Cathy David Earl Query: Alice Bob Cathy David Earl Associative Arrays Numerical Computing Environment D4M Dynamic Distributed Dimensional Data Model A C D E B A D4M query returns a sparse matrix or a graph… …for statistical signal processing or graph analysis in MATLAB D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization

D4M-10 Introduction Theory –Associate Arrays –Incidence Matrix Results Summary Outline

D4M-11 What are Spreadsheets and Big Tables? Spreadsheets Big Tables Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?) Big Tables (Google, Amazon, …) store most of the analyzed data in the world (Exabytes?) Simultaneous diverse data: strings, dates, integers, reals, … Simultaneous diverse uses: matrices, functions, hash tables, databases, … No formal mathematical basis; Zero papers in AMA or SIAM

D4M-12 D4M Key Concept: Associative Arrays Unify Four Abstractions Extends associative arrays to 2D and mixed data types A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0 Key innovation: 2D is 1-to-1 with triple store ('alice ','bob ','cited ') or('alice ','bob ',47.0) xATxATx ATAT  alice bob alice carl bob carl cited

D4M-13 Key innovation: mathematical closure –All associative array operations return associative arrays Enables composable mathematical operations A + B A - B A & B A|B A*B Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra Composable Associative Arrays Complex queries with ~50x less effort than Java/SQL Naturally leads to high performance parallel implementation

D4M-14 Associative Array Definitions High level usage dictated by these definitions Deeper algebraic properties set by the collision function f() Frequent switching between “algebras” (how spreadsheets are used)

D4M-15 Associative arrays can be constructed from a few definitions Similar to linear algebra, but applicable to a wider range of data Key questions –Which linear algebra properties do apply to associative arrays (intuitive) –Which linear algebra properties do not apply to associative arrays (watch out) –Which associative array properties do not apply to linear algebra (new) Theory Questions Linear Algebra watch out Associative Arrays intuitive new

D4M-16 Book: “Graph Algorithms in the Language of Linear Algebra” Editors: Kepner (MIT-LL) and Gilbert (UCSB) Contributors: –Bader (Ga Tech) –Bliss (MIT-LL) –Bond (MIT-LL) –Dunlavy (Sandia) –Faloutsos (CMU) –Fineman (CMU) –Gilbert (USCB) –Heitsch (Ga Tech) –Hendrickson (Sandia) –Kegelmeyer (Sandia) –Kepner (MIT-LL) –Kolda (Sandia) –Leskovec (CMU) –Madduri (Ga Tech) –Mohindra (MIT-LL) –Nguyen (MIT) –Radar (MIT-LL) –Reinhardt (Microsoft) –Robinson (MIT-LL) –Shah (USCB) References

D4M-17 Introduction Theory –Associate Arrays –Incidence Matrix Results Summary Outline

D4M-18 Digraphs are Black & White

D4M-19 The World is Color Artist: Ann Pibal; Painting: “XCRS”

D4M-20 5 Edge Colors Artist: Ann Pibal; Painting: “XCRS” Blue Silver Green Orange Pink

D4M-21 20 Vertices Artist: Ann Pibal; Painting: “XCRS” V2 V1 V3 V4 V5 V6 V7 V8 V9 V10 V11 V13 V14V12 V15 V16 V17 V18 V19 V20

D4M-22 1 Isolated Standard Edge Artist: Ann Pibal; Painting: “XCRS” P4

D4M-23 12 Multi Edges Artist: Ann Pibal; Painting: “XCRS” B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2

D4M-24 18 Hyper Edges Artist: Ann Pibal; Painting: “XCRS” B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 O5 P3 B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 P5 P6 P7 P8

D4M-25 27 Edge Orderings Artist: Ann Pibal; Painting: “XCRS” O5 P3 B1,S1,G1,O1,O2,P1 B2,S2,G2,O3,O4,P2 P5 P6 P7 P8 O5 < P3,P6,P7,P8 O5 < B1,S1,G1,O1,O2,P1 O5 < B2,S2,G2,O3,O4,P2 < P7,P8

D4M-26 52 Standard Multi Edges Artist: Ann Pibal; Painting: “XCRS” O5x5 P3x3 (B1,S1,G1,O1,O2,P1)x2 (B2,S2,G2,O3,O4,P2)x4 P5x2 P6x2 P7x2 P8x2

D4M-27 Summary Observations Artist: Ann Pibal; Painting: “XCRS” Standard edge representation fragments hyper edges –Information is lost Digraph representation compresses multi-edges –Information is lost Matrix representation drops edge labels –Information is lost Standard graph representation drops edge order –Information is lost Need edge representation that preserves information

D4M-28 Solution: Incidence Matrix Artist: Ann Pibal; Painting: “XCRS” EdgeColorOrder V01V02V03V04V05V06V07V08V09V10V11V12V13V14V15V16V17V18V19V20 B1Blue2 111 S1Silver2 111 G1Green2 111 O1Orange2 111 O2Orange2 111 P1Pink2 111 B2Blue2 11111 S2Silver2 11111 G2Green2 11111 O3Orange2 11111 O4Orange2 11111 P2Pink2 11111 O5Orange1 111111 P3Pink2 1111 P4Pink2 11 P5Pink2 111 P6Pink2 111 P7Pink3 111 P8Pink3 111

D4M-29 Introduction Theory Results –Network monitoring example –Bioinformatics example Summary Outline

D4M-30 Graph Construction Using D4M: Explode Schema Distributed Database Raw Data CSV Files Assoc. Arrays log_idsrc_ipserver_ip 001128.0.0.1208.29.69.138 002192.168.1.2157.166.255.18 003128.0.0.174.125.224.72 Dense Table src_ip|128.0.0.1src_ip|192.168.1.2server_ip|157.166.255.18server_ip|208.29.69.138server_ip|74.125.224.72 log_id|00110010 log_id|00201100 log_id|00310001 Exploded Table Use as row indices Create columns for each unique type/value pair

D4M-34 Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); (‘src_ip|128.0.0.1’,‘server_ip|208.29.69.138’,1) (‘src_ip|128.0.0.1’,‘server_ip|74.125.224.72’,1) (‘src_ip|192.168.1.2’,‘server_ip|157.166.255.18’,1)... (‘src_ip|128.0.0.1’,‘server_ip|208.29.69.138’,1) (‘src_ip|128.0.0.1’,‘server_ip|74.125.224.72’,1) (‘src_ip|192.168.1.2’,‘server_ip|157.166.255.18’,1)... Distributed Database Raw Data CSV Files Assoc. Arrays

D4M-35 Graphs can be constructed with minimal effort using D4M queries and associative array algebra Graph Construction Using D4M: Construct Associative Arrays Distributed Database Raw Data CSV Files Assoc. Arrays D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #1 keys = T(:,’time_stamp|10/May/2011:00:00:00’,:,... ’time_stamp|13/May/2011:23:59:59’,); D4M Query #2 data = T(Row(keys), :); D4M Query #2 data = T(Row(keys), :); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Associative Array Algebra G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’); Adj(G);

D4M-36 Accumulo Ingestion Scalability Study LLGrid MapReduce With A Python Application Data #1: 5 GB of 200 files Data #2: 30 GB of 1000 files 4 Mil e/s Accumulo Database: 1 Master + 7 Tablet servers

D4M-37 Introduction Theory Results –Network monitoring example –Bioinformatics example Summary Outline

D4M-38 Relative Cost per DNA Sequence Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed 03/08/2012 High Volume Sequencer Big Data Energy Efficient Portable Sequencer

D4M-39 Example Disease Outbreak May-July 2011 - Virulent E. Coli Outbreak Germany Sequencing and crowd source analysis showed promising potential -> Still too slow Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses Publishing sequence data allowed for broad community to fully characterize pathogen DNA Sequence released Outbreak identified Spanish Cucumbers implicated Sprouts Identified Deaths www.rki.de EHEC final report diarrhea kidney

D4M-40 RNA Reference Set Collected Sample sequence word (10mer) reference sequence ID unknown sequence ID A1A1 A2A2 A 1 A 2 ' reference bacteria unknown bacteria sequence word (10mer) reference sequence ID unknown sequence ID Sequence Matching  Graph  Sparse Matrix Multiply in D4M Associative arrays provide a natural framework for sequence matching

D4M-41 Database Automatically Computes Reference 10mer Distribution 50% 5% 0.5%

D4M-42 Leveraging “Big Data” Technologies for High Speed Sequence Matching D4M D4M + Triple Store BLAST 100x faster 100x smaller

D4M-43 Big data is found across a wide range of areas –Document analysis –Computer network analysis –DNA Sequencing Currently there is a gap in big data analysis tools for algorithm developers D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation Summary

D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under.

Similar presentations

Presentation on theme: "D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under.

Similar presentations

Presentation on theme: "D4M-1 Jeremy Kepner MIT Lincoln Laboratory 3 October 2012 Transforming Big Data with D4M This work is sponsored by the Department of the Air Force under."— Presentation transcript:

Similar presentations

About project

Feedback