The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan.

Slides:



Advertisements
Similar presentations
Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
Advertisements

A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Efficient Information Retrieval for Ranked Queries in Cost-Effective Cloud Environments Presenter: Qin Liu a,b Joint work with Chiu C. Tan b, Jie Wu b,
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Advanced Data Structures
Classifier Decision Tree A decision tree classifies data by predicting the label for each record. The first element of the tree is the root node, representing.
1 Advanced Data Structures. 2 Topics Data structures (old) stack, list, array, BST (new) Trees, heaps, union-find, hash tables, spatial, string Algorithm.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Sorting, Searching, and Simulation in the MapReduce Framework Michael T. Goodrich Dept. of Computer Science.
The Euler-tour technique
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
The Design and Analysis of Algorithms
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
The Data Attribution Abdul Saboor PhD Research Student Model Base Development and Software Quality Assurance Research Group Freie.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
Independent Component Analysis (ICA) A parallel approach.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
Querying Structured Text in an XML Database By Xuemei Luo.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
Mining High Utility Itemset in Big Data
MINATO ZDD Project Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data Toshiki Saitoh (ERATO) Joint work with Masashi.
HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Extended Finite-State Machine Inference with Parallel Ant Colony Based Algorithms PPSN’14 September 13, 2014 Daniil Chivilikhin PhD student ITMO.
Institute for Personal Robots in Education (IPRE)‏ CSC 170 Computing: Science and Creativity.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.
Introduction to Information and Computer Science
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
Relation. Combining Relations Because relations from A to B are subsets of A x B, two relations from A to B can be combined in any way two sets can be.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
Private Release of Graph Statistics using Ladder Functions J.ZHANG, G.CORMODE, M.PROCOPIUC, D.SRIVASTAVA, X.XIAO.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
The Design and Analysis of Algorithms
Parallel Density-based Hybrid Clustering
Privacy Preserving Subgraph Matching on Large Graphs in Cloud
Relational Algebra Chapter 4, Part A
CS110: Discussion about Spark
Comparative RNA Structural Analysis
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
GPX: Interactive Exploration of Time-series Microarray Data
On the Designing of Popular Packages
Tahsin Reza Matei Ripeanu Nicolas Tripoul
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Presentation transcript:

The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan Bohacek University of Delaware Computer and Information Sciences In collaboration with JP Morgan Chase & Co. Computer and Information Sciences

1

What are Many Employees Doing? SQL queries! Some employees access the database Queries reveal information about employees Computer and Information Sciences 2

Motivation Workforce efficiency Optimize Database Redundant work? New collaborations? Computer and Information Sciences 3

Solution Find computationally friendly representations of queries –Directed graphs (trees) Graph similarity techniques and Visualization –Find relationships between users –Find relationships between business units Computer and Information Sciences 4

What do SQL Queries Look Like? A query is a string of text SELECT FROM WHERE Computer and Information Sciences 5

SQL Query Example 1 SELECT A, B, C AS c, function1(D) AS d, E AS e, F AS f FROM table1_name WHERE table1_name.c1 ≤ ’STR’ and table1_name.c1 < ’STR’ AND (G=’STR’ AND C IN (’STR’, ’STR’ )) Computer and Information Sciences 6

SQL Query Example 2 SELECT A, H, E, C, F, G, I, D FROM table1_name WHERE table1_name.c1 ≤ ’STR’ and table1_name.c1 function2(p1, ’STR’) AND I≤ function2(p1, ’STR’)) Computer and Information Sciences 7

Similarity Between SQL Queries SELECT A, B, C AS c, function1(D) AS d, E AS e, F AS f FROM table1_name WHERE table1_name.c1 ≤ ’STR’ and table1_name.c1 < ’STR’ AND (G=’STR’ AND C IN (’STR’, ’STR’ )) SELECT A, H, E, C, F, G, I, D FROM table1_name WHERE table1_name.c1 ≤ ’STR’ and table1_name.c1 function2(p1, ’STR’) AND I≤ function2(p1, ’STR’)) Computer and Information Sciences 8

Representing SQL Queries Represent as graphs (parse trees) –No cycles Graphs are a more powerful representation –Similarity calculations –Structural relationships Graphs are visually appealing Computer and Information Sciences 9

SQL Parse Tree Example Computer and Information Sciences 10 SELECT A,B FROM C

Weisfeiler-Lehman Modified to work on ordered trees –Relabel trees according to global alphabet –Look for common labels and/or structures –Perform subtree analysis on pairs of trees Computer and Information Sciences 11

Global Alphabet T_SELECT1 SELECT2 T_COLUMN_LIST3 T_FROM4 T_SELECT_COLUMN5,6 FROM7 A8 B9 C10 Computer and Information Sciences 12

SELECT A,B FROM C Computer and Information Sciences 13

Example SELECT A, B, C AS c, function1(D) AS d, E AS e, F AS f FROM table1 name WHERE table1 name.c1 ≤ ’STR’ and table1 name.c1 < ’STR’ AND (G=’STR’ AND C IN (’STR’, ’STR’ )) SELECT A, H, E, C, F, G, I, D FROM table1 name WHERE table1 name.c1 ≤ ’STR’ and table1 name.c1 function2(p1, ’STR’) AND I≤ function2(p1, ’STR’)) Computer and Information Sciences 14

Computer and Information Sciences 15

Tree Similarity Algorithm Modified version of Weisfeiler-Lehman graph kernel Walk over nodes in trees T and T’ –For each pair of nodes Count number of matching subtrees (using labels) For a match, multiply by the height of the subtree Normalize Computer and Information Sciences 16

Tree Similarity Algorithm Computer and Information Sciences 17 Subtree Count = 0

Tree Similarity Algorithm Computer and Information Sciences 18 Subtree Count = 0

Tree Similarity Algorithm Computer and Information Sciences 19 Subtree Count = 0

Tree Similarity Algorithm Computer and Information Sciences 20 Subtree Count = 0

Tree Similarity Algorithm Computer and Information Sciences 21 Subtree Count = 2

Tree Similarity Algorithm Computer and Information Sciences 22 Subtree Count = 2

Tree Similarity Algorithm Computer and Information Sciences 23 Subtree Count = 2

Tree Similarity Algorithm Computer and Information Sciences 24 Subtree Count = 2

Tree Similarity Algorithm Computer and Information Sciences 25 Subtree Count = 2

Tree Similarity Algorithm Computer and Information Sciences 26 Subtree Count = 3

Tree Similarity Algorithm Computer and Information Sciences 27 Subtree Count = 3

User Similarity Computer and Information Sciences 28

User Similarity and Betweenness Centrality Compare collections of queries from each individual User Similarity –Find strong similarities between pairs of individuals Betweenness Centrality –Find individuals who share commonalities with many other users Computer and Information Sciences 29

User Similarity Let M, N denote the number of queries submitted by Users A, B respectively. Note that we calculate (sim(A  B) + sim(B  A))/2 in order to obtain a symmetrical matrix. Computer and Information Sciences 30

User Similarity Computer and Information Sciences 31

User Similarity Computer and Information Sciences 32

User Similarity Computer and Information Sciences 33

User Similarity Computer and Information Sciences 34

User Similarity Create similarity matrix Symmetrical Visualize Computer and Information Sciences 35

Betweenness Centrality How central a node is in a graph/network Tells us which users are similar to many other users –Versatile employee –Common underlying element Computer and Information Sciences 36

Workflow Computer and Information Sciences 37

Experimental Results 2 datasets (provided by JP Morgan Chase & Co.) Anonymized SQL queries Visualization –User similarity –Betweenness centrality –Heat Map Computer and Information Sciences 38

Dataset 1 49,655 queries (provided by JP Morgan Chase & Co.) 427 users Organized into 22 business units Computer and Information Sciences 39

Computer and Information Sciences 40

Computer and Information Sciences 41

Dataset 2 787,732 queries (provided by JP Morgan Chase & Co.) 92 users Organized into 6 business units Computer and Information Sciences 42

Computer and Information Sciences 43

Heat Map Shows all the similarities uncovered in dataset 2 Users sorted according to business unit Red = similar, Yellow = not similar Computer and Information Sciences 44

Computer and Information Sciences 45

Challenges Volume of data –49,655 x 49,655 = 10GB approx. –787,732 x 787,732 = 2.5TB approx. Computational cost –User similarity is quadratic in the number of queries. –Calculating user similarity matrix is n 4 Computer and Information Sciences 46

Acceleration Compute user similarity in parallel –No data dependencies If using accelerator, compute tree similarities in parallel OpenMP Computer and Information Sciences 47

Experimental Setup Machine: 16 GB of memory and 2x AMD Opteron 6320 CPUs, 8 cores per CPU clocked at 1.4 GHz Dataset 1: 49,655 queries Dataset 2: 787,732 queries Computer and Information Sciences 48

Acceleration Computer and Information Sciences 49

Contributions Created an algorithm that was used to measure the similarity between employees in a workforce (JP Morgan) Designed a scalable, encoding-based algorithm to measure the similarity between SQL queries Used these algorithms to create a visual representation of employee similarity across a workforce Computer and Information Sciences 50

Conclusion Developed a framework for calculating similarity between users/analysts Discovered similarities in the data Found similar users in different parts of the company Computer and Information Sciences 51

Acknowledgement Special thanks to JP Morgan Chase & Co. Collaboration on real-world problem Provided anonymized data Computer and Information Sciences 52