R EMCO C HANG | T UFTS U NIVERSITY 1/38 B IG D ATA V ISUAL A NALYTICS : A U SER -C ENTRIC A PPROACH Remco Chang Assistant Professor Computer Science, Tufts.

Slides:



Advertisements
Similar presentations
Techniques for Visualizing Massive Data Sets
Advertisements

1/26Remco Chang – Dagstuhl 14 Analyzing User Interactions for Data and User Modeling Remco Chang Assistant Professor Tufts University.
ProvenanceIntroLOCCog StateDist FuncWrap-up 1/52 User-Centric Visual Analytics Remco Chang Tufts University.
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.
ScalaRMotivationQueryPlanWrap-up 1/26 Dynamic Reduction of Query Result Sets for Interactive Visualization Leilani Battle (MIT) Remco Chang (Tufts) Michael.
VALTChessVA IntroAppsWrap-up 1/25 User-Centric Visual Analytics Remco Chang Tufts University Department of Computer Science.
Dist FuncIntroVAAppsATGWrap-up 1/25 Visual Analytics Research at Tufts Remco Chang Assistant Professor Tufts University.
ProvenanceIntroApplicationPersonalityDist FuncWrap-up 1/36 User-Centric Visual Analytics Remco Chang Tufts University Department of Computer Science.
WireVis Visualization of Categorical, Time-Varying Data From Financial Transactions Remco Chang, Mohammad Ghoniem, Robert Kosara, Bill Ribarsky, Jing Yang,
1/26Remco Chang – PNNL 14 Analyzing User Interactions for Data and User Modeling Remco Chang Assistant Professor Tufts University.
Search Engines and Information Retrieval
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
1 This work partially funded by NSF Grants IIS , IRIS and IIS Matthew O. Ward, Elke A. Rundensteiner, Jing Yang, Punit Doshi, Geraldine.
Research to Reality William Ribarsky Remco Chang University of North Carolina at Charlotte.
Live Re-orderable Accordion Drawing (LiveRAC) Peter McLachlan, Tamara Munzner Eleftherios Koutsofios, Stephen North AT&T Research Symposium August, 2007.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
Chapter 14 The Second Component: The Database.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Chapter 13 The Data Warehouse
Data Mining – Intro.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
Chapter 11 Databases.
1/30Remco Chang – SEAri Workshop 15 Big Data Visual Analytics: A User Centric Approach Remco Chang Assistant Professor Tufts University.
SizeIntroDefinitionComplexityTuftsWrap-up 1/54 Big Data Visual Analytics: Challenges and Opportunities Remco Chang Tufts University.
Database Systems – Data Warehousing
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
IntroDefinitionSizeComplexityWrap-up 1/54 Individual Big Data Visual Analytics: Challenges and Opportunities Remco Chang and Eli Brown Tufts University.
VALTVA IntroAppsWrap-up 1/16 Interactive Data Analysis and Model Exploration: A Visual Analytics Approach Remco Chang Tufts University Department of Computer.
COMP 410 & Sky.NET May 2 nd, What is COMP 410? Forming an independent company The customer The planning Learning teamwork.
1/20 (Big Data Analytics for Everyone) Remco Chang Assistant Professor Department of Computer Science Tufts University Big Data Visual Analytics: A User-Centric.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
Oracle Tuning Considerations. Agenda Why Tune ? Why Tune ? Ways to Improve Performance Ways to Improve Performance Hardware Hardware Software Software.
VISUAL ANALYTICS: VISUAL EXPLORATION, ANALYSIS, AND PRESENTATION OF LARGE COMPLEX DATA Remco Chang, PhD (Charlotte Visualization Center) (Tufts University)
VALTVA IntroAppsWrap-up 1/34 User-Centric Visual Analytics Remco Chang Tufts University Department of Computer Science.
1 Topics about Data Warehouses What is a data warehouse? How does a data warehouse differ from a transaction processing database? What are the characteristics.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
ProvenanceIntroPersonalityPrimingDist FuncWrap-up 1/52 User-Centric Visual Analytics Remco Chang Tufts University.
Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.
ProvenanceIntroPersonalityPrimingDist FuncWrap-up 1/40 User-Centric Visual Analytics Remco Chang Tufts University.
CS4432: Database Systems II Query Processing- Part 2.
1/41 Visualization and Analysis of Text Remco Chang, PhD Assistant Professor Department of Computer Science Tufts University December 17, 2010 Cologne,
Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.
Sunpyo Hong, Hyesoon Kim
IntroGoalCrowdPredictionWrap-up 1/26 Learning Debugging and Hacking the User Remco Chang Assistant Professor Tufts University.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
R EMCO C HANG | T UFTS U NIVERSITY 1/38 B IG D ATA V ISUAL A NALYTICS : A U SER -C ENTRIC A PPROACH Remco Chang Assistant Professor Computer Science, Tufts.
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
CPSC-310 Database Systems
Big Data Visual Analytics: A User-Centric Approach
Designing a Scalable Data Cleaning Infrastructure
Database management system Data analytics system:
CSCI5570 Large Scale Data Processing Systems
SuperB and its computing requirements
Pathology Spatial Analysis February 2017
So, what was this course about?
Chapter 13 The Data Warehouse
Remco Chang Associate Professor Computer Science, Tufts University
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Database Performance Tuning and Query Optimization
Big Data Visual Analytics: Challenges and Opportunities
CSc4730/6730 Scientific Visualization
Overview of big data tools
Chapter 11 Database Performance Tuning and Query Optimization
Carlos Ordonez, Javier Garcia-Garcia,
Presentation transcript:

R EMCO C HANG | T UFTS U NIVERSITY 1/38 B IG D ATA V ISUAL A NALYTICS : A U SER -C ENTRIC A PPROACH Remco Chang Assistant Professor Computer Science, Tufts University

R EMCO C HANG | T UFTS U NIVERSITY 2/38 F INANCIAL F RAUD – A C ASE FOR V ISUAL A NALYTICS  Financial Institutions like Bank of America have legal responsibilities to report all suspicious wire transaction activities  money laundering, supporting terrorist activities, etc  Data size: approximately 200,000 transactions per day (73 million transactions per year)

R EMCO C HANG | T UFTS U NIVERSITY 3/38 F INANCIAL F RAUD – A C ASE S TUDY FOR V ISUAL A NALYTICS  Problems:  Automated approach can only detect known patterns  Bad guys are smart: patterns are constantly changing  Previous methods:  10 analysts monitoring and analyzing all transactions  Using SQL queries and spreadsheet-like interfaces  Limited time scale (2 weeks)

R EMCO C HANG | T UFTS U NIVERSITY 4/38 W IRE V IS : F INANCIAL F RAUD A NALYSIS  In collaboration with Bank of America  Visualizes 7 million transactions over 1 year  A great problem for visual analytics:  Ill-defined problem (how does one define fraud?)  Limited or no training data (patterns keep changing)  Requires human judgment in the end (involves law enforcement agencies) R. Chang et al., Scalable and interactive visual analysis of financial wire transactions for fraud detection. Information Visualization,2008. R. Chang et al., Wirevis: Visualization of categorical, time-varying data from financial transactions. IEEE VAST, 2007.

R EMCO C HANG | T UFTS U NIVERSITY 5/38 W IRE V IS : A V ISUAL A NALYTICS A PPROACH Heatmap View (Accounts to Keywords Relationship) Multiple Temporal View (Relationships over Time) Search by Example (Find Similar Accounts) Keyword Network (Keyword Relationships)

R EMCO C HANG | T UFTS U NIVERSITY 6/38 E VALUATION Challenging – lack of ground truth Two types of evaluations: – Grounded Evaluation: real analysts, real data Find transactions that existing techniques can find Find new transactions that appear suspicious – Controlled Evaluation: real analysts, synthetic data Find all injected threat scenarios Adoption and Deployment

R EMCO C HANG | T UFTS U NIVERSITY 7/38 G OOD L ESSONS L EARNED  Analyst behavior  90% of time on Exploratory Data Analysis (EDA)  10% on confirmation (CDA)  Big data analysis == fast hypothesis testing  High Interactivity is key  Users can wait to find the exact answer

R EMCO C HANG | T UFTS U NIVERSITY 8/38 I NTERACTIVE V ISUALIZATION S YSTEMS Political Simulation – Agent-based analysis Bridge Maintenance – Exploring inspection reports Biomechanical Motion – Interactive motion comparison Interactive Metric Learning – DisFunction: learn a model from projection High-D Data Exploration – iPCA: Interactive PCA Jordan Crouser R. Chang et al., Two Visualization Tools for Analysis of Agent-Based Simulations in Political Science. IEEE CG&A, 2012

R EMCO C HANG | T UFTS U NIVERSITY 9/38 I NTERACTIVE V ISUALIZATION S YSTEMS Political Simulation – Agent-based analysis Bridge Maintenance – Exploring inspection reports Biomechanical Motion – Interactive motion comparison Interactive Metric Learning – DisFunction: learn a model from projection High-D Data Exploration – iPCA: Interactive PCA R. Chang et al., An Interactive Visual Analytics System for Bridge Management, EuroVis, 2010

R EMCO C HANG | T UFTS U NIVERSITY 10/38 I NTERACTIVE V ISUALIZATION S YSTEMS Political Simulation – Agent-based analysis Bridge Maintenance – Exploring inspection reports Biomechanical Motion – Interactive motion comparison Interactive Metric Learning – DisFunction: learn a model from projection High-D Data Exploration – iPCA: Interactive PCA R. Chang et al., Interactive Coordinated Multiple-View Visualization of Biomechanical Motion Data, IEEE Vis (TVCG) 2009.

R EMCO C HANG | T UFTS U NIVERSITY 11/38 I NTERACTIVE V ISUALIZATION S YSTEMS Political Simulation – Agent-based analysis Bridge Maintenance – Exploring inspection reports Biomechanical Motion – Interactive motion comparison Interactive Metric Learning – DisFunction: learn a model from projection High-D Data Exploration – iPCA: Interactive PCA Eli Brown R. Chang et al., Dis-function: Learning Distance Functions Interactively, IEEE VAST, 2012

R EMCO C HANG | T UFTS U NIVERSITY 12/38 I NTERACTIVE V ISUALIZATION S YSTEMS Political Simulation – Agent-based analysis Bridge Maintenance – Exploring inspection reports Biomechanical Motion – Interactive motion comparison Interactive Metric Learning – DisFunction: learn a model from projection High-D Data Exploration – iPCA: Interactive PCA R. Chang et al., iPCA: An Interactive System for PCA-based Visual Analytics, EuroVis 2009.

R EMCO C HANG | T UFTS U NIVERSITY 13/38

R EMCO C HANG | T UFTS U NIVERSITY 14/38 “T OUGH ” L ESSONS L EARNED  Careful engineering is not enough… A new paradigm is necessary to support this type of interactive analysis.

R EMCO C HANG | T UFTS U NIVERSITY 15/38 P ROBLEM S TATEMENT Visualization on a Commodity Hardware Large Data in a Data Warehouse

R EMCO C HANG | T UFTS U NIVERSITY 16/38 R ELATED W ORK  (See the DSIA workshop proceeding)  Organized with Carlos Scheidegger (Arizona), Jeff Heer (UW), Danyel Fisher (Microsoft Research)  Specialized Pull-based Databases  Tableau, Spotfire  Pre-compiled Data Cubes  Nanocube (Scheidegger), imMens** (Liu, Heer), Map-D** (Mostak)  Sampling  BlinkDB (Agrawal, Berkeley), DICE (Kamat, Nandi), Ordering guarantees (Kim et al.)  Pre-Fetching  Xmdv (Doshi, Ward), Time-series (Chan, Hanrahan), Query prediction (Cetintemel, Zdonik)  Others  Streaming (Fisher), Optimization (Wu)

R EMCO C HANG | T UFTS U NIVERSITY 17/38 T WO O BSERVATIONS : 1. The number of possible actions is finite and the user’s actions are “logical”. 2. Visualization itself is a bottleneck

R EMCO C HANG | T UFTS U NIVERSITY 18/38 T WO O BSERVATIONS : 1000 pixels 1000x1000 = 1 million User’s perception and cognition are further limitations 1. The number of possible actions is finite and the user’s actions are “logical”. 2. Visualization itself is a bottleneck  7 million data points lead to a 7:1 aggregation

R EMCO C HANG | T UFTS U NIVERSITY 19/38 P ROBLEM S TATEMENT  Problem: Data is too big to fit into the memory of the personal computer  Note: Ignoring various database technologies (OLAP, Column-Store, No-SQL, Array-Based, etc)  Goal: Guarantee a result set to a user’s query within X number of seconds.  Based on HCI research, the upperbound for X is 10 seconds  Ideally, we would like to get it down to 1 second or less  Method: trading accuracy and storage (caching), optimize on minimizing latency (user wait time).

R EMCO C HANG | T UFTS U NIVERSITY 20/38 O UR A PPROACH : P REDICTIVE P RE -F ETCHING  In collaboration with MIT (Leilani Battle, Mike Stonebraker)  ForeCache: Three-tiered architecture  Thin client (visualization)  Backend (array-based database)  Fat middleware  Prediction Algorithms  Storage Architecture  Cache Management (Eviction Strategies) R. Chang et al., Dynamic Prefetching of Data Tiles for Interactive Visualization. To Appear in SIGMOD 2016 Leilani Battle Stonebraker

R EMCO C HANG | T UFTS U NIVERSITY 21/38

R EMCO C HANG | T UFTS U NIVERSITY 22/38 P REDICTION A LGORITHMS  General Idea:  Lots of “experts”  Represent different prediction algorithms  Image based  Statistics based  Interaction based  (See our other publications on this topic)  One “manager”  Chooses which expert to listen to  Iterate  Manager builds “trusts” in the experts

R EMCO C HANG | T UFTS U NIVERSITY 23/ I TERATION : 0

R EMCO C HANG | T UFTS U NIVERSITY 24/ I TERATION : 0

R EMCO C HANG | T UFTS U NIVERSITY 25/ I TERATION : 0 User Requests Data Block 13

R EMCO C HANG | T UFTS U NIVERSITY 26/ I TERATION : 0 User Requests Data Block 13

R EMCO C HANG | T UFTS U NIVERSITY 27/ I TERATION : 0 User Requests Data Block 13

R EMCO C HANG | T UFTS U NIVERSITY 28/ I TERATION : 1

R EMCO C HANG | T UFTS U NIVERSITY 29/38 S TUDY R ESULTS  Using a simple Google-maps like interface  18 users explored the NASA MODIS dataset  Tasks include “find 4 areas in Europe that have a snow coverage index above 0.5”

R EMCO C HANG | T UFTS U NIVERSITY 30/ User’s Requests Data Block 52 W ORST C ASE S CENARIO : C ACHE M ISS

R EMCO C HANG | T UFTS U NIVERSITY 31/38 C ACHE M ISS  How to guarantee response time when there’s a cache miss?  Trick: the ‘EXPLAIN’ command  Usage: explain select * from myTable;  Returns the query plan and a cost estimation of running the query. R. Chang et al., Dynamic Reduction of Result Sets for Interactive Visualization, IEEE Big Data Workshop on Visualization, Leilani Battle Stonebraker

R EMCO C HANG | T UFTS U NIVERSITY 32/38 E XAMPLE EXPLAIN O UTPUT FROM S CI DB  Example SciDB the output of (a query similar to) Explain SELECT * FROM earthquake [("[pPlan]: schema earthquake <datetime:datetime NULL DEFAULT null, magnitude:double NULL DEFAULT null, latitude:double NULL DEFAULT null, longitude:double NULL DEFAULT null> [x=1:6381,6381,0,y=1:6543,6543,0] bound start {1, 1} end {6381, 6543} density 1 cells chunks 1 est_bytes e+09 ")] The four attributes in the table ‘earthquake’ Notes that the dimensions of this array (table) is 6381x6543 This query will touch data elements from (1, 1) to (6381, 6543), totaling 41,750,833 cells Estimated size of the returned data is e+09 bytes (~8GB)

R EMCO C HANG | T UFTS U NIVERSITY 33/38 O THER E XAMPLES  Oracle 11g Release 1 (11.1)

R EMCO C HANG | T UFTS U NIVERSITY 34/38 O THER E XAMPLES  MySQL 5.0

R EMCO C HANG | T UFTS U NIVERSITY 35/38 O THER E XAMPLES  PostgreSQL 7.3.4

R EMCO C HANG | T UFTS U NIVERSITY 36/38 R EDUCTION S TRATEGIES  If the query is estimated to be too expensive to execute, the middleware dynamically “modifies” the query by using:  Aggregation:  In SciDB, this operation is carried out as regrid (scale_factorX, scale_factorY)  Sampling  In SciDB, uniform sampling is carried out as bernoulli (query, percentage, randseed)  Filtering  Currently, the filtering criteria is user specified where (clause)

R EMCO C HANG | T UFTS U NIVERSITY 37/38 S UMMARY  Big data visual analytics requires fast interactive data systems.  A growing subfield in DB, VIS, and ML  Our approach: 1. Predictive pre-fetching 2. Three-tiered system 3. Pre-fetching based on “expert-manager” approach 4. Use the “explain” trick to handle cache-miss 5. Guarantees response time, but not data quality  Backbone (invisible) to data analysts

R EMCO C HANG | T UFTS U NIVERSITY 38/38 Q UESTIONS ? CS. TUFTS. EDU