1 Visualization and Data Mining techniques By- Group number- 14 Chidroop Madhavarapu(105644921) Deepanshu Sandhuria(105595184) Data Mining CSE 634 Prof.

Slides:

Advertisements

Similar presentations

Three-Step Database Design

Advertisements

Applications of one-class classification

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Heuristic Search techniques

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Random Forest Predrag Radenković 3237/10

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.

Visual Data Mining: Concepts, Frameworks and Algorithm Development Student: Fasheng Qiu Instructor: Dr. Yingshu Li.

Spatial Dependency Modeling Using Spatial Auto-Regression Mete Celik 1,3, Baris M. Kazar 4, Shashi Shekhar 1,3, Daniel Boley 1, David J. Lilja 1,2 1 CSE.

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Aki Hecht Seminar in Databases (236826) January 2009

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Classifier Decision Tree A decision tree classifies data by predicting the label for each record. The first element of the tree is the root node, representing.

Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.

Spatial Outlier Detection and implementation in Weka Implemented by: Shan Huang Jisu Oh CSCI8715 Class Project, April Presented by Jisu.

Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.

Chapter Sixteen EXPLORING, DISPLAYING, AND EXAMINING DATA

A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota

Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.

1 Software Testing Techniques CIS 375 Bruce R. Maxim UM-Dearborn.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.

Data Mining Techniques

Chapter 10 Architectural Design

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Perception-Based Classification (PBC) System Salvador Ledezma April 25, 2002.

Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.

Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.

Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.

Bug Localization with Machine Learning Techniques Wujie Zheng

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.

Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

VisDB: Database Exploration Using Multidimensional Visualization Maithili Narasimha 4/24/2001.

VizDB A tool to support Exploration of large databases By using Human Visual System To analyze mid-size to large data.

Chapter 4 Decision Support System & Artificial Intelligence.

Presented by: Sandeep Chittal Minimum-Effort Driven Dynamic Faceted Search in Structured Databases Authors: Senjuti Basu Roy, Haidong Wang, Gautam Das,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.

Data Mining and Decision Support

Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

On Using SIFT Descriptors for Image Parameter Evaluation Authors: Patrick M. McInerney 1, Juan M. Banda 1, and Rafal A. Angryk 2 1 Montana State University,

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.

The PLA Model: On the Combination of Product-Line Analyses 강태준.

Part Four ANALYSIS AND PRESENTATION OF DATA

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Outlier Discovery/Anomaly Detection

Visualization and Data Mining techniques

CSc4730/6730 Scientific Visualization

Data Warehousing and Data Mining

Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.

Shashi Shekhar Weili Wu Sanjay Chawla Ranga Raju Vatsavai

CSE572: Data Mining by H. Liu

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

1 Visualization and Data Mining techniques By- Group number- 14 Chidroop Madhavarapu( ) Deepanshu Sandhuria( ) Data Mining CSE 634 Prof. Anita Wasilewska

2 References zSzdeptzSzuserszSzkumarzSzdatavis.pdf/ganesh96visual.pdf zSzdeptzSzuserszSzkumarzSzdatavis.pdf/ganesh96visual.pdf IEEE%20Trans%20Vis.pdf IEEE%20Trans%20Vis.pdf users.cs.umn.eduzSzzCz7EctluzSzPaperTalkFilezSzits02.pdf/shekhar02cubeview.pdf users.cs.umn.eduzSzzCz7EctluzSzPaperTalkFilezSzits02.pdf/shekhar02cubeview.pdf users.cs.umn.eduzSzzCz7EpushengzSzpubzSzkdd2001zSzkdd.pdf/shekhar01detecting.pdf users.cs.umn.eduzSzzCz7EpushengzSzpubzSzkdd2001zSzkdd.pdf/shekhar01detecting.pdf

3 Motivation Visualization for Data Mining Huge amounts of information Limited display capacity of output devices Visual Data Mining (VDM) is a new approach for exploring very large data sets, combining traditional mining methods and information visualization techniques.

4 Why Visual Data Mining

5

6 VDM Approach VDM takes advantage of both, The power of automatic calculations, and The capabilities of human processing. Human perception offers phenomenal abilities to extract structures from pictures.

7 Levels of VDM No or very limited integration Corresponds to the application of either traditional information visualization or automated data mining methods. Loose integration Visualization and automated mining methods are applied sequentially. The result of one step can be used as input for another step. Full integration Automated mining and visualization methods applied in parallel. Combination of the results.

8 Methods of Data Visualization Different methods are available for visualization of data based on type of data Data can be Univariate Bivariate Multivariate

9 Univariate data Measurement of single quantitative variable Characterize distribution Represented using following methods Histogram Pie Chart

10 Histogram

11 Pie Chart

12 Bivariate Data Constitutes of paired samples of two quantitative variables Variables are related Represented using following methods Scatter plots Line graphs

13 Scatter plots

14 Line graphs

15 Multivariate Data Multi dimensional representation of multivariate data Represented using following methods Icon based methods Pixel based methods Dynamic parallel coordinate system

16 Icon based Methods

17 Pixel Based Methods Approach: Each attribute value is represented by one colored pixel (the value ranges of the attributes are mapped to a fixed color map). The values of each attribute are presented in separate sub windows. Examples: Dense Pixel Displays

18 Dense Pixel Display Approach: Each attribute value is represented by one colored pixel (the value ranges of the attributes are mapped to a fixed color map). Different attributes are presented in separate sub windows.

19 Visual Data Mining: Framework and Algorithm Development Ganesh, M., Han, E.H., Kumar, V., Shekar, S., & Srivastava, J. (1996). Working Paper. Twin Cities, MN: University of Minnesota, Twin Cities Campus.

20 References :zSzzSzftp.cs.umn.eduzSzdeptzSzuserszSzkumarzSzdatavis.pdf/ganesh96visua l.pdf :zSzzSzftp.cs.umn.eduzSzdeptzSzuserszSzkumarzSzdatavis.pdf/ganesh96visua l.pdf Visualization%20in%20DM-IEEE%20Trans%20Vis.pdf Visualization%20in%20DM-IEEE%20Trans%20Vis.pdf

21 Abstract VDM refers to refers to the use of visualization techniques in Data Mining process to Evaluate Monitor Guide This paper provides a framework for VDM via the loose coupling of databases and visualization systems. The paper applies VDM towards designing new algorithms that can learn decision trees by manually refining some of the decisions made by well known algorithms such as C4.5.

22 Components of VQLBCI The three major components of VQLBCI are Visual Representations, Computations and Events.

23 Visual Development of Algorithms Most interesting use of visual data mining is the development of new insights and algorithms. The figure below shows the ER diagram for learning classification decision trees. This model allows the user to monitor the quality and impact of decisions made by the learning procedure. Learning procedure can be refined interactively via a visual interface.

24 ER diagram for the search space of decision tree learning algorithm

25 General Framework Learning a classification decision tree from a training data set can be regarded as a process of searching for the best decision tree that meets user-provided goal constraints. The problem space of this search process consists of Model Candidates, Model Candidate Generator and Model Constraints. Many existing classification-learning algorithms like C4.5 and CDP fit nicely within this search framework. New learning algorithms that fit user’s requirements can be developed by defining the components of the problem space.

26 General Framework Model Candidate corresponds to the partial classification decision tree. Each node of the decision tree is a Model Atom Search process is the process of finding a final model candidate such that it meets user goal specifications. Model Candidate Generator transforms the current model candidate into a new model candidate by selecting one model atom to expand from the expandable leaf model atoms. Model Constraints (used by Model Candidate Generator) provide controls and boundaries to the search space.

27 Search Process

28 Acceptability Constraint Model Constraints consist of Acceptability constraints, Expandability constraints and a Data-Entropy calculation function. Acceptability constraint predicate specifies when a model candidate is acceptable and thus allows search process to stop. EX: A1) Total no of expandable leaf model atoms = 0. A2) Overall error rate of the model candidate <= acceptable error rate. A3) Total number of model atoms in the model candidate>= maximal allowable tree size. A1 is used in C4.5 and CDP

29 Expandability Constraint An Expandability constraint predicate specifies whether a leaf model atom is expandable or not. EX: C4.5 uses E1 and E2 CDP uses E2 and E3

30 Traversal Strategy Traversal strategy ranks expandable leaf model atoms based on the model atom attributes. EX: Increasing order of depth Decreasing order of depth Orders based on other model atom attributes.

31 Steps in Visual Algorithm Development No single algorithm is the best all the time, performance is highly data dependent. By changing different predicates of model constraints, users can construct new classification-learning algorithm. This enables users to find an algorithm that works the best on a given data set. Two algorithms are developed : BF based on Best First search idea and CDP+ which is a modification of CDP

32 BF This algorithm is based on the Best-First search idea. For Acceptability criteria, it includes A1 and A2 with a user specified acceptable error rate. The Traversal strategy chosen is T3 In Best-First, expandable leaf model atoms are ranked according to the decreasing order of the number of misclassified training cases. (local error rate * size of subset training data set) The traversal strategy will expand a model atom that has the most misclassified training cases, thus reducing the overall error rate the most.

33 CDP + CDP+ is a modification of CDP CDP has dynamic pruning using expandability constraint E3. Here, the depth is modified according to the size of the training data set of the model atom. We set B is the branching factor of the decision tree, t is the size of training data set belonging to model atom, T is the whole training data set.

34 Comparison of different classification learning algorithms

35 Experiment The new BF and CDP+ algorithms are compared with the C4.5 and CDP algorithms. Various metrics are selected to compare the efficiency, accuracy and size of final decision trees of the classification algorithm. The generation efficiency of the nodes is measured in terms of the total number of nodes generated. To compare accuracy of the various algorithms, the mean classification error on the test data sets have been computed.

36 Classification error for 10 data sets

37 Nodes generated for 10 data sets

38 Final decision tree size

39 Results/Conclusion CDP has accuracy comparable to C4.5 while generating considerably fewer nodes. CDP+ has accuracy comparable to C4.5 while generating considerably fewer nodes. CDP+ outperformed CDP in error rate and number of nodes generated. Considering all performance metrics together, CDP+ is the best overall algorithm. Considering classification accuracy alone, C4.5P is the winner.

40 Conclusion Different datasets require different algorithms for best results. Diverse user requirements put different constraints on the final decision tree. The experiment shows that Interactive Visual Data Mining Framework can help find the most suitable algorithm for a given data set and group of user requirements.

41 Data Mining for Selective Visualization of Large Spatial Datasets Proceedings of 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'02), Washington (November 2002), DC, USA, Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang, Rulin Liu Computer Science & Engineering Department University of Minnesota

42 References users.cs.umn.eduzSzzCz7EctluzSzPaperTalkFilezSzits02.pdf/shekhar02 cubeview.pdf users.cs.umn.eduzSzzCz7EctluzSzPaperTalkFilezSzits02.pdf/shekhar02 cubeview.pdf / users.cs.umn.eduzSzzCz7EpushengzSzpubzSzkdd2001zSzkdd.pdf/shek har01detecting.pdf / users.cs.umn.eduzSzzCz7EpushengzSzpubzSzkdd2001zSzkdd.pdf/shek har01detecting.pdf

43 Basic Terminology Spatial databases Alphanumeric data + geographical cordinates Spatial mining Mining of spatial databases Spatial datawarehouse Contains geographical data Spatial outliers Observations that appear to be inconsistent with the remainder of that set of data

44 Spatial Cluster

45 Contribution Propose and implement the CubeView visualization system General data cube operations Built on the concept of spatial data warehouse to support data mining and data visualization Efficient and scalable spatial outlier detection algorithms

46 Challenges in spatial data mining Classical data mining - numbers and categories. Spatial data – more complex and extended objects such as points, lines and polygons. Second, classical data mining works with explicit inputs, whereas spatial predicates and attributes are often implicit. Third, classical data mining treats each input independently of other inputs.

47 Application Domain The Traffic Management Center - Minnesota Department of Transportation (MNDOT) has a database to archive sensor network. Sensor network includes about nine hundred stations each of which contains one to four loop detector Measurement of Volume and occupancy. Volume is # vehicles passing through station in 5- minute interval Occupancy is percentage of time station is occupied with vehicles

48 Basic Concepts Spatial Data Warehouse Spatial Data Mining Spatial Outliers Detection

49 Spatial Data Warehouse Employs data cube structure Outputs - albums of maps. Traffic data warehouse Measures - volume and occupancy Dimensions - time and space.

50 Spatial Data Mining Process of discovering interesting and useful but implicit spatial patterns. key goal is to partially ‘automate’ knowledge discovery Search for “nuggets” of information embedded in very large quantities of spatial data.

51 Spatial Outliers Detection Suspiciously deviating observations Local instability Each Station Spatial attributes – time, space Non spatial attributes – volume, occupancy

52 Basic Structure – CubeView

53 CubeView Visualization System Each node in cube – a visualization style S - Traffic volume of station at all times. T TD – Time of the day T DW – Day of the week ST TD – Daily traffic volume of each station T TD T DW S– Traffic volume at each station at different times on different days

54 Dimension Lattice

55 CubeView Visualization System

56 CubeView Visualization System

57 CubeView Visualization System

58 Data Mining Algorithms for Visualization Problem Definition Given a spatial graph G ={ S, E } S - s1, s2, s3, s4…….. E – edges (neighborhood of stations) f ( x ) - attribute value for a data record N ( x )- fixed cardinality set of neighbors of x ) - Average attribute value of x neighbors S( x ) - difference of the attribute value of each data object and the average attribute value of neighbors.

59 Data Mining Algorithms for Visualization Problem Definition cont… S( x ) - difference of the attribute value of each data object and the average attribute value of neighbors. Test for detecting an outlier confidence level threshold θ

60 Data Mining Algorithms for Visualization Few points First, the neighborhood can be selected based on a fixed cardinality or a fixed graph distance or a fixed Euclidean distance. Second, the choice of neighborhood aggregate function can be mean, variance, or auto-correlation. Third, the choice for comparing a location with its neighbors can be either just a number or a vector of attribute values. Finally, the statistic for the base distribution can be selected as normal distribution.

61 Data Mining Algorithms for Visualization Algorithms Test Parameters Computation(TPC) Algorithm Route Outlier Detection(ROD) Algorithm

62 Data Mining Algorithms for Visualization

63 Data Mining Algorithms for Visualization

64 Data Mining Algorithms for Visualization

65 Software group/vis/traffic_volumemap2.htm group/vis/traffic_volumemap2.htm group/vis/DataCube.htm group/vis/DataCube.htm

66 Visualization and Data Mining techniques Thank you!!!!