Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.

Slides:



Advertisements
Similar presentations
QMM 384 – Data Mining Data Mining: Introduction Introduction to Predictive Analytics.
Advertisements

CPS : Information Management and Mining Shivnath Babu.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Introduction to Data Mining by Tan, Steinbach, Kumar.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Week 9 Data Mining System (Knowledge Data Discovery)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
© Vipin Kumar CSci 8980 (Data Mining) Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Decision Support: Data Mining Introduction.
Data Mining: Introduction
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
1 Business System Analysis & Decision Making - Lecture 14 Zhangxi Lin ISQS 5340 Summer II 2006.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Knowledge Discovery and Data Mining Evgueni Smirnov.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
CSE4334/5334 DATA MINING CSE 4334/5334 Data Mining, Fall 2011 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan, Steinbach, Kumar 9/4/20071 Introduction to Data Mining Tan, Steinbach,
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
Introduction to Data Mining Jinze Liu April 8 th, 2009.
COMSATS Institute of Information Technology Department of Computer Science Databases and Information Systems Dr. Ramzan Talib Databases and Information.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
An Introduction to Data Mining
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Lecture Notes for Chapter 1 Introduction to Data Mining.
Modul 1: Introduction. Topics  Definitions  Business intelligence  DW & OLAP  Data mining  Data Warehousing and Data Mining Motivation  Data mining.
Data Mining: Introduction
Introduction to Game Data Mining
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Mining Introduction
Data Mining: Introduction
Statistics 202: Statistical Aspects of Data Mining
Data Mining: Introduction
Introduction to Data Mining- CMPT 741 Instructor: Ke Wang
Data mining and real systems modeling
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Data Mining: Introduction
Data Mining: Introduction
William Norris Professor and Head, Department of Computer Science
William Norris Professor and Head, Department of Computer Science
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining: Introduction
Data Warehousing and Data Mining
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Introduction to Bioinformatics
Data Mining: Introduction
First 2-3 Lectures: Intro to DM/DS
Presentation transcript:

Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer Science kumar@cs.umn.edu www.cs.umn.edu/~kumar

Why Data Mining? Commercial Viewpoint Lots of data is being collected and warehoused Web data Yahoo! collects 10GB/hour purchases at department/ grocery stores Walmart records  20 million transactions per day Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Data Mining? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite NASA EOSDIS archives over 1-petabytes of Earth Science data per year telescopes scanning the skies Sky survey data gene expression data scientific simulations terabytes of data generated in a few hours Traditional techniques infeasible for raw data Data mining may help scientists in automated analysis of massive data sets in hypothesis formation

Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Statistics/ AI Machine Learning/ Pattern Recognition Data Mining Database systems

Data Mining Tasks... Data Milk Clustering Predictive Modeling Anomaly Detection Association Rules Milk

Data Mining for Biology Explosion of various types of biological data in recent years: Protein sequences (SwissProt, MIPS) Genome sequences (TIGR) Gene expression (Stanford MicroArray Database) Metabolic pathways (KEGG, HumanCyc) Automated techniques for knowledge discovery are crucial for deriving useful information from these data sets Identification of all genes on a genome Prediction of protein function and structure from its amino acid sequence Inference of pathways and regulatory networks Drug discovery and identification of putative binding sites in protein structures

How can data mining help biologists? Data mining particularly effective if the pattern/format of the final knowledge is presumed; common in biology: Protein complexes (clustering and association patterns) Gene regulatory networks (predictive models) Protein structure/function (predictive models) Motifs (association patterns) We will look at two examples: Clustering of ESTs Identifying protein functional modules from protein complexes

Clustering Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

Applications of Cluster Analysis Understanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets Clustering precipitation in Australia

Clustering of ESTs in Protein Coding Database Laboratory Experiments New Protein Functionality of the protein Similarity Match Researchers John Carlis John Riedl Ernest Retzel Elizabeth Shoop Clusters of Short Segments of Protein-Coding Sequences (EST) Known Proteins

Expressed Sequence Tags (EST) Generate short segments of protein-coding sequences (EST). Match ESTs against known proteins using similarity matching algorithms. Find Clusters of ESTs that have same functionality. Match new protein against the EST clusters. Experimentally verify only the functionality of the proteins represented by the matching EST clusters

EST Clusters by Hypergraph-Based Scheme 662 different items corresponding to ESTs. 11,986 variables corresponding to known proteins Found 39 clusters 12 clean clusters each corresponds to single protein family (113 ESTs) 6 clusters with two protein families 7 clusters with three protein families 3 clusters with four protein families 6 clusters with five protein families Runtime was less than 5 minutes.

Association Analysis Association analysis: Analyzes relationships among items (attributes) in a binary transaction data Example data: market basket data Data can be represented as a binary matrix Applications in business and science Two types of patterns Itemsets: Collection of items Example: {Milk, Diaper} Association Rules: X  Y, where X and Y are itemsets. Example: Milk  Diaper Set-Based Representation of Data Binary Matrix Representation of Data

Where are the parts located? How many roles can these play? How flexible and adaptable are they mechanically? What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -- types of parts (nuts & washers)? Where are the parts located? Which parts interact? © Mark Gerstein, Yale

Data Mining Book For further details and sample chapters see www.cs.umn.edu/~kumar/dmbook