Presentation is loading. Please wait.

Presentation is loading. Please wait.

Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.

Similar presentations


Presentation on theme: "Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer."— Presentation transcript:

1 Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology
Vipin Kumar William Norris Professor and Head, Department of Computer Science

2 Why Data Mining? Commercial Viewpoint
Lots of data is being collected and warehoused Web data Yahoo! collects 10GB/hour purchases at department/ grocery stores Walmart records  20 million transactions per day Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management)

3 Why Data Mining? Scientific Viewpoint
Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite NASA EOSDIS archives over 1-petabytes of Earth Science data per year telescopes scanning the skies Sky survey data gene expression data scientific simulations terabytes of data generated in a few hours Traditional techniques infeasible for raw data Data mining may help scientists in automated analysis of massive data sets in hypothesis formation

4 Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Statistics/ AI Machine Learning/ Pattern Recognition Data Mining Database systems

5 Data Mining Tasks... Data Milk Clustering Predictive Modeling
Anomaly Detection Association Rules Milk

6 Data Mining for Biology
Explosion of various types of biological data in recent years: Protein sequences (SwissProt, MIPS) Genome sequences (TIGR) Gene expression (Stanford MicroArray Database) Metabolic pathways (KEGG, HumanCyc) Automated techniques for knowledge discovery are crucial for deriving useful information from these data sets Identification of all genes on a genome Prediction of protein function and structure from its amino acid sequence Inference of pathways and regulatory networks Drug discovery and identification of putative binding sites in protein structures

7 How can data mining help biologists?
Data mining particularly effective if the pattern/format of the final knowledge is presumed; common in biology: Protein complexes (clustering and association patterns) Gene regulatory networks (predictive models) Protein structure/function (predictive models) Motifs (association patterns) We will look at two examples: Clustering of ESTs Identifying protein functional modules from protein complexes

8 Clustering Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

9 Applications of Cluster Analysis
Understanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets Clustering precipitation in Australia

10 Clustering of ESTs in Protein Coding Database
Laboratory Experiments New Protein Functionality of the protein Similarity Match Researchers John Carlis John Riedl Ernest Retzel Elizabeth Shoop Clusters of Short Segments of Protein-Coding Sequences (EST) Known Proteins

11 Expressed Sequence Tags (EST)
Generate short segments of protein-coding sequences (EST). Match ESTs against known proteins using similarity matching algorithms. Find Clusters of ESTs that have same functionality. Match new protein against the EST clusters. Experimentally verify only the functionality of the proteins represented by the matching EST clusters

12 EST Clusters by Hypergraph-Based Scheme
662 different items corresponding to ESTs. 11,986 variables corresponding to known proteins Found 39 clusters 12 clean clusters each corresponds to single protein family (113 ESTs) 6 clusters with two protein families 7 clusters with three protein families 3 clusters with four protein families 6 clusters with five protein families Runtime was less than 5 minutes.

13 Association Analysis Association analysis: Analyzes relationships among items (attributes) in a binary transaction data Example data: market basket data Data can be represented as a binary matrix Applications in business and science Two types of patterns Itemsets: Collection of items Example: {Milk, Diaper} Association Rules: X  Y, where X and Y are itemsets. Example: Milk  Diaper Set-Based Representation of Data Binary Matrix Representation of Data

14

15

16

17 Where are the parts located?
How many roles can these play? How flexible and adaptable are they mechanically? What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -- types of parts (nuts & washers)? Where are the parts located? Which parts interact? © Mark Gerstein, Yale

18

19

20

21

22

23

24

25

26 Data Mining Book For further details and sample chapters see


Download ppt "Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer."

Similar presentations


Ads by Google