Basic Definitions Support: number of clusters that contain all the members of an analyte-set Confidence of Association rule X  Y: Support( X  Y ) / Support(

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Mining Association Rules from Microarray Gene Expression Data.
An Introduction to Multivariate Analysis
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Multivariate Methods Pattern Recognition and Hypothesis Testing.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
The Human Genome Project and ~ 100 other genome projects:
Data Mining Presentation Learning Patterns in the Dynamics of Biological Networks Chang hun You, Lawrence B. Holder, Diane J. Cook.
Fuzzy K means.
Assigning Numbers to the Arrows Parameterizing a Gene Regulation Network by using Accurate Expression Kinetics.
Data Mining – Intro.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
SCATTER PLOTS AND LINES OF BEST FIT
Analysis of microarray data
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.
Data Mining Techniques
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Gene expression & Clustering (Chapter 10)
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Microarray - Leukemia vs. normal GeneChip System.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
TEMPLATE DESIGN © Molecular Re-Classification of Renal Disease Using Approximate Graph Matching, Clustering and Pattern.
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Objectives Interpret data in a scatter plot. Find a line of best fit on a scatter plot by inspection.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Data Mining and Decision Support
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Forecasting.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Data Mining – Intro.
Microarray - Leukemia vs. normal GeneChip System.
I can interpret scatter plots
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Molecular Classification of Cancer
School of Computer Science & Engineering
Baselining PMU Data to Find Patterns and Anomalies
Dimension reduction : PCA and Clustering
Stat4Onco Annual Symposium Zhenming Shun April 27, 2019
Introduction to Artificial Intelligence Lecture 22: Computer Vision II
Presentation transcript:

Basic Definitions Support: number of clusters that contain all the members of an analyte-set Confidence of Association rule X  Y: Support( X  Y ) / Support( X ) Lift (Correlations) of Association rule X  Y: Support( X  Y ) / Support( X )*Support( Y ) To get the strongly related analyte sets of size k, generate candidate sets from the sets of size (k-1) prune ones that don’t pass support and confidence test For example: {1,2},{1,3},{2,3} exists  {1,2,3} is a candidate set IF Support({1,2,3} > supportLimit & Confidence({1,2}, {1,2,3}) > confidenceLimit & Confidence({1,3}, {1,2,3}) > confidenceLimit & Confidence({2,3}, {1,2,3}) > confidenceLimit THEN {1, 2, 3} is a strongly related analyte set. Information Refining: Improving the Quality of Information Mined from Heterogeneous and High-Dimensional Time Series Fatih Altiparmak 1, Ozgur Ozturk 1, Selnur Erdal 1, Hakan Ferhatosmanoglu 1, Donald C. Trost 2 1 The Ohio State University, Columbus, OH 2 Pfizer Global Research and Development Our Two Step INFORMATION REFINING Method Challenges in Mining Heterogeneous, Asynchronous Time Series Clinical Trial: A clinical trial is a research study to answer specific questions about vaccines or new therapies. Clinical trials are used to determine whether new drugs or treatments are both safe and effective. In these trials, patients are assigned a treatment or a placebo and measurements for certain analytes (blood ingredients) are taken at intervals. These measurements can be represented as a time series for each analyte. Case Study 1: Pharmaceutical Clinical Trials Decreasing price of obtaining data w/ technology  data abundant Opportunity: Cross validation information from different sources Difficulty: Data Incompatibility Conventional Data Mining (DM) techniques not fit for heterogeneous & high-dimensional time series Challenges Faced both in Clinical Trials and Microarray High- dimensionality, Heterogeneity, non-uniformity???, Insufficient length, Unequal interval sizes (variable sampling???), Different lengths, Asynchronicity???, Diverse data sources, Varying sensitivity with source, Noise Brute Force DM compared with our method Global mining of data causes inaccuracies even with extensive preprocessing Results had little meaning Heterogeneity and incompleteness of data Difficulty to interpret such results First Step Apply DM over homogeneous subsets of data, gather information Second Step Refine Information by identifying common or distinct patterns over it Find significant and clean subsets of data. e.g. Most appropriate Analytes and Patients to make accurate experiments -26 (of 43) analytes and 152 patients- Information Refining on Clinical Trials Step 1: Mine the data within clean subsets Step2:Refine information (Detect Related) Refining the Information Group NameGroup Analytes TransporterHemoglobin, Hematocrit, RBC count Acute InfectionWBC Count, Neutrophils, Neutrophils (abs) Serum ProteinTotal Protein, Albumin, Globulin, Calcium LiverSGOT(AST), SGOT(ALT), LDH A panel of analytes that effectively models the human health A subset representing all 43 analytes Decision support to choose representative(s) from each group of analytes An analyte will be a representative of a panel if it is in a global panel. Alternative Approach that Finds Unrelated Run the Algorithm on the Dual of Support values Total number of patients - support Output: Selected Features: Global Panels Goals of Pfizer Project Group Name Acute InfectionTransporter Serum Protein Liver Representation frequency100%91%87%98% Correlation Coefficient Qualitative DTW-Euc100 DTW-SWC100 Euclidian Input : Analyte clusters for each patient Find the frequently co-occurring analytes Merge the analyte sets using Support Test Confidence Test Output: Strongly related analyte sets (used in redundancy elimination.) Analytes are clustered for each patient K-Medoid Clustering with 5 different metrics Output: analyte clusters for each patient Our Novel Distance Metrics Slope Wise Comparison (SWC) Trends matched (increasing or decreasing) Qualitative Metric (non-linear correlations) Uses a local distance metric (SWC was used) Local Distance metric must be capable of comparing relationship of two points (a pair) of one series with that of two points of another series Captures the similarity between patterns of changes of time series, regardless of whether the nature of the dependence between them is linear or non-linear. Acknowledgements Pfizer??? Children’s Hosp??? BAALC group??? References “Information Mining over Heterogeneous and High Dimensional Time Series Data in Clinical Trials Databases”, Altiparmak F., Ferhatosmanoglu H., Erdal S., Trost C., IEEE Transactions on Information Technology in Biomedicine (TITB) “Similarity Based Analysis of Microarray Time-Series Data”, Altiparmak F., Erdal S., Ozturk O., Ferhatosmanoglu H. (Submitted to TITB) Preprocessing Information Refining Depicted on a Hypothetical Run Findings 1: Strongly Related Analyte Sets Result of Ensemble Algorithm: Feature Selection: Identifying a Global Panel Safety Detection Early identification of abnormal individuals to detect safety problems Dynamic and multi-dimensional monitoring rules Prediction of biomarkers Classification of changes Current method: Simple univariate normal boundaries: We need Multi-variate signals Trajectories??? (non-random variation over placebo patients) Detection of change in correlation of analytes over time Modeling of health state given clinical measurements Healthy vs. Diseased Change in health state Model the state with less # of analytes? How to model the analytes? Feature selection – which analytes are necessary to model a certain health state/disease Global panel of analytes that best represents the overall information in the data Clusters of analytes that represent different groups of biological panels

Microarray Technology: A new way of studying how thousands of genes interact with each other and how a cell's regulatory networks control vast batteries of genes simultaneously. The method uses tiny droplets containing functional DNA located as a precise grid on glass slides. Fluorescent labeled DNA probes from the cell being studied are allowed to bind to these complementary DNA strands. Brightness of each fluorescent dot, measured with a scanner, reveals how much of a specific DNA fragment is present, an indicator of how active it is. Microarray Data Usually time series data Each series shows change in the expression levels of corresponding gene Measured as density of the gene products existing in cell Case Study 2: Haemophilus Influenza Microarray Data