Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis and Management of Microarray Data Dr G. P. S. Raghava.

Similar presentations


Presentation on theme: "Analysis and Management of Microarray Data Dr G. P. S. Raghava."— Presentation transcript:

1 Analysis and Management of Microarray Data Dr G. P. S. Raghava

2 Major Applications Identification of differentially expressed genes in diseased tissues (in presence of drug) Identification of differentially expressed genes in diseased tissues (in presence of drug) Classification of differentially expressed (genes) or clustering/ grouping of genes having similar behaviour in different conditions Classification of differentially expressed (genes) or clustering/ grouping of genes having similar behaviour in different conditions Use expression profile of known disease to diagnosis and classify of unknown genes Use expression profile of known disease to diagnosis and classify of unknown genes

3 Management of Microarray Data n Magnitude of Data –Experiments n 50 000 genes in human n 320 cell types n 2000 compunds n 3 times points n 2 concentrations n 2 replicates –Data Volume n 4*10 11 data-points n 10 15 = 1 petaB of Data

4 Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene expression matrix

5 Management of Microarray Data Major Issues n Large volume of microarray data in last few years –Storage and efficient access –Comparison and integration of data n Problem of data access and exchange –Data scattered around Internet –Supplementary material of publications –Difficult for user to access relivent data n Problems with existing databases –Diverse purpose –Developed for specific purpose

6 Management of Microarray Data n Specific Database –Platform (eg.Stanford MA Database; SMD) –Organism (Yeast MA global viewer) –Project (Life cycle database of Drosophila) n Problem with Supplement and MA databases –Lack of direct access –Quality not checked –No standard format –Incomplete data

7 n Comprehensive database server to manage massive amount of Microarray Data –Biomaterial Information –Raw Data & Images –Web Tools (normalization; data viewing; analysis) n Run on local servers allows full management and permission to add and view data n Minimum Information about Microarray Experiment (MIAME) n BASE http://bioinformatics1.uams.edu:8081:/ http://bioinformatics1.uams.edu:8081:/

8 Public Databases n Gene Expression data is an essential aspect of annotating the genome n Publication and data exchange for microarray experiments n Data mining/Meta-studies n Common data format - XML n MIAME (Minimal Information About a Microarray Experiment)

9 GEO at the NCBI

10 Microarray Data Mining Challenges n too few records (samples), usually < 100 n too many columns (genes), usually > 1,000 n Too many columns likely to lead to False positives n for exploration, a large set of all relevant genes is desired n for diagnostics or identification of therapeutic targets, the smallest set of genes is needed n model needs to be explainable to biologists

11 Analysis of Microarray Data n Analysis of images n Preprocessing of gene expression data n Normalization of data –Subtraction of Background Noise –Global/local Normalization –House keeping genes (or same gene) –Expression in ratio (test/references) in log n Differential Gene expression –Repeats and calculate significance (t-test) –Significance of fold used statistical method n Clustering –Supervised/Unsupervised (Hierarchical, K-means, SOM) n Prediction or Supervised Machine Learnning (SVM)

12 Low Level Analysis or Preprocessing of gene expression data n Scale Transformation n Normalization and Scaling n Replicate Handling n Missing value Handling n Flat pattern filtering n Pattern standardization

13 Normalization Techniques n Global normalization –Divide channel value by means n Control spots –Common spots in both channels –House keeping genes –Ratio of intensity of same gene in two channel is used for correction n Iterative linear regression n Parametric nonlinear nomalization –log(CY3/CY5) vs log(CY5)) –Fitted log ratio – observed log ratio n General Non Linear Normalization –LOESS –curve between log(R/G) vs log(sqrt(R.G))

14 Classification n Task: assign objects to classes (groups) on the basis of measurements made on the objects n Unsupervised: classes unknown, want to discover them from the data (cluster analysis) n Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations

15 Cluster analysis n Used to find groups of objects when not already known n “Unsupervised learning” n Associated with each object is a set of measurements (the feature vector) n Aim is to identify groups of similar objects on the basis of the observed measurements

16 Unsupervised Learnning n Hierarchical clustering: merging two branches at the time until all vari-ables(genes) are in one tree. [it does not answer the question of “howmany gene clusters there are”?] n K-mean clustering: assuming there are K clusters. [what if this assumption is incorrect?] n Self Organizing Maps (SOM) –Split all genes into similar sub-groups –Finds its own groups (machine learning) n Principle Component –every gene is a dimension (vector), find a single dimension that best represents the differences in the data n Model-based clustering: the number of clusters is determined dynamically [could be one of the most promising methods]

17 ‘cluster’ unclustered Average linkage hierarchical clustering, melanoma only

18

19 Supervised Analysis n Fisher’s linear discriminant analysis n Quadratic discriminant analysis n Logistic regression (a linear discriminant analysis) n Neural networks n Support vector machine

20 Example: Tumor Classification n Reliable and precise classification essential for successful cancer treatment n Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables n Uncertainties in diagnosis remain; likely that existing classes are heterogeneous n Characterize molecular variations among tumors by monitoring gene expression (microarray) n Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)

21 Higher Level Microarray data analysis n Clustering and pattern detection n Data mining and visualization n Controls and normalization of results n Statistical validatation n Linkage between gene expression data and gene sequence/function/metabolic pathways databases n Discovery of common sequences in co-regulated genes n Meta-studies using data from multiple experiments

22 Thanks


Download ppt "Analysis and Management of Microarray Data Dr G. P. S. Raghava."

Similar presentations


Ads by Google