Download presentation
Presentation is loading. Please wait.
1
APPLICATION OF STATISTICS IN BIOINFORMATICS
by Ajit Kumar Roy Bioinformatics Center, CIFA, Bhubaneswar National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
2
Definition of Bioinformatics
Bioinformatics is defined as a scientific discipline that compasses all the aspects of biological information like acquisition, processing, storage, distribution, analysis and interpretation that combines the tools and techniques of mathematics, computer science and biology with the aim of understanding the biological significance of a variety of data. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
3
Statistical packages SAS SPSS
National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
4
Need of Statistical Analysis in Bioinformatics
In the post Genomic era, explosive growth in biological data from large sequencing projects that are producing nucleotide sequences continuously at a faster rate and the content of the nucleotide database is doubling in few months. Large scale data mining has given rise to knowledge discovery and visualization of results in systematic manner. Biological data are redundant and difficult to handle if not stored systematically. Statistical methods help in easy storage, access, analysis and interpretation of data. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
5
Stages for Generation of Information and Knowledge
All raw data need to undergo the following few stages for generation of useful information and knowledge discovery. The process of knowledge discovery via data mining can be divided into four basic activities; selection, pre-processing, data mining, and interpretation. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
6
Statistical Data Mining
Data Mining is all about automating the process of searching for patterns in the data. It is the process of working on problems which are a composition of four underlying subproblems: classification, clustering, pattern search and outlier detection. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
7
Statistical Data Mining
Decision Trees Probability for Data Miners Gaussians Theory Probability Density Functions Maximum Likelihood Estimation Gaussian Bayes Classifiers Cross-Validation Neural Networks Instance-based learning (aka Case-based or Memory-based or non-parametric) National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
8
Statistical Data Mining
Regression for Predicting Real-valued Outputs: Bayesian Networks Game Tree Search Algorithms, including Alpha-Beta Search Hierarchical Clustering Bayesian Classifiers K-means Clustering Short Overview of Bayes Nets Gaussian Mixture Models Hidden Markov Models National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
9
Statistical Data Mining
PAC (Probably Approximately Correct) Learning Markov Decision Processes Reinforcement Learning Self - Organization Map Elementary probability and Naive Bayes classifiers Spatial Surveillance Time Series Methods Zero-Sum Game Theory Non-zero-sum Game Theory National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
10
Statistical Data Mining
A time-series-based anomaly detection algorithms Artificial Intelligence Correspondence Analysis Principal Component Analysis Search Algorithms A-star Heuristic Search Genetic Algorithms National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
11
Decision Tree A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output. These trees serve as a descriptive means for calculating conditional probabilities. A decision tree is used to identify the strategy most likely to reach a goal. They also help you to form a balanced picture of the risks and rewards associated with each possible course of action. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
12
Gaussian Theory Gaussian method is also known as Normal Distribution of Bell-shaped curve. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
13
MAXIMUM LIKELIHOOD ESTIMATION
Maximum likelihood estimation (MLE) is a popular statistical method used to calculate the best way of fitting a mathematical model to some data. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
14
NEURAL NETWORKS A Neural Network (NN) is an information processing paradigm that is inspired by the way biological nervous systems, process information. It is composed of a large number of highly interconnected processing elements (neurons) working in union to solve specific problems. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
15
K-MEANS CLUSTERING User defines the number of clusters (e.g. k=5)
K-cluster center locations are formed Each data point finds out the center it is closest to Each Center finds the centroid of the points it owns National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
16
HIDDEN MARKOV MODELS A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. A HMM can be considered as the simplest dynamic Bayesian network National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
17
HIERARCHICAL CLUSTERING
The hierarchical clustering can be represented as a tree, or a dendrogram. Branch lengths represent the degree of similarity between the genes. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
18
SELF-ORGANIZING MAP This method maps the multidimensional distances of the feature space to two-dimensional distances in the output map. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
19
PRINCIPAL COMPONENT ANALYSIS
The basic idea in PCA is to find the components that explain the maximum amount of variance possible by n linearly transformed components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
20
CORRESPONDENCE ANALYSIS
Correspondence Analysis is a multivariate exploratory graphical technique used to describe the relationship between the row and column variables of a contingency table. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
21
Demonstration of Correspondence Analysis using SPSS – Study of Codon Usage Variation of Labeo rohita
National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
22
Data Source: Nucleotide sequences of L
Data Source: Nucleotide sequences of L.rohita downloaded from the Genbank of NCBI (National Centre for Biotechnology Information) serve as the source of secondary data fro the statistical analysis. Core Ntd seqs = 141 Expr Seq Tags = 24 Complete cds = 7 Partial cds = 7 Complete seqs = 33 Partial seqs = 11 Clone seqs = 76 Other seqs = 8 National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
23
Softwares used : To find out the RSCU ( Relative Synonymous Codon Usage) values from raw data (sequences) Codon Usage software of SMS (Sequence Manipulation Suite) is used. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
24
Results for 1162 residue sequence "gi|4836724|gb|AF134200
Results for 1162 residue sequence "gi| |gb|AF |AF Labeo rohita growth hormone precursor, mRNA, complete cds" starting "GGCACGAGTT" AmAcid Codon Number / RSCU … Ala GCG Ala GCA Ala GCT Ala GCC Cys TGT Cys TGC . A part of RSCU (Relative Synonymous Codon Usage) of one of the sequences of L.rohita obtained using the above Software is displayed. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
25
Use of SPSS for CoA : CoA (Correspondence Analysis) using SPSS (Statistical Package for Social Sciences) National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
26
Output of SPSS: Correspondence Table Table of Row profiles
Table of Column profiles Summary Table Overview of Row points Overview of Column points Confidence column points Confidence row points Plot of Row points for Nucleic Acid Plot of Column points for Codons Biplot between row and column points National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
27
Output of SPSS: Correspondence table with nucleotide sequences as row variable & codon as column variable National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
28
Output of SPSS: The row profiles are the cell contents divided by their corresponding row total Mi,j = ni / Row Total Example. (M1,2 = 0.330/ = ~ 0.016) for the cell at first row and second column National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
29
Output of SPSS: The column profiles are the cell contents divided by their corresponding column total M’i,j = nj / Column Total Example. (M’1,2 = 0.330/ = ~ 0.010) for the cell at first row and second column National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
30
Number of Dimension = Min of ( i / j) – 1 (Min of 100 or 64 – 1 = 63)
Output of SPSS: Number of Dimension = Min of ( i / j) – 1 (Min of 100 or 64 – 1 = 63) SVn = √Inertian (SV1= √ 0.50 = 0.223) PIn = Inertian / Inertiatotal (PI1 = 0.050/0.626 = 0.79) National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
31
Scores in dimension 1 & 2 = Coordinates for points in the Plots
Output of SPSS: Scores in dimension 1 & 2 = Coordinates for points in the Plots Row points with higher contribution in D1 = 54, 19, 27, 19, 7, etc. Row points with higher contribution in D2 = 38, 33, 60, 68, etc. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
32
Scores in dimension 1 & 2 = Coordinates for points in the Plots
Output of SPSS: Scores in dimension 1 & 2 = Coordinates for points in the Plots Stronger Column points in D1 = GAG, GAA, CAT, CAC, GAC, etc. Stronger Column points in D2 = GAC, CAA, GTG, CAG, CAT, etc. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
33
Output of SPSS: Biplot of symmetrical normalization between 64 codons and 100 Nucleic acids sequences of L.rohita : Amino acids are spotted as Blue color triangles Codons are spotted as green color circles National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
34
Biplot between Row and column points:
Output of SPSS: Biplot between Row and column points: The plot can be divided into four quadrants. The Genes and Codons present in the similar quadrant are having higher degree of association. National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
35
Biplot between Row and column points:
Output of SPSS: The associated nucleotide sequences and codons can be identified. AT ending codons are frequent in ntd seqs on the left side of the plot. GC ending codons fall on the right side of the plot. Genes with higher level of expression fall around the origin or centroid. Biplot between Row and column points: National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
36
From the bi plot, 4 associated groups can be identified
Output of SPSS: 24, 16, 28,5, 61, 37, 24 10, 66, 14, 30, 21, 30, 58 Association1 Association3 ACA,GCA TTG,GCT, TAT,TTA GCC, ATA, CTT,TAC, TCC, ACA 32, 44, 47, 46, 51, 48, 40, 45, 42, 41 36, 35, 98, 15 Association2 Association4 AAA, TTA, AAT, AGT GCT, GGA , GAT, TTG From the bi plot, 4 associated groups can be identified National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
37
THANK YOU National Workshop on Emerging Bioinformatics Tools and Techniques in Agricultural Research to be held at OUAT, Bhubaneswar during March 2008.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.