BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,

Slides:



Advertisements
Similar presentations
Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Lecture 3 Outline: Thurs, Sept 11 Chapters Probability model for 2-group randomized experiment Randomization test p-value Probability model for.
Genome-wide association studies BNFO 602 Roshan. Application of SNPs: association with disease Experimental design to detect cancer associated SNPs: –Pick.
Measuring the degree of similarity: PAM and blosum Matrix
What is a χ2 (Chi-square) test used for?
DNA sequences alignment measurement
Introduction to Bioinformatics
Classification and risk prediction
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Using Statistics in Research Psych 231: Research Methods in Psychology.
Sequence analysis course
Genome-wide association studies Usman Roshan. SNP Single nucleotide polymorphism Specific position and specific chromosome.
Expected accuracy sequence alignment
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Genome-wide association studies BNFO 601 Roshan. Application of SNPs: association with disease Experimental design to detect cancer associated SNPs: –Pick.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Sequence alignment.
Similar Sequence Similar Function Charles Yan Spring 2006.
Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
BCOR 1020 Business Statistics Lecture 21 – April 8, 2008.
Class 3: Estimating Scoring Rules for Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Using Statistics in Research Psych 231: Research Methods in Psychology.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 8 – Comparing Proportions Marshall University Genomics.
Genome-wide association studies Usman Roshan. SNP Single nucleotide polymorphism Specific position and specific chromosome.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
MBP1010 – Lecture 8: March 1, Odds Ratio/Relative Risk Logistic Regression Survival Analysis Reading: papers on OR and survival analysis (Resources)
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Pairwise Sequence Analysis-III
More Contingency Tables & Paired Categorical Data Lecture 8.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Expected accuracy sequence alignment Usman Roshan.
Genome-wide association studies
Evaluation of protein alignments Usman Roshan BNFO 236.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
Inferential Statistics Psych 231: Research Methods in Psychology.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Applied statistics Usman Roshan.
Lecture 1 BNFO 601 Usman Roshan.
Ab initio gene prediction
BNFO 602 Lecture 2 Usman Roshan.
BNFO 602 Lecture 2 Usman Roshan.
Pairwise Sequence Alignment (cont.)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

BNFO 602 Lecture 2 Usman Roshan

Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem, great potential for personalized medicine and personal genomics Phylogenetics: understanding evolutionary histories

Pairwise sequence alignment How to align two sequences?

Pairwise alignment How to align two sequences? We use dynamic programming Treat DNA sequences as strings over the alphabet {A, C, G, T}

Pairwise alignment

Dynamic programming Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

Dynamic programming Time and space complexity is O(mn) Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

Dynamic programming Animation slides by Elizabeth Thomas in Cold Spring Harbor Labs (CSHL)

How do we pick gap parameters?

Structural alignments Recall that proteins have 3-D structure.

Structural alignment - example 1 Alignment of thioredoxins from human and fly taken from the Wikipedia website. This protein is found in nearly all organisms and is essential for mammals. PDB ids are 3TRX and 1XWC.

Structural alignment - example 2 Computer generated aligned proteins Unaligned proteins. 2bbm and 1top are proteins from fly and chicken respectively. Taken from

Structural alignments We can produce high quality manual alignments by hand if the structure is available. These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

Benchmark alignments Protein alignment benchmarks –BAliBASE, SABMARK, PREFAB, HOMSTRAD are frequently used in studies for protein alignment. –Proteins benchmarks are generally large and have been in the research community for sometime now. –BAliBASE 3.0BAliBASE 3.0

Biologically realistic scoring matrices PAM and BLOSUM are most popular PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

PAM We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families Compute probabilities of change and background probabilities by simple counting

Genome wide association studies

Application of SNPs: association with disease Experimental design to detect cancer associated SNPs: –Pick random humans with and without cancer (say breast cancer) –Perform SNP genotyping –Look for associated SNPs –Also called genome-wide association study

Case-control example Study of 100 people: –Case: 50 subjects with cancer –Control: 50 subjects without cancer Count number of alleles and form a contingency table #Allele1#Allele2 Case1090 Control298

Odds ratio Odds of allele 1 in cancer = a/b = e Odds of allele 1 in healthy = c/d = f Odds ratio of recessive in cancer vs healthy = e/f #Allele1#Allele2 Cancerab Healthycd

Risk ratio (Relative risk) Probability of allele 1 in cancer = a/(a+b) = e Probability of allele 2 in healthy = c/(c+d) = f Risk ratio of recessive in cancer vs healthy = e/f #Allele1#Allele2 Cancerab No cancercd

Odds ratio vs Risk ratio Risk ratio has a natural interpretation since it is based on probabilities In a case-control model we cannot calculate the probability of cancer given recessive allele. Subjects are chosen based disease status and not allele type Odds ratio shows up in logistic regression models

Example Odds of allele 1 in case = 15/35 Odds of allele 1 in control = 2/48 Odds ratio of allele 1 in case vs control = (15/35)/(2/48) = 10.3 Risk of allele 1 in case = 15/50 Risk of allele 2 in control = 2/50 Risk ratio of allele 1 in case vs control = 15/2 = 7.5 #Allele1#Allele2 Case1535 Control248

Odds ratios in genome-wide association studies Higher odds ratio means stronger association Therefore SNPs with highest odds ratios should be used as predictors or risk estimators of disease Odds ratio generally higher than risk ratio Both are similar when small

Statistical test of association (P-values) P-value = probability of the observed data (or worse) under the null hypothesis Example: –Suppose we are given a series of coin-tosses –We feel that a biased coin produced the tosses –We can ask the following question: what is the probability that a fair coin produced the tosses? –If this probability is very small then we can say there is a small chance that a fair coin produced the observed tosses. –In this example the null hypothesis is the fair coin and the alternative hypothesis is the biased coin

Effect of population structure on genome-wide association studies Suppose our sample is drawn from a population of two groups, I and II Assume that group I has a majority of allele type I and group II has mostly the second allele. Further assume that most case subjects belong to group I and most control to group II This leads to the false association that the major allele is associated with the disease

Effect of population structure on genome-wide association studies We can correct this effect if case and control are equally sampled from all sub-populations To do this we need to know the population structure

Population structure prediction Treated as an unsupervised learning problem (i.e. clustering)