BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,

BNFO 602 Lecture 2 Usman Roshan

Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem, great potential for personalized medicine and personal genomics Phylogenetics: understanding evolutionary histories

Pairwise sequence alignment How to align two sequences?

Pairwise alignment How to align two sequences? We use dynamic programming Treat DNA sequences as strings over the alphabet {A, C, G, T}

Pairwise alignment

Dynamic programming Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

Dynamic programming Time and space complexity is O(mn) Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

Dynamic programming Animation slides by Elizabeth Thomas in Cold Spring Harbor Labs (CSHL) http://meetings.cshl.org/tgac/tgac/flash/DynamicProgramming.swf

How do we pick gap parameters?

Structural alignments Recall that proteins have 3-D structure.

Structural alignment - example 1 Alignment of thioredoxins from human and fly taken from the Wikipedia website. This protein is found in nearly all organisms and is essential for mammals. PDB ids are 3TRX and 1XWC.

Structural alignment - example 2 Computer generated aligned proteins Unaligned proteins. 2bbm and 1top are proteins from fly and chicken respectively. Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

Structural alignments We can produce high quality manual alignments by hand if the structure is available. These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

Benchmark alignments Protein alignment benchmarks –BAliBASE, SABMARK, PREFAB, HOMSTRAD are frequently used in studies for protein alignment. –Proteins benchmarks are generally large and have been in the research community for sometime now. –BAliBASE 3.0BAliBASE 3.0

Biologically realistic scoring matrices PAM and BLOSUM are most popular PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

PAM We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families Compute probabilities of change and background probabilities by simple counting

Genome wide association studies

Application of SNPs: association with disease Experimental design to detect cancer associated SNPs: –Pick random humans with and without cancer (say breast cancer) –Perform SNP genotyping –Look for associated SNPs –Also called genome-wide association study

Case-control example Study of 100 people: –Case: 50 subjects with cancer –Control: 50 subjects without cancer Count number of alleles and form a contingency table #Allele1#Allele2 Case1090 Control298

Odds ratio Odds of allele 1 in cancer = a/b = e Odds of allele 1 in healthy = c/d = f Odds ratio of recessive in cancer vs healthy = e/f #Allele1#Allele2 Cancerab Healthycd

Risk ratio (Relative risk) Probability of allele 1 in cancer = a/(a+b) = e Probability of allele 2 in healthy = c/(c+d) = f Risk ratio of recessive in cancer vs healthy = e/f #Allele1#Allele2 Cancerab No cancercd

Odds ratio vs Risk ratio Risk ratio has a natural interpretation since it is based on probabilities In a case-control model we cannot calculate the probability of cancer given recessive allele. Subjects are chosen based disease status and not allele type Odds ratio shows up in logistic regression models

Example Odds of allele 1 in case = 15/35 Odds of allele 1 in control = 2/48 Odds ratio of allele 1 in case vs control = (15/35)/(2/48) = 10.3 Risk of allele 1 in case = 15/50 Risk of allele 2 in control = 2/50 Risk ratio of allele 1 in case vs control = 15/2 = 7.5 #Allele1#Allele2 Case1535 Control248

Odds ratios in genome-wide association studies Higher odds ratio means stronger association Therefore SNPs with highest odds ratios should be used as predictors or risk estimators of disease Odds ratio generally higher than risk ratio Both are similar when small

Statistical test of association (P-values) P-value = probability of the observed data (or worse) under the null hypothesis Example: –Suppose we are given a series of coin-tosses –We feel that a biased coin produced the tosses –We can ask the following question: what is the probability that a fair coin produced the tosses? –If this probability is very small then we can say there is a small chance that a fair coin produced the observed tosses. –In this example the null hypothesis is the fair coin and the alternative hypothesis is the biased coin

Effect of population structure on genome-wide association studies Suppose our sample is drawn from a population of two groups, I and II Assume that group I has a majority of allele type I and group II has mostly the second allele. Further assume that most case subjects belong to group I and most control to group II This leads to the false association that the major allele is associated with the disease

Effect of population structure on genome-wide association studies We can correct this effect if case and control are equally sampled from all sub-populations To do this we need to know the population structure

Population structure prediction Treated as an unsupervised learning problem (i.e. clustering)

BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,

Similar presentations

Presentation on theme: "BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,

Similar presentations

Presentation on theme: "BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,"— Presentation transcript:

Similar presentations

About project

Feedback