Presentation is loading. Please wait.

Presentation is loading. Please wait.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.

Similar presentations


Presentation on theme: "CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences."— Presentation transcript:

1 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences Ilka Hoof Ph.D. student Immunological Bioinformatics Center for Biological Sequence Analysis Danmarks Tekniske Universitet

2 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 2/31 Significant positions? HIV-1 gp120 PDB: 2NY7

3 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 3/31 Significant positions? HIV-1 gp120 PDB: 2NY7 Antibody-binding site?

4 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 4/31 Significant positions? HIV-1 protease PDB: 2CEN Catalytic efficiency?

5 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 5/31 Significant positions? “Which sites in HIV-1 protease contribute significantly to the fitness level of an HIV-1 mutant?” “Where is the binding site of a specific antibody located on the antigen?” “Which sites are important for enzymatic activity?” Given a multiple sequence alignment and a numerical value associated with each sequence  Values imply a ranking of the sequences What we’re interested in: Which positions distinguish high and low ranking sequence? e.g. binders vs. non-binders high vs. low fitness high vs low enzymatic activity

6 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 6/31 The data we have

7 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 7/31 The output we want...how do we get there?

8 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 8/31 SigniSite 1.0 http://www.cbs.dtu.dk/services/SigniSite/

9 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 9/31 SigniSite - method Rank-based statistical test 0.002 0.084 0.128 0.273 0.593 0.892 0.923 0.999 1.0 2.0 3.0 4.0 5.5 7.0 8.0 9.0 real-valued dataranks Calculate mean rank for each residue type

10 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 10/31 SigniSite - the method

11 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 11/31 SigniSite - the method Calculate the mean rank for each residue type.

12 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 12/31 SigniSite - the method What’s the null hypothesis of our statistical test? The observed mean rank of a residue type does not significantly deviate from the expected mean rank. What is expected? We assume random distribution of the amino acids in the column. Given N sequences, the expected mean rank is

13 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 13/31 Z score determines significance Given the shape of the distribution, what’s significant? mean sd obs. rank Z score can be calculated from mean and standard deviation: +1.96 p < 0.025

14 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 14/31 Z score determines significance observed mean rank for E

15 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 15/31 Are the random mean ranks normally distributed?

16 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 16/31 Same mean, but different standard deviation Frequencies: 0.5 0.25 0.1 0.05 Mean rank distributions for different frequencies

17 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 17/31 How to estimate the standard deviation? Our test reminds of the Wilcoxon rank statistic: Given two samples of size n 1 and n 2, n 1 +n 2 = N. Let R be the mean rank of sample 1. The distribution of mean ranks R can be approximated by the normal distribution with mean and standard deviation

18 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 18/31 Coping with ties Formula as before but weighted with tie-correction factor T where and t is a vector which contains the counts of ties, i.e. m denotes the number of distinct values in the data set. Example: all values the same => T = 0 all values different => T = 1

19 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 19/31 Simple example category 1 category 2

20 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 20/31 Simple example Tie correction vs. no tie correction Standard deviation Z score

21 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 21/31 Multiple testing problem We perform a significance test for each amino acid type in each column. Problem: The more hypotheses we test, the higher the probability of obtaining at least one false positive. Each test is performed with the same type-I error  e.g.  = 0.05. The total significance level  tot of m significance tests is then given by  tot   1 - (1 -  ) m Examples: 1 test  tot   1 - (1 - 0.05) 1 = 0.05 2 tests  tot   1 - (1 - 0.05) 2 = 0.0975 100 tests  tot   1 - (1 - 0.05) 100 = 0.99 Correction for multiple testing necessary!

22 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 22/31 How many statistical tests are performed? One test per amino acid type and column. w i is the number of different amino acids in column i

23 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 23/31 Correction for multiple testing Adjusted p-values using Bonferroni’s single-step method: Multiply all unadjusted p-values by the number of tests m Adjusted p-values are given by for j = 1,..., m

24 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 24/31 Correction for multiple testing Adjusted p-values using Holm’s step-down method: observed ordered unadjusted p-values Adjusted p-values are given by for j = 1,..., m So, nothing more than:

25 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 25/31 Application of SigniSite

26 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 26/31 Ab-binding affinity to HIV-1 gp120 Alignment length: 569 residues

27 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 27/31 SigniSite web service

28 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 28/31 SigniSite results 10 significant sites identified. Holm step-down correction,  = 0.05 Heatmap

29 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 29/31 SigniSite results Sequence logos display Z score for all amino acid types display Z score only for significant amino acid types “ordinary” frequency logo

30 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 30/31 SigniSite results

31 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 31/31 SDPpred http://math.genebee.msu.ru/~psn/index.htm Kalinina et al. (2004), Protein Sci 13(2): 443-56

32 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 32/31 SDPpred Categories instead of continuous values Mutual information Amino acids with similar physico-chemical properties are weakly penalized Statistical test: observed mutual inf. = expected mutual inf.?

33 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 33/31 SDPpred - Results

34 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 34/31 SDPpred - Results

35 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 35/31 SDPpred - Results

36 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 36/31 You can use SigniSite and SDPpred to find sites of interest in your biological data Logos are a nice and clear way of displaying sequence information Whenever you perform statistical tests, remember the multiple testing problem! Conclusion


Download ppt "CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences."

Similar presentations


Ads by Google