Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Using (and abusing) sequence analysis.

Similar presentations


Presentation on theme: "Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Using (and abusing) sequence analysis."— Presentation transcript:

1 Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Using (and abusing) sequence analysis to make biological discoveries

2 Only a small fraction of amino acid residues is directly involved in protein function (including enzymatic); the rest of the protein serves largely as structural scaffold Significant sequence similarity is evidence of homology Conserved sequence motifs are determinants of conserved ancestral functions

3 The evolving roles of computational analysis in biology Pre-sequencing era (before 1978-80) Study biological function Clone/sequence gene Analyze/interpret sequence Pre-genomic era (1980-1996) Sequence genome Analyze/interpret sequences of all genes Prioritize targets Study biological function Post-genomic era (1996-

4

5 Sequence complexity Measure of the randomness of a sequence Random sequence - highest complexity (entropy) - globular protein domains Homopolymer - lowest complexity (entropy) - non-globular structures Algorithmic complexity QQQQQQQQQQQQQ = (Q) n KRKRKRKRKRKR = (KR)n ASDFGHKLCVNM - random sequence - no algorithm to derive from a simpler one

6 seg BRCA1 45 3.4 3.7 > BRCA1.seg > gi|728984|sp|P38398|BRC1_HUMAN Breast cancer type 1 susceptibility protein 1-388 MDLSALRVEEVQNVINAMQKILECPICLEL IKEPVSTKCDHIFCKFCMLKLLNQKKGPSQ CPLCKNDITKRSLQESTRFSQLVEELLKII CAFQLDTGLEYANSYNFAKKENNSPEHLKD EVSIIQSMGYRNRAKRLLQSEPENPSLQET SLSVQLSNLGTVRTLRTKQRIQPQKTSVYI ELGSDSSEDTVNKATYCSVGDQELLQITPQ GTRDEISLDSAKKAACEFSETDVTNTEHHQ PSNNDLNTTEKRAAERHPEKYQGSSVSNLH VEPCGTNTHASSLQHENSSLLLTKDRMNVE KAEFCNKSKQPGLARSQHNRWAGSKETCND RRTPSTEKKVDLNADPLCERKEWNKQKLPC SENPRDTEDVPWITLNSSIQKVNEWFSR sdellgsddshdgesesnakvadvldvlne 389-458 vdeysgssekidllasdphealickservh sksvesnied 459-526 KIFGKTYRKKASLPNLSHVTENLIIGAFVT EPQIIQERPLTNKLKRKRRPTSGLHPEDFI KKADLAVQ ktpeminqgtnqteqngqvmnitnsghenk 527-635 tkgdsiqneknpnpieslekesafktkaep isssisnmelelnihnskapkknrlrrkss trhihalelvvsrnlsppn 636-995 CTELQIDSCSSSEEIKKKKYNQMPVRHSRN LQLMEGKEPATGAKKSNKPNEQTSKRHDSD TFPELKLTNAPGSFTKCSNTSELKEFVNPS LPREEKEEKLETVKVSNNAEDPKDLMLSGE RVLQTERSVESSSISLVPGTDYGTQESISL LEVSTLGKAKTEPNKCVSQCAAFENPKGLI HGCSKDNRNDTEGFKYPLGHEVNHSRETSI EMEESELDAQYLQNTFKVSKRQSFAPFSNP GNAEEECATFSAHSGSLKKQSPKVTFECEQ KEENQGKNESNIKPVQTVNITAGFPVVGQK DKPVDNAKCSIKGGSRFCLSSQFRGNETGL ITPNKHGLLQNPYRIPPLFPIKSFVKTKCK knlleenfeehsmsperemgnenipstvst 996-1089 isrnnirenvfkeasssninevgsstnevg ssineigssdeniqaelgrnrgpklnamlr lgvl 1090-1238 QPEVYKQSLPGSNCKHPEIKKQEYEEVVQT VNTDFSPYLISDNLEQPMGSSHASQVCSET PDDLLDDGEIKEDTSFAENDIKESSAVFSK SVQKGELSRSPSPFTHTHLAQGYRRGAKKL ESSEENLSSEDEELPCFQHLLFGKVNNIP sqstrhstvateclsknteenllslknsln 1239-1312 dcsnqvilakasqehhlseetkcsaslfss qcseledltantnt 1313-1316 QDPF Non-globular regions Globular domains

7 1422-1513 GSQPSNSYPSIISDSSALEDLRNPEQSTSE KAVLTSQKSSEYPISQNPEGLSADKFEVSA DSSTSKNKEPGVERSSPSKCPSLDDRWYMH SC sgslqnrnypsqeelikvvdveeqqleesg 1514-1616 phdltetsylprqdlegtpylesgislfsd dpesdpsedrapesarvgnipsstsalkvp qlkvaesaqspaa 1617-1863 AHTTDTAGYNAMEESVSREKPELTASTERV NKRMSMVVSGLTPEEFMLVYKFARKHHITL TNLITEETTHVVMKTDAEFVCERTLKYFLG IAGGKWVVSYFWVTQSIKERKMLNEHDFEV RGDVVNGRNHQGPKRARESQDRKIFRGLEI CCYGPFTNMPTDQLEWMVQLCGASVVKELS SFTLGTGVHPIVVVQPDAWTEDNGFHAIGQ MCEAPVVTREWVLDSVALYQCQELDTYLIP QIPHSHY

8

9

10

11

12 1422-1513 GSQPSNSYPSIISDSSALEDLRNPEQSTSE KAVLTSQKSSEYPISQNPEGLSADKFEVSA DSSTSKNKEPGVERSSPSKCPSLDDRWYMH SC sgslqnrnypsqeelikvvdveeqqleesg 1514-1616 phdltetsylprqdlegtpylesgislfsd dpesdpsedrapesarvgnipsstsalkvp qlkvaesaqspaa 1617-1863 AHTTDTAGYNAMEESVSREKPELTASTERV NKRMSMVVSGLTPEEFMLVYKFARKHHITL TNLITEETTHVVMKTDAEFVCERTLKYFLG IAGGKWVVSYFWVTQSIKERKMLNEHDFEV RGDVVNGRNHQGPKRARESQDRKIFRGLEI CCYGPFTNMPTDQLEWMVQLCGASVVKELS SFTLGTGVHPIVVVQPDAWTEDNGFHAIGQ MCEAPVVTREWVLDSVALYQCQELDTYLIP QIPHSHY

13

14

15

16 Paradigm shift in database searching Query sequence Sequence database Set of homologs PSSM Query sequence PSSM database Domain architecture Traditional New PSI-BLAST RPS-BLAST

17

18

19

20

21

22

23

24

25 BRCA1 RING BARD1 DOMAIN ARCHITECTURE OF SELECTED BRCT PROTEINS BRCT CMP-trans REV1 yeast DPB11 yeast ATP-dep ligase DNA ligase III human AZF PARP vertebrates HhH polX TdT eukaryotes ATP and PCNA-binding RFC1 NAD-dep ligase DNA ligase bacteria eukaryotes PHD-l BRCA1/BARD homolog plant

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41 Use of profile libraries to examine domain representation in individual proteomes Profile library 6,200 ~20,000 yeast worm Detect domains using PSI-BLAST, IMPALA Compare domain distributions Chervitz SA, Aravind L, Sherlock G, Ball CA, Koonin EV, Dwight SS, Harris MA, Dolinski K, Mohr S, Smith T, Weng S, Cherry JM, Botstein D. 1998. Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science 282: 2022-8

42 Normalized domain counts in worm and yeast 1.Hormone receptor; 2.POZ; 3.EGF; 4.MATH; 5.PTPase; 6.Cation Channels; 7.PDZ; 8.SH2; 9.FNIII; 10.Homeodomain; 11.LRR; 12.EF hands; 13.Ankyrin; 14.RING finger; 15.C2H2 finger; 16.small GTPase; 17.RRM; 18.AAA+; 19.C6 finger

43 Searching a domain library is often easier and more informative than searching the entire sequence database. However, the latter yields complementary information and should not be skipped if details are of interest. Varying the search parameters, e.g. switching composition-based statistics on and off, can make a difference. Using subsequences, preferably chosen according to objective criteria, e.g. separation from the rest of the protein by a low-complexity linker, may improve search performance. Trying different queries is a must when analyzing protein (super)families. Even hits below the threshold of statistical significance often are worth analyzing, albeit with extreme care. Transferring functional information between homologs on the basis of a database description alone is dangerous. Conservation of domain architectures, active sites and other features needs to be analyzed (hence automated identification of protein families is difficult and automated prediction of functions is extremely error-prone). Always do a reality check!


Download ppt "Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Using (and abusing) sequence analysis."

Similar presentations


Ads by Google