1 Lessons 5-6 Classifying a protein / Inside the genome.

1 Lessons 5-6 Classifying a protein / Inside the genome

2 Learning about a protein What does a protein do??  Post-translational modifications – phosphorylation, glycosylation, etc.  Identifying patterns, motifs  Secondary structure  Tertiary/quaternary structure  Protein-protein interactions

3 Domains & Motifs

4 Domains  An analysis of known 3-D protein structures reveals that, rather than being monolithic, many of them contain multiple folding units.  Each such folding unit is a domain (>50 aa, 50 aa, < 500 aa)

5 calcium/calmodulin-dependent protein kinase SH2 domain: interact with phosphorylated tyrosines, and are thus part of intracellular signal-transuding proteins. Characterized by specific sequences and tertiary structure

6 What is a motif??  A sequence motif = a certain sequence that is widespread and conjectured to have biological significance  Examples: KDEL – ER-lumen retention signal PKKKRKV – an NLS (nuclear localization signal)

7 More loosely defined motifs  KDEL (usually) +  HDEL (rarely) =  [HK]-D-E-L: H or K at the first position  This is called a pattern (in Biology), or a regular expression (in computer science)

8 Syntax of a pattern  Example: W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE]

9 Patterns  W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE] Any amino, between 9- 11 times F or Y or V WOPLASDFGYVWPPPLAWS ROPLASDFGYVWPPPLAWS WOPLASDFGYVWPPPLSQQQ 

10 Patterns - syntax  The standard IUPAC one-letter codes.  ‘ x ’ : any amino acid.  ‘ [] ’ : residues allowed at the position.  ‘ {} ’ : residues forbidden at the position.  ‘ () ’ : repetition of a pattern element are indicated in parenthesis. X(n) or X(n,m) to indicate the number or range of repetition.  ‘ - ’ : separates each pattern element.  ‘‹’ : indicated a N-terminal restriction of the pattern.  ‘›’ : indicated a C-terminal restriction of the pattern.  ‘. ’ : the period ends the pattern.

11 Pattern ~ motif ~ signature  A pattern (similar to consensus and profile) is a way to represent a conserved sequence  Whereas a profile and consensus usually relate to the entire sequence, a pattern usually relates to a a few tens of amino-acids

12 Profile-pattern-consensus GTTCAA GCTGAA CTTCAC 54321.0010.66A.1000T.00.6600.33C.00.3300G GTTCAA [AC]-A-[GC]-T-[TC]-[GC] multiple alignment consensus pattern profile Information: consensus<pattern<profileNNTNAN

13 Interpro  Interpro: a collection of many protein signature databases (Prosite, Pfam, Prints … ) integrated into a hierarchical classifying system

14 Interpro example

15 PTM – Post-Translational Modification

16 PTM – Post-Translational Modification  Phosphorylation Tyr, Ser, Thr  Glycosylation (addition of sugars) Asn, Ser, Thr  Addition of fatty acids (e.g. N- myristoylation, S-Palmitoylation)

17 So how to predict Take into account: 1. Context (motif): PKC (a kinase) recognizes X S/T X R/K N-Myristoylation at M G X X X S/T Several times – we don ’ t know the exact motif! 2. Conservation Is the motif found (for instance, in human) also conserved in related organisms (for instance, in chimp)?

18 Prediction problems  Signal for detection is very short  Not enough biological knowledge for characterizing the signal  Tertiary structure

19 Prediction will be more efficient if more information is available

20 Secondary Structure

21 Secondary Structure  Reminder- secondary structure is usually divided into three categories: Alpha helix Beta strand (sheet) Anything else – turn/loop

22 Secondary Structure  An easier question – what is the secondary structure when the 3D structure is known?

23 DSSP  DSSP (Dictionary of Secondary Structure of a Protein) – assigns secondary structure to proteins which have a crystal structure H = alpha helix B = beta bridge (isolated residue) E = extended beta strand G = 3-turn helix I = 5-turn helix T = hydrogen bonded turn S = bend

24 Predicting secondary structure from primary sequence

25 Chou and Fasman (1974) Name P(a) P(b) P(turn) Alanine 142 83 66 Arginine 98 93 95 Aspartic Acid 101 54 146 Asparagine 67 89 156 Cysteine 70 119 119 Glutamic Acid 151 037 74 Glutamine 111 110 98 Glycine 57 75 156 Histidine 100 87 95 Isoleucine 108 160 47 Leucine 121 130 59 Lysine 114 74 101 Methionine 145 105 60 Phenylalanine 113 138 60 Proline 57 55 152 Serine 77 75 143 Threonine 83 119 96 Tryptophan 108 137 96 Tyrosine 69 147 114 Valine 106 170 50 The propensity of an amino acid to be part of a certain secondary structure (e.g. – Proline has a low propensity of being in an alpha helix or beta sheet  breaker)

26 Chou-Fasman prediction  Look for a series of >4 amino acids which all have (for instance) alpha helix values >100  Extend ( … )  Accept as alpha helix if average alpha score > average beta score Ala Pro Tyr Phe Phe Lys Lys His Val Ala Thr α 142 57 69 113 113 114 114 100 106 142 83 β 83 55 147 138 138 74 74 87 170 83 119

27 Chou and Fasman (1974)  Success rate of 50%

28 Improvements in the 1980’s  Conservation in MSA  Smarter algorithms (e.g. HMM, neural networks).

29 Accuracy  Accuracy of prediction seems to hit a ceiling of 70-80% accuracy AccuracyMethod 50% Chou & Fasman 69% Adding the MSA 70-80% MSA+ sophisticated computations

30 Gene Ontology

31 GO  Gene Ontology – a project for consistent description of gene products in different databases.  Consistent description - Common key definitions. Example: ‘ protein synthesis ’ or ‘ translation ’

32 GO  GO - GO describes proteins in terms of : biological process cellular component molecular function  GO is not: –A sequence database. –A portal for sequence information

33 GO – structure nucleus Nuclear chromosome cell cellular component

34 GO example Links from the swissprot entry of human protein kinase C alphaprotein kinase C alpha

35 Examples for use of GO  Enrichment for a GO category: 1. Do all upregulated genes in a microarray you built belong to the same GO “ molecular function ” category? 2. You have predicted a new transcription factor binding site. Do all genes with this site belong to the same GO biological process?

36 Evaluation of prediction methods

37 Evaluation of prediction methods  Comparing our results to experimentally verified sites Positive (hit) Negative True True-positive True-positive True-negative True-negative False False-positive (false alarm) False-negative (miss) Our prediction gives: Is the prediction correct?

38 Method evaluation Positive (hit) Negative True True-positive True-positive True-negative True-negative False False- positive (false alarm) False-negative (miss)  A good method will be one with a high level of true-positives and true-negatives, and a low level of false-positives and false-negatives Our prediction gives: Is the prediction correct?

39 Calibrating the method  All methods have a parameter (or a score) that can be calibrated to improve the accuracy of the method.  For example: the E-value cutoff in BLAST

40 Calibrating E-value cutoff  Reminder: the lower the E-value, the more ‘ significant ’ the alignment between the query and the hit.

41 Calibrating the E-value  What will happen if we raise the E-value cutoff (for instance – work with all hits with an E-value which is < 10) ? Positive (hit) Negative True True-positive True-positive True-negative True-negative False False-positive (false alarm) False-negative (miss) Our prediction gives: Is the prediction correct?

42 Calibrating the E-value  On the other hand – if we lower the E- value (look only at hits with E-value < 10 - 8 ) Positive (hit) Negative True True-positive True-positive True-negative True-negative False False-positive (false alarm) False-negative (miss) Our prediction gives: Is the prediction correct?

43 Improving prediction  Trade-off between specificity and sensitivity

44 Sensitivity vs. specificity  Sensitivity =  Specificity = True positive True positive + False negative Represent all the proteins which are really phosphorylated True negative True negative + False positive Represent all the proteins which are really NOT phosphorylated How good we hit real phosphorylations How good we avoid real non- phosphorylations

45  Raising the E-value to 10: sensitivity specificity  Lowering the E-value to 10 -8 sensitivity specificity

46 Over-predictions: example  Many PTM-predictors tend to over- predict  high level of false positives  low specificity WHY? 1. Tertiary structure! (buried/exposed, tertiary motifs) 2. The phosphorylation recognition mechanism is not completely clear!

47 Inside the genome

48 2001: the human genome

49 Neck to neck competition  Celera Genomics (private company) versus the International Human Genome Sequencing Consortium (public company)

50 The highlights  ~30,000 genes in the human genome (today – estimated at 20-25K)  Oases of genes in empty deserts  Long-range variation in GC content  Repetitive elements rule

51 How many genes in the genome?  Ratio of average gene size to genome size: 100,000  Based on ESTs: 35,000-120,000

52 Detecting genes in the human genome Gene finding methods:  Ab initio The challenge: small exons in a sea of introns  Homology-based The problem: will not detect novel genes

53 Genscan (ab initio)  Based on a probabilistic model of a gene structure  Takes into account: - gene composition – exons/introns - GC content - splice signals - promoters  Goes over all 6 reading frames Burge and Karlin, 1997, Prediction of complete gene structure in human genomic DNA, J. Mol. Biol. 268

54 Splicing

55 Splicing Mechanism Note: small exons in an ‘ocean’ of introns typical exon – hundreds bp typical intron – thousands bp

56 Eukaryotic splice sites Poly-pyrimidine tract

57 CpG Islands: another signal  CpG islands are regions of the genome with a higher frequency of CG dinucleotides (not base-pairs!) than the rest of the genome  CpG islands often occur near the beginning of genes  maybe related to the binding of the TF Sp1

58 Human genome gene count 1. Ab initio – Genscan 2. Confirmation using  ESTs  mRNA  Known protein motifs (Pfam) from any organism 3. Known genes: Refseq, Swissprot, TrEMBL

59 Human genome gene count  31,000 genes  1.5% of the genome: coding  33% - transcribed into genes

60 Comparative proteome analysis Functional categories based on GO, for genes which matched an entry in Interpro

61 Comparative proteome analysis  Humans have more proteins involved in cytoskeleton, immune defense, and transcription

62 Evolutionary conservation of human proteins  Performed BLASTP of each protein against the ‘ nr ’ NCBI database PSI-BLAST: non- vertebrates also

63 Horizontal (lateral) gene transfer   Lateral Gene Transfer (LGT) is any process in which an organism transfers genetic material to another organism that is not its offspring

64 Mechanisms:  Transformation  Transduction (phages/viruses)  Conjugation

65 Bacteria to vertebrate LGT criteria  Homologs in bacteria  Homologs in vertebrates (detected in PSI-BLAST)  No significant homologs in non- vertebrates

66 Bacteria to vertebrate LGT detection  E-value of bacterial homolog X9 better than eukaryal homolog Human query: Hit ……………… e-value Frog ………….. 4e-180 Mouse ………… 1e-164 E.Coli ………….. 7e-124 Streptococcus.. 9e-71 Worm ……………….0.1

67 Bacteria to vertebrate LGT vertebrates Bacteria Non- vertebrates

68 Bacteria to vertebrate LGT  Genes with a role in metabolism of xenobiotics or stress response  Selective advantage for these transfers.  May be highly important immune gene

70 Bacteria to vertebrate LGT??  Hundreds of sequenced bacterial genome vs. handful of eukaryotes  Gene finding in bacteria is much easier than in eukaryotes  On the practical side: rigid mechanical barriers to LGT in eukaryotes (nucleus, germ line)

71 Repetitive elements in the human genome

72 The C-value paradox  Genome size does not correlate with organism complexity AmoebaRiceHumanYeast 67 billion 4.3 billion 3 billion 12 million Genome size ?~30,00020-25,0006,275 Number of genes

73 Repetitive elements  The C-value mystery was partially resolved when it was found that large portions of genomes contain repetitive elements

74 Repeats in the human genome  ~50% of the human genome (~1% coding): 1. Transposon derived (=interspersed repeats) (45% of the genome) 2. Retrotransposed cellular genes 3. Sequence repeats (A) n, (CG) n, etc. 4. Segmental duplications

75 DNA Transposons & Retrotransposons DNA transposons Encode a tranposase enzyme Cut-and-paste mechanism: Transposase binds to the inverted repeats of the transposon, and to a target sequence in the DNA Replicative transposition Retro-transposons Encode reverse-transciptase and endonuclease Transposition via an RNA intermediate

76 Transposable elements in the human genome Retrotransposons * * * Non-LTR retrotransposons ** LTR transposon **

77 LINEs and SINEs  Highly successful elements in eukaryotes  SINEs are freeriders on the backs of LINEs – encode no proteins

78 Determining the age of transposable elements  For each family, a consensus sequence was built ===> the ancestral sequence  Compute the divergence (%) of each sequence from the ancestor  Convert sequence divergence to actual ages

79 Age of transposable elements  Most transposable elements date back to the emergence of placental mammals (low disposal rate of transposons)  DNA transposons in the human genome are dead (high divergence from ancestor)!

80 Where are the transposons located?  LINEs  AT-rich regions (less genes)  SINEs (MIR, Alu)  GC-rich areas …… ?? … they use the LINE machinery …….??

81 Why are there SINEs in GC-rich regions? 1. SINEs target GC rich regions 2. Evolutionary advantage for SINEs that ‘ land ’ in a GC-rich region  How do we resolve between the two options?

82 Age distribution of Alu’s in GC regions

83 SINEs in GC-rich regions 1. High rate of random loss in AT-rich regions 2. Negative selection against Alu in AT-rich 3. Positive selection (evolutionary advantage) for Alu in GC rich Comparison with LINEs Alus correlate with actively transcribed genes

84 Are Alus functional??  SINEs are transcribed under stress  SINE RNAs may bind a protein kinase  promote translation under stress Need to be in regions which are highly transcribed  Role in alternative splicing

85 Repeats in the human genome  ~50% of the human genome (~1% coding): 1. Transposon derived (=interspersed repeats) (45% of the genome) 2. Retrotransposed cellular genes 3. Sequence repeats (A) n, (CG) n, etc. 4. Segmental duplications

86 Segment duplications  1077 segmental duplications detected  Several genes in the duplicated regions associated with diseases (may be related to homologous recombination)  Most are recent duplications (conservation of entire segment, versus conservation of coding sequences only)

88 Genome-wide studies

89 Sequenced genomes Assembled and annotated eukaryote genomes in Ensembl

91  481 segments > 200 bp absolutely conserved (100% identity) between human, rat and mouse

92 Comparison with a neutral substitution rate  Compare the substitution rate in a any 1Mb region  Probability of 10 -22 of obtaining 1 ultranconserved element (UE) by chance

93 481 UEs 111 UE overlap a known mRNA: exonic UEs 256 - no overlap (non- exonic) 114 - inconclusive 100 intronic 156 intergenic

94 Who are the genes? Type 1: exonic Type 2: genes which are near non- exonic UEs

95  Type 1: enrichment for: - RNA binding and splicing regulation - RRM motif (RNA recognition)  Type 2: enrichment for: - Transcription regulation, DNA binding - DNA binding motifs

96 Intergenic UEs  Genes which flank intergenic UEs are enriched for early developmental genes  Are UEs distal enhancers of these genes?

97 Gene enhancer  A short region of DNA, usually quite distant from a gene (due to chromatin complex folding), which binds an activator  An activator recruits transcription factors to the gene

98 Experimental studies of UEs Some UEs cluster within regions enriched for genes encoding developmentally important transcription factors Within these loci, a special pattern of histone methylation (bivalent domains) Silence the developmental genes when unnecessary Suggest that the DNA pattern affects the histone methylation Cell, Vol 125, 315-326, 21 April 2006

99 Experimental studies of UEs Tested 167 UEs (both mouse-human UEs and fish-human UEs) for enhancer activity: cloned before a reporter gene to test their activity 45% functioned as enhancers

100 A bioinformatic success  Ultraconservation can predict highly important function!

101 BUT …

102 PLoS Biol. 2007 Sep;5(9):e234 Chose 4 UEs which are near specific genes: genes which show a specific phenotype when knocked-out Performed complete deletion of these UEs … the mice were viable and did not show any different phenotype

103 Conclusions…  Ultraconservation can be indicative of important function  …  And sometimes not: - gene redundancy - long-range phenotypes - laboratories cannot mimic life

1 Lessons 5-6 Classifying a protein / Inside the genome.

Similar presentations

Presentation on theme: "1 Lessons 5-6 Classifying a protein / Inside the genome."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Lessons 5-6 Classifying a protein / Inside the genome.

Similar presentations

Presentation on theme: "1 Lessons 5-6 Classifying a protein / Inside the genome."— Presentation transcript:

Similar presentations

About project

Feedback