Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Families and Functional Annotation Once genes have been id.ed they need to be functionally annotated A computational first step is to group genes.

Similar presentations


Presentation on theme: "Gene Families and Functional Annotation Once genes have been id.ed they need to be functionally annotated A computational first step is to group genes."— Presentation transcript:

1 Gene Families and Functional Annotation Once genes have been id.ed they need to be functionally annotated A computational first step is to group genes w/ other genes - some of which will hopefully have known fx.s Once genes are classified, we can begin to examine whether certain genes are missing or overrepresented in the given genome - possibly reflecting the niche of the organism As w/ earlier computational analyses, functional annotation based solely on in silico analyses is only a first step 17:17

2 Gene Families and Functional Annotation Sequence-similarity searches are a first pass in classification BLAST - Basic Local Alignment Search Tool BLASTn - nucleotide BLASTp- protein BLASTx - translates a nucleotide sequence into all possible reading frames and scans these against a protein database All give a Expectation, E, value score - to evaluate the significance of the match In both eukaryotes and prokaryotes, 1/3 to 1/2 of searched genes do not match a protein = orphan genes 17:17

3 Protein Structural Domains Proteins are made up of combinations of distinct structural units or domains Genes can be grouped based on the domains they contain These groupings depend on structural similarity - sequence similarity alone may be insufficient 17:17

4 Gene clustering by seq. similarity BLAST searches generally return matches from more than one protein from more than one species This happens if the query protein is part of a gene (protein) family or contains multiple domains found in other proteins 17:17

5 BLAST output can be interpreted as a match to one or more protein domains - Searches of closely related sp. often id. genes/proteins w/ similar domain structure Domains shuffle over evolutionary time and are often found in different combinations across more distant comparisons Domains do tend to follow biologically reasonable patterns - DNA binding domains w/ other DNA binding domains, transmembrane domains w/ intra and extracellular domains 17:17

6 Gene clustering by seq. similarity Genes can be classified by domain content The Enzyme Commission (EC) hierarchical classification of enzymes - each enzyme is assigned a number that reflects sub-classification of function, e.g. ADH is EC Other classification schemes are not as obvious - protein function is often context-specific PFAM - protein database that allows access to biochemical properties of predicted proteins 17:17

7 Gene clustering by seq. similarity InterPro - classifies individual protein domains 17:17

8 Gene clustering by seq. similarity Protein functional prediction ≠ assignment of genes to families Protein function prediction allows general conclusions about protein function and genome content based on protein domains Classification of gene families involves distinguishing between paralogs and orthologs 17:17

9 Major Classes of Protein Function Enzymes Signal transduction (receptors and kinases) Nucleic acid binding (transcription factors, nucleic acid enzymes) Structural (cytoskeletal, extracellular matrix, motor proteins) Channel (voltage and chemically gated) Immunoglobins Calcium-binding proteins Transporters Subclasses vary - as do the representation w/in each genome 17:17

10 Gene Clusters Alignment searches (BLAST) identify genes w/ similar sequence to the query If searches id. a single gene, or genes w/ a single fx then functional assignment to query seq. is simple - but searches often id lg # of seq.s w/ multiple functions The most similar sequence is not nec. the seq. w/ which the query seq. shares a fx 17:17

11 Gene Clusters One approach is to try and define as large a protein family as possible (including many possible functions) PSI-BLAST can be used to identify a large set of potential protein family members A BLAST search is conducted to create an initial protein sequence alignment - which is then used to initiate a fresh search The process is then iterated until no further matches are id.ed - this reduces the degree of seq. similarity required for inclusion in the family A “true” family of genes ought to be bounded by a significance cut-off to limit the proteins included 17:17

12 Gene Clusters Clusters of orthologous genes, COGs, can be used to classify proteins COGs are created by id.ing the best hit for each gene in complete pairwise comparisons across a set of genomes 17:17

13 Gene Clusters 185,000 proteins from 66 microbial genomes id.ed 4,873 COGs - 75% of all predicted microbial proteins 50% of 110,00 proteins from fly, nematode, human, ariabidopsis, yeasts and a microsporidian form 4,852 COGs 17:17 COG0837

14 Gene Clusters COGs include both orthologs and paralogs In (a) HuA and HuA’ are paralogs - distinguishing which retains the ancestral fx is not as simple as determining which has the most similar seq. 17:17

15 Gene Clusters HuA and MmA differ in 5 a.a., none affect fx HuA’ and MmA differ in 4 a.a., but one of which changes the charge of a critical residue Clustering based on similarity would lead to erroneous fx classification 17:17

16 Gene Phylogenies Clustering groups genes by seq. similarity Phylogentic analyses ascertain how groups of similar genes are related by descent In the HuA, MmA example, the 2 A’ genes can either result from one (orthologs) or two (paralogs) duplication events Paralogs are less likely to share a function 17:17

17 Gene Phylogenies Often gene fx can be inferred from phylogenetic analysis The first step is aligning the sequences A gene tree is then constructed using some algorithm Duplications and gene relatedness are then ascertained In the example on the lft, an ancient duplication splits 2 fx.al grps, on the rt protein 2 likely has the same fx as 5 and 6 17:17

18 Gene Ontology Molecular function alone may not predict/describe biological fx (think crystallins) The Gene Ontology (GO) annotates and groups genes using a multi-character approach including cell biological and molecular fx and/or subcellular localization The GO project uses defined vocabulary and a hierarchical structure to classify genes and includes links indicating the type of evidence for the classification 17:17

19 GO network In this example, the gene INNER NO OUTER is at the center w/ the 3 separate classifications radiating out from it 17:17

20 Gene Otology The GO vocabulary includes 7000 terms describing molecular fx, 5000 describing biological process, some annotations include as many as 12 levels w/ in hierarchy terms This is too deep for efficient computational searches - other simplified systems are also being developed to allow computationally screen and classify genes 17:17

21 Homology = similarity due to common ancestry The Gpdh gene sequence from two different species are homologous sequences All comparisons made in molecular evolution (biology) are based on comparing homologous sequences = apples to apples Sequences must be aligned to allow comparison = homologous bases lined up in columns Molecular Phylogenetics Human MVHLTP Baboon MVHLTP Cow MLTP Sheep MLTP Mouse MVHLTP The cow and sheep β globin proteins are 2 a.a. shorter than the other sequences, so gaps are added to align the seqeunces Human MVHLTP Baboon Cow Sheep Mouse :17

22 Accumulation of sequence differences through time is the basis of molecular systematics, which analyses them in order to infer evolutionary relationships A gene tree is a diagram of the inferred ancestral history of a group of sequences A gene tree is only an estimate of the true pattern of evolutionary relations UPGMA and Neighbor joining = simple ways to estimate a gene tree Bootstrapping = sampling w/ replacement, a common technique for assessing the reliability of a node in a gene tree Taxon = the source of each sequence Gene Trees 17:17

23 Rooted and Unrooted Trees Analyses of a set of genes produces an unrooted tree Trees can be rooted, assigned polarity, by assignment of an outgroup - a sequence that is known to be more distantly related than any within the rest of the analysis (the ingroup) Tree branch length denotes the amount of change along that branch in some tree building methods 3 distinct unrooted trees 17:17

24 Tree Building methods The 3 primary methods (algorithms) for building gene trees are: 1. Parsimony - a character-based approach that surveys every possible tree topology. The most parsimonious tree is the topology that requires the minimum # of steps (changes) in a data set Position 1 of this example - tree1 requires 1 change, tree2 2 changes and tree3 2 changes. When the 4 positions are summed tree 3 is found to be the best (shortest) 17:17

25 Tree Building methods The 3 primary methods (algorithms) for building gene trees are: 2. Maximum Likelihood - also a character-based approach, surveys every possible tree topology and assigns all topologies a maximum likelihood estimate (score) based on a model of evolution describing the probability of changes (mutation) through time. The ML tree is the one with the highest probability This method can be accurate, but is computationally expensive 17:17

26 Tree Building methods The 3 primary methods (algorithms) for building gene trees are: 3. Distance Methods - are not character based, instead they calculate pairwise distances across entire aligned sequences and construct data matrixes. Trees are built by grouping pairs with the shortest distances between them. These methods can also incorporate complex evolutionary models This method is computationally cheap, will always return and answer, but are not always accurate. The simplest distance method, The simplest distance method, Unweighted Pair Group Method with Arithmatic Mean, UPGMA, simply counts the number of sequence changes in all pairwise comparisons 17:17

27 UPGMA Trees 17:17

28 HuBaCoShMoHaCh Hu Ba Co Sh12915 Mo716 Ha14 HuBaCoShMoHaCh HuBa Co Sh12915 Mo716 Ha14 Hu Ba 2/2 = Co Sh 3/2 = UPGMA Tree Construction 17:17

29 HuBaCoShMoHaCh HuBa CoSh Mo716 Ha14 HuBaCoShMoHaCh HuBa Co Sh12915 Mo716 Ha14 Hu Ba 1.0 Co Sh 7/2 = Mo Ha 3.5 UPGMA Tree Construction 17:17

30 HuBaCoShMoHaCh HuBa CoSh Mo716 Ha14 Hu Ba 1.0 Co Sh Mo Ha 3.5 HuBaCoShMoHaCh HuBa CoSh MoHa15 8/2 = 4 Hu Ba 1.0 Co Sh UPGMA Tree Construction 17:17

31 ((HuBa)(CoSh))MoHaCh ((HuBa)(Cosh)) MoHa /2 = Hu Ba 1.0 Co Sh HuBaCoShMoHaCh HuBa CoSh MoHa15 Mo Ha UPGMA Tree Construction 17:17

32 ((HuBa)(CoSh))MoHaCh ((HuBa)(Cosh)) MoHa15 ((HuBa)(CoSh))(MoHa)Ch ((HuBa)(Cosh))(MoHa) /2 = Hu Ba 1.0 Co Sh Mo Ha Ch UPGMA Tree Construction 17:17

33 a b 1.0 c d e f g Final UPGMA Tree 17:17

34 Phylogenetic Trees Phylogenetic trees are representations summarizing a reconstructed evolutionary history A phylogenetic tree is a diagram that proposes a hypothesis for reconstructed evolutionary relationships between a set of objects (taxa or OTUs) Phylogenetic trees can represent relationships between species or genes

35 Phylogenetic Trees OTUs are connected by a set of lines - branches or edges External nodes or leaves are existing OTUs or extinct objects tht did not give rise to descendents Internal nodes represent ancestral states hypothesized to have occurred during evolution

36 Internal nodes can represent speciation or gene duplication events A gene tree does not necessarily coincide with a species tree Gene duplications will cause a gene tree to differ from a species tree Human Monkey Rat Mouse Strugeon Chicken Zebrafish Platy Lamprey Hagfish

37 Resolution Trees may be fully or only partially resolved Every node in a fully resolved tree is bifurcating or dichotomous Some nodes in unresolved trees are multifurcating or polytomous Human Monkey Rat Mouse Strugeon Chicken Zebrafish Platy Lamprey Hagfish Human Monkey Rat Mouse Strugeon Chicken Zebrafish Platy Lamprey Hagfish

38 Rooting Unrooted trees establish the relationships among taxa, but not the evolutionary pathway For 4 taxa there are 3 unrooted trees, but 15 rooted trees Human Monkey Rat Mouse Chicken Human Monkey Rat Mouse HumanMonkey RatMouse Human Monkey Rat Mouse

39 Rooting Unrooted trees establish the relationships among taxa, but not the evolutionary pathway For 4 taxa there are 3 unrooted trees, but 15 rooted trees Human Monkey Rat Mouse Chicken Human Monkey Rat Mouse Human Monkey Rat Mouse Human Monkey Rat Mouse Human Monkey Rat Mouse Human Monkey Rat Mouse Human Monkey Rat Mouse

40 Types of Trees Cladograms show the genealogy of taxa, but do not include timing or divergence (branch lengths have no meaning) Human Monkey Rat Mouse Human Monkey Rat Mouse

41 Types of Trees Additive trees show the genealogy of taxa and branch lengths represent divergence between taxa Comparison of branch lengths gives a meaningful estimate of evolutionary divergence Human Monkey Rat Mouse

42 Types of Trees Ultrametric trees are similar to additive trees, but assume a constant rate of change between characters used to build the tree - a molecular clock Comparison of branch lengths gives a meaningful estimate of evolutionary divergence Ultrametric trees are always rooted Human Monkey Rat Mouse time

43 Outgroups The most accurate way to root a tree is to use an “outgroup” a taxon or group of taxa more distantly related than any member of the “ingroup” Human Monkey Rat Mouse Chicken time

44 Representing Phylogenies Phylogenetic relationships can be represented as graphical trees, tables or parenthetical statements (Newick or New Hampshire format) ((raccon, bear),((sea_lion, seal), ((monkey,cat), weasel)), dog); ((raccon:0.20, bear:0.07):0.01,((sea_lion:0.12, seal:0.12):0.08, ((monkey:1.00,cat:0.47), weasel:0.18)), dog:0.25);

45 Bootstrapping Many tree building algorithms will give a single, fully resolved, tree from any data set. Nodes will all be equally represented even if one is supported by many characters and another by very few. How to quantify support for any given tree? We can’t re- run evolution. We can sample many different genes and we can bootstrap our data. Bootstrapping is sampling a data set, with replacement, to generate a new data set. We then use this new set in a phylogenetic analysis - and repeat this process hundreds or thousands of times. We can then present bootstrap scores at each node, the % of bootstrap trees that contained that specific node

46 Bootstrapping 1- G A D D Y T T K L P 2- G V E D Y T T K - P 3- G A D D Y T T R L P 4- C V E D Y T T R - P 1- T K L L T P D A D G 2- T K - - T P E V D G 3- T R L L T P D A D G 4- T R - - T P E V D C 1- G P K D K K T P D P 2- G P K D K K T P E P 3- G P R D R R T P D P 4- C P R D R R T P E P 1- L P Y D A D D P T G 2- - P Y E V D E P T G 3- L P Y D A D D P T G 4- - P Y E V D E P T C

47 Bootstrapping 1- G A D D Y T T K L P 2- G V E D Y T T K - P 3- G A D D Y T T R L P 4- C V E D Y T T R - P 1- T K L L T P D A D G 2- T K - - T P E V D G 3- T R L L T P D A D G 4- T R - - T P E V D C 1- G P K D K K T P D P 2- G P K D K K T P E P 3- G P R D R R T P D P 4- C P R D R R T P E P 1- L P Y D A D D P T G 2- - P Y E V D E P T G 3- L P Y D A D D P T G 4- - P Y E V D E P T C

48 Bootstrapping and Condensed Trees In this example, bear and raccoon form a pair in 50% of the data sets We can choose to present a tree that condenses branches of less than some threshold bootstrap support - a condensed tree

49 Consensus Trees Some tree building methods will produce multiple equally “good” trees A consensus tree shows the features that are shared by all or some trees. A strict consensus tree only includes features found in all trees A majority-rule consensus tree includes features found ≥ a set %

50 Reconciled trees attempt to combine gene trees and species trees, clearly identifying both speciation and duplication events Reconciled Trees Tree showing duplications Species tree

51 Reconciled trees attempt to combine gene trees and species trees, clearly identifying both speciation and duplication events Reconciled Trees Species tree indicating locations of duplication events Tree showing information on speciation, duplication and gene loss

52 Not all proteins w/ similar fx have common evolutionary history Nonhomologous genes can evolve similar fx through convergent evolution Seq. similarity and structure, outside of functional sites, is expected to be low - here catalytic residues and overall structure of chymotrypsin (yellow) and subtilisin (green) = analogous enzymes Analogous Genes

53 Sequence similarity not due to homology is homoplasy Homoplasy can result from convergent evolution, parallel evolution or evolutionary reversal Homoplasy 1- G A D D Y T T K L P 2- G V E D Y T T K - P 3- G A D D Y T T R L P 4- C V E D Y T T R - P

54 Transfer of genes from one species to another, horizontal gene transfer (HGT) or lateral gene transfer (LGT), will confuse phylogenetic analysis - results in tangled tree - branches that join HGT is more common in bacteria and archaea, but is also found in eukaryotes HGT or LGT

55 After transfer the gene in the donor and recipient species will be very similar - xenologous genes Phylogenetic analysis of these sequences will indicated recipient is more closely related to donor than it truly is. Here, 80% seq. identity between a eukaryotic gene and its likely bacterial source Xenologous Genes Outgroup consists of members of the same gene superfamily w/in ingroup, all seq. are bacterial except Trichomonas vaginalis, a parasitic protozoan

56 The # of orthologous, homologous and unique genes in human, chicken and puffer fish genomes - BLAST analysis Core orthologs = Single copy orthologs in dark blue, genes present in all 3 but duplicated in at least 1 are in lighter blue Pairwise orthologs = orthologs found in only 2 species Orthologous genes

57 The # of orthologous, homologous and unique genes in human, chicken and puffer fish genomes Homologous genes for which orthology/paralogy cannot be determined in yellow Unique genes in gray Orthologous genes

58 Duplication w/in a gene can result in complex proteins w/ repeated domains These may be identifiable on a dot-plot Here BRCA2 plotted against itself, repeats visible w/ window analysis Duplication w/in genes

59 W/ complete genome seq. - can compare entire genomes to identify equivalent regions and orthologous genes - syntentic regions - except that large scale rearrangements are common Genes are lost and duplicated - and inverted or moved between chromosomes The local genomic environment tends to be similar between orthologs, but the large-scale structures differ Synteny

60 Comparative Genomics Synteny is inversely correlated with time since last common ancestor In 500 zebrafish genes 50-80% occur in conserved homology segments, 2 or more genes in the same order as in humans Approx. 1/2 of the chromosomes retain ~ complete synteny between cats and humans

61 Orthologs must be distinguished from paralogs for phylogenetic reconstruction and assignment of possible function Pseudogenes must be distinguished from both Orthologs and Paralogs

62 Gene loss can eliminate orthologs from two species - this is especially difficult with large (similar) gene families Gene trees  species trees, but multiple genes may Orthologs, Paralogs and Gene Loss A,B,C,D are species  and  are paralogs Evolutionary history Incorrect species tree based on gene tree

63 BLAST can be used to identify orthologs and paralogs between 2 genomes Mask low complexity and commonly occurring domains Scan all gene sequences from one genenome are then scanned on another noting best-scoring BLAST hits (BeTs) - repeat for all possible pairs of genomes Paralogous genes resulting from a duplication since the divergence between two species will be each others BeTs Orthologs form groups from different genomes w/ reciprocal BeTs Clustering of Orthologs and Paralogs

64 Cluster of Orthologous Groups (COG) and euKaryotic Orthologous Groups (KOG) data bases have been constructed to identify large numbers of orthologs Here all 3 genes from 3 different genomes are each others BeT in pairwise comparisons between the three genomes Members of COGs or KOGs are assumed to have related fxs This type of analysis is an alternative to exhaustive phylogenetic trees - large data sets (# species or genes) Clustering of Orthologs and Paralogs

65 This method identifies orthologs and paralogs in this case With sufficient # of genomes - 2 COGS will form, one associated w/ the  part and the other with the  part of the tree Clustering of Orthologs and Paralogs

66 Gene loss can still be problematic Comparison of only species A and B would incorrectly group  and  genes Clustering of Orthologs and Paralogs


Download ppt "Gene Families and Functional Annotation Once genes have been id.ed they need to be functionally annotated A computational first step is to group genes."

Similar presentations


Ads by Google