Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 2: Introduction to Computational Biology

Similar presentations


Presentation on theme: "Lecture 2: Introduction to Computational Biology"— Presentation transcript:

1 Lecture 2: Introduction to Computational Biology
Alexei Drummond

2 Outline Sequences and sequence databases Similarity and Homology
Sequence alignment Dot plots Database searches for similar sequences CS

3 Sequence Definition: A sequence S is an ordered set of n characters (si) representing nucleotides or amino acids. S = {s1, s2,…,sn-1 , sn} DNA is composed of four nucleotides or bases: si  {A, C, G, T} RNA is composed of four nucleotides: si  {A, C, G, U}(T is transcribed as U) Proteins are composed of twenty amino acids CS

4 Biomolecular sequences
DNA 5’-ACGATCGACTGGTATATCGATGCT-3’ RNA 5’-ACGAUCGACUGGUAUAUCGAUGCU-3’ Protein MFINRWLFSTNHKDIGTLYLLFGAW CS

5 What is a gene? DNA Splice sites Intergenic DNA Start codon Stop codon
5’ 3’ 3’ 5’ Exon 1 Intron 1 Exon 2 Intron 2 Exon 3 Both the exons and introns are transcribed Primary RNA transcript 5’ 3’ The introns are removed Messenger RNA (mRNA) Translated to protein CS

6 Eukaryotes versus Prokaryotes
Note: There is no cellular biology in the exam! Plants, animals and fungi Larger cells, often multicellular Well defined nucleus, and specialized organelles Introns Lots of intergenic DNA 100Mb -100 Gb genomes Bacteria and Archaea Small No nucleus No introns Not much intergenic DNA Typically 1-10Mb genomes CS Graphics from MIT:

7 Sequence databases Where do biologists store their data? Databases
Public, private proprietary General, specialist Hard drive Chromatograms/Electropherograms Flat file sequence formats Fasta, Genbank et cetera Flat file alignment formats Nexus, ClustalX, GCG et cetera CS

8 CS

9 NCBI Nucleotide database
CS

10 Searching by accession number
CS

11 Genbank record CS

12 Genbank headers LOCUS X00166 711 bp DNA linear PHG 10-FEB-1999
DEFINITION Bacteriophage lambda cI gene encoding the repressor protein for transcriptional control of tetracycline resistance on plasmid pTR 262. ACCESSION X00166 VERSION X GI:15056 KEYWORDS repressor; tetracycline resistance. SOURCE Enterobacteria phage lambda ORGANISM Enterobacteria phage lambda Viruses; dsDNA viruses, no RNA stage; Caudovirales; Siphoviridae; Lambda-like viruses. REFERENCE 1 (bases 1 to 711) AUTHORS Nilsson,B., Uhlen,M., Josephson,S., Gatenbeck,S. and Philipson,L. TITLE An improved positive selection plasmid vector constructed by oligonucleotide mediated mutagenesis JOURNAL Nucleic Acids Res. 11 (22), (1983) PUBMED CS

13 Genbank feature table FEATURES Location/Qualifiers source 1..711
/organism="Enterobacteria phage lambda" /mol_type="genomic DNA" /db_xref="taxon:10710" CDS >711 /note="unnamed protein product; coding sequence cI gene" /codon_start=1 /transl_table=11 /protein_id="CAA " /db_xref="GI:15057" /db_xref="GOA:P03034" /db_xref="InterPro:IPR001387" /db_xref="InterPro:IPR006198" /db_xref="InterPro:IPR010982" /db_xref="InterPro:IPR011056" /db_xref="PDB:1F39" /db_xref="PDB:1GFX" /db_xref="PDB:1J5G" /db_xref="PDB:1LLI" /db_xref="PDB:1LMB" /db_xref="PDB:1LRP" CS

14 Genbank sequence ORIGIN
1 atgagcacaa aaaagaaacc attaacacaa gagcagcttg aggacgcacg tcgccttaaa 61 gcaatttatg aaaaaaagaa aaatgaactt ggcttatccc aggaatctgt cgcagacaag 121 atggggatgg ggcagtcagg cgttggtgct ttatttaatg gcatcaatgc attaaatgct 181 tataacgccg cattgcttgc aaaaattctc aaagttagcg ttgaagaatt tagcccttca 241 atcgccagag aaatctacga gatgtatgaa gcggttagta tgcagccgtc acttagaagt 301 gagtatgagt accctgtttt ttctcatgtt caggcaggga tgttctcacc tgagcttaga 361 acctttacca aaggtgatgc ggagagatgg gtaagcacaa ccaaaaaagc cagtgattct 421 gcattctggc ttgaggttga aggtaattcc atgaccgcac caacaggctc caagccaagc 481 tttcctgacg gaatgttaat tctcgttgac cctgagcagg ctgttgagcc aggtgatttc 541 tgcatagcca gacttggggg tgatgagttt accttcaaga aactgatcag ggatagcggt 601 caggtgtttt tacaaccact aaacccacag tacccaatga tcccatgcaa tgagagttgt 661 tccgttgtgg ggaaagttat cgctagtcag tggcctgaag agacgtttgg c // CS

15 Fasta format >gi|15056|emb|X | Bacteriophage lambda cI gene encoding the… ATGAGCACAAAAAAGAAACCATTAACACAAGAGCAGCTTGAGGACGCACGTCGCCTTAAAGCAATTTATG AAAAAAAGAAAAATGAACTTGGCTTATCCCAGGAATCTGTCGCAGACAAGATGGGGATGGGGCAGTCAGG CGTTGGTGCTTTATTTAATGGCATCAATGCATTAAATGCTTATAACGCCGCATTGCTTGCAAAAATTCTC AAAGTTAGCGTTGAAGAATTTAGCCCTTCAATCGCCAGAGAAATCTACGAGATGTATGAAGCGGTTAGTA TGCAGCCGTCACTTAGAAGTGAGTATGAGTACCCTGTTTTTTCTCATGTTCAGGCAGGGATGTTCTCACC TGAGCTTAGAACCTTTACCAAAGGTGATGCGGAGAGATGGGTAAGCACAACCAAAAAAGCCAGTGATTCT GCATTCTGGCTTGAGGTTGAAGGTAATTCCATGACCGCACCAACAGGCTCCAAGCCAAGCTTTCCTGACG GAATGTTAATTCTCGTTGACCCTGAGCAGGCTGTTGAGCCAGGTGATTTCTGCATAGCCAGACTTGGGGG TGATGAGTTTACCTTCAAGAAACTGATCAGGGATAGCGGTCAGGTGTTTTTACAACCACTAAACCCACAG TACCCAATGATCCCATGCAATGAGAGTTGTTCCGTTGTGGGGAAAGTTATCGCTAGTCAGTGGCCTGAAG AGACGTTTGGC CS

16 Hepatitis C sequence database
Specialist databases usually refer to sequences in the public databases, but have extra information and search criteria specific to the domain. CS

17 Hepatitis C sequence database

18 Problem 1: detecting sequence similarity between two sequences
Biologists often want to detect if two sequences are similar How is sequence similarity defined? What is it used for? Are there different types of similarity? CS

19 How is sequence similarity defined?
The number of matching nucleotides (when aligned)? The amount of shared information? The “distance” between the two sequences under some metric? 38 out of 60 sites are identical in this alignment CS

20 How is sequence similarity defined?
A1 is 42 nucleotides long A2 is 60 nucleotides long So 38/42 = 90% of A1 is “explained” by A2 Whereas 38/60 = 63% of A2 is “explained” by A1 CS

21 What is similarity used for?
Detecting homology (shared evolutionary history) Reconstructing evolutionary history to better understand biology Determining the structure and function of new sequences, by matching them with sequences of known structure/function Grouping sequences together to increase statistical power of single-sequence analyses Many many more uses… CS

22 Are their different types of similarity?
Chance similarity For example: if you compare two long random sequences of DNA you will always find some small region containing the same sequence. Similarity due to a common origin, followed by divergent/independent evolution (called homology) Similarity due to convergence Bird wings and bat wings Lysozyme gut enzyme in cows and colobus monkeys CS

23 Sequence Homology x a b x y a b
Homologous protein or DNA sequences share common ancestry A statement of homology is therefore an evolutionary hypothesis Homology need not imply similar function Homology is a binary property, a pair of sequences are either homologous or not homologous. No such thing as degree of homology Homology is often inferred by sequence similarity t a, b homologous a b x y From Wikipedia: In genetics, homology is used in reference to protein or DNA sequences, meaning that the given sequences share ancestry. Sequence homology may also indicate common function. Homology is an all-or-nothing quality; there is no such condition as "degrees of homology." Sequence regions that are homologous may be called conserved, consensus or canonical sequences and represent the most common choice of base or amino acid at each position.Homology among proteins and DNA is often concluded on the basis of sequence similarity, especially in bioinformatics. For example, in general, if two genes have an almost identical DNA sequence, it is likely that they are homologous. However, it may be that the sequence similarity did not arise from their sharing a common ancestor; short sequences may be similar by chance, or sequences may be similar because both were selected to bind to a particular protein, such as a transcription factor. Such sequences are similar but not homologous. Many algorithms exist to cluster protein sequences into sequence families, which are sets of mutually homologous sequences. (See sequence clustering and sequence alignment.) a, b not homologous a b CS

24 Origin of similar genes
Similar genes in the same genome arise by gene duplication Similar genes in different genomes arise from common ancestry A copy of a gene might be inserted next to the original Two copies mutate independently Each can take on separate functions All or part can be transferred from one part of genome to another A Gene duplication A B Speciation A B A’ B’ Species I Species II CS

25 Orthology and paralogy
"Where the homology is a result of gene duplication so that both copies have descended side by side during the history of an organism, (for example, alpha and beta hemoglobin) the genes should be called paralogous (para=in parallel). Where the homology is the result of speciation so that the history of the gene reflects the history of the species (for example alpha hemoglobin in man and mouse) the genes should be called orthologous (ortho=exact). " Fitch WM. Distinguishing homologous from analogous proteins.  Systematic Zoology 1970 Jun;19(2):   CS

26 Orthology and paralogy
From Wikipedia: Homology of sequences can be of two types: orthology or paralogy. Homologous sequences are orthologous if they were separated by a speciation event: if a gene exists in a species, and that species diverges into two species, then the copies of this gene in the resulting species are orthologous. Homologous sequences are paralogous if they were separated by a gene duplication event: if a gene in an organism is duplicated, then the two copies are paralogous. A pair of sequences that are orthologous to each other are called orthologs, a pair that are paralogous are called paralogs.Orthologs will typically have the same or similar function. This is not alway true for paralogs: due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions.The genes encoding myoglobin and hemoglobin are considered to be ancient paralogs. Similiarly, the four known classes of hemoglobins (hemoglobin A, hemoglobin A2, hemoglobin S, and hemoglobin F) are all paralogs of each other. While each of these genes serve the same basic function of oxygen transport, they have already diverged slightly in function: fetal hemoglobin (hemoglobin F) has a higher affinity to oxygen than adult hemoglobin.Another example can be found in rodents such as rats and mice. Rodents have a pair of paralogous insulin genes, although it is unclear if any divergence in function has occurred.Paralogous genes often belong to the same species, but this is not necessary: for example, the hemoglobin gene of humans and the myoglobin gene of chimpanzees are paralogs. This is a common problem in bioinformatics: when the genome of different species have been sequenced and homologous genes have been found, one can not immediately conclude that these genes have the same or similar function, as they could be paralogs whose function has diverged. CS

27 Orthology, paralogy and multigene families
Reproduced from NCBI education website CS

28 Solution 1: Pairwise sequence alignment
Definition: Procedure for optimizing a score function on a pair of sequence S1 and S2 by introducing gap characters into a subsequence of one or both of the sequences so as to construct aligned sequences A1 and A2. The objective is to find the similarity regions in the two sequences. A1 and A2 will be the same length. Ai will consist only of a subsequence of Si once gap characters are removed. CS

29 Pairwise sequence alignment
Sequences S1 = a c g g t S2 = a g g c t t Alignment A1 = a c g g – t - | | | | A2 = a – g g c t t CS

30 Global versus Local Alignment
We distinguish Global alignment algorithms which optimize overall alignment between two sequences Local alignment algorithms which seek only highly similar subsequences Alignment stops at the ends of regions of strong similarity Favors finding conserved patterns in otherwise dissimilar sequences CS

31 Global vs. Local Alignment
LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA Local GKG ||| CS

32 Solution 2: The dot plot G C T A Window size = 1 Matches = 1 0/1 1/1
CS

33 Filtering the dot plot G C T A Window size = 3 Matches = 2 0/3 1/3 2/3
3/3 CS

34 Dot plots 1,1 2,2 The dot plot is a graphical method that can be tuned
CS

35 Dot plots 3,3 5,22 CS

36 Dot matrix analysis with Geneious
Get phage l cI and phage P22 c2 repressor sequences from Genbank Nucleotide database Accessions X00166 and V01153 respectively Use Geneious ( Use window size of 11 and stringency of 7 See figure 3.X in Mount CS

37 Dot matrix analysis with Geneious
CS

38 Dot matrix analysis with Geneious (2)
Get human LDL receptor protein sequence from Genbank (accession P01130) Make copy, and look at self-similarity Use window size of 1 and stringency of 1 Use window size of 23 and stringency of 7 CS

39 Human LDL receptor self similarity
23,7 1,1 CS

40 Dot plots Two 100 nucleotide fragments of the nef gene
Low complexity repetitive region is visible as dense region of parallel lines CS

41 Which alignment is best?
CS

42 Problem 2: finding similar sequences in a database using query sequence
Biologists often want to find known sequences that are similar to a newly obtained sequence How to rapidly compare the new sequence to the hundreds of billions of bases already sequenced? Pairwise align new sequence to all the sequences in the database? Which database to search? CS

43 Similarity searching Many heuristic algorithms Exact algorithms BLAST
FASTA Exact algorithms Pairwise alignment on all database entries Only possible for small databases CS

44 BLAST CS

45 CS


Download ppt "Lecture 2: Introduction to Computational Biology"

Similar presentations


Ads by Google