Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology, Part 1 Introduction Robert F. Murphy Copyright  1996, 2000-2009. All rights reserved.

Similar presentations


Presentation on theme: "Computational Biology, Part 1 Introduction Robert F. Murphy Copyright  1996, 2000-2009. All rights reserved."— Presentation transcript:

1 Computational Biology, Part 1 Introduction Robert F. Murphy Copyright  1996, All rights reserved.

2 Course Introduction What these courses are about What these courses are about What I expect What I expect What you can expect What you can expect

3 Course numbers = undergraduate course = undergraduate course = graduate course = graduate course  Difference is additional research paper for graduate course

4 What these courses are about overview of ways in which computers are used to solve problems in biology overview of ways in which computers are used to solve problems in biology supervised learning of illustrative or frequently-used algorithms and programs supervised learning of illustrative or frequently-used algorithms and programs supervised learning of programming techniques and algorithms selected from these uses supervised learning of programming techniques and algorithms selected from these uses

5 I expect students will have basic knowledge of biology and chemistry (at the level of Modern Biology/Chemistry) and willingness to learn more students will have basic knowledge of biology and chemistry (at the level of Modern Biology/Chemistry) and willingness to learn more students have some programming experience and willingness to work to improve students have some programming experience and willingness to work to improve heterogeneous class - I plan to include refreshers on each new topic heterogeneous class - I plan to include refreshers on each new topic students will ask questions in class and via students will ask questions in class and via

6 You can expect Two major course sections Two major course sections  Computational Molecular Biology (Sequence & Structure Analysis)  Computational Cell Biology (Modeling and Image Analysis) Class sessions: lectures/demonstrations Class sessions: lectures/demonstrations Recitations: reviews/quizzes/help Recitations: reviews/quizzes/help Quizzes on assigned reading/previous lectures (5% of grade) Quizzes on assigned reading/previous lectures (5% of grade) Homework assignments (50% for , 40% for ) Homework assignments (50% for , 40% for ) Midterm March 5 (20% of grade) Midterm March 5 (20% of grade) Final (25% of grade) Final (25% of grade) Research Paper (10% for ) Research Paper (10% for ) Grades determined by weighted average of components Grades determined by weighted average of components Communication on class matters via list Communication on class matters via list

7 Textbooks for first half of course For all students For all students  Required textbook: An Introduction to Bioinformatics Algorithms Recommended additional textbook Recommended additional textbook  Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids by Durbin et al. (ISBN: )

8 Web page or or  Lecture Notes (as PowerPoint files)  Homework Assignments (as PDF files)  Additional materials as needed

9 Class schedule Tuesdays and Thursdays Tuesdays and Thursdays  3:00 to 4:20 lecture Fridays Fridays  1:30 to 2:20 recitation

10 Information flow A major task in computational molecular biology is to “decipher” information contained in biological sequences A major task in computational molecular biology is to “decipher” information contained in biological sequences Since the nucleotide sequence of a genome contains all information necessary to produce a functional organism, we should in theory be able to duplicate this decoding using computers Since the nucleotide sequence of a genome contains all information necessary to produce a functional organism, we should in theory be able to duplicate this decoding using computers

11 Review of basic biochemistry Central Dogma: DNA makes RNA makes protein Central Dogma: DNA makes RNA makes protein Sequence determines structure determines function Sequence determines structure determines function

12 Structure macromolecular structure divided into macromolecular structure divided into  primary structure (1D sequence)  secondary structure (local 2D & 3D)  tertiary structure (global 3D) DNA composed of four nucleotides or "bases": A,C,G,T DNA composed of four nucleotides or "bases": A,C,G,T RNA composed of four also: A,C,G,U (T transcribed as U) RNA composed of four also: A,C,G,U (T transcribed as U) proteins are composed of amino acids proteins are composed of amino acids

13 DNA properties - base composition Some properties of long, naturally-occuring DNA molecules can be predicted accurately given only the base composition Some properties of long, naturally-occuring DNA molecules can be predicted accurately given only the base composition Since double-stranded DNA should have the same number of As as Ts, DNA base composition usually expressed as %GC (the percent of all base pairs that are G:C) or  GC (fraction of all base pairs that are G:C) Since double-stranded DNA should have the same number of As as Ts, DNA base composition usually expressed as %GC (the percent of all base pairs that are G:C) or  GC (fraction of all base pairs that are G:C)

14 DNA properties - melting temperature Example of zero order sequence properties Example of zero order sequence properties  T m, the melting temperature, defined as the temperature at which half of the DNA is single-stranded and half is double-stranded

15 DNA properties - melting temperature T m ( o C) =  GC (for 0.15 M NaCl) Fraction of double- stranded base pairs Fraction of separate strands (dashed line)

16 DNA structure - restriction maps Restriction enzymes cut DNA at specific sequences. Restriction enzymes cut DNA at specific sequences. A restriction map is a graphical description of the order and lengths of fragments that would be produced by the digestion of a DNA molecule with one or more restriction enzymes A restriction map is a graphical description of the order and lengths of fragments that would be produced by the digestion of a DNA molecule with one or more restriction enzymes

17 Restriction map for circular plasmid

18

19 Transcription transcription is accomplished by RNA polymerase transcription is accomplished by RNA polymerase RNA polymerase binds to promoters RNA polymerase binds to promoters promoters have distinct regions "-35" and "-10" promoters have distinct regions "-35" and "-10" efficiency of transcription controlled by binding and progression rates efficiency of transcription controlled by binding and progression rates transcription start and stop affected by tertiary structure transcription start and stop affected by tertiary structure regulatory sequences can be positive or negative regulatory sequences can be positive or negative

20 RNA processing eukaryotic genes are interrupted by introns eukaryotic genes are interrupted by introns these are "spliced" out to yield mRNA these are "spliced" out to yield mRNA splicing done by spliceosome splicing done by spliceosome splicing sites are quite degenerate but not all are used splicing sites are quite degenerate but not all are used same transcript can be spliced in multiple ways (“alternative splicing”) same transcript can be spliced in multiple ways (“alternative splicing”)

21 RNA splicing

22 Translation conversion from RNA to protein is by codon: 3 bases = 1 amino acid conversion from RNA to protein is by codon: 3 bases = 1 amino acid translation done by ribosome translation done by ribosome translation efficiency controlled by mRNA copy number (turnover) and ribosome binding efficiency translation efficiency controlled by mRNA copy number (turnover) and ribosome binding efficiency translation affected by mRNA tertiary structure translation affected by mRNA tertiary structure

23 Translation

24 Protein localization leader sequences can specify cellular location (e.g., insert across membranes) leader sequences can specify cellular location (e.g., insert across membranes) leader sequences usually removed by proteolytic cleavage leader sequences usually removed by proteolytic cleavage

25 Protein localization

26 Postranslational processing peptides fold after translation - may be assisted or unassisted peptides fold after translation - may be assisted or unassisted processing enzymes recognize specific sites (amino acid sequences) processing enzymes recognize specific sites (amino acid sequences) protein signals can involve secondary and tertiary structure, not just primary structure protein signals can involve secondary and tertiary structure, not just primary structure

27 Representing and Retrieving Sequences

28 Definition A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids  DNA composed of four nucleotides or "bases": A,C,G,T  RNA composed of four also: A,C,G,U (T transcribed as U)  proteins are composed of amino acids (20)

29 Representation of Sequences characters characters  simplest  easy to read, edit, etc. bit-coding bit-coding  more compact, both on disk and in memory  comparisons more efficient  more to come on this

30 Character representation of sequences DNA or RNA DNA or RNA  use 1-letter codes (e.g., A,C,G,T) protein protein  use 1-letter codes  can convert to/from 3-letter codes (e.g., A = Ala = Alanine C = Cys = Cysteine) C = Cys = Cysteine)

31 Representing uncertainty in nucleotide sequences It is often the case that we would like to represent uncertainty in a nucleotide sequence, i.e., that more than one base is “possible” at a given position It is often the case that we would like to represent uncertainty in a nucleotide sequence, i.e., that more than one base is “possible” at a given position  to express ambiguity during sequencing  to express variation at a position in a gene during evolution  to express ability of an enzyme to tolerate more than one base at a given position of a recognition site

32 Representing uncertainty in nucleotide sequences To do this for nucleotides, we use a set of single character codes that represent all possible combinations of bases To do this for nucleotides, we use a set of single character codes that represent all possible combinations of bases This set was proposed and adopted by the International Union of Biochemistry and is referred to as the I.U.B. code This set was proposed and adopted by the International Union of Biochemistry and is referred to as the I.U.B. code

33 The I.U.B. Code A, C, G, T, U A, C, G, T, U R = A, G (puRine) R = A, G (puRine) Y = C, T (pYrimidine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) M = A, C (aMino group) K = G, T (Keto group) K = G, T (Keto group) B = C, G, T (not A) B = C, G, T (not A) D = A, G, T (not C) D = A, G, T (not C) H = A, C, T (not G) H = A, C, T (not G) V = A, C, G (not T/U) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used N = A, C, G, T/U (iNdeterminate) X or - are sometimes used

34 Representing uncertainty in protein sequences Given the size of the amino acid “alphabet”, it is not practical to design a set of codes for ambiguity in protein sequences Given the size of the amino acid “alphabet”, it is not practical to design a set of codes for ambiguity in protein sequences Fortunately, ambiguity is less common in protein sequences than in nucleic acid sequences Fortunately, ambiguity is less common in protein sequences than in nucleic acid sequences Could use bit-coding as for nucleic acids but rarely done Could use bit-coding as for nucleic acids but rarely done

35 Sequence File Formats

36 Sequence file formats Two characteristics of file formats Two characteristics of file formats  text or binary  minimal or annotated Text files use IUB codes and are readable by a word processor (e.g., SimpleText, Microsoft Word) or text editor (e.g., emacs) Text files use IUB codes and are readable by a word processor (e.g., SimpleText, Microsoft Word) or text editor (e.g., emacs) Binary files are usually readable only by the program that created them (e.g., MacVector) Binary files are usually readable only by the program that created them (e.g., MacVector) Annotated files preserve information known about the sequence (coding region start/stop, protein features, literature references, etc.) Annotated files preserve information known about the sequence (coding region start/stop, protein features, literature references, etc.)

37 Examples of ASCII sequence file formats Fasta Fasta >gi|995614|dbj|D49653|RATOBESE Rat mRNA for obese. CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC 

38 Examples of ASCII sequence file formats GCG GCG LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95 DEFINITION Rat mRNA for obese. ACCESSION - KEYWORDS - SOURCE Rattus norvegicus; Norway rat ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae; Rattus Myomorpha; Muridae; Murinae; Rattus REFERENCE [1] AUTHORS Murakami, T. & Shima, K. AUTHORS Murakami, T. & Shima, K. TITLE Cloning of rat obese cDNA and its expression in obese rats. TITLE Cloning of rat obese cDNA and its expression in obese rats. JOURNAL Biochem. Biophys. Res. Commun., 209, 3, , (1995) JOURNAL Biochem. Biophys. Res. Commun., 209, 3, , (1995) COMMENT Database Reference: DDBJ RATOBESE DDBJ RATOBESE Accession: D49653 Accession: D Submitted (10-Mar-1995) to DDBJ by: Submitted (10-Mar-1995) to DDBJ by: Takashi Murakami Takashi Murakami Department of Laboratory Medicine Department of Laboratory Medicine School of Medicine School of Medicine University of Tokushima University of Tokushima Kuramotocho 3-chome Kuramotocho 3-chome Tokushima 770 Tokushima 770 Japan Japan Phone: Phone: Fax: Fax: [continued]

39 Examples of ASCII sequence file formats GCG [continued] GCG [continued] FEATURES From To/Span Description pept obese pept obese ???? source; /organism=Rattus norvegicus; ???? source; /organism=Rattus norvegicus; /strain=OLETF, LETO and Zucker; /strain=OLETF, LETO and Zucker; /dev_stage=differentiated; /sequenced_mol=cDNA /dev_stage=differentiated; /sequenced_mol=cDNA to mRNA; /tissue_type=adipose to mRNA; /tissue_type=adipose BASE COUNT 121 A 167 C 133 G 118 T 0 OTHER ORIGIN ? RATOBESE.G Length: 539 Jan 30, :32 PM Check: RATOBESE.G Length: 539 Jan 30, :32 PM Check: CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT 1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC//  

40 Entrez

41 Entrez Databases Entrez Databases PubMed: The biomedical literature PubMed: The biomedical literature  PUBMED database contains Medline abstracts as well as links to full text articles on sites maintained by journal publishers PubMed Central: free, full text journal articles PubMed Central: free, full text journal articles Books: online books Books: online books OMIM: Online Mendelian Inheritance in Man OMIM: Online Mendelian Inheritance in Man Nucleotide sequence database (Genbank) Nucleotide sequence database (Genbank) Protein sequence database Protein sequence database Genome: complete genome assemblies Genome: complete genome assemblies Structure: three-dimensional macromolecular structures Structure: three-dimensional macromolecular structures

42 Entrez Databases Taxonomy: organisms in GenBank Taxonomy: organisms in GenBank SNP: single nucleotide polymorphism SNP: single nucleotide polymorphism PopSet: population study data sets PopSet: population study data sets And many more… And many more…

43 Entrez essentials Semi-automated entry of information into databases Semi-automated entry of information into databases Critical to usefulness is the links between databases Critical to usefulness is the links between databases

44 Entrez literature searching can find papers on a given subject can find papers on a given subject can find papers on a specific gene can find papers on a specific gene can find papers related to a given paper can find papers related to a given paper can switch between literature and sequence databases can switch between literature and sequence databases Pubmed has links to publishers’ websites to view full text of articles Pubmed has links to publishers’ websites to view full text of articles Pubmed Central has free full text copies Pubmed Central has free full text copies

45 Entrez sequence searching can find sequences for a given gene or protein can find sequences for a given gene or protein can download copy of sequence can download copy of sequence

46 Example Entrez Session Goal: Find literature and sequences for cystic fibrosis genes Goal: Find literature and sequences for cystic fibrosis genes  Use OMIM with Keyword searching.  Switch to Nucleotide database to see sequence.  Switch to Protein database to see sequence.  Change to GenPept format to save sequence.  Use links to find related literatures in pubmed.  Use Related Articles to find similar articles.  Search the Nucleotide database by gene name.  Set Limits to narrow down the search

47 Example Entrez Session: home of Entrez

48 Example Entrez Session: search OMIM for ‘cystic fibrosis’

49 Example Entrez Session: first hit is CFTR

50 Example Entrez Session: after clicking links  Nucleotide

51 Example Entrez Session: after clicking links  Protein

52 Example Entrez Session: Protein sequence from original cDNA

53 Example Entrez Session: change ‘Send to’ to ‘File’

54 Example Entrez Session: Links  PubMed

55 Example Entrez Session: paper in PubMed that is related

56 Example Entrez Session: Related Articles

57 Computation of related articles Similarity between documents is measured by the words they have in common: Similarity between documents is measured by the words they have in common:  Which words are considered?  What is the weight of each word ?  How do we calculate a similarity score of two articles?

58 Computation of related articles: words considered Remove stopwords: uninformative Remove stopwords: uninformative Stem words Stem words Words from the abstract are “text words” Words from the abstract are “text words” Words from the title are put in twice Words from the title are put in twice Words from the MeSH terms Words from the MeSH terms  U.S. National Library of Medicine  Vocabulary used for indexing articles  Consistent way to retrieve information

59 View the MeSH terms: change ‘Display’ to ‘Citation’

60 Computation of related articles: weight of each word Global weight: Global weight:  Greater, if the word is less frequent in the whole database Local weight: Local weight:  Greater, if the word is more frequent in the document  Longer document is not favored

61 Computation on related articles: Similarity score of two articles Weight of one pair of common word: Weight of one pair of common word: local wt1 * local wt2 * global wt local wt1 * local wt2 * global wt Similarity of two articles: sum of weights of all common words Similarity of two articles: sum of weights of all common words The higher the score the closer the two articles The higher the score the closer the two articles Similarity scores are pre-computed Similarity scores are pre-computed

62 Example Entrez Session: search Nucleotide for cftr

63 Example Entrez Session: 1249 hits related to cftr

64 Example Entrez Session: set limits as title and mRNA

65 Example Entrez Session: 46 hits with limits

66 Example Entrez Session: further narrow it down to human

67 Block Diagram for Entrez Literature Searching Entrez Search Engine Additional Search Criterion Desired Output Format Results of Previous Search Displayed Item Selection Results of Search (List) Item Display

68 Reading for next class Read Chapter 1 Read Chapter 1 Depending on background, read Chapter 2 and/or Chapter 3 Depending on background, read Chapter 2 and/or Chapter 3 Read Chapter 4 through section 4.6 Read Chapter 4 through section 4.6


Download ppt "Computational Biology, Part 1 Introduction Robert F. Murphy Copyright  1996, 2000-2009. All rights reserved."

Similar presentations


Ads by Google