Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Similar presentations


Presentation on theme: "Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)"— Presentation transcript:

1 Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)

2 Molecular biology Nucleic acid –DNA –RNA Central dogma –Transcription –Translation Protein –Amino acid –Primary structure –Secondary structure –Tertiary structure 2

3 Nucleic acid A nucleic acid is a macromolecule composed of chains of monomeric nucleotide In biochemistry these molecules carry genetic information or form structures within cells The most common nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) 3

4 4

5 Nucleic acid components Sugar 5

6 Nucleic acid components Base Purine –Adenine (A) and guanine (G) Pyrimidine –Thymine (T), cytosine (C) –Uracil (U, only in RNA) 6

7 7

8 8

9 DNA Chemically, DNA is a long polymer of simple units called nucleotides, with a backbone made of sugars and phosphate groups joined by ester bonds Attached to each sugar is one of four types of molecules called bases It is the sequence of these four bases along the backbone that encodes information 9

10 DNA Base pairing Each type of base on one strand forms a bond with just one type of base on the other strand Here, purines form hydrogen bonds to pyrimidines, with A bonding only to T, and C bonding only to G The two types of base pairs form different numbers of hydrogen bonds, AT forming two hydrogen bonds, and GC forming three hydrogen bonds Chargaff rule –A=T and G=C DNA sequence –5’CpGpCpApApTpT 3’TpTpApApCpGpC –CGCGAATT 10

11 11

12 12 Double helix

13 Hydrogen bond A hydrogen bond exists between an electronegative atom and a hydrogen atom bonded to another electronegative atom This type of force always involves a hydrogen atom and the energy of this attraction is close to that of weak covalent bonds (155 kJ/mol), thus the name – Hydrogen Bonding Biological functions –DNA/RNA base paring –protein secondary/tertiary structure formation –some properties of water molecule –antibody-antigen (and other protein-protein) binding 13

14 14 Hydrogen bond is resulted from electronegativity

15 15 Major and minor grooves

16 DNA structure 16 k5iS1f0&NR=1

17 Any Questions? 17 About DNA

18 Central dogma 18

19 19

20 Central dogma The process by witch information is extracted from the nucleotide sequence of a gene and then used to make a protein is essentially the same for all living things on Earth and is described by the grandly named central dogma of molecular biology Information in cells passes from DNA to RNA to proteins 20

21 RNA Information stored from DNA is used to make a more transient, single-stranded polynucleotide called RNA (Ribonucleic Acid) RNA is very similar to DNA, but differs in a few important structural details –in the cell RNA is usually single stranded, while DNA is usually double stranded –RNA nucleotides contain ribose while DNA contains deoxyribose (a type of ribose that lacks one oxygen atom) –in RNA the nucleotide uracil substitutes for thymine, which is present in DNA 21

22 22

23 Central dogma Transcription Transcription is the synthesis of RNA under the direction of DNA Both nucleic acid sequences use the same language, and the information is simply transcribed, or copied, from one molecule to the other DNA sequence is enzymatically copied by RNA polymerase to produce a complementary nucleotide RNA strand, called messenger RNA (mRNA) 23

24 DNA transcription 24 Z3DsntU

25 Transcription detail 25 class.unl.edu/biochem/gp2/m_biology/an imation/m_animations/gene2.swf

26 RNA Various types mRNA –messenger RNA (mRNA) is the RNA that carries information from DNA to the ribosome –the coding sequence of the mRNA determines the amino acid sequence in the protein that is produced Non-coding RNA –many RNAs do not code for protein –these non-coding RNA can be encoded by their own genes (RNA genes), but can also derive from mRNA introns –the most prominent examples of non-coding RNAs are transfer RNA (tRNA) and ribosomal RNA (rRNA), both of which are involved in the process of translation –there are also non-coding RNAs involved in gene regulation, RNA processing and other roles 26

27 Central dogma Translation Translation is the second stage of protein biosynthesis Translation occurs in the cytoplasm where the ribosomes are located In translation, mRNA is decoded to produce a specific polypeptide according to the rules specified by the genetic code Many types of transcribed RNA, such as transfer RNA, ribosomal RNA, and small nuclear RNA are not necessarily translated into an amino acid sequence 27

28 From RNA to protein synthesis 28 gkPEAo

29 Protein translation 29 lonmA0

30 30 Genetic code

31 Any Questions? 31 About central dogma

32 Protein 32

33 Protein Proteins are large organic compounds made of amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues Proteins can also work together to achieve a particular function, and they often associate to form stable complexes 33

34 Protein Amino acid In chemistry, an amino acid is a molecule that contains both amine and carboxyl functional groups In biochemistry, this term refers to alpha- amino acids with the general formula H2NCHRCOOH, where R is an organic substituent In the alpha amino acids, the amino and carboxylate groups are attached to the same carbon, which is called the α–carbon 34

35 35

36 Amino acid Various side chains The various alpha amino acids differ in which side chain (R group) is attached to their alpha carbon They can vary in size from just a hydrogen atom in glycine through a methyl group in alanine to a large heterocyclic group in tryptophan 36

37 37

38 38

39 39

40 40

41 Amino acid The building blocks of proteins Amino acids combine in a condensation reaction that releases water and the new “amino acid residue” that is held together by a peptide bond Proteins are defined by their unique sequence of amino acid residues; this sequence is the primary structure of the protein Just as the letters of the alphabet can be combined to form an almost endless variety of words, amino acids can be linked in varying sequences to form a vast variety of proteins 41

42 42 Peptide bond

43 43

44 44

45 Protein After knowing amino acids Amino acids form short polymer chains called peptides or longer chains called either polypeptides or proteins The process of such formation from an mRNA template is known as translation, which is part of protein biosynthesis Twenty amino acids are encoded by the standard genetic code and are called proteinogenic or standard amino acids 45

46 Protein structure hierarchy 46

47 47

48 48

49 49

50 50

51 Protein structure hierarchy Secondary structure In biochemistry and structural biology, secondary structure is the general three- dimensional form of local segments of biopolymers such as proteins and nucleic acids (DNA/RNA) It does not, however, describe specific atomic positions in three-dimensional space, which are considered to be tertiary structure 51

52 52

53 Protein structure hierarchy Tertiary structure The three-dimensional structure of a protein or any other macromolecule, as defined by the atomic coordinates Describe the spatial relations among it secondary structures Tertiary structure is considered to be largely determined by the protein’s primary sequence, or the sequence of amino acids of which it is composed The majority of protein structures known to date have been solved with the experimental technique of X-ray crystallography A second common way of solving protein structures uses NMR (Nuclear Magnetic Resonance) –lower-resolution data and is limited to relatively small proteins –can provide time-dependent information about the motion of a protein in solution 53

54 54

55 Protein structure hierarchy Quaternary structure Many proteins are actually assemblies of more than one polypeptide chain, which in the context of the larger assemblage are known as protein subunits In addition to the tertiary structure of the subunits, multiple-subunit proteins possess a quaternary structure, which is the arrangement into which the subunits assemble 55

56 Protein sub-structure 56

57 Protein sub-structure Domain A protein domain is a part of protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain Each domain forms a compact three- dimensional structure and often can be independently stable and folded Domains vary in length from between about 25 amino acids up to 500 amino acids in length The shortest domains such as zinc fingers are stabilized by metal ions or disulfide bridges Domains often form functional units 57

58 Protein domain Zinc finger 58

59 Protein sub-structure Motif Sequence motif –a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance –for proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three dimensional arrangement of amino acids, which may not be adjacent Structure motif –a three-dimensional structural element or fold within the chain, which appears also in a variety of other molecules –in the context of proteins, the term is sometimes used interchangeably with “structural domain,” although a domain need not be a motif nor, if it contains a motif, need not be made up of only one 59

60 60

61 61

62 62

63 Molecular biology Reference 台大莊榮輝教授網站 –http://juang.bst.ntu.edu.tw/BC2008/ind ex.htmhttp://juang.bst.ntu.edu.tw/BC2008/ind ex.htm 交大分子生物學網站 –http://www.life.nctu.edu.tw/~mb/c htmhttp://www.life.nctu.edu.tw/~mb/c htm 63

64 Any Questions? 64 About molecular biology

65 Sequence alignment 65 Ina FASTA file Outpairwise sequence alignment Requirement - output alignment score (identity) - complexity/teamwork report - using Perl would be the best Bonus - alignment allowing mismatches - output alignment

66 Deadline /4/27 23:59 Zip your code, step-by-step README, complexity analyses and anything worthy extra credit. to

67 Input –Download from UniProt –UniProt is the universal protein resource, a central repository of protein data created by combining Swiss-Prot, TrEMBL and PIR. This makes it the world's most comprehensive resource on protein information. –http://www.uniprot.org/uniprot/?query=Saccharomyces+cerevisiae+transcription+fact or+AND+reviewed%3ayes&force=yes&format=fastahttp://www.uniprot.org/uniprot/?query=Saccharomyces+cerevisiae+transcription+fact or+AND+reviewed%3ayes&force=yes&format=fasta –>sp|P32333|MOT1_YEAST TATA-binding protein-associated fac... MTSRVSRLDRQVILIETGSTQVVRNMAADQMGDLAKQHPEDILSLLSRVYPFLLVKKWET... TFIKTLR >sp|Q00947|STP1_YEAST Transcription factor STP1... MPSTTLLFPQKHIRAIPGKIYAFFRELVSGVIISKPDLSHHYSCENATKEEGKDAADEEK... >sp|P38830|NDT80_YEAST Meiosis-specific transcription fac... MNEMENTDPVLQDDLVSKYERELSTEQEEDTPVILTQLNEDGTTSNYFDKRKLKIAPRST... Output –>MOT1_YEAST STP1_YEAST90 NDT80_YEAST80 >STP1_YEAST MOT1_YEAST90 NDT80_YEAST70 >NDT80_YEAST MOT1_YEAST80 STP1_YEAST70 67

68 Sequence similarity Sequence identity Sequence alignment –dynamic programming –backtracking –substitution matrix 68

69 Sequence similarity Identity Which sequence is more similar to DKELIR? –EPELIR or DKGLIR A trivial (but useful) concept, identity –DKELIR EPELIR **** identity: 4/6 = 66.7% –DKELIR DKGLIR ** *** identity: 5/6 = 83.3% 69

70 Sequence similarity Alignment When two sequence have different lengths –DKELIR MERPEPELIR identity: 0% Obviously, we need to shift the first sequence by 4 residues – DKELIR MERPEPELIR This movement is so-called alignment 70

71 Sequence alignment A way of arranging the sequences of DNA, RNA or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences More complex alignments may involve gaps (EDKELIR vs. MERPEPELIR) – E----DKELIR MERPEP--ELIR * **** identity: 5/11 = 45.5% And substitution matrix – E--DKELIR MERPEPELIR *.**** identity: 6/9 = 66.7% 71

72 Sequence alignment Dynamic programming GAATTCAGTTA G G A T C G A

73 A class of solution methods for solving sequential decision problems with a compositional cost structure In this matrix, each element S i,j indicates that the best alignment score between the two corresponding sub- sequences –the key is to find the relationships between the problem S i,j to its sub-problems S α,β, where α ≦ i and β ≦ j 73

74 Dynamic programming Insertion gap GAATTCAGTTA G G A T012233? C0 G0 A0 74

75 Dynamic programming Deletion gap GAATTCAGTTA G G A T C0122? G0 A0 75

76 Dynamic programming Match GAATTCAGTTA G G A T0122? C0 G0 A0 76

77 Dynamic programming Relationship Two key ingredients for an optimization problem to be suitable for a dynamic-programming solution –each substructure is optimal –sub-problems are dependent, otherwise, a divide-and- conquer approach is the choice Since now we know the three relationships –insertion gap –deletion gap –match We can easily construct an alignment based on this matrix with the so-called backtracking technique 77

78 Dynamic programming Backtracking GAATTCAGTTA G G A T C G A

79 Backtracking Alternative paths GAATTCAGTTA G G A T C G A

80 Backtracking The backtracking algorithm enumerates a set of partial candidates that, in principle, could be completed in various ways to give all the possible solutions to the given problem A dynamic programming matrix can produce all possible alignments of the best score from different backtracking paths Alternative paths –G-AATTCAGTTA GGA-T-C-G--A * * * * * * identity: 6/12 = 50% –G-AATTCAGTTA GGA--TC-G--A * * ** * * identity: 6/12 = 50% 80

81 Sequence alignment Substitution matrix Some alignments may involve mismatch relationship – E----DKELIR MERPEP--ELIR * **** identity: 5/11 = 45.5% – E--DKELIR MERPEPELIR *.**** identity: 6/9 = 66.7% 81

82 Sequence alignment with substitution matrix Initialize GAATTCAGTTA G -3 G -6 A -9 T -12 C -15 G -18 A

83 Sequence alignment with substitution matrix Match GAATTCAGTTA G -3? G -6 A -9 T -12 C -15 G -18 A

84 Sequence alignment with substitution matrix Deletion gap GAATTCAGTTA G -38? G -6 A -9 T -12 C -15 G -18 A

85 Sequence alignment with substitution matrix Insertion gap GAATTCAGTTA G G -6? A -9 T -12 C -15 G -18 A

86 Sequence alignment with substitution matrix Mismatch GAATTCAGTTA G G -65? A -9 T -12 C -15 G -18 A

87 Sequence alignment with substitution matrix Initialize GAATTCAGTTA G G -653? A -9 T -12 C -15 G -18 A

88 Sequence alignment with substitution matrix Complete GAATTCAGTTA G G A T C G A

89 Sequence alignment with substitution matrix Backtracking GAATTCAGTTA G G A T C G A

90 Beginning Perl for Bioinformatics 90

91 Biology and computer science Bioinformatics –biological data is proliferating rapidly –computer-based tools now play an increasingly critical role in the advancement of biological research –Bioinformatics, a rapidly evolving discipline, is the application of computational tools and techniques to the management and analysis of biological data Recently, the new term in silico has become a common reference to biological studies carried out in the computer –in vivo  in life, that is, in a living organism –in vitro  in glass, that is, in the test tube –in silico  in algorithm; most computer chips are made primarily of silicon 91

92 Getting started with Perl Perl is a popular programming language that's extensively used in areas such as bioinformatics and web programming The word Perl refers to –the language in which you will write programs Perl programs, Perl scripts, Perl code –the application on your computer that runs those programs Perl interpreter A low and long learning curve –one can get started very quickly, and then learn additional topics as needed ex: object-oriented programming is also well-supported in Perl Perl's benefits –ease of programming, rapid prototyping, portability, healthy society and abundant online resources 92

93 Perl Sequences and strings Example 4-1. Putting DNA into the computer #!/usr/bin/perl –w # Storing DNA in a variable, and printing it out use strict; # First we store the DNA in a variable called $DNA my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Next, we print the DNA onto the screen print $DNA; Checkpoints –#comments in Perl, like // in C –#!magic line in Unix, ignored in Windows –-wdisplay warnings, a good habit –use strictforce programmers to declare variables first, a good habit –mydeclare a variable –$scalar variable –printconvenient printf() 93

94 Example 4-2. Concatenating DNA #!/usr/bin/perl –w # Concatenating DNA # Store two DNA fragments into two variables my $DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; my $DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA'; # Using "string interpolation" my $DNA3 = "$DNA1$DNA2"; print "$DNA3\n\n"; # An alternative way using the "dot operator": $DNA3 = $DNA1. $DNA2; print "$DNA3\n\n"; # Print the same thing without using the variable $DNA3 print $DNA1, $DNA2, "\n\n"; print "$DNA1$DNA2\n\n"; Checkpoints –""string, notice that "$a\n" is different to '$a\n' –.concatenation operator –printallows multiple arguments Maybe the Perl slogan should be, “There are more than two ways to do it.” 94

95 Example 4-3. Transcribing DNA into RNA #!/usr/bin/perl –w # Transcribing DNA into RNA use strict; # The DNA my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Transcribe the DNA to RNA # Substitute all T's with U's my $RNA = $DNA; $RNA =~ s/T/U/g; print "$RNA\n"; Checkpoints –=~binding operator –s///substitute command, where g is modifier of s 95

96 96

97 Using the Perl documentation A Perl programmer’s most important resource is the Perl documentation –it should be installed on your computer –it may also be found on the Internet at the Perl site –just Google ‘perl’ perldoc –$ perldoc -f printf –Documentation link  Perl’s Builtin Functions  Alphabetical Listing of Perl’s Functions The Perl documentation –check out the examples they give is usually the quickest way –it may answer some questions but raises others E.g., the documentation of print starts out: “Prints a string or a comma-separated list of strings.” But then comes a bunch of gibberish (or it is just the learning curve!) Filehandles? Output streams? List context? –it also includes several tutorials 97

98 Example 4-4. Calculating the reverse complement of a strand of DNA #!/usr/bin/perl -w # Calculating the reverse complement of a strand of DNA use strict; # The DNA my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Calculate the reverse complement # Warning: this attempt will fail! # First, copy the DNA into new variable $revcom (short for REVerse COMplement) # Notice that variable names can use lowercase letters like "revcom“ # as well as uppercase like "DNA“. In fact, lowercase is more common. my $revcom = reverse $DNA; # Next substitute all bases by their complements, # A->T, T->A, G->C, C->G $revcom =~ s/A/T/g; $revcom =~ s/T/A/g; $revcom =~ s/G/C/g; $revcom =~ s/C/G/g; print "$revcom\n“; This examples has a logical bug Checkpoints –reverse()reverse a string, the parentheses can be omitted 98

99 What is 99 The bug

100 Example 4-4. Calculating the reverse complement of a strand of DNA #!/usr/bin/perl -w # Calculating the reverse complement of a strand of DNA use strict; # The DNA my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # The problem is that the first two substitute commands above # change all the A's to T's (so there are no A's) and then all the # T's to A's (so all the original A's and T's are all now A's). # Same thing happens to the G's and C's all turning into G's. # Make a copy of the DNA my $revcom = reverse $DNA; # Next substitute all bases by their complements, # A->T, T->A, G->C, C->G $revcom =~ tr/ACGTacgt/TGCAtgca/; print "$revcom\n"; Checkpoints –trtransliteration operator 100

101 101

102 Example 4-5. Reading protein sequence data from a file #!/usr/bin/perl -w # Reading protein sequence data from a file use strict; # The filename of the file containing the protein sequence data $proteinfilename = 'NM_021964fragment.pep'; # First we have to "open" the file, and associate a "filehandle" with it. open FH, $proteinfilename; # Now we do the actual reading of the protein sequence data from the file, by using the angle brackets to get the input from the filehandle. We store the data into our variable $protein. $protein = ; # Now that we've got our data, we can close the file. close FH; print $protein; Checkpoints –open()open a file, like fopen() in C –filehandleit’s an interface between program and file system, very common concept –<>it’s an operator in Perl, just memorize it –close()close a file, like fclose() in C 102

103 Example 4-6. Reading protein sequence data from a file, take 2 #!/usr/bin/perl -w # Reading protein sequence data from a file, take 2 use strict; # Open a file open FH, 'NM_021964fragment.pep'; # Suppose that the file has three lines, and since the read only is returning # one line, we'll read a line and print it, three times. # First line my $protein = ; print $protein; # Second line $protein = ; print $protein; # Third line $protein = ; print $protein; close FH; Checkpoints –how stupid it is 103

104 Example 4-7. Reading protein sequence data from a file, take 3 #!/usr/bin/perl -w # Reading protein sequence data from a file, take 3 use strict; # Open a file open FH, 'NM_021964fragment.pep'; # Read the protein sequence data from the file, and # store it into the = ; close PROTEINFILE; Checkpoints variable $protein[0]the first element –printknow how to print an array, no loop is required 104

105 Perl facilities for array = ( 'A', 'C', 'G', 'T' ); # A C G T pop, push (for stack)shift, unshift (for queue) –my $base1 = $base1 = print $base1; # Tprint $base1; # A # A C # C G T $base1; # A C G # A C G T reverse = # T G C A scalar (length of an array) –print # 4 splice (insert/delete/replace elements at an arbitrary place) –extremely powerful! 2, 0, 'X'; # A C X G T 105

106 Any Questions? 106 About Perl facilities for array

107 How to 107 = ( '1', '2', '3', '4' ); 2, 1, 'a', 'b'; # 1 2 a b 4 answer

108 Example 4-8. Scalar context and list context #!/usr/bin/perl -w # Demonstration of "scalar context" and "list context" use strict; = ( 'A', 'C', 'G', 'T' ); my $a print $a; # 4 ($a) print $a; # A Checkpoints –many Perl operations behave differently depending on the context in which they are used –another example is the reverse() on a string array 108

109 Perl Flow control Example 5-1. if-elsif-else #!/usr/bin/perl -w # if-elsif-else use strict; my $word = 'MNIDDKL'; # if-elsif-else conditionals if ( 'QSTVSGE' eq $word ) { print "QSTVSGE\n"; } elsif ( 'MNIDDKL' eq $word ) { print "MNIDDKL--the magic word!\n"; } else { print "Is \"$word\" a peptide?\n"; } Checkpoints –elsifelse if –eqequal, check for equality between strings –nenot equal, check for equality between strings 109

110 Example 5-2. Reading protein sequence data from a file, take 4 #!/usr/bin/perl -w # Reading protein sequence data from a file, take 4 use strict; # In case the open fails, print an error message and exit. my $fn = 'NM_021964fragment.pep'; unless ( open FH, $fn ) { print "Could not open file $fn!\n"; exit; } # Read and print line-by-line. my $protein; while ( $protein = ) { print $protein; } close FH; Checkpoints –unlessthe opposite of if, just to fit English more –openreturn false when fails –exitinterrupt the program, like exit() in C –In Perl, an assignment returns the value of the assignment. If there is another line to read in, the assignment occurs, the $protein is not null, and the conditional is true. 110

111 Perl Code layout Format AFormat B –while ($a) {while ($a) if ($b) {{ print "ok\n"; if ($b) } { } print "ok\n"; } } Format CFormat D – while ($a)while($a){if($b){print "ok\n";}} if ($b) { print "ok\n"; } } A and B are common ways to lay out code, A is more preferred in Perl DON’T use C or D, ever! Perl provides a guide for code style. –$ perldoc perlstyle –however, they are not rules, and you may use your own judgment 111

112 Perl Subroutine Example 6-1. A subroutine to append ACGT to DNA #!/usr/bin/perl -w # A program with a subroutine to append ACGT to DNA use strict; my $dna = 'CGACGTCTTCTCAGGCGA'; # The argument is $dna; the result is $longer_dna my $longer_dna = &addACGT($dna); print $longer_dna; #################### # Subroutines for Example 6-1 sub addACGT { my $dna = $dna.= 'ACGT'; return $dna; } Checkpoints –&call subroutine, can be omitted but helpful (e.g., vi) for highlighting –subsubroutine definition –my $dnascoping issue array, a special variable in Perl –returnreturn value 112

113 Perl Scoping Make the variables specific to the subroutine with my –my is a keyword in Perl that limits variables to the block where they are used (here the block is the subroutine) Making variables local to a restricted part of a program is called scoping. There are different models of scoping: –in Perl, using my variables is known as lexical scoping, also known as static scoping –dynamic scoping is hard to track, use them as less as possible Manipulation directly on $_[0] will change the argument (call by reference) 113

114 Any Questions? 114 About Perl subroutine

115 How to 115 Return multiple values? return ( 'a', 'b' ); $_[0] = 'a'; $_[1] = 'b'; answer#1 answer#2


Download ppt "Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)"

Similar presentations


Ads by Google