It & Health 2009 Summary Thomas Nordahl Petersen
Teachers Thomas Nordahl Petersen Rasmus Wernersson Lisbeth Nielsen Fink Anders Gorm Pedersen Bent Petersen Ramneek Gupta Thomas Blicher
Outline of the course Topics will cover a general introduction to bioinformatics –Evolution –DNA / Protein –Alignment and scoring matrices How does it work & what are the numbers –Visualization of multiple alignments Phylogenetic trees and logo plots –Commonly used databases Uniprot/Genbank & Genome browsers –Protein 3D-structure –Artificial neural networks & case stories –Practical use of bioinformatics tools Preparation for exam
Topics covered - (some of them)
Information flow in biological systems
Amino Acids Amine and carboxyl groups. Sidechain ‘R’ is attached to C-alpha carbon The amino acids found in Living organisms are L-amino acids
Amino Acids - peptide bond N-terminalC-terminal
1 and 3-letter codes 1.There are 20 naturally occurring amino acids 2.Normally the one/three codes are used Ala - A Cys - C Asp - D Glu - E Phe - F Gly - G His - H Ile - I Lys - K Leu - L Met - M Asn - N Pro - P Gln - Q Arg - R Ser - S Thr - T Val - V Trp - W Tyr - Y
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Theory of evolution Charles Darwin
Phylogenetic tree
Global versus local alignments Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm). Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm). Global alignment Seq 1 Seq 2 Local alignment
Pairwise alignment: the solution ” Dynamic programming ” (the Needleman-Wunsch algorithm)
Sequence alignment - Blast
Blosum & PAM matrices Blosum matrices are the most commonly used substitution matrices. Blosum50, Blosum62, blosum80 PAM - Percent Accepted Mutations PAM-0 is the identity matrix. PAM-1 diagonal small deviations from 1, off- diag has small deviations from 0 PAM-250 is PAM-1 multiplied by itself 250 times.
Sequence profiles (1J2J.B) >1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK
Log-odds scores BLOSUM is a log-likelihood matrix: Likelihood of observing j given you have i is –P(j|i) = P ij /P i The prior likelihood of observing j is –Q j, which is simply the frequency The log-likelihood score is –S ij = 2log 2 (P(j|i)/log(Q j ) = 2log 2 (P ij /(Q i Q j )) –Where, Log 2 (x)=log n (x)/log n (2) –S has been normalized to half bits, therefore the factor 2
BLAST Exercise
Genome browsers - UCSC Intron - Exon structure Single Nucleotide polymorphism - SNP
SNPs
Protein 3D-structure
Protein structure Primary structure: Amino acids sequences Secondary structure: Helix/Beta sheet Tertiary structure: Fold, 3D cordinates
Protein structure -helix helix3 residues/turn - few, but not uncommon - helix3.6 residues/turn - by far the most common helix Pi-helix4.1 residues/turn - very rare
Protein structure strand/sheet
Protein folds Class 4’th is ‘few secondary structure Architecture Overall shape of a domain Topology Share secondary structure connectivity
Protein 3D-structure
Neural Networks From knowledge to information Protein sequence Biological feature
A data-driven method to predict a feature, given a set of training data In biology input features could be amino acid sequence or nucleotides Secondary structure prediction Signal peptide prediction Surface accessibility Propeptide prediction Use of artificial neural networks N C Signal peptide Propeptide Mature/active protein
Prediction of biological features Surface accessible Predict surface accessible from amino acid sequence only.
Logo plots Information content, how is it calculated - what does it mean.
Logo plots - Information Content Sequence-logo Calculate Information Content I = a p a log 2 p a + log 2 (4), Maximal value is 2 bits Total height at a position is the ‘Information Content’ measured in bits. Height of letter is the proportional to the frequency of that letter. A Logo plot is a visualization of a mutiple alignment. ~0.5 each Completely conserved