Presentation on theme: "Bioinformatics The application of computer science to biological data Tony C Smith Department of Computer Science University of Waikato"— Presentation transcript:
Bioinformatics The application of computer science to biological data Tony C Smith Department of Computer Science University of Waikato
Bioinformatics Tony C Smith The essence is prediction … My dog is very littl_ My dog is very littl_ ? We know that letters do not occur in English at random (e.g. ‘t’ is more common than ‘x’) We know that context changes the probability of a letter (e.g. ‘x’ is more common than ‘t’ after the sequence “I eat Weet-Bi_”) Predicting symbols is fundamental to a wide range of important applications (e.g. encryption, compression)
Bioinformatics Tony C Smith Prediction in bioinformatics Predicting the location of genes in DNA Predicting gene roles in an organism Predicting errors in a genetic transcription Predicting the function of proteins Predicting diseases from molecular samples Anything that involves “making a judgment”; a yes/no decision about whether some sample datum ‘does’ or ‘does not’ have some property.
Bioinformatics Tony C Smith Representation W e e t – B i x … … to the computer, everything is binary!
Bioinformatics Tony C Smith A A C G T C A T T C G A T G A T T C G A Just as we can teach a computer to predict things about a sequence of letters in English prose, we can also teach it to predict things about a other sequences—like a genetic sequence
Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc gcggctacgttcatcccagcagcagcgattttaaaattaa cgcatcagactctcgtcgcgttcgtcgcctttattcacgcta atggacgacatcttttactacgacggcgcctacgcatcg cagcatacgacgcccagcatagtattttagaggcgagg acatcatcatatcgcagctacagcgcatcagacgcata cgacgacgactacgacgacactaacgacgatgttgcg cacccacaccagttatatagagacgaactcgcatcagc ttgcaatcggcgctacgcttcaaaatttattatattcccggc gcggctacgttcatcccagcagcagcgattttaaaattaa cgcatcagactctcgtcgcgttcgtcgcctttattcacgcta atggacgacatcttttactacgacggcgcctacgcatcg cagcatacgacgcccagcatagtattttagaggcgagg acatcatcatatcgcagctacagcgcatcagacgcata cgacgacgactacgacgacactaacgacgatgttgcg cacccacaccagttatatagagacgaactcgcatcagc
Bioinformatics Tony C Smith A genetic prediction problem A gene encodes a protein It is a blueprint that provides biochemical instructions on how to construct a sequence of amino acids so as to make a working protein that will perform some function in the organism
Bioinformatics Tony C Smith A genetic prediction problem encoding region untranslated region transcription factor RNA
Bioinformatics Tony C Smith A genetic prediction problem untranslated region
Bioinformatics Tony C Smith A genetic prediction problem untranslated region ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc What transcription factors bind to this gene? Where is the transcription factor binding site?
Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues:A binding site is often a short general pattern E.g. CCGATNATCGG
Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues:The patterns are often reverse complements E.g.CCGATNATCGG GGCTANTAGCC
Bioinformatics Tony C Smith A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues:Where there is one binding site, often there is another nearby.
Bioinformatics Tony C Smith A genetic prediction problem All of these properties are the kinds of things for which computer science has developed algorithms and data structures to identify quickly and efficiently, and therefore it is exactly the kind of problem computer scientists should be able to solve.
Bioinformatics Tony C Smith proteomics Three consecutive nucleotides in the coding region form a ‘codon’ … i.e. encode an amino acid. A string of amino acids makes a protein. 3 nucleotides, 4 possibilities each: 4 3 = 64 possible codons But there are only 20 amino acids!
Bioinformatics Tony C Smith proteomics Glycine:GGA, GGC, GGG, GGT Tyrosine:TAT, TAC Methionine:ATG There is quite a bit of redundancy in codons.
Bioinformatics Tony C Smith Amide group Carboxyl group R group Amino Acid
Bioinformatics Tony C Smith Amino Acid glycine tyrosine
Bioinformatics Tony C Smith
Signal peptide A relatively short sequence of amino residues at the N-terminus of the nascent protein typically residues typically residues MAGPRPSPWARLLLAALISVSLSGTLARCKKAPVSKKCETCVGQAALTGL … Cleaved off as protein passes through membrane (operates like a pass key) Knowing signal peptide helps determine protein function in the organism
Bioinformatics Tony C Smith Local biases in residues around the cleavage site Sequence regularities can be exploited by statistical and pattern-based models
Bioinformatics Tony C Smith Existing solutions Partial alignments (Altschul & Gish, 1996) Neural networks (Nielsen at al., 1997) Hidden Markov models (Nielsen et al., 1999) Polypeptide probabilities (Chou, 2001) Maximum entropy (Clote, 2002)
Bioinformatics Tony C Smith SignalP (Nielsen et al., ) HMMs (or NNs) used to predict cleavage point
Bioinformatics Tony C Smith Existing methods all perform reasonably well and with about the same accuracy (90% eukaryotes, 87% gram-, 85% gram+) Do not offer a transparent explanatory framework as to the underlying biology Many other learning algorithms do! (WEKA data mining tools, Waikato University)
Bioinformatics Tony C Smith From sequences to text Primary sequence data has many similarities with text –Amino residues (letters) –Polypeptides (words) –Secondary structures (phrases/sentences)
Bioinformatics Tony C Smith
From sequences to text Primary sequence data looks like text –Amino residues (letters) –Polypeptides (words) –Secondary structures (phrases/sentences) –Tertiary structure (whole documents) Approach: transform a sequence into a set of pseudo-text documents
Bioinformatics Tony C Smith Approach Problem is stated as two-class: an amino acid is either the first residue of the mature protein or it is not Each residue is described by a single document, which includes as many electrochemical, structural or contextual facts as are available (desirable)
Bioinformatics Tony C Smith Properties of amino acids
Bioinformatics Tony C Smith Free facts about amino acids
Bioinformatics Tony C Smith Residue as a document E.g.CysteineCysC aliphatic [yes], aromatic [no], hydrophobic [yes], charge [-], polarized [yes], small [no], number of nitrogen atoms , contains sulphur [yes], has a carbon ring [no], ionized [yes], valence , cbeta [no], covalent [yes], h-bond [yes], etc. (whatever else experimenter wants to include)
Concluding remarks A [pseudo] text classification approach to sequence prediction problems can perform as well as the state-of- the-art stochastic methods Allows miscellaneous facts (i.e. any textual description of relevant information) to be included A ranked list of features from the text classifier provides insights into the underlying biology Features could be used for text generation
Bioinformatics Tony C Smith Biotechnology Biologists know proteins, computer scientists know machine learning Together, they can find out a lot of hidden information about genes and proteins Biotechnology is a multi-billion dollar industry Biotechnology is one of the best funded areas of scientific research
Bioinformatics Tony C Smith The University of Waikato Waikato University is the centre of the universe for machine learning The Machine Learning Group is a large, globally active, well-funded research group The WEKA workbench of ML tools is used around the world Professors at Waikato University literally wrote the book on sequence modeling
Bioinformatics Tony C Smith The University of Waikato If you’re seriously interested in machine learning, in getting involved in bioinformatics research, or indeed any other area along the leading edge of computer science, then university is the only place to be, and Waikato wants You!