CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar
CLUNCH Overview “Languages” of Computers and Biology Probability Models for NL and Biology Maximum Entropy Basic ME amino acid model The “Whole Protein Model” Results in a gene prediction model
CLUNCH Bits and Bytes: The Alphabet of Computers Computer electronics are complicated: RAM, processor, etc. It all comes down to bits (1s and 0s). Bits can be organized into bytes (8). Bytes can represent, among other things, letters (ASCII), which can form sentences.
CLUNCH
DNA: Biology’s Alphabet Biology is complicated. It comes down to nucleotides (A,C,G,T). Nucleotides can be grouped into codons. Codons represent amino acids, amino acids make proteins/genes.
CLUNCH Find the words!
CLUNCH Find the genes! AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC
CLUNCH NL and Biological Modeling “Mary went to the ____.” MSGTIPSCPTAL ___
CLUNCH Markov Models
CLUNCH ME, In a Nutshell Constrain the model. Maximize entropy.
CLUNCH Constraining features “is the” occurs with frequency 1/ Define a feature: Require that:
CLUNCH Exponential Solution A unique solution exists with maximum entropy:
CLUNCH Triggers Triggers – Words that increase the likelihood of other words. Crop → Harvest Cuban → Havana Iran → Hashemi Hate → Hate
CLUNCH Unigram and Bigram Caches Caches – frequency tables built from the history. Is “supercalifragilisticexpialidocious” a common word? Allow for model adaptation.
CLUNCH Applying ME Models in Computational Biology Significant improvement for NLP. Same for biological models? AA sequences: a simple test case.
CLUNCH Feature Sets Unigrams and Bigrams Self-triggers - frequency of a specific amino acid. Class based self-triggers - frequency of a specific amino acid class. Unigram Cache - Amino acid frequency for this protein.
CLUNCH Training and testing data Burset et al. set of 571 proteins. Homologous proteins eliminated. Resulting set of 204 proteins split into 2 groups of 102 each.
CLUNCH
Results “Long distance” features help. Best model gives a 30% reduction in perplexity over unigram reduction. Our model may improve predictions made by Genscan, a eukaryotic gene finding algorithm.
CLUNCH Limitations of this model Artificial model. Cannot represent all global features.
CLUNCH The “Whole Sentence” Model
CLUNCH Secondary Structure MAGTVTEAWDVAVFAARRRNDEDDTTRDSLFTYTNSNNTRGPFEGPNYHIAPRWVYNITS VWMIFVVIASIFTNGLVLVATAKFKKLRHPLNWILVNLAIADLGETVIASTISVINQISG YFILGHPMCVLEGYTVSTCGISALWSLAVISWERWVVVCKPFGNVKFDAKLAVAGIVFSW VWSAVWTAPPVFGWSRYWPHGLKTSCGPDVFSGSDDPGVLSYMIVLMITCCFIPLAVILL CYLQVWLAIRAVAAQQKESESTQKAEKEVSRMVVVMIIAYCFCWGPYTVFACFAAANPGY AFHPLAAALPAYFAKSATIYNPIIYVFMNRQFRNCIMQLFGKKVDDGSELSSTSRTEVSS VSNSSVSPA -----HHHHHHHHHHH EEE EEEE--- EEEEEEEEEEEEH--HEEHHHHHHHH------HHHHHHHHH---HHEEEEEEEEEEE--- EEE-----EEE----EEEE-EHEHHHHEHHHH-HEEEE HHHHHEHEEEEEE EHH H E EEEEEEEEE------EHHHH HHHHHHHHHHHHHH-H----HHHHHHHHHHHHHEEEEEEEE EEEHHH HHHHHHHHHH EEEEEH-----HHHHHHH EEEEE
CLUNCH “Whole Protein” Results 19 features evaluated Two were selected: –Mean length of alpha helix region –Maximum length of any structural region 59% increase in protein likelihood
CLUNCH Improved Glimmer Models Glimmer used IMMs to predict genes in bacteria. Will adding amino acid triggers improve these models? How much?
CLUNCH H. Pylori Genome 1562 Coding Sequences Split into: –Training (>500bp) – 1154 genes, 1,354,167 bp –Testing (<500bp) – 408 genes, 129,045 bp
CLUNCH Glimmer Depth
CLUNCH Lateral Gene Transfer Many genes in bacteria come not from their ancestors but from other bacterial species. Different bacteria “prefer” to use different codons. Analogous to detection of plagiarism detection?
CLUNCH Model Adaptation Gene models are trained for every organism. Lots of unused information Analogous to cross-domain application of NLP models.
CLUNCH Thanks Lyle Ungar Roni Rosenfeld NIH Grant
CLUNCH
N-Gram Features Unigram (frequency of individual words) Bigram (frequency of pairs of words)
CLUNCH
Trigger feature function