Presentation on theme: "CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar."— Presentation transcript:
CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar
CLUNCH Overview “Languages” of Computers and Biology Probability Models for NL and Biology Maximum Entropy Basic ME amino acid model The “Whole Protein Model” Results in a gene prediction model
CLUNCH Bits and Bytes: The Alphabet of Computers Computer electronics are complicated: RAM, processor, etc. It all comes down to bits (1s and 0s). Bits can be organized into bytes (8). Bytes can represent, among other things, letters (ASCII), which can form sentences.
DNA: Biology’s Alphabet Biology is complicated. It comes down to nucleotides (A,C,G,T). Nucleotides can be grouped into codons. Codons represent amino acids, amino acids make proteins/genes.
CLUNCH NL and Biological Modeling “Mary went to the ____.” MSGTIPSCPTAL ___
CLUNCH Markov Models
CLUNCH ME, In a Nutshell Constrain the model. Maximize entropy.
CLUNCH Constraining features “is the” occurs with frequency 1/ Define a feature: Require that:
CLUNCH Exponential Solution A unique solution exists with maximum entropy:
CLUNCH Triggers Triggers – Words that increase the likelihood of other words. Crop → Harvest Cuban → Havana Iran → Hashemi Hate → Hate
CLUNCH Unigram and Bigram Caches Caches – frequency tables built from the history. Is “supercalifragilisticexpialidocious” a common word? Allow for model adaptation.
CLUNCH Applying ME Models in Computational Biology Significant improvement for NLP. Same for biological models? AA sequences: a simple test case.
CLUNCH Feature Sets Unigrams and Bigrams Self-triggers - frequency of a specific amino acid. Class based self-triggers - frequency of a specific amino acid class. Unigram Cache - Amino acid frequency for this protein.
CLUNCH Training and testing data Burset et al. set of 571 proteins. Homologous proteins eliminated. Resulting set of 204 proteins split into 2 groups of 102 each.
Results “Long distance” features help. Best model gives a 30% reduction in perplexity over unigram reduction. Our model may improve predictions made by Genscan, a eukaryotic gene finding algorithm.
CLUNCH Limitations of this model Artificial model. Cannot represent all global features.
CLUNCH “Whole Protein” Results 19 features evaluated Two were selected: –Mean length of alpha helix region –Maximum length of any structural region 59% increase in protein likelihood
CLUNCH Improved Glimmer Models Glimmer used IMMs to predict genes in bacteria. Will adding amino acid triggers improve these models? How much?
CLUNCH H. Pylori Genome 1562 Coding Sequences Split into: –Training (>500bp) – 1154 genes, 1,354,167 bp –Testing (<500bp) – 408 genes, 129,045 bp
CLUNCH Glimmer Depth
CLUNCH Lateral Gene Transfer Many genes in bacteria come not from their ancestors but from other bacterial species. Different bacteria “prefer” to use different codons. Analogous to detection of plagiarism detection?
CLUNCH Model Adaptation Gene models are trained for every organism. Lots of unused information Analogous to cross-domain application of NLP models.
CLUNCH Thanks Lyle Ungar Roni Rosenfeld NIH Grant
N-Gram Features Unigram (frequency of individual words) Bigram (frequency of pairs of words)