Presentation on theme: "CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar."— Presentation transcript:
CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar
CLUNCH Overview “Languages” of Computers and Biology Probability Models for NL and Biology Maximum Entropy Basic ME amino acid model The “Whole Protein Model” Results in a gene prediction model
CLUNCH Bits and Bytes: The Alphabet of Computers Computer electronics are complicated: RAM, processor, etc. It all comes down to bits (1s and 0s). Bits can be organized into bytes (8). Bytes can represent, among other things, letters (ASCII), which can form sentences.
CLUNCH ME, In a Nutshell Constrain the model. Maximize entropy.
CLUNCH Constraining features “is the” occurs with frequency 1/10000. Define a feature: Require that:
CLUNCH Exponential Solution A unique solution exists with maximum entropy:
CLUNCH Triggers Triggers – Words that increase the likelihood of other words. Crop → Harvest Cuban → Havana Iran → Hashemi Hate → Hate
CLUNCH Unigram and Bigram Caches Caches – frequency tables built from the history. Is “supercalifragilisticexpialidocious” a common word? Allow for model adaptation.
CLUNCH Applying ME Models in Computational Biology Significant improvement for NLP. Same for biological models? AA sequences: a simple test case.
CLUNCH Feature Sets Unigrams and Bigrams Self-triggers - frequency of a specific amino acid. Class based self-triggers - frequency of a specific amino acid class. Unigram Cache - Amino acid frequency for this protein.
CLUNCH Training and testing data Burset et al. set of 571 proteins. Homologous proteins eliminated. Resulting set of 204 proteins split into 2 groups of 102 each.
CLUNCH Lateral Gene Transfer Many genes in bacteria come not from their ancestors but from other bacterial species. Different bacteria “prefer” to use different codons. Analogous to detection of plagiarism detection?
CLUNCH Model Adaptation Gene models are trained for every organism. Lots of unused information Analogous to cross-domain application of NLP models.
CLUNCH Thanks Lyle Ungar Roni Rosenfeld NIH Grant