Presentation is loading. Please wait.

Presentation is loading. Please wait.

CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

Similar presentations


Presentation on theme: "CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar."— Presentation transcript:

1 CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

2 CLUNCH Overview “Languages” of Computers and Biology Probability Models for NL and Biology Maximum Entropy Basic ME amino acid model The “Whole Protein Model” Results in a gene prediction model

3 CLUNCH Bits and Bytes: The Alphabet of Computers Computer electronics are complicated: RAM, processor, etc. It all comes down to bits (1s and 0s). Bits can be organized into bytes (8). Bytes can represent, among other things, letters (ASCII), which can form sentences.

4 CLUNCH

5 DNA: Biology’s Alphabet Biology is complicated. It comes down to nucleotides (A,C,G,T). Nucleotides can be grouped into codons. Codons represent amino acids, amino acids make proteins/genes.

6 CLUNCH Find the words!

7 CLUNCH Find the genes! AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC

8 CLUNCH NL and Biological Modeling “Mary went to the ____.” MSGTIPSCPTAL ___

9 CLUNCH Markov Models

10 CLUNCH ME, In a Nutshell Constrain the model. Maximize entropy.

11 CLUNCH Constraining features “is the” occurs with frequency 1/ Define a feature: Require that:

12 CLUNCH Exponential Solution A unique solution exists with maximum entropy:

13 CLUNCH Triggers Triggers – Words that increase the likelihood of other words. Crop → Harvest Cuban → Havana Iran → Hashemi Hate → Hate

14 CLUNCH Unigram and Bigram Caches Caches – frequency tables built from the history. Is “supercalifragilisticexpialidocious” a common word? Allow for model adaptation.

15 CLUNCH Applying ME Models in Computational Biology Significant improvement for NLP. Same for biological models? AA sequences: a simple test case.

16 CLUNCH Feature Sets Unigrams and Bigrams Self-triggers - frequency of a specific amino acid. Class based self-triggers - frequency of a specific amino acid class. Unigram Cache - Amino acid frequency for this protein.

17 CLUNCH Training and testing data Burset et al. set of 571 proteins. Homologous proteins eliminated. Resulting set of 204 proteins split into 2 groups of 102 each.

18 CLUNCH

19 Results “Long distance” features help. Best model gives a 30% reduction in perplexity over unigram reduction. Our model may improve predictions made by Genscan, a eukaryotic gene finding algorithm.

20 CLUNCH Limitations of this model Artificial model. Cannot represent all global features.

21 CLUNCH The “Whole Sentence” Model

22 CLUNCH Secondary Structure MAGTVTEAWDVAVFAARRRNDEDDTTRDSLFTYTNSNNTRGPFEGPNYHIAPRWVYNITS VWMIFVVIASIFTNGLVLVATAKFKKLRHPLNWILVNLAIADLGETVIASTISVINQISG YFILGHPMCVLEGYTVSTCGISALWSLAVISWERWVVVCKPFGNVKFDAKLAVAGIVFSW VWSAVWTAPPVFGWSRYWPHGLKTSCGPDVFSGSDDPGVLSYMIVLMITCCFIPLAVILL CYLQVWLAIRAVAAQQKESESTQKAEKEVSRMVVVMIIAYCFCWGPYTVFACFAAANPGY AFHPLAAALPAYFAKSATIYNPIIYVFMNRQFRNCIMQLFGKKVDDGSELSSTSRTEVSS VSNSSVSPA -----HHHHHHHHHHH EEE EEEE--- EEEEEEEEEEEEH--HEEHHHHHHHH------HHHHHHHHH---HHEEEEEEEEEEE--- EEE-----EEE----EEEE-EHEHHHHEHHHH-HEEEE HHHHHEHEEEEEE EHH H E EEEEEEEEE------EHHHH HHHHHHHHHHHHHH-H----HHHHHHHHHHHHHEEEEEEEE EEEHHH HHHHHHHHHH EEEEEH-----HHHHHHH EEEEE

23 CLUNCH “Whole Protein” Results 19 features evaluated Two were selected: –Mean length of alpha helix region –Maximum length of any structural region 59% increase in protein likelihood

24 CLUNCH Improved Glimmer Models Glimmer used IMMs to predict genes in bacteria. Will adding amino acid triggers improve these models? How much?

25 CLUNCH H. Pylori Genome 1562 Coding Sequences Split into: –Training (>500bp) – 1154 genes, 1,354,167 bp –Testing (<500bp) – 408 genes, 129,045 bp

26 CLUNCH Glimmer Depth

27 CLUNCH Lateral Gene Transfer Many genes in bacteria come not from their ancestors but from other bacterial species. Different bacteria “prefer” to use different codons. Analogous to detection of plagiarism detection?

28 CLUNCH Model Adaptation Gene models are trained for every organism. Lots of unused information Analogous to cross-domain application of NLP models.

29 CLUNCH Thanks Lyle Ungar Roni Rosenfeld NIH Grant

30 CLUNCH

31 N-Gram Features Unigram (frequency of individual words) Bigram (frequency of pairs of words)

32 CLUNCH

33 Trigger feature function


Download ppt "CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar."

Similar presentations


Ads by Google