CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
MNW2 course Introduction to Bioinformatics
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Master’s course Bioinformatics Data Analysis and Tools
Course information To reach me: Barry Cohen GITC 4301 W 4:00-5:30 F 4:45-5:55 Web site,
It & Health 2010 Summary Thomas Nordahl Petersen.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
Chromosomes carry genetic information
Mutations Section 12–4 This section describes and compares gene mutations and chromosomal mutations.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
1 Advanced Smoothing, Evaluation of Language Models.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Laboratory Training for Field Epidemiologists Typing May 2007 Sequencing and Phylogeny.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Markov Chain Models BMI/CS 576 Fall 2010.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
Watermarks.  Four sequences, 1000 bp each  Inserted into noncoding regions of genome  Translated into English using secret triplet nucleotide to character.
Ch. 1: Exploring Life. 1.Organization -The basic characteristic of life is a high degree of order. -Hierarchy of structural levels: Biosphere  Ecosystems.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette.
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
Tutorial -1: BB 101 (30/7/13) Q.1: The language of life is coded into two sets of alphabets. The genetic information which is coded in the DNA is read.
Gene Regulations and Mutations
Dr. Kathleen Hill Assistant Professor Department of Biology The University of Western Ontario Office Hours: Monday 1 to 5pm Room 333 Western.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
DNA in the Cell Stored in Number of Chromosomes (24 in Human Genome) Tightly coiled threads of DNA and Associated Proteins: Chromatin 3 billion bp in Human.
The Genetic Code. The DNA that makes up the human genome can be subdivided into information bytes called genes. Each gene encodes a unique protein that.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
(H)MMs in gene prediction and similarity searches.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Introduction to Bioinformatics Summary Thomas Nordahl Petersen.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
ORF Calling.
bacteria and eukaryotes
Bioinformatics Overview
Variation among organisms
Interpolated Markov Models for Gene Finding
The triplet code Starter A DNA molecule is 23% guanine.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
What do you with a whole genome sequence?
16.1 – Genetic Variation in Bacteria
Essential Question: How cells make proteins
How genes on a chromosome determine what proteins to make
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
Microbial gene identification using interpolated Markov models
Applying principles of computer science in a biological context
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
5.4 Cladistics.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Protein Synthesis.
The Production of Proteins by DNA
Presentation transcript:

CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

CLUNCH Overview “Languages” of Computers and Biology Probability Models for NL and Biology Maximum Entropy Basic ME amino acid model The “Whole Protein Model” Results in a gene prediction model

CLUNCH Bits and Bytes: The Alphabet of Computers Computer electronics are complicated: RAM, processor, etc. It all comes down to bits (1s and 0s). Bits can be organized into bytes (8). Bytes can represent, among other things, letters (ASCII), which can form sentences.

CLUNCH

DNA: Biology’s Alphabet Biology is complicated. It comes down to nucleotides (A,C,G,T). Nucleotides can be grouped into codons. Codons represent amino acids, amino acids make proteins/genes.

CLUNCH Find the words!

CLUNCH Find the genes! AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC

CLUNCH NL and Biological Modeling “Mary went to the ____.” MSGTIPSCPTAL ___

CLUNCH Markov Models

CLUNCH ME, In a Nutshell Constrain the model. Maximize entropy.

CLUNCH Constraining features “is the” occurs with frequency 1/ Define a feature: Require that:

CLUNCH Exponential Solution A unique solution exists with maximum entropy:

CLUNCH Triggers Triggers – Words that increase the likelihood of other words. Crop → Harvest Cuban → Havana Iran → Hashemi Hate → Hate

CLUNCH Unigram and Bigram Caches Caches – frequency tables built from the history. Is “supercalifragilisticexpialidocious” a common word? Allow for model adaptation.

CLUNCH Applying ME Models in Computational Biology Significant improvement for NLP. Same for biological models? AA sequences: a simple test case.

CLUNCH Feature Sets Unigrams and Bigrams Self-triggers - frequency of a specific amino acid. Class based self-triggers - frequency of a specific amino acid class. Unigram Cache - Amino acid frequency for this protein.

CLUNCH Training and testing data Burset et al. set of 571 proteins. Homologous proteins eliminated. Resulting set of 204 proteins split into 2 groups of 102 each.

CLUNCH

Results “Long distance” features help. Best model gives a 30% reduction in perplexity over unigram reduction. Our model may improve predictions made by Genscan, a eukaryotic gene finding algorithm.

CLUNCH Limitations of this model Artificial model. Cannot represent all global features.

CLUNCH The “Whole Sentence” Model

CLUNCH Secondary Structure MAGTVTEAWDVAVFAARRRNDEDDTTRDSLFTYTNSNNTRGPFEGPNYHIAPRWVYNITS VWMIFVVIASIFTNGLVLVATAKFKKLRHPLNWILVNLAIADLGETVIASTISVINQISG YFILGHPMCVLEGYTVSTCGISALWSLAVISWERWVVVCKPFGNVKFDAKLAVAGIVFSW VWSAVWTAPPVFGWSRYWPHGLKTSCGPDVFSGSDDPGVLSYMIVLMITCCFIPLAVILL CYLQVWLAIRAVAAQQKESESTQKAEKEVSRMVVVMIIAYCFCWGPYTVFACFAAANPGY AFHPLAAALPAYFAKSATIYNPIIYVFMNRQFRNCIMQLFGKKVDDGSELSSTSRTEVSS VSNSSVSPA -----HHHHHHHHHHH EEE EEEE--- EEEEEEEEEEEEH--HEEHHHHHHHH------HHHHHHHHH---HHEEEEEEEEEEE--- EEE-----EEE----EEEE-EHEHHHHEHHHH-HEEEE HHHHHEHEEEEEE EHH H E EEEEEEEEE------EHHHH HHHHHHHHHHHHHH-H----HHHHHHHHHHHHHEEEEEEEE EEEHHH HHHHHHHHHH EEEEEH-----HHHHHHH EEEEE

CLUNCH “Whole Protein” Results 19 features evaluated Two were selected: –Mean length of alpha helix region –Maximum length of any structural region 59% increase in protein likelihood

CLUNCH Improved Glimmer Models Glimmer used IMMs to predict genes in bacteria. Will adding amino acid triggers improve these models? How much?

CLUNCH H. Pylori Genome 1562 Coding Sequences Split into: –Training (>500bp) – 1154 genes, 1,354,167 bp –Testing (<500bp) – 408 genes, 129,045 bp

CLUNCH Glimmer Depth

CLUNCH Lateral Gene Transfer Many genes in bacteria come not from their ancestors but from other bacterial species. Different bacteria “prefer” to use different codons. Analogous to detection of plagiarism detection?

CLUNCH Model Adaptation Gene models are trained for every organism. Lots of unused information Analogous to cross-domain application of NLP models.

CLUNCH Thanks Lyle Ungar Roni Rosenfeld NIH Grant

CLUNCH

N-Gram Features Unigram (frequency of individual words) Bigram (frequency of pairs of words)

CLUNCH

Trigger feature function