Presentation on theme: "Algorithms for Finding Genes Rhys Price Jones Anne R. Haake."— Presentation transcript:
Algorithms for Finding Genes Rhys Price Jones Anne R. Haake
There’s something about Genes The human genome consists of about 3 billion base pairs Some of that can be transcribed and translated to produce proteins Most of it does not What is it that makes some sequences of DNA encode protein?
The Genetic Code Pick a spot in your sequence Group nucleotides from that point on in triplets. Each triplet encodes for an amino acid, or a “stop”, or a “start” TGA, TAG, TAA code for “stop” But there are exceptions!
Complications In the 1960s it was thought that a gene was a linear structure of nucleotides directly linked in triplets to amino acids in the resulting protein This view did not last long with the discovery in the late 1960s of genes within genes and overlapping genes In 1977, split genes were discovered
Introns and Exons Most eukaryotic genes are interrupted by non-coding sections and broken into pieces called exons. The interrupting sequences are called introns Some researchers have used a biological approach and searched for splicing sites at intron-exon junctions Catalogs of splice sites were created in the 1980s But unreliable
Can you tell if it’s a gene? What is it about some sections of DNA that make them encode for proteins? Compare TTCTTCTCCAAGAGCAGGGCTTAATTCTATGCTTCCAGGCGAAAGACTG CATGGCTAACAAAGCAACGCCTAACACATTCCTAAGCAATTGGCTTGCA CC GTCTTCCCGAGGGTGTTTCTCCAATGGAAAGAGGCGTCGCTGGGCACC CGCCGGGAACGGCCGGGTGACCACCCGGTCATTGTGAACGGAAGTTTC GAGA Is there anything that distinguishes the second from the first? We’ll return to this issue
Before we embark on the biological problem We will look at an easier analogy Sometimes called the Crooked Casino Basic idea: sequence of values generated by one or more processes from the values, try to deduce which process produced which parts of the sequence
A useful analogy The loaded die I rolled a die 80 times and got 2363662651611344141666324264633142224 2145353556233136631321454662631116546 622542 Some of the time it was a fair die Some of the time the die produced 6s half the time and 1-5 randomly the rest of the time
There are no computer algorithms to recognize genes reliably Statistical approaches Similarity-based approaches Others
Similarity Methods Searching sequences for regions that have the “look and feel” of a gene (Dan Gusfield) An early step is the investigation of ORFs (in Eukaryotes) look for possible splicing sites and try to assemble exons Combine sequence comparison and database search, seeking similar genes
Statistical Approaches Look for features that appear frequently in genes but not elsewhere As for the loaded die sequence above, that is not 100% reliable But then, what is? This is Biology and it’s had billions of years to figure out how to outfox us! If you want to learn about nature, to appreciate nature, it is necessary to understand the language that she speaks in. She offers her information only in one form; we are not so unhumble as to demand that she change before we pay any attention. (R. Feynman)
Relative Frequency of Stop Codons 3 of the possible 64 nucleotide triplets are stop codons We would thus expect stop codons to form approximately 5% of any stretch of random DNA, giving an average distance between stop codons of about 20 codons Very long stretches of triplets containing no stop codons (long ORFs) might, therefore, indicate gene possibilities
Rather like looking for regions where my die was fair By seeking a shortage of 6s Similarly, there is (organism-specific) codon bias of various kinds, and this can be exploited by gene-finders People and programs have looked for such bias features as: regularity in codon frequencies regularity in bicodon frequencies periodicity homogeneity vs. complexity
CpG Islands CG dinucleotides are often written CpG to avoid confusion with the base pair C-G In the human genome CpG is a rarer bird than would be expected in purely random sequences (there are chemical reasons for this involving methylation) In the start regions of many genes, however, the methylation process is suppressed, and CpG dinucleotides appear more frequently than elsewhere
Rather like looking for stretches where my die was loaded By seeking a plethora of 6s But now we’ll look for high incidences of CpG
How can we detect shortage or plethora of CpGs? Markov Models Assume that the next state depends only on the previous state If the states are x 1 x 2... x n We have a matrix of probabilities p ij p ij is the probability that state x j will follow x i
Transition matrices CpG islands A C G T A 18 27 43 12 C 17 37 27 19 G 16 34 37 13 T 8 35 38 18 elsewhere A C G T A 30 20 28 21 C 32 30 8 30 G 25 25 30 20 T 18 24 29 29 (adapted from Durbin et al, 1998)
How do we get those numbers? empirically lots of hard wet and dry lab work We can think of these numbers as parameters of our model later we’ll pay more attention to model parameters
For our loaded die example we have 0 means non-6 loaded 0 6 0 50 50 6 50 50 unloaded 0 6 0 83 17 6 83 17
Discrimination Given a sequence x we can calculate a log odds score that it was generated by one of two models. S(x) = log (P(x | model A) / P(x | model B)) = sum of log((prob x i-1 to x i in A) / (prob x i-1 to x i in B)) Demonstration (define beta (lambda (x y) (cond ((and (= x 6) (< y 6)) (log (/.83.5))) ((and (= x 6) (= y 6)) (log (/.17.5))) ((and (< x 6) (< y 6)) (log (/.83.5))) ((and (< x 6) (= y 6)) (log (/.17.5)))))) (define s (lambda (l) (cond ((null? (cdr l)) 0) (else (+ (beta (car l) (cadr l)) (s (cdr l)))))))
Similar Analysis for CpG Islands Visit the code (show mary) (show gene) (scorecpg mary) (scorecpg gene) Sliding Window (slidewindow 20 mary) (show (classify mary))
Hidden Markov Models The Markov model we have just seen could be used for locating CpG islands For example you could calculate the log odds score of a moving window of, say 50 nucleotides Hidden Markov Models try to combine the two models in a single model For each state, say x, in the Markov model, we imagine two states x A and x B for the two models A and B Both new states “emit” the same character But we don’t know whether it was C A or C B that “emitted” the nucleotide C
HMMs Hidden Markov Models are so prevalent in bioinformatics, we’ll just refer to them as HMMs Given a sequence of “emitted signals” -- the observed sequence and a matrix of transition probabilities and emission probabilities For a path of hidden states, calculate the conditional probability of the sequence given the path Then try to find an optimal path for the sequence such that the conditional probability is maximized
Must be able to move between models Figure out these transition probabilities G+ C+ T+ T- G- C-A- A+
Add a begin and an end state and figure even more transition probabilities G+ C+ T+ T- G- C-A- A+ B E
Basic Idea Each state emits a nucleotide (except B and E) A+ and A- both emit A C+ and C- both emit C You know the sequence of emissions You don’t know the exact sequences of states Was it A+ C+ C+ G- that produced ACCG Or was it A- C- C+ G- You want to find the most likely sequence
Parameters of the HMM How do we know the transition probabilities between hidden states? empirical, hard lab work as above for the CpG island transition probabilities imperfect data will lead to imperfect model Is there an alternative?
Machine Learning Grew out of other disciplines, including statistical model fitting Tries to automate the process as much as possible use very flexible models lots of parameters let the program figure them out Based on... known data and properties training sets Induction and inference
Bayesian Modeling Understand the hypotheses (model) Assign prior probabilities to the hypotheses Use Bayes’ theorem (and the rest of “probability calculus”) to evaluate posterior probabilities for the hypotheses in light of actual data
Training Neural Networks Need set of inputs and desired outputs Run the network with a given input Compare to desired output Adjust weights to improve Repeat...
Training HMMs Similar idea Begin with guesses for the parameters of the HMM Run it on your training set, and see how well it predicts whatever you designed it for Use the results to adjust the parameters Repeat...
Hope that When you provide brand new inputs The trained network will produce useful outputs
Some gene finder programs GRAIL – neural net discriminant Genie – HMM GENSCAN – “Semi Markov Model” GeneParser – neural net FGENEH etc (Baylor) – Classification by Linear Discrimination Analysis (LDA) a well-known statistical technique GeneID – rule-based