Probabilistic sequence modeling I: frequency and profiles Haixu Tang School of Informatics.

Slides:



Advertisements
Similar presentations
The genetic code.
Advertisements

Computational Biology, Part 8 Protein Coding Regions Robert F. Murphy Copyright  All rights reserved.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.
RNA Say Hello to DNA’s little friend!. EngageEssential QuestionExplain Describe yourself to long lost uncle. How do the mechanisms of genetics and the.
Transcription & Translation Worksheet
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 3 Finding Motifs Aleppo University Faculty of technical engineering.
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence analysis June 19, 2007 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence analysis June 17, 2003 Learning objectives-Review amino acids structures. Understand sliding window programs. Understand difference between identity,
Gene Prediction: Statistical Approaches Lecture 22.
Computational Biology, Part 4 Protein Coding Regions Robert F. Murphy Copyright  All rights reserved.
Principles of Biology By Frank H. Osborne, Ph. D. Molecular Genetics.
Transcription and Translation
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
3. Genome Annotation: Gene Prediction. Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions.
Unit 7 RNA, Protein Synthesis & Gene Expression Chapter 10-2, 10-3
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Nature and Action of the Gene
Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.
FEATURES OF GENETIC CODE AND NON SENSE CODONS
CHAPTER 12 PROTEIN SYNTHESIS AND MUTATIONS -RNA -PROTEIN SYNTHESIS -MUTATIONS.
Copyright © 2004 by Limsoon Wong A Biology Review.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Starter Read 11.4 Answer concept checks 2-4.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Genome sequencing Haixu Tang School of Informatics.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Gene Prediction: Statistical Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005 ChengXiang Zhai Department of Computer Science.
Secondary structure prediction
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Outline 1.What is an amino acid / protein naturally occurring amino acids 3.Codon – triplet coding for an amino acid 1.How are proteins synthesized.
GENOME: an organism’s complete set of genetic material Humans ~3 billion base pairs CHROMOSOME: Part of the genome; structure that holds tightly wound.
Outline What is an amino acid / protein
Protein Secondary Structure Prediction G P S Raghava.
Genome Annotation Haixu Tang School of Informatics.
RNA 2 Translation.
LO: SWBAT explain how protein shape is determined and differentiate between the different types of mutations. DN: h/0 protein synthesis HW: Read pp #
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
Chapter 17 How to read a table of codons. These are two forms in which you might see a table of codons.
©1998 Timothy G. Standish From DNA To RNA To Protein Timothy G. Standish, Ph. D.
Parts is parts…. AMINO ACID building block of proteins contain an amino or NH 2 group and a carboxyl (acid) or COOH group PEPTIDE BOND covalent bond link.
Mutations Csaba Bödör, Semmelweis University, 1 st Dept. of Pathology.
GENE EXPRESSION. Transcription 1. RNA polymerase unwinds DNA 2. RNA polymerase adds RNA nucleotides (A ↔ U, G ↔ C) 3. mRNA is formed! DNA reforms a double.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
NUCLEIC ACIDS AND PROTEIN SYNTHESIS. DNA complex molecule contains the complete blueprint for every cell in every living thing Amount of DNA that would.
Translation PROTEIN SYNTHESIS.
Whole process Step by step- from chromosomes to proteins.
DNA, RNA and Protein Synthesis
Protein Synthesis DNA RNA Protein.
Modelling Proteomes.
The blueprint of life; from DNA to Protein
Warm-Up 3/12/13 After transcription, an mRNA molecule with the sequence A U A C G C A G U was created. What was the sequence of the original DNA strand?
What is Transcription and who is involved?
Outline What is an amino acid / protein
PROTEIN SYNTHESIS RELAY
More on translation.
Gene Prediction: Statistical Approaches
Fundamentals of Protein Structure
NOTE SHEET 13 – Protein Synthesis
Central Dogma and the Genetic Code
Station 2 Protein Synethsis.
Protein Synthesis: Translation
DNA to proteins.
Nucleic Acids Review.
Introduction to Bioinformatics II
Presentation transcript:

Probabilistic sequence modeling I: frequency and profiles Haixu Tang School of Informatics

Genome and genes Genome: an organism’s genetic material (Car encyclopedia) Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA. (Chapters to make components of a car, or to use and drive a car).

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg Gene!

Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions of genes in a genome Gene Prediction: Computational Challenge

A simple model for gene prediction: frequency-based DNA modeling DNA is a double strand molecule –G-C pair  strong –A-T pair  weak Coding regions often have higher GC content than non-coding regions

Frequency-based DNA modeling of coding vs. non-coding regions To predict coding regions in an organism (e.g. human), collect a set of known coding and non-coding DNA sequences from this organism (training set) Compute the frequency distribution of GC and AT pairs in coding and non-coding regions, respectively: F (f GC |c), F (f AT |c)=1- F (f GC |c); F (f GC |nc), F (f AT |nc)=1- F (f GC |nc);

A example from zygomycete Phycomyces blakesleeanus f GC non-coding coding

Model comparison Two models: coding vs. non-coding Given a DNA sequence, its likelihood of being a coding sequence, based on its GC content f GC f GC non-coding coding P(c|f GC )=P(c)(P(f GC |c)/[P(f GC |c)P(c)+ P(f GC |nc)]P(nc)) P(nc|f GC )=P(nc)(P(f GC |nc)/[P(f GC |c)P(c)+ P(f GC |nc)P(nc)]) Assume P(c)=P(nc), P(f GC |c)= F (f<f GC |c), P(f GC |nc)= F (f>f GC |nc), C=log(P(f GC |c)/P(f GC |nc))

Sliding window approach aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcgg ctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccg atgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcat gcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagct gggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggcta tgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcg gctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatga caatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggc tatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgcta agctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgca tgcggctatgctaagctcatgcgg

Sliding window approach c Sequence Coding region

Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA A more complicated model: codon usages

Codon: 3 consecutive nucleotides 4 3 = 64 possible codons Genetic code is degenerative and redundant –Includes start and stop codons –An amino acid may be coded by more than one codon (codon degeneracy) Translating Nucleotides into Amino Acids

In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations Systematically deleted nucleotides from DNA –Single and double deletions dramatically altered protein product –Effects of triple deletions were minor –Conclusion: every triplet of nucleotides, each codon, codes for exactly one amino acid in a protein Codons

UAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames Genetic Code and Stop Codons

Six Frames in a DNA Sequence stop codons – TAA, TAG, TGA start codons - ATG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

Testing reading frames Create a 64-element hash table and count the frequencies of codons in a reading frame; Amino acids typically have more than one codon, but in nature certain codons are more in use; Uneven use of the codons may characterize a coding region;

Codon Usage in Human Genome

AA codon /1000 frac Ser TCG Ser TCA Ser TCT Ser TCC Ser AGT Ser AGC Pro CCG Pro CCA Pro CCT Pro CCC AA codon /1000 frac Leu CTG Leu CTA Leu CTT Leu CTC Ala GCG Ala GCA Ala GCT Ala GCC Gln CAG Gln CAA Codon Usage in Mouse Genome

Using codon frequency to find correct reading frame Consider sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9…. where x i is a nucleotide let p 1 = p x1 x2 x3 p x4 x5 x6 …. p 2 = p x2 x3 x4 p x5 x6 x7 …. p 3 = p x3 x4 x5 p x6 x7 x8 …. then probability that ith reading frame is the coding frame is: p i p 1 + p 2 + p 3 slide a window along the sequence and compute P i P i =

Adding the background model: gene finding In the previous model, we assume at least one reading frame is the codon sequence  testing reading frames, not gene finding Adding a background model –p 0 = p x1 p x2 p x3 p x4 …. –Based on the nucleotide frequency in the non-coding sequence –P i = p i / (p 0 +p 1 +p 2 +p 3 ) In practice, this model should be extended to all six reading frames.

Protein secondary structure prediction Amino acid sequence NLKTEWPELVGKSVEE AKKVILQDKPEAQIIVL PVGTIVTMEYRIDRVR LFVDKLDNIAEVPRVG folding

Basic structural units of proteins: Secondary structure α-helix β-sheet Secondary structures, α-helix and β-sheet, have regular hydrogen-bonding patterns.

Secondary Structure Prediction Given a protein sequence, secondary structure prediction aims at predicting the state of each amino acid as being either H (helix), E (extended=strand), or O (other). The quality of secondary structure prediction is measured with a “3-state accuracy” score, or Q 3. Q 3 is the percent of residues that match “reality” (X-ray structure).

Chou and Fasman: a frequency model P(  |S)=  s p(  |s)=  s p(  |f(s)) –p(  |f(s))~p(f(s)|  )/p(f(s)) –Similarly for  and turn structures

Amino Acid  -Helix  -SheetTurn Ala Cys Leu Met Glu Gln His Lys Val Ile Phe Tyr Trp Thr Gly Ser Asp Asn Pro Arg Chou and Fasman: a frequency model Favors  -Helix Favors  -strand Favors turn

Profile model The frequency model does not consider the order of the training sequences –Permuting the training sequences will not change the model In some cases, the order is of important biological meaning, e.g. sequence motifs; Profile model fully constrains the order of the training sequences

Profile / PSSM LTMTRGDIGNYLGLTVETISRLLGRFQKSGML LTMTRGDIGNYLGLTIETISRLLGRFQKSGMI LTMTRGDIGNYLGLTVETISRLLGRFQKSEIL LTMTRGDIGNYLGLTVETISRLLGRLQKMGIL LAMSRNEIGNYLGLAVETVSRVFSRFQQNELI LAMSRNEIGNYLGLAVETVSRVFTRFQQNGLI LPMSRNEIGNYLGLAVETVSRVFTRFQQNGLL VRMSREEIGNYLGLTLETVSRLFSRFGREGLI LRMSREEIGSYLGLKLETVSRTLSKFHQEGLI LPMCRRDIGDYLGLTLETVSRALSQLHTQGIL LPMSRRDIADYLGLTVETVSRAVSQLHTDGVL LPMSRQDIADYLGLTIETVSRTFTKLERHGAI DNA / proteins Segments of the same length L; Often represented as Positional frequency matrix;

A DNA profile (matrix) TATAAA TATAAT TATAAA TATTAA TTAAAA TAGAAA T C A G T C A G Sparse data  pseudo-counts

Searching profiles: sampling Give a sequence S of length L, compute the likelihood ratio of being generated from this profile vs. from background model: –R(S|P)= –Searching motifs in a sequence: sliding window approach

Testing a motif Equivalent to computing the significance of a sequence motif TATAAA TATAAT TATAAA TATTAA TTAAAA TAGAAA TCGAAT GCATTT ACTTAA CGCTGC AAACCG CGATAC CCAAGT GACCTA T C A G T C A G

Model comparison: relative entropy H(x) =  i  P(x j ) log (P(x j )/P 0 (x j ))= –b j  the random background distribution T C A G T C A G

Probability distribution What is the probability P(H|B) of getting a matrix with a relative entropy H from the background model B = {b j }? p(h)  the probability distribution of relative entropy score for the frequency of a single column (can be pre-calculated) P(H) = –convolution of function p(s), can be calculated only approximately by Fast Fourier Transformation (FFT)

Finding a motif atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

Motif finding is Difficult atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG cAAtAAAAcGGcGGG..|..|||.|..|||

The Motif Finding Problem Given a set of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Find the motif in each of the individual sequences

The Motif Finding Problem If starting positions s=(s 1, s 2,… s t ) are given, finding consensus is easy because we can simply construct (and evaluate) the profile to find the motif. But… the starting positions s are usually not given. How can we find the “best” profile matrix? –Gibbs sampling –Expectation-Maximization algorithm

Conclusion Frequency and profile are two basic models for sequence analysis; They represent two extreme models in terms of incorporating order information in the sequences; Model selection should be based on biological ideas;