Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.

Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad

Goal The ultimate goal of molecular cell biology is to understand the physiology of living cells in terms of the information that is encoded in the genome of the cell How computational approaches can help in achieving this goal ?

A gene codes for a protein Protein mRNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

What is Computational Gene Finding? Given an uncharacterized DNA sequence, find: –Which region codes for a protein? –Which DNA strand is used to encode the gene? –Which reading frame is used in that strand? –Where does the gene start and end? –Where are the exon-intron boundaries in eukaryotes? –(optionally) Where are the regulatory sequences for that gene? Search space - 2-5% of Genomic DNA (~ 100 – 1000 Mbp) (~ 100 – 1000 Mbp)

Need for Computational Gene Prediction  of a protein.  It is the first step towards getting at the function of a protein.  It also helps accelerate the annotation of genomes.

Deoxyribonucleic acid (DNA) – is a blueprint of the cell Composed of four basic units - called nucleotides Each nucleotide contains - a sugar, a phosphate and one of the 4 bases: Adenine(A), Thymine(T), Guanine(G), Cytosine(C)

For all computational purposes, a DNA sequence is considered to be a string on a 4-letter alphabet: A, T, G, C ACGCTGAATAGC The aim is to find grammar & syntax rules of DNA language based on the 4-letter alphabet, - similar to English Grammar to form meaningful sentences

Biological Sequences Order of occurrence of bases: not completely random - Different regions of the genome exhibit different patterns of the four bases, A, T, G, C e.g., protein coding regions, regulatory regions, intron/exon boundaries, repeat regions, etc. Aim: identifying these various patterns to infer their functional roles

Assumption in biological sequence analysis: - strings carrying information will be different from random strings If a hidden pattern can be identified in a string, it must be carrying some functional information

Example   This is a lecture on bioinformatics   asjd lkjfl jdjd sjftye nvcrow nzcdjhspu

Frequency of letters A. 7.3% N.7.8% B. 0.9% O.7.4% C. 3.0% P.2.7% D. 4.4% Q.0.3% E. 13.0% R.7.7% F. 2.8% S.6.3% G. 1.6% T.9.3% H. 3.5% U.2.7% I. 7.4% V.1.3% J. 0.2% W.1.6% K. 0.3% X.0.5% L. 3.5% Y.1.9% M. 2.5% Z.0.1%

Other statistics  Frequencies of the most common first letter of a word, last letter of a word, doublets, triplets etc.  20 most used words in –Written English the of to in and a for was is that on at he with by be it an as his –Spoken English the and I to of a you that in it is yes was this but on well he have for

Parallels in DNA language ATGGTGGTCATGGCGCCCCGAACCCTC TTCCTGCTGCTCTCGGGGGCCCTGACC CTGACCGAGACCTGGGCGGGTGAGTG CGGGGTCAGGAGGGAAACAGCCCCTG CGCGGAGGAGGGAGGGGCCGGCCCG GCGGG GTCTCAACCCCTCCTCGCCCCCAGGCT CCCACTCCATGAGGTATTTCAGCGCCG CCGTGTCCCGGCCCGGCCGCGGGGAG CCCCGCTTCATCGCCATGGGCTACGTG GACGACACGCAGTTCGTGCGGTTC

Parallels in DNA language ATG GTG GTC ATG GCG CCC CGA ACC CTC TTC CTG CTG CTC TCG GGG GCC CTG ACC CTG ACC GAG ACC TGG GCG GGT GAG TGC GGG GTC AGG AGG GAA ACA GCC CCT GCG CGG AGG AGG GAG GGG CCG GCC CGG CGG… GTC TCA ACC CCT CCT CGC CCC CAG GCT CCC ACT CCA TGA GGT ATT TCA GCG CCG CCG TGT CCC GGC CCG GCC GCG GGG AGC CCC GCT TCA TCG CCA TGG GCT ACG TGG ACG ACA CGC AGT TCG TGC GGT TC…

This task needs to be automated because of the large genome sizes: Smallest genome: Mycoplasma genitalium 0.5 x 10 6 bp Human genome: 3 x 10 9 bp (not the largest)

Finding genes in Prokaryotes each gene is one continuous stretch of baseseach gene is one continuous stretch of bases most of the DNA sequence codes for proteinmost of the DNA sequence codes for protein (70% of the H.influenzea bacterium genome is coding)

Gene Structure in Prokaryotes  Starts with the promoter region, which is followed by a transcription start site  Non-coding region called the 5’ untranslated region (5’ UTR)  The coding region, CDS, is a continuous region: – starts with a start codon ATG, and terminates with a stop codon (TAA/TAG/TGA)  Followed by another non-coding region called the 3’untranslated region (3’ UTR)  In the end there is a polyadenylation signal

Finding genes in Prokaryotes Gene prediction in prokaryotes is considerably simple and involves: identifying long reading framesidentifying long reading frames using codon frequenciesusing codon frequencies

Finding genes in Eukaryotes the coding region is usually discontinuousthe coding region is usually discontinuous composed of alternating stretches of exons and intronscomposed of alternating stretches of exons and introns Only 2-3 % of the human genome (~3 x 10 9 bp) codes for proteinsOnly 2-3 % of the human genome (~3 x 10 9 bp) codes for proteins

Gene Structure in Eukaryotes Starts with a promoter region, which is followed by a transcription start site Starts with a promoter region, which is followed by a transcription start site Non-coding region called the 5’ untranslated region (5’ UTR) Non-coding region called the 5’ untranslated region (5’ UTR) The initial exon which contains the start codon (AUG) The initial exon which contains the start codon (AUG) An alternating sequence of introns and internal exons. An alternating sequence of introns and internal exons. The terminating exon, which contains the stop codon (TAA, TAG, TGA) The terminating exon, which contains the stop codon (TAA, TAG, TGA)

Gene Structure in Eukaryotes Non-coding region called the 3’ untranslated region (3’ UTR) Non-coding region called the 3’ untranslated region (3’ UTR) In the end there is a polyadenylation signal In the end there is a polyadenylation signal Exon-intron boundaries, called splice sites, are signaled by certain short sequences: Exon-intron boundaries, called splice sites, are signaled by certain short sequences: donor sites (GT) – 5’ (3’) end of an intron (exon) acceptor sites (AG) – 3’ (5’) end of an intron (exon) Note: Some eukaryotic genes are single-exon, or, intron-less genes

Finding genes in Eukaryotes Gene finding problem complicates:  due to the existence of interweaving exons and introns – stop codons may exist in intronic regions making it difficult to identify correct ORF  a gene region may encode many proteins – due to alternative splicing  Exon length need not be multiple of three – resulting in frameshift between exons  Gene may be intron-less (single-exon genes)  Relatively low gene density - only 2 - 5% of the human genome codes for proteins

Methods for Identifying Coding Regions  Finding Open Reading Frames (ORFs)  Homology Search DNA vs. Protein SearchesDNA vs. Protein Searches  Content-based methods: Coding statistics, viz., codon usage bias, periodicity in base occurrence, etc.Coding statistics, viz., codon usage bias, periodicity in base occurrence, etc.  Signal-based methods: CpG islandsCpG islands Start/Stop signals, promoters, poly-A sites, intron/exon boundaries, etc.Start/Stop signals, promoters, poly-A sites, intron/exon boundaries, etc.  Integration of these methods

Finding Open Reading Frames (ORF)  Once a gene has been sequenced it is important to determine the correct open reading frame (ORF).  Every region of DNA has six possible reading frames, three in each direction  The reading frame that is used determines which amino acids will be encoded by a gene.  Typically only one reading frame is used in translating a gene, and this is often the longest open reading frame

Finding Open Reading Frames (ORF)  Detecting a relatively long sequence deprived of stop codons indicate a coding region  An open reading frame starts with a start codon (atg) in most species and ends with a stop codon (taa, tag or tga)  Once the open reading frame is known the DNA sequence can be translated into its corresponding amino acid sequence using the genetic code The codons are triplet of bases

The Genetic Code

Finding Open Reading Frames (ORF) Consider the following sequence of DNA: 5´ TCAATGTAACGCGCTACCCGGAGCTCTGGG CCCAAATTTCATCCACT 3´ “ Forward Strand ” Its complementary Strand is: 3´ AGTTACATTGCGCGATGGGCCTCGAGACCCGGG TTTAAAGTAGGTGA 5´ “ Reverse Strand ” The DNA sequence can be read in six reading frames - three in the forward and three in the reverse direction depending on the start position

Finding Open Reading Frames (ORF) 5´ TCAATGTAACGCGCTACCCGGAGCTCTG GGCCCAAATTTCATCCACT 3´ Three reading frames in the forward direction: 1. TCA ATG TAA CGC GCT ACC CGG AGC TCT GGG CCC AAA TTT CAT CCA CT 1. TCA ATG TAA CGC GCT ACC CGG AGC TCT GGG CCC AAA TTT CAT CCA CT 2. CAA TGT AAC GCG CTA CCC GGA GCT CTG GGC CCA AAT TTC ATC CAC T 3. AAT GTA ACG CGC TAC CCG GAG CTC TGG GCC CAA ATT TCA TCC ACT Start codon

Finding Open Reading Frames (ORF) 3´ AGTTACATTGCGCGATGGGCCTCGAGAC CCGGGTTTAAAGTAGGTGA 5´ Three reading frames in the reverse direction: 1. AG TTA CAT TGC GCG ATG GGC CTC GAG ACC CGG GTT TAA AGT AGG TGA 2. A GTT ACA TTG CGC GAT GGG CCT CGA GAC CCG GGT TTA AAG TAG GTG 3. AGT TAC ATT GCG CGA TGG GCC TCG AGA CCC GGG TTT AAA GTA GGT Start codon stop codon

Finding Open Reading Frames (ORF) In this case the longest open reading frame (ORF) is the 3 rd reading frame of the complementary strand : AGT TAC ATT GCG CGA TGG GCC TCG AGA CCC GGG TTT AAA GTA When read 5´ to 3´, the longest ORF is: ATG AAA TTT GGG CCC AGA GCT CCG GGT AGC GCG TTA CAT TGA

Finding Long ORFs  First step to distinguish between a coding and a non-coding region is to look at the frequency of stop codons  Sequence similarity search (database search)  When no sequence similarity is found, an ORF can still be considered gene-like according to some statistical features:  the three-base periodicity  higher G+C content  signal sequence patterns

Finding Long ORFs Once a long ORF/ all ORFs above a certain threshold are identified, - these ORF sequences are called putative coding sequences - translate each ORF using the Universal Genetic code to obtain amino acid sequence - search against the protein database for homologs

Finding genes in Prokaryotes Drawbacks:  The addition or deletion of one or more bases will cause all the codons scanned to be different  sensitive to frame shift errors  sensitive to frame shift errors  Fails to identify very small coding regions  Fails to identify the occurrence of overlapping long ORFs on opposite DNA strands (genes and ‘ shadow genes ’ )

Web-based tools ORF Finder (NCBI) http://www.ncbi.nih.gov/gorf/gorf.htmlEMBOSS getorf - Finds and extracts open reading frames plotorf - Plot potential open reading frames Sixpack - Display a DNA sequence with 6- frame translation and ORFs http://www.hgmp.mrc.ac.uk/Software/EMB OSS/Apps/getorf.html

Homology Search This involves Sequence-based Database Searching DNA Database searchingDNA Database searching Protein Database searchingProtein Database searching

Homology Search Why search databases? When one obtains a new DNA sequence, one needs to know:  whether it already exists in the databanks  whether it has any homologous sequences (i.e., sequences derived from a common ancestry) in the databases  Given a putative coding ORF, search for homologous proteins – proteins similar in their folding or structure or function.

Homology Search DNA vs. Protein Searches Use protein for database similarity searches whenever possible

Homology Search Three main search tools used for database search: BLAST - algorithm by Karlin & Altschul BLAST - algorithm by Karlin & Altschul http://www.ncbi.nlm.nih.gov/BLAST/ http://www.ncbi.nlm.nih.gov/BLAST/ FastA - algorithm by Pearson & Lipman FastA - algorithm by Pearson & Lipman http://www.ebi.ac.uk/fasta33/ http://www.ebi.ac.uk/fasta33/ Smith-Waterman (SW) algorithm Smith-Waterman (SW) algorithm - dynamic programming algorithm - dynamic programming algorithm

Limitations of Homology Search  Only limited number of genes are available in various databases.  Currently only 50% of the sequences are found to be similar to previously known sequences. It should always be kept in mind that similarity-based methods are only as reliable as the databases that are searched, and apparent homology can be misleading at times

Content-based Methods At the core of all gene identification programs – there exist one or more coding measures A coding statistic - a function that computes the likelihood that the sequence is coding for a protein. A good knowledge of core coding statistics is important to understand how gene identification programs work.

Classification of Coding Measures Classification of Coding Measures Coding statistics measure base compositional bias periodicity in base occurrence codon usage bias Main distinction is between measures dependent of a model of coding DNA measures independent of such a model.

Model dependent coding statistics capture the specific features of coding DNA:  Unequal usage of codons in the coding regions - a universal feature of the genomes  Dependencies between nucleotide positions  Base compositional bias between codon positions - requires a representative sample of coding DNA from the species under consideration to estimate the model's parameters

Markov Models Dependencies between nucleotide positions in coding regions - can be explicitly described by means of Markov Models In Markov Models - the probability of a nucleotide at a particular codon position depends on the nucleotide(s) preceding it. Probability of a DNA sequence of length L: transition probabilities

Markov Models Table III: Probabilities of the four nucleotides at the different codon positions conditioned to the nucleotide in the preceding codon position

Model independent coding statistics capture only the “universal” features of coding DNA:  Position Asymmetry – how asymmetric is the distribution of nucleotides at the 3 triplet positions  Periodic Correlation - correlations between nucleotide positions - do not require a sample of coding DNA

Signal-based Methods GT AG Signal – a string of DNA recognized by the cellular machinery

Signals for gene identification There are many signals associated with genes, each of which suggests but does not prove the existence of a gene Most of these signals can be modeled using weight matrices

Signals for gene identification  CpG Islands – identify the 2% of the genome that codes for proteins  Start & Stop Codons – signifies the start & end of a coding region  Transcription Start Site – to identify the start of coding region  Donor & Acceptor Sites - signifies the start & end of intronic regions  Cap Site – found in the 5’ UTR

Signals for gene identification  Promoters – to initiate transcription (found in 5’ UTR region)  Enhancers – regulates gene expression, (found in 5’ or 3’ UTR regions, intronic regions, or up to few Kb away from the gene)  Transcription Factor Binding Sites – short DNA sequences where proteins bind to initiate transcription /translation process  Poly-A Site – identify the end of coding region (found in 3’ UTR region)

Promoter Detection Not all ORFs are genes Not all ORFs are genes True coding regions have specific sequences upstream of the start site known as promoters where the RNA polymerase binds to initiate transcription, e.g., in E. coli:  No two patterns are identical  All genes do not have these patterns Consensus patterns

Promoter Detection Signals are short and variable  Pattern search using positional frequencies  Compute f(b,i) the frequency of nucleotide b at position i  Probability of sequence S to be a promoter is:  i f(b,i)(i = 1,…,6)  Probability S is not a promoter f(b), where f(b) is the expected frequency of b  Find odds ratio of S being a promoter to not being a promoter

Positional Weight Matrix for TATA box 123456 A2952659511 C921413203 G1011615120 T79244131796

Complications in Gene Prediction The problem of gene identification is further complicated in case of eukaryotes by the vast variation that is found in the structure of genes. On an average, a vertebrate gene is 30Kb long. Of this, the coding region is only about 1Kb. The coding region typically consists of 6 exons, each about 150bp long. These are average statistics

Complications in Gene Prediction Huge variations from the average are observed Biggest human gene, dystrophin is 2.4Mb long. Blood coagulation human factor VIII gene is ~ 186Kb. It has 26 exons with sizes varying from 69 bp to 3106 bp and its 25 introns range in size from 207 to 32,400 bp. An average 5’ UTR is 750bp long, but it can be longer and span several exons (for e.g., in MAGE family). On an average, the 3’ UTR is about 450bp long, but for e.g., in case of the gene for Kallman’s syndrome, the length exceeds 4Kb

Some facts about human genes  Comprise about 3% of the genome  Average gene length: ~ 8,000 bp  Average of 5-6 exons/gene  Average exon length: ~200 bp  Average intron length: ~2,000 bp  ~8% genes have a single exon Some exons can be as small as 1 or 3 bp

Complications in Gene Prediction In higher eukaryotes the gene finding becomes far more difficult because It is now necessary to combine multiple ORFs to obtain a spliced coding region. Alternative splicing is not uncommon, Exons can be very short, and introns can be very long. Given the nature of genomic sequence in humans, where large introns are known to exist, there is definitely a need for highly specific gene finding algorithms.

GENSCAN: http://genes.mit.edu/GENSCAN.html Probabilistic model based on HMM. The different states of the model correspond to different functional units on a gene, e.g., promoter region, exon, intron etc. It uses a homogenous 5 th order Markov model for non- coding regions and a 3-periodic (inhomogenous) 5 th order Markov model for coding regions Signals are modeled by weight matrixes, weight arrays and maximal dependence decomposition techniques. Species: Vertebrates, Maize Arabidopsis, Trained on human genes, Accuracy lower for non-vertebrates Gene prediction programs

Fgene: Uses Dynamic Programming to find the optimal combinations of exons, promoters, and polyA sites detected by a pattern recognition algorithm. Species: Human, Drosophila, Nematode, Yeast, Plant http://dot.imgen.bcm.tmc.edu.9331/gene-finder/gf.html GrailExp: Uses HMM as the underlying computational technique to determine its genomic predictions. Uses Neural Network to combine the information from various gene finding signals Species: Human, Mouse http://compbio.ornl.gov/grailexp/ Gene prediction programs

Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.

Similar presentations

Presentation on theme: "Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.

Similar presentations

Presentation on theme: "Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad."— Presentation transcript:

Similar presentations

About project

Feedback