Presentation is loading. Please wait.

Presentation is loading. Please wait.

BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.

Similar presentations


Presentation on theme: "BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8."— Presentation transcript:

1 BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8

2 GENE PREDICTION/GENE FINDING
The vast amount of raw sequence data generated because of advancement in sequencing technology needs biological interpretation Known as ‘annotation’ To find genes and determine their functions At present, the annotation of most human genes is based on cDNA sequence data. Lec-8

3 Protein coding genes Prokaryotic Eukaryotic
No introns, simpler regulatory features Eukaryotic Exon-intron structure Complex regulatory features Many different types of RNA exist: tRNA, rRNA, miRNA, siRNA, snoRNA, snRNA Lec-8

4 Coding sequence Actual region of DNA that is translated to form proteins. While the ORF may contain introns as well, the CDS refers to those nucleotides that can be divided into codons which are actually translated into amino acids by the ribosomal translation machinery. In prokaryotes the ORF and the CDS are the same. Lec-8

5 What is gene prediction?
Which region codes for a protein? Which DNA strand is used to encode the gene? Where does the gene start and end? Where are the exon-intron boundaries in eukaryotes? Where (optionally) are the regulatory sequences for that gene? The characterization of genomic features using computational and experimental methods is called gene prediction or annotation. Lec-8

6 Computational methods of gene prediction
Computational gene finding is a process of: Identifying common phenomena in known genes Building a computational framework/model that can accurately describe the common phenomena Using the model to scan uncharacterized sequence to identify regions that match the model, which become putative genes Test and validate the predictions Lec-8

7 Biological overview of ‘gene’
Gene: defined as a segment of DNA that contains the necessary information to produce a functional product, usually a protein. DNA (or RNA in some viruses) Promoter: controls the activity of a gene Coding sequence: determines what the gene produces Core promoter-minimal portion of the promoter required to initiate transcription properly Proximal promoter-tends to contain primary regulatory elements ; serves as a binding site for specific transcription factors ORF -Open reading frame Starts with ATG (start codon) though not always Terminates with TAA, TAG or TGA (stop codons) Promoter is the regulatory region of the DNA located upstream (towards the 5’ region) of a gene. This provides a control point for regulated gene transcription. It contains specific DNA sequences that are recognized by proteins known as transcription factors. These factors bind to the promoter sequences recruiting RNA polymerase, the enzyme that synthesizes RNA from the coding region of the gene. Lec-8

8 Structure of eukaryotic gene
In general -25 to -30 from the start, a consensus sequence is found called TATA or Hogness box,  which acts as first recognition sequence for the assembly of RNA-pol complex, without which enzymes won’t assemble.  The start point itself is bracketed by a set of sequences; hence this region is called InR box or InR sequence. In the upstream region of the TATA box, at a distance there several sequence boxes such as GC box GGGCGG at -90, a CAAT box at -75 and an eight base Octamer box.  The number of such sequence and the position from the start or TATA box varies.  Besides the binding of RNA polymerase complex to TATA box, other factors binding to specific sequence boxes increase the efficiency of transcription.  Lec-8

9 Methods of gene prediction
Look for something that looks like an already known gene (homology) Extrinsic/Homology method Look for something that matches statistical patterns common to all genes (ab initio) Intrinsic/Ab initio method Combining homology and ab initio Hybrid method Lec-8

10 Extrinsic/Homology Method
Based on sequence similarity of query sequence with annotated genes present in databases. It is known that only approx. half of the genes can be found by homology to other known genes or proteins. Based on the following principles: Coding regions evolve slower than non-coding regions, i.e. local sequence similarity can be used as a gene finder Homologous sequences reflect a common evolutionary origin and possibly a common gene structure. Standard pair-wise comparison methods can be used (BLAST or Smith-Waterman) Include gene syntax information (start/stop codons etc.) Useful to confirm predictions inferred by other methods Lec-8

11 Intrinsic/Ab initio Method
Predicts genes based on statistical properties of the given DNA sequence. Statistical patterns inside and outside of the gene regions as well as typical patterns at their boundaries. Lec-8

12 Features for gene prediction in eukaryotes
Signal sensors Content sensors (extrinsic and intrinsic content sensors) Signal sensors Evaluates fixed-length features in DNA Signals: splice sites, start/stop codon, branch points, promoters and terminators of transcription, polyadenylation sites, ribosomal-binding sites, topoisomerase II binding sites, various transcription factor-binding sites etc. These are measures that try to detect the presence of the functional sites specific to a gene. The basic signal sensor is a simple consensus sequence or an expression that describes a consensus sequence along with allowable variations. Use of weight matrices Lec-8

13 Features for gene prediction in eukaryotes (contd.)
Content sensors Evaluates variable length features which extend from one signal to another They classify a DNA region into different types, e.g. coding vs non-coding Extrinsic content sensor These sensors perform similarity searching between a genomic sequences region and a protein or DNA sequence present in a database. Basic tools needed for similarity searching, i.e. BLAST, FASTA etc. Intragenomic and Intergenomic comparisons Intra-genomic comparisons provide useful information regarding multigenic families, representing a huge percentage of the existing genes Intergenomic or cross-species comparisons identify orthologous genes without a preliminary knowledge about them Lec-8

14 Features for gene prediction in eukaryotes (contd.)
Intrinsic content sensor Based on statistical models of the nucleotide frequencies and dependencies present in codon structure Use of MM CpG islands (regions which often mark the beginning of genes where frequency of CG is not as low as it is in the rest of the genome) Sensors for repetitive DNA (e.g. ALU sequences) In mammalian genomes, CpG islands are typically 300-3,000 base pairs in length, and have been found in or near approximately 40% of promoters of mammalian genes. About 70% of human promoters have a high CpG content. Given the frequency of GC two-nucleotide sequences, the number of CpG dinucleotides is much lower than would be expected. Alu element/sequence is a short stretch of DNA originally characterized by the action of the Alu (Arthrobacter luteus) restriction endonuclease. Lec-8

15 Gene prediction tools Software based on ab initio methods
GENSCAN, FGENESH, GeneMark.hmm, Glimmer, Genie, GeneID Software based on similarity-based methods GeneWise, SYNCOD, ORFgene2, EbEST Lec-8


Download ppt "BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8."

Similar presentations


Ads by Google