BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

On line (DNA and amino acid) Sequence Information Lecture 7.
Central Dogma Big Idea 3: Living systems store, retrieve, transmit, and respond to info essential to life processes.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Prepared with lots of help from friends... Metsada Pasmanik-Chor, Zohar Yakhini and NUMEROUS WEB RESOURCES. BioInformatics / Computational Biology Introduction.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
RNA.
Thursday, 5 June 2008 Problems in sequence analysis Identification by sequence similarity Genes Determining Plant-Cyanobacterial Symbioses and Consideration.
Finding prokaryotic genes and non intronic eukaryotic genes
FROM GENE TO PROTEIN: TRANSCRIPTION & RNA PROCESSING Chapter 17.
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
On line (DNA and amino acid) Sequence Information
Essentials of the Living World Second Edition George B. Johnson Jonathan B. Losos Chapter 13 How Genes Work Copyright © The McGraw-Hill Companies, Inc.
Lesson Overview 13.1 RNA.
DNA, RNA & Proteins Transcription Translation Chapter 3, 15 & 16.
Chapter 13.2 (Pgs ): Ribosomes and Protein Synthesis
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
FROM DNA TO PROTEIN Transcription – Translation We will use:
Gene Expression. What is Gene Expression?  Expression can be defined as: –Shown –Manifested –Articulated We can determine a person’s genes by what is.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
CHMI E.R. Gauthier, Ph.D. 1 CHMI 2227E Biochemistry I Gene expression.
Protein Synthesis. DNA acts like an "instruction manual“ – it provides all the information needed to function the actual work of translating the information.
Gene Expression and Gene Regulation. The Link between Genes and Proteins At the beginning of the 20 th century, Garrod proposed: – Genetic disorders such.
RNA and Protein Synthesis
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
CENTRAL DOGMA OF BIOLOGY. Transcription & Translation How do we make sense of the DNA message? Genotype to Phenotype.
Molecular Biology Primer. Starting 19 th century… Cellular biology: Cell as a fundamental building block 1850s+: ``DNA’’ was discovered by Friedrich Miescher.
RNA Ribonucleic Acid. Structure of RNA  Single stranded  Ribose Sugar  5 carbon sugar  Phosphate group  Adenine, Uracil, Cytosine, Guanine.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
What is the job of p53? What does a cell need to build p53? Or any other protein?
Organizing information in the post-genomic era The rise of bioinformatics.
12-3 RNA and Protein Synthesis
Sackler Medical School
DNA and Translation Gene: section of DNA that creates a specific protein Approx 25,000 human genes Proteins are used to build cells and tissue Protein.
What is central dogma? From DNA to Protein
DNA in the Cell Stored in Number of Chromosomes (24 in Human Genome) Tightly coiled threads of DNA and Associated Proteins: Chromatin 3 billion bp in Human.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Protein Synthesis How genes work.
GENOME: an organism’s complete set of genetic material In humans, ~3 billion base pairs CHROMOSOME: Part of the genome; structure that holds tightly wound.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Finding genes in the genome
Welcome to the combined BLAST and Genome Browser Tutorial.
Transcription and The Genetic Code From DNA to RNA.
Ch 12-3 Notes, part 2 The Central Dogma = Protein Synthesis.
Chapter 13 Test Review.
Introduction to molecular biology Data Mining Techniques.
Unit 1: DNA and the Genome Structure and function of RNA.
CS177 week 3 scavenger hunt team mini-project start in class finish as part of homework this will include a mixture of things we have and have not covered.
12-3 RNA and Protein Synthesis Page 300. A. Introduction 1. Chromosomes are a threadlike structure of nucleic acids and protein found in the nucleus of.
Ch. 11: DNA Replication, Transcription, & Translation Mrs. Geist Biology, Fall Swansboro High School.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Bacterial infection by lytic virus
bacteria and eukaryotes
Bacterial infection by lytic virus
PROTEIN SYNTHESIS.
RNA and Protein Synthesis
Gene architecture and sequence annotation
Transcription Definition
Central Dogma Central Dogma categorized by: DNA Replication Transcription Translation From that, we find the flow of.
General Animal Biology
RNA and Protein Synthesis
Transcription.
Transcription/ Translation Notes 16-17
Presentation transcript:

BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching) How many transcription factors (TFs) in Corn? DNA and p53 Transcription Factor

Lecture Outline What is a gene? DNA Databases today – GenBank How to find a new gene in the GenBank How to know that you have a full length (complete) gene Storing your work

1: What is a gene? A gene is a unit of genetic information Genes are made of DNA (found in cell nucleus) One gene encodes one protein (polypeptide) (made in cell cytoplasm) A messenger RNA (mRNA) mediates the expression of a gene (via ribosome) An organism is encoded for by numerous genes (about 26,000 for humans)

Central Dogma of Molecular Genetics DNA – all genes are present in every cell Only some genes are expressed in a given cell mRNA population represents those genes expressed in a given cell (tissue specific gene expression)

How a gene is expressed DNA mRNA Mature mRNA Protein D IntronI IntronII Exon1 Exon2 Exon3 DNA D mRNA Intron slicing and polyA tailing D Mature mRNA AAAAAAA polyA tail Start Stop Open Reading Frame (ORF) Translation on ribosome Protein

Where can you find a gene ? Book collections can be stored in a library Collections of genes can be made and stored in gene libraries ! There are 2 main kinds of gene libraries Genomic libraries are made from DNA and contain entire genes (exons and introns). cDNA libraries are made from mRNAs that are converted into DNA (only exons)

cDNA libraries are very useful A library of genes expressed in a given tissue type is a cDNA library To study a tissue (e.g. liver or brain) then a cDNA library contains the genes used to make that tissue cDNA libraries are made from mRNA which is converted into DNA. One cDNA clone from a cDNA library contains the coding information for that gene (with introns removed)

cDNA is made from mRNA Mature mRNA DNA/RNA Complementary DNA cDNA AAAAAAA Mature mRNA Start Stop TTTTTTT Add polyT primer, nucleotides, and Reverse Transcriptase AAAAAAA DNA/RNA TTTTTTT RNA removed (by NaOH) and second strand synthesized TTTTTTT Complementary DNA cDNA

A full length cDNA is hard to find Start Stop mRNA is degraded from 5’ end AAAAAAA Open Reading Frame (ORF) AAAAAAA AAAAAAA AAAAAAA AAAAAAA Most cDNAs are not full length (flcDNA) and the ORF is incomplete (partial)

cDNA (EST) libraries have few flcDNAs Open Reading Frame (ORF) cDNA libraries are made and individual clones sequenced at random A sequenced cDNA is called an Expressed Sequence Tag (EST) Millions of ESTs from different tissues of different organisms are stored in GenBank – but only a small few are flcDNAs! -how to find the longest ones? Where ?

2. DNA Databases today – GenBank GenBank is housed at NCBI www.ncbi.nlm.nih.gov

The Entrez Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB. The number of bases in these databases continues to grow at an exponential rate. As of April 2006, there are over 130 billion bases in GenBank and RefSeq alone !

A virtual “Jungle” of information……. The main infomration access point is in Entrez (click on All databases) A virtual “Jungle” of information…….

3. How to find a new gene in this jungle? Class project to clone novel transcription factor (TF) genes from Corn A good starting point is the set of predicted TFs from rice (whose genome has been completed) Visit the GRASSIUS website

New NSF supported database GRASSIUS Website New NSF supported database www.grassius.org

GRASSIUS Outreach section

GRASSIUS Helpful Links (On Links menu) Maize MAGI Maize Assembled Genomic Island [MAGI] MaizeGDB MaizeGDB is the community database for information on Zea mays The Maize Full Length Project This project uses genomics tools to understand a fundamental biological process through identificaion of genes expressed during maize reproduction and in somatic tissues responding to abiotic perturbations such as heat, cold, salt, UV-B, drought, and lack of light. Rice TIGR Rice Genome The TIGR Rice Genome Annotation Database and Resource is a National Science Foundation project and provides sequence and annotation data for the rice genome RiceTFDB RiceTFDB (2.1) is a public database arising from efforts to identify and catalogue all Oryza sativa genes involved in transcriptional control Comparative Genomic Resources CGGC Comparative Grass Genomics Center (CGGC) AGRIS The Arabidopsis Gene Regulatory Information Server (AGRIS) is a information resource for Arabidopsis promoter sequences, transcription factors and their target genes

The genes that you clone and study will be added to this database Grass Transcription Factor Database (GRASSTFDB) GRASSTFDB provides a comprehensive collection of transcription factors from maize, sugarcane, sorghum and rice. Transcription factors, defined here specifically as proteins containing domains that suggest sequence-specific DNA-binding activities, are classified based on the presence of 50+ conserved domains. Links to resources that provide information on mutants available, map positions or putative functions for these transcription factors are provided. The genes that you clone and study will be added to this database

Use Known Rice TF Gene to find related TF in maize EST database 2516 TFs in 66 families

Example: the G2-like TF family These TFs are known to be important in the growth of plants and is found in several other species but not yet studied in corn

Each TF gene has a unique Locus number (like a bar code)

Clicking on a locus gives more information on that particular gene You want to retrieve the sequence (at bottom of the page See next slide)

Domain architecture is info on the protein product These links give you the actual ORF (Coding Sequence CDS), entire gene or protein sequence

The actual ORF of a gene (Coding Sequence CDS). The first start codon is always ATG and the last is one of three stop codons TAA, TAG or TGA The ORF is a multiple of 3

The ORF is translated into the protein sequence The start codon ATG always encodes the amino acid methionine M. The * indicates the stop codon (no amino acid in the protein) Copy and paste this sequence into a new protein molecule in VectorNTI

In the Protein Molecules Local Database – make a new subbase for your protein files Click on “New Protein Molecule” and type in the locus name of the rice TF locus e.g. Os01g08160.1

Click on the sequence and Features menu “Edit sequence” Paste in the sequence and click “OK” twice

..Let the hunt for the corn TF begin! Using the Rice TF as a starting point…… ..Let the hunt for the corn TF begin! Highlight the protein sequence and click on Tools …

Do a BLAST search (like a google search) to search the GenBank We will use the NCBI BLAST server

There are 5 different BLAST programs to choose from BLAST stands for Basic Local Alignment Search Theorem. (Like doing a Google search)

Select tblastn program and est others database and then submit Select tblastn program and est others database and then submit. When “Finished” click in file

BLAST Report has graphic and list windowpanes

In windowpane A is info on each “Hit” against the database (here there are 500 hits) The first is with a corn (Zea mays) mRNA (EST) A C D In windowpane C the arrows show how the query sequence (Q:1) lines up with the highlighted hit (H:1) (Top blue line in windowpane D)

The actual alignment of the sequence Q1 with the 1st hit is shown in windowpane E Note that amino acid 91 of Q1 aligns with nucleotide 64 of H1(=amino acid 21) so hit1 is a partial cDNA (NOT full length)  however….

Scrolling down we find that another blue line does overlap with the beginning of the query Now amino acid 22 overlaps with bp 347 of the corn EST with the GenBank accession EE188556 This one looks like it is a flcDNA  Click on this and the Genbank file will open….

Now the new gene is in your sights! Genbank file EE188556 seems to be a flcDNA By highlighting the sequence and translating it in different frames, then by examining with the BLAST result it can be seen that the correct ORF is in frame 2 Extending back from the shared region about 45 amino acids we find a Met (ATG start codon)

Record the GenBank number EE188556 In the comments file make sure that the clone is available from the Arizona Genomics Institute The plate location is needed to request the clone

Save the Genbank file into VectorNTI and you will use this in the second part of the course Export the file as an archive and email it to jgray5@utnet.utoledo.edu with your group number and GenBank file in the subject line e.g. Group5 EST-EE188556

In your lab report include the following in the Results Section for this lab 1: The Rice Locus number that you started with 2: The protein sequence of the Rice gene and a brief description of the TF family to which it belongs 3: The GenBank Accession number of the Maize EST that appears to be a flcDNA similar to the rice TF 4: The Arizona Genomics Institute Plate number Congratulations! Now you have hunted down a new gene and you will clone this in the 2nd and 3rd part of the course