BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)

BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)
How many transcription factors (TFs) in Corn? DNA and p53 Transcription Factor

Lecture Outline What is a gene? DNA Databases today – GenBank
How to find a new gene in the GenBank How to know that you have a full length (complete) gene Storing your work

1: What is a gene? A gene is a unit of genetic information
Genes are made of DNA (found in cell nucleus) One gene encodes one protein (polypeptide) (made in cell cytoplasm) A messenger RNA (mRNA) mediates the expression of a gene (via ribosome) An organism is encoded for by numerous genes (about 26,000 for humans)

Central Dogma of Molecular Genetics
DNA – all genes are present in every cell Only some genes are expressed in a given cell mRNA population represents those genes expressed in a given cell (tissue specific gene expression)

How a gene is expressed DNA mRNA Mature mRNA Protein D
IntronI IntronII Exon1 Exon2 Exon3 DNA D mRNA Intron slicing and polyA tailing D Mature mRNA AAAAAAA polyA tail Start Stop Open Reading Frame (ORF) Translation on ribosome Protein

Where can you find a gene ?
Book collections can be stored in a library Collections of genes can be made and stored in gene libraries ! There are 2 main kinds of gene libraries Genomic libraries are made from DNA and contain entire genes (exons and introns). cDNA libraries are made from mRNAs that are converted into DNA (only exons)

cDNA libraries are very useful
A library of genes expressed in a given tissue type is a cDNA library To study a tissue (e.g. liver or brain) then a cDNA library contains the genes used to make that tissue cDNA libraries are made from mRNA which is converted into DNA. One cDNA clone from a cDNA library contains the coding information for that gene (with introns removed)

cDNA is made from mRNA Mature mRNA DNA/RNA Complementary DNA cDNA
AAAAAAA Mature mRNA Start Stop TTTTTTT Add polyT primer, nucleotides, and Reverse Transcriptase AAAAAAA DNA/RNA TTTTTTT RNA removed (by NaOH) and second strand synthesized TTTTTTT Complementary DNA cDNA

A full length cDNA is hard to find
Start Stop mRNA is degraded from 5’ end AAAAAAA Open Reading Frame (ORF) AAAAAAA AAAAAAA AAAAAAA AAAAAAA Most cDNAs are not full length (flcDNA) and the ORF is incomplete (partial)

cDNA (EST) libraries have few flcDNAs
Open Reading Frame (ORF) cDNA libraries are made and individual clones sequenced at random A sequenced cDNA is called an Expressed Sequence Tag (EST) Millions of ESTs from different tissues of different organisms are stored in GenBank – but only a small few are flcDNAs! -how to find the longest ones? Where ?

2. DNA Databases today – GenBank
GenBank is housed at NCBI

The Entrez Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB. The number of bases in these databases continues to grow at an exponential rate. As of April 2006, there are over 130 billion bases in GenBank and RefSeq alone !

A virtual “Jungle” of information…….
The main infomration access point is in Entrez (click on All databases) A virtual “Jungle” of information…….

3. How to find a new gene in this jungle?
Class project to clone novel transcription factor (TF) genes from Corn A good starting point is the set of predicted TFs from rice (whose genome has been completed) Visit the GRASSIUS website

New NSF supported database
GRASSIUS Website New NSF supported database

GRASSIUS Outreach section

GRASSIUS Helpful Links (On Links menu)
Maize MAGI Maize Assembled Genomic Island [MAGI] MaizeGDB MaizeGDB is the community database for information on Zea mays The Maize Full Length Project This project uses genomics tools to understand a fundamental biological process through identificaion of genes expressed during maize reproduction and in somatic tissues responding to abiotic perturbations such as heat, cold, salt, UV-B, drought, and lack of light. Rice TIGR Rice Genome The TIGR Rice Genome Annotation Database and Resource is a National Science Foundation project and provides sequence and annotation data for the rice genome RiceTFDB RiceTFDB (2.1) is a public database arising from efforts to identify and catalogue all Oryza sativa genes involved in transcriptional control Comparative Genomic Resources CGGC Comparative Grass Genomics Center (CGGC) AGRIS The Arabidopsis Gene Regulatory Information Server (AGRIS) is a information resource for Arabidopsis promoter sequences, transcription factors and their target genes

The genes that you clone and study will be added to this database
Grass Transcription Factor Database (GRASSTFDB) GRASSTFDB provides a comprehensive collection of transcription factors from maize, sugarcane, sorghum and rice. Transcription factors, defined here specifically as proteins containing domains that suggest sequence-specific DNA-binding activities, are classified based on the presence of 50+ conserved domains. Links to resources that provide information on mutants available, map positions or putative functions for these transcription factors are provided. The genes that you clone and study will be added to this database

Use Known Rice TF Gene to find related TF in maize EST database
2516 TFs in 66 families

Example: the G2-like TF family
These TFs are known to be important in the growth of plants and is found in several other species but not yet studied in corn

Each TF gene has a unique Locus number (like a bar code)

Clicking on a locus gives more information on that particular gene
You want to retrieve the sequence (at bottom of the page See next slide)

Domain architecture is info on the protein product
These links give you the actual ORF (Coding Sequence CDS), entire gene or protein sequence

The actual ORF of a gene (Coding Sequence CDS).
The first start codon is always ATG and the last is one of three stop codons TAA, TAG or TGA The ORF is a multiple of 3

The ORF is translated into the protein sequence
The start codon ATG always encodes the amino acid methionine M. The * indicates the stop codon (no amino acid in the protein) Copy and paste this sequence into a new protein molecule in VectorNTI

In the Protein Molecules Local Database – make a new subbase for your protein files
Click on “New Protein Molecule” and type in the locus name of the rice TF locus e.g. Os01g

Click on the sequence and Features menu
“Edit sequence” Paste in the sequence and click “OK” twice

..Let the hunt for the corn TF begin!
Using the Rice TF as a starting point…… ..Let the hunt for the corn TF begin! Highlight the protein sequence and click on Tools …

Do a BLAST search (like a google search) to search the GenBank
We will use the NCBI BLAST server

There are 5 different BLAST programs to choose from
BLAST stands for Basic Local Alignment Search Theorem. (Like doing a Google search)

Select tblastn program and est others database and then submit
Select tblastn program and est others database and then submit. When “Finished” click in file

BLAST Report has graphic and list windowpanes

In windowpane A is info on each “Hit” against the database (here there are 500 hits)
The first is with a corn (Zea mays) mRNA (EST) A C D In windowpane C the arrows show how the query sequence (Q:1) lines up with the highlighted hit (H:1) (Top blue line in windowpane D)

The actual alignment of the sequence Q1 with the 1st hit is shown in windowpane E
Note that amino acid 91 of Q1 aligns with nucleotide 64 of H1(=amino acid 21) so hit1 is a partial cDNA (NOT full length)  however….

Scrolling down we find that another blue line does overlap with the beginning of the query
Now amino acid 22 overlaps with bp 347 of the corn EST with the GenBank accession EE188556 This one looks like it is a flcDNA  Click on this and the Genbank file will open….

Now the new gene is in your sights!
Genbank file EE seems to be a flcDNA By highlighting the sequence and translating it in different frames, then by examining with the BLAST result it can be seen that the correct ORF is in frame 2 Extending back from the shared region about 45 amino acids we find a Met (ATG start codon)

Record the GenBank number EE188556
In the comments file make sure that the clone is available from the Arizona Genomics Institute The plate location is needed to request the clone

Save the Genbank file into VectorNTI and you will use this in the second part of the course
Export the file as an archive and it to with your group number and GenBank file in the subject line e.g. Group5 EST-EE188556

In your lab report include the following in the Results Section for this lab
1: The Rice Locus number that you started with 2: The protein sequence of the Rice gene and a brief description of the TF family to which it belongs 3: The GenBank Accession number of the Maize EST that appears to be a flcDNA similar to the rice TF 4: The Arizona Genomics Institute Plate number Congratulations! Now you have hunted down a new gene and you will clone this in the 2nd and 3rd part of the course

BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)

Similar presentations

Presentation on theme: "BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)

Similar presentations

Presentation on theme: "BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)"— Presentation transcript:

Similar presentations

About project

Feedback