Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cloning and Sequencing Explorer Series

Similar presentations


Presentation on theme: "Cloning and Sequencing Explorer Series"— Presentation transcript:

1

2 Cloning and Sequencing Explorer Series
Bioinformatics

3 Instructors Stan Hitomi Coordinator – Math & Science
Principal – Alamo School San Ramon Valley Unified School District Danville, CA Kirk Brown Lead Instructor, Edward Teller Education Center Science Chair, Tracy High School and Delta College, Tracy, CA Bio-Rad Curriculum and Training Specialists: Sherri Andrews, Ph.D. Essy Levy, M.Sc. Leigh Brown, M.A.

4 Bioinformatics The application of information technology to molecular biology

5 Questions Concerning your Data
Class Data Set Are our sequences high quality? Are my sequences similar to GAPDH? Are any of my sequences primarily cloning vector? Individual Clone Sequences Do my individual sequences align to give me a single long sequence? Are there discrepancies between my reads? Which GAPDH gene did we clone? Annotation of Clone Sequence What is the intron- exon structure/mRNA sequence of my clone? What is the protein sequence of my clone? Questions Concerning your Data

6 Sequence Data Analysis Tools
Sequence data storage and analysis tools (iFinch and Finch TV) Sequence comparison algorithm (NCBI BLAST) Sequence Assembly (CAP3) mRNA sequence prediction (BLAST and manual) Protein sequence prediction (EMBL-EBI EMBOSS Transeq) Sequence Data Analysis Tools

7 Advanced Preparation Practice with iFinch using the guest account- highly recommended! Activate your iFinch account (2 months subscription) Download FinchTV onto lab computers Set up project and folder in iFinch Upload sequence data

8 Guest iFinch Account http://classroom1.bio-rad.ifinch.com/Finch
Username: BR_guest Password: guest Example data sets for each stage of process No uploading of data

9 Your own iFinch account
Each account has a unique URL: E.g. Instructor’s Username: Platenumber e.g. A150936 Instructor’s Password: Platenumber e.g. A150936 Student Username: Platenumber_student e.g. A150936_student Student Password: Platenumber e.g. A150936 Once activated- change your passwords! Active for 2 months. Your own iFinch account

10 Download FinchTV

11 Make project & folder and upload data to iFinch: Demo
Walk through making project (Demo plate # A180255) Making Folder Uploading data to iFinch

12 Student Activities Review data quality and view sequence traces
Use BLAST for preliminary check on which GAPDH was cloned Assemble sequences into a contig Verify which GAPDH gene was cloned Predict intron exon boundaries and generate mRNA sequence Predict protein sequence

13 Sequence Quality

14 Q20 values The quality value of a “base call” is: Q= -10Log10(Perror)
where P is the probability of an error. Thus if the chance that a base call is incorrect is 1/100, P would be 0.01 and the quality value would be 20 (Q=20). Convention rates sequences by the number of basecalls that have quality values of 20 or higher- a Q20 value. The quality values of a sequence are calculated automatically by software in iFinch- a common program for this was developed by the University of Washington and is called “Phred”

15 Sequence Quality Q20= 732 Q20= 161 Q20= 238

16 iFinch and Sequence Quality

17 Screen for poor quality sequence, vector, GAPDH family
Folder “class data set” FZNN1971.b2_C07.ab1 – Shows trimmed regions, vector, GAPDH Open in Finch TV- show trimmed regions and vector

18 Class Data Set Find the column labeled Q20. These data correspond to the number of bases in each read that have a quality value of 20 or greater. Click the column labeled Q20 to sort the data so that the lower quality data appear (low numbers of Q20 bases) at the top of the table. If there are more than 25 rows in your folder, click all to view them on one screen. Click one of the chromatogram labels where the Q20 value is greater than zero to view the Chromatogram Read report. Notice the graph of the quality values. The dotted line marks where the quality value is equal to 20. Do you see many quality values above 20? Did many bases need to be trimmed? At the top of the Chromatogram Read page click Open in FinchTV to view the trace in FinchTV. Do the same with a high Q20 value.

19 Sort Class Data into Folders

20 Record Data Information

21 Download sequences for initial screen using BLAST
Open Guest iFinch account User: BR-guest, Psw: guest Click :Folders Click :Salvia folder Look at data Go back to folder report Click: Download folder data- save to new folder on hard drive View FASTA format in MSWord or text editor Upload file back to iFinch

22 BLAST sequences for initial screen
Click NCBI BLAST on iFinch homepage Choose nucleotide search Browse for downloaded salvia.fsa file to upload Choose “Others (nr etc)”, Select “Reference Genomic sequences” Choose “Plants (taxid)” Choose “Somewhat similar sequences (blastn)” Click BLAST

23 All 4 sequences were analyzed by BLAST- choose from pull down menu at top of page
Mouse over top bar Scroll down to list of homologous sequences E value represents the number of equally good sequence matches to the query sequence that would be expected in a database of the same size containing random sequences. Scroll down to sequence alignments Query: Your sequence Subject: Database matching sequence BLAST Results In list of homologous sequences describe each field see below In Sequence alignment point out query vs subject seq and sequence coordinates Then have class look at the four seqs and decide which GAPC gene Ask whether there is a possibility that same clone may match different genes- and why? a. Accession: An accession number is the unique identifier given to a DNA sequence when it is submitted to a database. (It can also refer to a submitted protein sequence.) In this case since the Arabidopsis genome has been completely sequenced the top accession number refers to the chromosome. b. Description: The description refers to the source of the matching sequence. Again since the entire Arabidopsis genome is in the database these matches are to a complete chromosome sequence. This would not be the case for many BLAST searches, where the description might be for an organism and a putative gene or DNA fragment. c. Max score: Each of the colored bars in the BLAST alignment graph (at the top of the BLAST search results page) have been assigned a score based on the extent of the match. The max score comes from the block of aligned sequence that had the highest score. Because the blastn score is about twice the number of matching nucleotides, it is possible to estimate that the maximum score of 232 for the top sequence represents either 116 matching bases or a longer region that contains gaps. d. Total score: The total score is obtained by adding the scores from the region of the query sequence that matches any region on the sequence (chromosomal in this case) in the database. In this example, the total score is not very helpful because it represents the total from all the matching regions on a single chromosome. Since both chromosomes 1 and 3 of Arabidopsis contain multiple copies of the GAPDH genes, this score is not informative. e. Query coverage: The query coverage corresponds to the fraction of the entire query sequence that is matched by parts of the subject sequence. In this case for the top match, 69% of the query sequence matches the subject sequence. f. E value: The E value can have such a large range that it is reported as a power of 10 (expressed as an exponent; for example, e-2 means 10–2). For each subject sequence or match in the database, the E value represents the number of equally good sequence matches to the query sequence that would be expected in a database of the same size containing random sequences. When E values are below 1 they can be translated to the probability that two sequences will match to the same extent. This would mean that with an E value of 0.01 there is a 1% chance of finding an equally good match in a database of random sequences. While low E values are good, high E values suggest that it is possible to find an equally good match by random chance. In the top row of this example, the E value is 8 x 10–59. This means that there is a one in 1.25 in 1058 chance of finding this match in a database of random sequences. In other words, a match like this is not likely to occur by random chance. Two additional factors have a strong influence on E values: the length of the sequence, because it is easier to find a perfect match to a shorter sequence than it is to a longer sequence and the size of the database, because it is also easier to find a match in a larger database than it is in a smaller one. g. Max ident (maximum identity): This column shows the block of sequence that has the highest percentage of matching bases. In this example the maximum identity of any matching block is 96% with Oryza sativa (rice). However, when you scroll down and examine the matching regions in more detail, you will find that the region where 96% of the bases match is only 28 nucleotides long (see below). This is a good example of how short sequences can give a good match that is not meaningful. In this example then max ident is not a very useful statistic. h. Links: The final column in the blastn alignment table contains links to other databases that are identified in a key above the table on the BLAST results page. In this example, there are no links to other databases and for this analysis these links will probably not be useful

24 Which GAPC Gene? Sequence G01- the poor quality sequence does not match anything- possible this is just intronic sequences (and therefore we do not expect it to match since only the exons will be homologous between species) or the sequence quality is too poor to match. Things that help- If there is a name of the gene (sometimes there is just the chromosome number) 2) Base pairs in the genome location (see image on slide and then the base pairs on the subject sequence) 3) Whether it is on Arabidopsis Chromosome 1 (could be GAPC2, GAPCP2, GAPCP1) or chromosome 3 (likely GAPC)

25 Break time!

26 Questions Concerning your Data
Class Data Set Are our sequences high quality? Are my sequences similar to GAPDH? Are any of my sequences primarily cloning vector? Individual Clone Sequences Do my individual sequences align to give me a single long sequence? Are there discrepancies between my reads? Which GAPDH gene did we clone? Annotation of Clone Sequence What is the intron- exon structure/mRNA sequence of my clone? What is the protein sequence of my clone? Questions Concerning your Data

27 Initial Screen Result We have cloned Salvia GAPC gene
Now we need to put the sequences together to make a contig- (contiguous sequence) Then correct any sequence discrepancies between different reads Reasons for discrepancies- sequencing errors, homology of internal GAP SEQ primers- In a single clone this would not be PCR errors since the clone comes from a single piece of DNA.

28 CAP3 Program (Contig Assembly Program)
On iFinch home page click “sequence assembly” Note- You already have the sequence in FASTA format that was used in the earlier BLAST search. Thus you just cut and paste the sequence into this page. Opportunity to reiterate FASTA format.

29 Assembly Results Your sequence file (your input)
Single sequences (any seqs that could not be assembled) Contigs (save in FASTA format as “.txt” file) Assembly details (Save as landscape “.txt file)

30 Salvia Contig A01 I01 C01 G01

31 Check for Discrepancies
Look through assembly file for sequence discrepancies Open chromatogram files in FinchTV Examine actual chromatograms and use personal judgment to determine which base call is correct Correct FinchTV file and save back to iFinch (not available in guest account) noting the changes in the revision history If the consensus sequence has changed, download folder sequences again like previously and reassemble with CAP3 program This is an iterative process. Good quality data makes this much easier and takes much less time to process.

32 BLAST search with contig
Submit contig FASTA file for BLAST search (same database as before- plant reference genomic database) We did this previously for the individual sequences. Now we are doing this for the entire clone which should give us more confidence in the result since we are looking a for a much longer match.

33 Break time!

34 Questions Concerning your Data
Class Data Set Are our sequences high quality? Are my sequences similar to GAPDH? Are any of my sequences primarily cloning vector? Individual Clone Sequences Do my individual sequences align to give me a single long sequence? Are there discrepancies between my reads? Which GAPDH gene did we clone? Annotation of Clone Sequence What is the intron- exon structure/mRNA sequence of my clone? What is the protein sequence of my clone? Questions Concerning your Data

35 Determine Gene Structure

36 Workflow Mention GenBank has sequence from both mRNA, CDNA libraries and complete genomic DNA. Thus by comparing our genomic DNA to a database of mRNA sequences we can see which parts of the genomic sequence are not present in the mRNA and therefore must be intronic.

37 BLAST Search Against Reference mRNA Database
Blastn search with contig against plant Reference mRNA sequences database Change Algorithm parameters BLAST Search Against Reference mRNA Database Look at total scores for BLAST results Many results are from hypothetical proteins On 4/18 there were 3 GAPC mRNAs in the top 10 hits- Arabidopsis: NM_111283 Maize: NM_ Wine grapes: XM_ (predicted)

38 Reformat BLAST results
Reformat results in plain text format Save files to iFinch folder

39 Save Contig File in MSWord
Delete all paragraph marks using find and replace command Save to hard drive as “.rtf” file with a new name. Color contig sequence with exons as determined from BLAST results Put exons together in a first draft of the mRNA sequence and save to iFinch folder Submit draft mRNA sequence to blastn against plant reference mRNA database A base insertion or deletion can cause disruption of the codon sequence and therefore give the wrong predicted amino acid structure

40 BLAST search with derived mRNA sequence
Correct intron-exon boundaries (use Arabidopsis mRNA as model) Resubmit to BLAST Reiterate if necessary until no indels are evident and you are satisfied with a final mRNA sequence Save to iFinch folder BLAST search with derived mRNA sequence

41 Use blastx to Search Protein Database
Blastx converts nucleic acid sequence to amino acid sequence and searches protein database. What protein sequences are similar Are there more or fewer differences in the amino acid sequence vs DNA sequence? (This step can also help determine whether you have cloned Arabidopsis GAPC by accident.)

42 Translate mRNA into Protein Sequence

43 Check Protein Sequence with blastp Search
Ensure translation is in correct frame Save to iFinch folder

44 Congratulations! You have cloned, sequenced and annotated a novel gene. You could now submit this to GenBank. Data from additional samples would strengthen the data- for example- assemble sequences from the same gene from different student teams Download data from iFinch if you wish to keep it for the long term

45 Webinars Enzyme Kinetics — A Biofuels Case Study
Real-Time PCR — What You Need To Know and Why You Should Teach It! Proteins — Where DNA Takes on Form and Function From plants to sequence: a six week college biology lab course From singleplex to multiplex: making the most out of your realtime experiments explorer.bio-rad.comSupportWebinars


Download ppt "Cloning and Sequencing Explorer Series"

Similar presentations


Ads by Google