Presentation on theme: "Working with the Conifer_dbMagic database: A short tutorial on mining conifer assembly data. This tutorial is designed to be used in a “follow along” fashion."— Presentation transcript:
Working with the Conifer_dbMagic database: A short tutorial on mining conifer assembly data. This tutorial is designed to be used in a “follow along” fashion. You will need to have the Conifer_dbMagic database launched to replicate steps shown in this tutorial. If you do not already have the conifer_dbMagic.jnlp (Java web start) file on your desktop, use the following URL to download and launch the file now: http://ancangio.uga.edu/ng-genediscovery/conifer_dbMagic.jnlp
Upon launching the program, the Assemblies menu will appear. The drop down menu is used to select the species and the assembly that you wish to query.
Each species ID is followed by an extension that identifies the assembler used for de novo transcriptome assembly. For example the _MIRA, _NGEN, and _NBLR extensions indicate that either miraEST, NGen, or Newbler was used to assemble the transcript data, respectively. The three P. taeda libraries are listed slightly differently, i.e. as PtMIRA, PtNBLR1, and PtNGen2
We have selected the C.atl_MIRA assembly for our example. Now click on the “Submit” button to open the Assembly Display Screen
This is the main Assembly Display panel. Two tabs are on the upper right: Search UniScript and Blast Annotation. Note that C.atlantica is listed in the box on the right under “Genotypes.” Do not click on it. Click on the Submit button in the center of the window. IMPORTANT: Do not click anywhere within the “Select Contigs containing all these genotypes box.” This is a feature that is not utilized in conifer_dbMagic, since there is no genotype information parsed into the database.
We see that there are 30,658 matches found (total clusters) in this assembly. Each of four columns that have been populated can now be sorted (either increasing or decreasing) simply by clicking on the its column header: Num- database numeric identifier UniScript- cluster name UniScript Length- total consensus length in bases Total Seq- total number of sequence reads associated with each cluster (Note that there are “clusters” of one, i.e. singletons.). *If you want to open a new Assemblies menu to select a different species or a different assembler, you can simply click on “New Display” located at the top left of this window. *In this and in all other windows, the term “UniScript” in column 2 is a legacy term meaning unique transcript, but is simply the contig name (or isotig name in the case of Newbler assemblies) associated with each cluster in the database.
To search the assembly by UniScript or Sequence Name, or to filter the assembly by either the UniScript length, or by the number of sequence reads in a cluster, use the “UniScript filters” box seen at the upper left. Two drop down menus are available: First, select UniScript Length “between x,y” and then type in a range of 2000 to 3000 bases. Next, select Number of Sequences >= and type in the number 10. Now click the Submit button
The result is 768 clusters that have consensi between 2000 and 3000 bases, and have at least 10 sequence reads per cluster. Note that all of the column values have also changed to reflect the new query results. Now click twice on the Total Seq column header to sort from highest to lowest values.
After sorting, we will click on the first row to highlight cluster C.atlantica_rep_c103, which has the largest number of total reads (303). Next, click on “View Alignment” at the bottom of the window to see the cluster alignment. *Multiple clusters can be selected here and multiple alignment windows can be opened for viewing or comparing several clusters at once.
A new UniScript Alignment window now appears with the consensus sequence shown at the top, and a pileup view of all aligned sequences listed below. Individual sequence read names are seen on the left. The red blocks indicate inconsistencies among the sequenced reads and the consensus sequence (some of these may be interpreted as possible indel/SNP containing reads). The slider bars located on the bottom and right side of the window are used to scroll through the alignment.
Now, return to the Assembly Display window by clicking on it, and then click on “Blast Annotation” at the bottom of the window.
The view switches to the Blast Annotation tab (one can also go here directly as will be shown later). The UniScript Name for the cluster we identified in the “Search UniScript” tab has been auto-filled with a database generated ID. Next, click to highlight a target blast database (NCBI NR) in the Select Target Database(s) panel. Click “Submit” to see the Blastx returns for the selected contig.
Here we see the blastx results panel, and we have returned 10 records for the C.atlantica_repC_103 cluster. Just as in the Search UniScript tables, one can sort the blast data table columns by clicking on any column header. Column widths can also be modified by clicking on the dividing line and dragging to the desired width. In any list obtained from the database, e.g. in the Search UniScript or the Blast Annotation tabs, one can highlight contiguous or multiple, separated rows of interest using standard Windows Shift or Ctrl key/mouse click combinations. Use CtrlC to copy a highlighted table or individual rows of table data for pasting into text or Excel files.
Next, we will click on the “Expect” column and sort the blast data by their expect values. Note that whenever a row is highlighted, the amino acid alignment between the query sequence and the target sequence appears at the bottom of the window, which itself can be scrolled through using the slider bar. Now, click the “Reset” button to clear this query result.
Next, we type in the word “actin” in the Annotation box, and select < from the drop down menu next to Expect Val and type 1e-75 in Expect Val box. Click to highlight the TAIR_9 database. Click Submit
We see that 442 records are returned whose TAIR blast description records contain the term “actin,” and that also have expect values < 1e-75. *Note in the highlighted row that any record, e.g. Num=9, containing the term “actin” in the description is returned, i.e. the word “interacting,” whether it is actually an “actin” gene or not. Also note that up to five different blast records may be returned for any given cluster.
Now, we will sort the blast data by clicking on the “Match Length” column, sorting from highest to lowest values. Next, scroll down and highlight the first entry for ACT1 in the Seq Description column (you will need to increase this column width to see it)- record Num= 46, cluster C. atlantica_rep_c1017) Now click “Search UniScript” at the bottom of the window.
We are returned to the Search UniScript tab and the “UniScript Name(s)” box has been auto-filled with a database generated ID. Click Submit and the information for the ACT1 cluster is returned.
Click to highlight the UniScript row. Now, we can either click to view the alignment of the cluster, as we saw previously, or we can click on “Make Fasta” After clicking Make Fasta, a dialog box appears for selection of either just the consensus sequence, or the consensus sequence plus all individual sequence reads associated with it. The fasta file can then be downloaded to a local directory of choice.
This concludes the conifer_dbMagic tutorial Here are some helpful commands for working in or copying information from java database tables: Ctrl A = all rows selected. Click/Shift/Click = a defined group of rows with the range selected using the mouse. Click/Ctrl/Click = multiple, ungrouped rows selected using the mouse. Ctrl C = copy rows that have been highlighted.