Presentation on theme: "Exploring the Human Transcriptome"— Presentation transcript:
1 Exploring the Human Transcriptome Claudia NeuhauserUniversity of MinnesotaInformatics Institute
2 From DNA to ProteinsSource: Wikipedia (http://en.wikipedia.org/wiki/Alternative_splicing)
3 RNA: Ribonucelic Acid Types of RNA Ribosomal RNA (rRNA): catalytic component of ribosomes (about 80-85%)Transfer RNA (tRNA): transfers amino acids to polypeptide chain at the ribosomal site of protein synthesis (about 15%)Messenger RNA (mRNA): carries information about a protein sequence to the ribosomes (about 5%)Other typesmiRNA, siRNA,snRNA, dsRNA,…
4 RNA: Ribonucelic Acid Types of RNA Ribosomal RNA (rRNA): catalytic component of ribosomes (about 80-85%)Transfer RNA (tRNA): transfers amino acids to polypeptide chain at the ribosomal site of protein synthesis (about 15%)Messenger RNA (mRNA): carries information about a protein sequence to the ribosomes (about 5%)Other typesmiRNA, siRNA,snRNA, dsRNA,…
5 TranscriptomeThe transcriptome is the set of all RNA produced in a cell (or population of cells)The transcriptome of a cell varies over time and with environmental conditionsThe mRNA transcripts reflect which genes are actively expressedMicroarray technologyRNA-seq technology
6 Exploring Transcriptomes Both microarray and RNA-seqcompare mRNA and provide quantification of gene transcriptsFrom: Functional Genomics (G. Meroni and F. PetreraAccessed through INTECH (http://www.intechopen.com/books/functional-genomics/beyond-the-gene-list-exploring-transcriptomics-data-in-search-for-gene-function-trait-mechanisms-and)
7 Comparing Microarray and RNA-Seq Wang, Zhong, Mark Gerstein, and Michael Snyder. "RNA-Seq: a revolutionary tool for transcriptomics." Nature Reviews Genetics 10.1 (2009):
10 Malone, John H. , and Brian Oliver Malone, John H., and Brian Oliver. "Microarrays, deep sequencing and the true measure of the transcriptome." BMC biology 9.1 (2011): 34.
11 Figure 4: Correlation of gene expression based on RPKM by RNA-Seq and protein abundance by label-free method(A) MS1 based quantification by msInspect plotted against RPKM, log transformed. (B) Normalized MS2 spectral counts (NSAF)) plotted against RPKM, log transformed. Data for mouse mitochondrial genes in brainstem tissue. Protein abundance by msInspect is based on top 3 normalized peptide area intensities.Source: Ning, Kang, Damian Fermin, and Alexey I. Nesvizhskii. "Comparative analysis of different label-free mass spectrometry based protein abundance estimates and their correlation with RNA-Seq gene expression data." Journal of proteome research 11.4 (2012):
12 Resources Recount Ensembl RefSeq Expression Atlas Online resource of RNA-seq gene count datasets from 18 different studiesEnsemblGenome database (automated gene annotation system)RefSeqNCBI Reference Sequence Database (manually curated)Expression AtlasInformation on gene expression patterns under different biological conditions
13 The DataReCount“ReCount is an online resource consisting of RNA-seq gene count datasets built using the raw data from 18 different studies. […] By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.”
14 From ReCount to Excel IWang, ET, et al. (2008):Count tables can be accessed by clicking on the “link”Ctrl-aCtrl-cOpen ExcelClick on Cell A1Ctrl-v
15 From ReCount to Excel II Click on the Data tab in your spreadsheet and click on Text to Columns in the ribbon under Data Tools. The Convert to Columns Wizard will guide you through the next steps.Your original data are separated by spaces. Click on Delimited to choose the original data type, and click Next.Click Space in the Delimiters box. You should see how the data will be displayed in the data preview. If it looks correct, click Finish.Save your file or use the ones uploaded to the site.
16 The Data Gene ID Reads gene SRX003935 SRX003921 SRX003924 SRX003923 1 Sample IDgeneSRX003935SRX003921SRX003924SRX0039231ENSG1222162ENSG3ENSG251374264ENSG65195ENSG86ENSG7ENSGENSG33125889ENSG10ENSG339269404253
17 Exercises 1 & 2: The Wang et al. Data Open the Heart tabExplore the genesPick a gene ID and search in your browser for the gene IDExplore the gene on the Ensemble websiteExplore the read count distributionWhat percentage of genes are expressed?What is the distribution of read counts?Detailed instructions are in workbook
18 From Raw Counts to Interpretation What affects the magnitude of the number of reads assigned to a specific gene?Exon modelExpression levelLength of geneSequencing depth
19 Normalizing Raw Counts I Raw DataSimilar number of reads but different lengthsTo compare genes within a sample, divide raw count by length of gene
20 Normalizing Raw Counts II Find the total number of reads NFor gene i, calculateThese numbers are very smallThe median is around 4x10-10Multiply by 109=1,000,000,000This new quantity is called RPKM (or FPKM)Reads per kilobase pair per million mapped reads
21 Normalizing Raw Counts III Calculating RPKMThis quantity can be used for within sample analysisNote: gene annotation and length come from an ‘exon model’
22 Exercise 3 Heart Length tab Calculate RPKM Plot RPKM as a function of lengthFind genes that are strongly expressed in the heart and go to the Expression Atlas to confirmDetailed instructions are in workbook
23 Exercise 4The Heart-Liver tab has RNA-seq read counts for two tissue types, the heart and the liver. We will use this data set to learn about differential expression.How many genes are expressed in both the heart and the liver, in one but not the other, and in neither tissue?
24 Normalizing Raw Counts IV To compare across samples, we need to account for sequencing depthFor each sample, find the total number of readsFor gene i in sample k, calculateSum over all genes i in sample to obtain normalizing factor Λk
25 Normalizing Raw Counts V For each gene i in sample k, divide λik by ΛkThis quantity, called relative abundance, can be used to compare across samples
26 Exercise 5The Heart Liver Length tab has an additional column (Column C) with the length of each gene. We will compare relative importance of each gene.Determine the total number of reads N for each tissue.Calculate relative abundance for each tissueGraph the cumulative distribution function of the relative abundance as a function of the number of genes.Detailed instructions are in workbook
27 Exercise 6 Calculate the log fold change ‘=ABS(LOG(ratio,2))’ Graph the log fold change as a function of relative abundance for each tissue type
Your consent to our cookies if you continue to use this website.