Presentation on theme: "Walk-thru of CAGE exercise Also at /tag_analysis/ /tag_analysis/"— Presentation transcript:
Walk-thru of CAGE exercise Also at http://people.binf.ku.dk/albin/teaching/htbinf /tag_analysis/ http://people.binf.ku.dk/albin/teaching/htbinf /tag_analysis/ …together with updated slides And linked from web page
Interlude: a logistics problem The largest cDNA project so far made 102,000 cDNAs If you publish, you need to be able to ship these to the people asking for it This would take >50kg of dry ice! Expensive and a logistics nightmare since you need to keep track of the 102,000 tubes How can we transfer DNA?
RNA-seq With a high-throughput tag sequencer, we can also do the brute force approach – fragment all mRNAs in a cell and sequence the pieces (or part of the pieces) This is commonly referred to as RNA-seq
Compared to SAGE, CAGE Sequence the whole mRNA – not just the end or the start Can give connectivity, so that we know what exons that are used, and what isoforms Is actually bad at capturing 5’ and 3’ edges, due to statistical issues (white board demo)
Typical protocol AAAAA TTTTT AAAAA Isolate mRNA Break up mRNAs Make cDNAs of RNA fragments Add adapters, amplify and sequence
We sequence 25-35 bp reads…randomly selected from each side of the fragment
Mapping tags Challenge: What do we get (pros and cons) if we map the tags a) To the genome b) To the transcriptome (like all refseq transcripts)
Genome: unbiased – we could hit any transcripts. Hard to hit spliced tags, and possibly mRNAs that get modified… Transcriptome: We hit annotated genes, and splice sites are not a problem. On the other hand, we cannot find new things
Going from tags to wigs Showing all tags as blocks in the browser is possible, but dumb – because there are potentially thousands in the window of interest, and we go blind Easy way to summarize is to make nucleotide histograms – whiteboard demo
Looking at RNA-seq data At the tag _analysis web directoy, there is a wig file, mm9_brain.wig showing tags an RNA-seq experiment from mouse brains. Upload this to the browser and look at the two genes below – are they expressed, and how much? Kcnc3 Hoxa5
Thought challenge: from tags to expression We have a wig file showing where all the tags match on the genome We have the UCSC annotation for all known genes We want something like a microarray, saying – Gene X has an expression of Y – How can we do this? (2 minutes with your sideman)
“Naïve solution” For each gene, count the tags that overlap it – Gene X has 45 tags – Gene Y has 4578 tags – Etc Problems with this?
Length of transcripts will have an effect! A long transcript gives more tags when broken up, and can be captured more easily So, the number of tags from a transcript depends on – Actual expression (number of RNA molecules) – Length of the RNAs
Normalizing for length – not that hard For each gene, count the tags that overlap it, and divide by gene length – Gene X has 45/(length of x) tags – Gene Y has 4578(length of y) tags – Etc What if we want to compare two experiments?
We also need to normalize for sample size, just as in SAGE, CAGE and ESTs Recap: TPM is a normalization that remakes the tags count into what we would get if having exactly one million tags …so, 10^6* (#tags in my gene)/(total tags)
Combining the two Normalize by gene length AND sample size Gene X has an expression of – Z TPMs/(N) – Where N is the RNA length.
Summary of tag technologies ESTs: old, expensive, long tags. Biased to 5’and 3’ of genes. Can be used for exploration SAGE: 3’ end tags. Only gene expression, no functional data. Limited for exploration CAGE/5’SAGE: 5’ end tags. Promoter expression and location. Can be used for exploration RNA-seq: “Random” tags over the whole mRNA. Expression and location – can be used for both expression and exploration