Using genome browsers Visualization and data repositories.

Using genome browsers Visualization and data repositories

Motivation Aside from R, genome browsers are arguably the most important tool in computational genomics …but is not widely used in the experimental community The browser gives you an immediate edge - you can look at data, form hypothesis and up-and download data!

In this course 1: How to use the web interface; understanding the data types 2: How to download and upload data to the browser; interaction with R 3: How to make complex analyses between data types; Galaxy and R

Today's teaching: Lectures with genome browser examples Short discussions with your neighbour Exercises

Kick starting with a challenge You are a major sequencing center You have sequenced the killer whale (Orca) genome - you have the whole genome as a stretch of ACGTs How do you make sense of this and show it to others? What value does the data have in itself? 2 minutes with your neighbour

Jim Kent, assembly-guru. Some profound words about the genome sequence “Well, it has a lot of G, C, A and Ts”

Genomes are worthless …without any annotation What type of annotations do we want to put on genomes? 2 minutes with your neighbour

Examples: 'DNA' annotation: –Known genes –Predicted genes –Repeats, transposons, CpG islands –Conservation across species 'Dynamic' annotation: –Known transcripts –Expression data –DNA modifications

How to present this data? Plain text files are useless..for most biologists Use the genome sequence as a frame, on which we map real data or predictions

The idea of the browser Based on the genome, we can –Zoom up and down, and scroll sideways –See the data in different representations –Select WHAT data we want to see (way to much data to look at all at once)‏ Important side-effect: if we map all interesting data, it means that all data is at one place, which means that we can download what we are interested in to do analysis!

The three browsers UCSC genome browser –http://genome.ucsc.edu –Updated often, simple but powerful interface. Very simple underlying data formats ensEMBL –http://www.ensembl.org –More complex web interface, with multiple zoom levels. Very complex underlying data formats The generic genome browser –http://www.gmod.org/GBrowse –Actually more a software development platform, so that you can do your own. Resembles UCSC more than ensEMBL

In this course… We will only use the UCSC browser due to –Simplicity –Lecturer bias –The galaxy tool - a very nifty web-tool to do power user analysis on UCSC data (more later)‏ If you know this browser, other browsers are easy to understand

Basic concepts Zooming Data tracks

Data tracks -the problem Example: The road from Melby to Ølby MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km

Example: The road from Melby to Ølby MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km Data tracks -the problem

MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km MelbyØlby 5 km 10 km Data tracks -the solution 5 km houses trees Monday Sunday

This is how genome browsers show the data Chromosome position Gene track mRNA track ExonsIntrons

Annotation tracks A track is often one source of data, from a particular place, that is mapped to the genome Data can be viewed as “blocks” with a start and an end, expressed as chromosome coordinates It is important to know what the data is before trying to interpret it We will first look at the “human mRNA” track

Human mRNA track What the guys at UCSC did: –Take all the known mRNAs in Genbank, and map these to the human genome using a software called BLAT (similar to blast). Everything that hits will be shown in this track. –What is the pros and cons of this approach? What are the limitations? 2 minutes with your neighbour!

Example answers: Pros Simple, and no filtering - leaving me to make interpretation Cons Not real annotation - again, leaving me to make interpretation Heavily reliant on the data source quality Limited by the extent of data

A short non-interactive tour We will use the browser extensively from now on But first, I will guide through a few key concepts - otherwise confusion ensues when trying the real thing

What version of the genome do you have? Genome sequences are based on many short sequenced reads, which then are assembled into a single sequence This is very tricky, and we get slightly updated genomes at regular intervals A version of the genome is called an assembly So, whenever you say that you are using a genome sequence to do something, you have to say what assembly you are working on!

More about assemblies The official naming system is –[species abbreviation][assembly number] For instance hg17 (human nr 17), or mm8 (mus musculus 8)‏ There is an alternative way: the date of the release. So, hg17 is also called “Human May 2004”

Even more about assemblies Rules of thumb: The newer an assembly, the “better” Some older assemblies have more data mapped to them (because they have been around longer)‏ Some genomes are new, and unstable: updates come often, and big differences between updates. Some are more mature (like human)‏

Selecting species & assembly Species Assembly: the genome “version”. Where on the genome

Looking at the genome, with mRNAs Different mRNAs (same gene)‏ Direction of arrows shows strand Chromosome overview

Zooming in (We'll learn how later)‏

Some points: Transcription, in this case, is right to left - transcription on the minus strand - shown by the arrows Two of the mRNAs start here, the others start even further upstream. Probably alternative promoters The fat, two-colored blocks are predicted to be protein-coding parts Note that There are parts of mRNAs that are not translated - so called UTRs There is one mRNA that is clearly non-coding (might have a stop-coding further upstream)‏

Zooming even further down - we see the actual DNA Codons Clicking on any of these mRNAs take you to the corresponding Genbank entry

Different data representations Each data track has a selection 'box' Use this to : -turn tracks on or off -change visualization

examples Full Squished Dense

Time to try it out.. Important: the genome browser shows many tracks by default, some which are named in a confusing way Don’t let this throw you. We will walk them through! Goto http://genome.ucsc.edu/http://genome.ucsc.edu/ Click 'Genome browser' to the left

We'll use default position for now, so just click the 'Submit' button (which is on the right)‏

Overwhelmed? Many types of data! We will only use some, others you can explore yourselves Below the image, the data tracks are categorized for easier access:

Let’s look only at the Human mRNA track as before Challenge: Turn off all tracks, except “base position” and “human mRNA”! (Expand/collapse the categories, then hide tracks. Use 'refresh' to update the image.)‏

Challenge Using the following buttons, and what we already went through, find out: What is the DNA sequence of the first two codons of mRNA DQ892408? What is the “gene name” of the mRNAs we are looking at? Are the two longest RNAs starting at exactly the same place? What are the neighboring genes?

Before we go any further… What are all these data? What can we use them for?

Fast info on a given track: Click on the actual track name (over the box)‏ What does the “refseq genes” track hold? What is the difference to “other refseq” or “Genscan genes” When would you use each track?

It is not realistic to go through all tracks in this course …and not meaningful, because new tracks are added over time We will go over the main types of tracks, and the relevant experimental methods for producing the tracks Understanding what we are looking is very necessary for meaningful interpretation

Big groups of things, summarized Sequence features –CpG islands –Repeats Transcripts or part of transcripts –mRNA, ESTs The so-called genes (predicted or experimental)‏ Tiling array expression data Chip-Chip Variation within species (SNPs)‏ Conservation and alignments between species –net alignments, Phastcons scores, The ENCODE dataset

Between transcription and translation – the modern RNA world After transcription, RNAs are immature (precursor mRNAs). Processing RNAs give mature mRNAs, which gives access to the cytoplasm, and translation. As usual, we know only a small part of the mechanisms... 5' CAP structure is added 3' polyA stretch is added Splicing (not always!)‏ RNA editing (rare?)‏

Splicing

Problem: We want to know what mRNA look like... but RNA is unstable, can't be sequenced directly Solution: Turn them into cDNA first.

Into a plasmid – so, we have a library of plasmids each carrying one cDNA This is a “cDNA library” that later can be sequenced or used for other things

General problems with cDNA sequencing: Reverse transcriptase falls off Hard to sequence long transcripts Many cDNAs are identical –Very expensive if you want to sequence all unique molecules

Solving the problem Only sequence parts of cDNAs - these are called ESTs(more in a few slides)‏ Semi-recent development: sequencing of full-length cDNAs, using –Cap-trapping –PolyA primers –subtraction

Subtraction: how to only get RNAs you have not seen yet Simple concept: For a cDNA sample, we add an excess of abundant RNAs. These will hybridize Then, we remove everything which hybridized …and sequence the rest

Discuss with your neighbour (2 min)‏ Say that we have two cDNA libraries - one is subtracted, one is not What are they good for? Expression (how many transcripts of a certain gene)? Annotation and gene discovery?

Visualizing and annotating cDNAs in the genome browser The genome is actually needed to make sense of cDNAs, especially if it is not protein-coding A general approach is to map your cDNA to the genome using an alignment algorithms Here, we will use BLAT and the UCSC browser Should be straight-forward, but...lets try it out: See the course page for 3 mouse sequences in the blat_seqs file – I will do one in real-time Assume these are new sequences that you must say whether they are good enough to be part of the genome browser

Bottom line cDNA genome is sometimes trivial, but can become very tricky. Bear this in mind when you look at genome mappings – this is the process they are annotated with! cDNAs are often good quality, but always be sceptical unless there are multiple lines of evidence Biological knowledge helps here – sanity checks become easier

More on the problem of sequencing cDNAs Hard to sequence full-length cDNAs …and expensive to sequence many If we cannot sequence the whole cDNAs… Only sequence parts of cDNAs - these are called expressed sequence tags: ESTs

Expressed sequence tags (EST)‏ Cheaper, and easier to scale up Problems: many ESTs are simply trash – the result of over-enthusiastic sequencing For longer genes, no coverage of the middle part

Complementary information to cDNAs Can be used for expression studies (more later)‏ Many MORE of them than full-length cDNAs - higher coverage If you only have ONE cDNA for a given isoform, ESTs can help to “validate it”

So-called “gene” tracks We have now seen that often a “gene” have many mRNAs - forming a “transcription unit” If you have many mRNAs, it is good to have summary tracks of genes or transcription units The UCSC browser has (at least) two of these: –The RefSeq track –The “Known genes” track

Refseq Refseq is actually database with high-quality cDNAs, from NCBI. So, a Refseq sequence always has at least one identical cDNA in GenBank. Good, because some individual cDNAs are trash, and we get a more manageble dataset Bad, because the criteria used are somewhat arnitrary. For example, “long cDNAs are better than short”

Known Genes A track made by the UCSC people, which uses multiple databases (Refseq, uniprot, etc)‏ Horrible name - easy to misunderstand it - it is NOT all known genes! If clicking on individual genes, you get very nice summaries, sometimes with expression information

Searching by gene name If you put in a gene name, or an accession number in the coordinate box, the browser will search the mRNA, Refseq and Known Genes tracks (and some more) for this name, and give you a list if you get more than one hit Is usually easy: here is an example: the Dicer1 gene (an important RNAse)‏

CpG islands A CpG dinucleotide is simply a C followed by a G CpGs are uncommon (1%) in vertebrate genomes, due to that the C in the CG is easily methylated and then deaminated into a T However, there are stretches of CpG rich dinucleotides, called CpG islands These are correlated with promoters - around 50% of promoters have a CpG island. Function is unclear! In the UCSC browser, this is simply called the CpG island track

Repeats Large portions of the genomes are “repeats”, classified into two main types: 1)Tandem repeats Two or more nucleotides are repeated, directly after each other ATTCGATTCGATTCG (number of repeats are used in crime forensics and parentage tests)‏ 2) Interspersed repeats Results of RNA-mediated transposition (not in this course)‏

Repeats, cont Generally, repeats are considered “uninformative”, and presents problems when aligning things to the genome However, there are clear cases of functional repeats In the UCSC browser, all repeats can be turned on in the repeat track

Lets look at these things 5 minutes with your neighbour: Look at the RPS9 gene, and turn on Refseqs, known genes, human mRNAs, ESTs, CpG islands and repeats How well does refseqs, ESTs and Known genes correlate Are there any CpGs or repeats - where are they located? What type of repeats are there?

Using genome browsers Visualization and data repositories.

Similar presentations

Presentation on theme: "Using genome browsers Visualization and data repositories."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using genome browsers Visualization and data repositories.

Similar presentations

Presentation on theme: "Using genome browsers Visualization and data repositories."— Presentation transcript:

Similar presentations

About project

Feedback