Public data and tool repositories Section 2 Genome Browsers

Public data and tool repositories Section 2 Genome Browsers

Problems from last section
Query Entrez Gene with the following two queries separately and then explain the differences between the two results using a logical NOT operation: tyrosine kinase[Gene Ontology] AND human[Organism] cd00192[Domain Name] AND human[Organism] Retrieve the APP gene record from NCBI and use the Display dropdown menu to display Conserved Domain Links. Use the ids of the listed domains to query Entrez Gene for records with the same domains. Use the SNP Geneview link at NCBI to identify coding SNPs in the APP gene. Which SNP is missing from this display which was present in the Ensembl APP protein record? Use the Homologene link at NCBI to identify possible functional orthologs for human APP. How does this list compare to the Ensembl list of orthologs that we reviewed previously?

Review of last section example: human APP gene
NCBI Entrez databases Constructing queries Gene, Nucleotide and Protein RefSeq EBI/Ensembl Finding genes Viewing Genes, Transcripts, Exons, Proteins and SNPs Common id and data formats

This section Genome assembly and genome browsers
Promoter/enhancer analysis example More information

Genome Build Process Organism sequence data is assembled into contiguous pieces (contigs) Contigs are mapped to genomic features and the coordinate system is assigned Unmapped sequence data be assigned to artificial chromosomes Assembly is improved as more sequence data is available Entrez Genome Project

Genome Browsers Make millions of sequences available through easily accessible, user-friendly interfaces Provide genomic sequence, exon structure, mRNA sequence, EST and SNP data via web-based text search interfaces Options available for local installs

Commonly Used Browsers
The Entrez Map Viewer The EBI/Ensembl browser The UCSC genome browser

NCBI Map Viewer Integrates feature identity information with whole genome view Allows one to view and search an organism's complete genome Displays chromosome maps User can zoom into progressively greater levels of detail, down to the sequence data for a region of interest. Focus more on individual sequences Worked example: Querying for APP gene in NCBI Map Viewer 1. Go to 2. Select Homo sapiens in Search drop down menu 3. Type APP in the for: text box Select Go button Map Viewer will display all hits to “app” on all chromosomes in all human assemblies Number of hits per chromosome is displayed in red under the chromosome number, the location of each hit is highlighted on the chromosome and the hits are listed for each assembly Type 21 into on chromosome box Select Find button Select the APP Map Element of type GENE The map viewer now displays the chromosome view with APP highlighted. Things to note: a. NCBI displays its data along the vertical axis (unlike UCSC and ENSEMBL) Each feature has a list of links to its right, these are link outs to detailed feature info, and will vary based on the feature type. 12. The list of links for APP include: a. APP – this links to the feature specific page, in this case the gene entry. b. OMIM - online mendelian inheritance in man: This database is a catalog of human genes and genetic disorders and is curated manually. c. HGNC- HUGO gene record. HUGO is the authority that issues canonical gene names. This is fairly recent, so many genes have old names and pseudonyms... these alternate names should be listed in HUGO. (ex AD1) d. sv - detailed nucleotide record for feature. Sort of analogous to the UCSC view.... but just for a single feature. e. pr - NCBI protein record. Genbank file for the protein record. f. dl - download feature info. This page allows you to download the feature(s) in the range of interest. You can specify formats and other settings. g. ev- Evidence Viewer. Supporting data for this feature. mrna seq, protein seqs, EST. h. mm - model maker view: Model Maker allows you to view the evidence that was used to build a gene model on assembled genomic sequence, and to create your own version of the model by selecting exons of interest. i. hm - homolog gene. Lists related genes in all species j. sts – uniSTS k. CCDS - CCDS entries are manually curated gene sets that are considered to be correct. Each splice variant in the CCDS is experimentally validated, and no genes with ambiguous translation will make it into this set. It covers ~ 3/4 of the known Hs genes to date l. SNP - dbSNP records in this range Ex: Looking at the APP gene in the NCBI Map Viewer

EBI/Ensembl Browser Provides access to sequence data from ~40 organisms Includes the human genome sequence and data from all the commonly used experimental organisms Displays the location of genes, variations and other sequence features within genomes Greatest strengths: browsing of large genomic contigs comparative genomic features Worked example: Querying for APP gene in EBI/Ensembl browser Go to Select Homo sapiens in drop down menu 3. Type APP in the text box 4. Select Go button 5. Choose the 1st record 6. This page displays the full info for APP: transcript, genomic info, cross links, etc. 7. In the Genomic Location field(5th field down in the report), choose the 1st link. 8. The genome browser pops up centered on APP. 9. Notice this browser has a horizontal orientation, and has 4 sections: Chromosome, Overview, Detailed, and Basepair Expand the base pair view Explore the ‘Detailed’ panel: a. Allows you to set the display information to view b. Use the drop down menus at the top of the box to choose display fields c. Choose to display SNPs c. Display setting can be set with these menus d. Finally, you may also export the image 12. Explore the interactive display 13. The right hand columns offer links to several more tools: comparative alignments, data export utilities, and link outs to other browsers Ex: Looking at the APP gene in the EBI/Ensembl Browser

UCSC Genome Browser Strength is genome position-based data aggregation: Data positioned on “best” genome build and organised into “tracks” Outside data tracks Genome builds Genes, known and predicted mRNA Expression and regulation Variations and repeats Inside data tracks Known Genes Comparative genomics Custom tracks Worked example: APP in the UCSC Genome Browser 1. Go to 2. Click on the Genome Browser link on the left sidebar 3. Verify that genome: Human and the latest assembly (currently, Mar. 2006) are selected 4. Type in app into the query box and click the Submit button 5. In the list of results, click on the link for APP, isoform a (but isoforms b and c are just as good since all the positions are the same) 6. Orient the participants to the display… 7. at the top are some browsing and zooming controls 8. on the next line are the position jumping controls 9. next is the chromosome overview 10. then we have the browser window with data tracks displayed as position-registered, layered horizontal lines 11. and at the bottom are track display controls. 12. Let’s make sure we’re all looking at the same tracks… 13. Scroll down to the track display controls and make sure everything is set to “hide” by clicking on the hide all button 14. Set the following tracks to “dense”: Base Position, Known Genes, RefSeq Genes, Ensembl Genes, Human mRNAs, Spliced ESTs, Human ESTs, UniGene, TFBS Conserved, Conservation, Most Conserved, Repeat Masker, Simple Repeats 15. Click on the refresh button 16. Point out the different marks on the gene tracks: exons: large ticks, UTR: small ticks, line: intron, arrow heads: transcription direction 17. Click on the Known Genes, RefSeq Genes and Ensembl Genes to expand them to packed display 18. Known genes are UCSC pick of “best” gene models 19. Ensembl gene predictions are analogous to RefSeq, but they are made by EMBL (more about that later) 20. Human mRNAs, Spliced ESTs and Human ESTs are derived from sequencing records. 21. Notice that the UniGene clusters for this sequence are fractured and don’t combine exons. Possible explanations include too few mRNA and EST sequences to form robust clusters, bad EST coverage of exons won’t allow good sequence overlaps, or misalignment of cluster sequences with the genome. 22. To check out the number of EST sequences, click on Human ESTs. It appears that there is very good EST coverage of this gene, but that they are very fragmented with numerous 3’ ends. This could confuse the UniGene clustering algorithm since it weights 3’ ends more highly than 5’ ends. 23. To check out how big the exons are, click on one of the RefSeq sequences, and then click on the linked Entrez Gene id. 24. In the new browser window, change the display to Gene Table and scan through the exon lengths. No problem here, since the exons seem to be about average size (about 100nt long). 25. To check for misalignment of cluster sequences with the genome, take note of one of the RefSeq sequence ids (say NM_000484), click on one of the UniGene ids, and then the Human UniGene id. 26. In the UniGene record browser in the new browser window, search for the RefSeq sequence. We find that Hs is large cluster that defines APP. 27. Close that browser window and search for Hs in the UCSC genome browser. It is not found and therefore the problem must be misalignment. 28. Press the browser’s back button until you’re showing the APP gene again 29. Click the hide all button 30. Set the following tracks to “dense”: Base Position, Known Genes, RefSeq Genes, Ensembl Genes, TFBS Conserved, Conservation, Most Conserved, SNPs, Repeat Masker, Simple Repeats 31. Most tracks have the following display types: hide, dense, squish, pack, full 32. The TFBS Conserved track is a good one to try out the different track display types-- have participants try each type on their own--remember, nothing will happen unless you click the refresh button 33. Make all the tracks dense again, we’re going to look upstream for possible regulatory regions. 34. Click on the TFBS Conserved and Conservation tracks to flip them into pack display 35. The Conservation track shows a histogram with degree of conservation after comparison to other vertebrate genomes. 36. The TFBS Conserved track shows high scoring transcription factor binding sites in regions of high vertebrate conservation. These are computationally derived. 37. Click on the double arrow right to pan the browser view to the right. 38. Identify the TFBS Conserved site upstream of the putative transcription start site for APP. 39. This is V$MEF2_01. Click on it. This is the record for myogenic enhancing factor 2 family of transcription factors. 40. To see if this is known to be related to APP, open a new browser window, go to NCBI and search all databases for MEF2 AND APP. 41. You get two hits in PubMed and two hits in Gene. Click on the PubMed hits. 42. Click on the Burton et al paper to read the abstract and see that APP seems to mediate phosphorylation of MEF2, which regulates neuronal survival. There was differential activity of mutated APP in familial AD. There is also a GeneRIF record in the APP gene record which references this paper. 43. Hypothesis: perhaps APP activates MEF2 and MEF2 negatively regulates APP transcription? Ex: Looking at the APP gene in the UCSC Genome Browser

APP Upstream Region 15kb Worked example: Extracting and aligning human and mouse APP upstream regions 1. First we will extract DNA sequence upstream of human and mouse APP genes 2. Go to NCBI at 3. Query for APP[Gene Name] AND (human[Organism] OR mouse[Organism]) 4. Click on the Gene results 5. Click on the human APP gene record 6. Click on the gene’s chromosome sequence, and select FASTA format 7. Note that we are viewing the sequence on the minus strand of DNA 8. We will extract sequence 10kb upstream of the transcription start site (TSS) 9. Copy the coordinate in the “Range to” text box, which is our working TSS 10. Paste it into the “Range from” text box 11. Add to the number in the “Range to” box (remember we on the minus strand) 12. Click the Refresh button 13. Select File in the “Send to” dropdown menu 14. Confirm that there is a file called “sequences.fasta” on the desktop 15. Click the browser’s go back button until you are at the Gene database results 16. Click on the mouse APP gene record, and repeat steps 5-12. 17. Confirm that there is a second file on the desktop called “sequences(2).fasta” 18. Now we will confirm that we extracted the correct sequences 19. Open a new browser window and go to UCSC Genome Browser at 20. Click on the BLAT link on the left sidebar 21. Make sure the genome dropdown menu is set to human 22. Click the Browse… button below the large text area and select the human “sequences.fasta” file on the desktop 23. Click the submit button 24. Click on the browser link on the first hit in the list of results 25. Your query sequence is at the top of the genome browser viewer as the black box and arrows indicate direction of your sequence 26. Click on the zoom out 3X button and verify that your sequence is 10kb in length and directly upstream of the APP TSS 27. Click the browser’s back button until you are at the BLAT Search Genome page 28. Select the mouse genome and browse for the mouse “sequences(2).fasta”, and then repeat steps 23-26 29. Now we will do an alignment using some comparative genomics tools 30. Go to DCODE.org at 31. Click on the zPicture link on the left sidebar 32. We will enter the human DNA as sequence 1 and mouse as sequence 2 33. In the Sequence 1 box, click on the Browse… button, go to the Desktop and select “sequences.fasta” 34. In the Sequence 2 box, do the same but select “sequences(2).fasta” 35. Click the SUBMIT button at the bottom of the page, and you’ll get a queue message 36. Click on the “click here” to refresh link, and you’ll eventually see a results page 37. Click on the Dynamic visualization icon 38. The graphic on this page shows a comparison between our human and mouse DNA sequences as percent identity and red shaded segments are segments showing high identity. 39. Click on one of the red-shaded sections and you’ll retrieve a regional alignment and the sequence segments. 40. Click on the browser’s go back button 41. Click on the Dot-plot icon 42. A dot-plot is the traditional way to view pairwise sequence comparisons, particularly whole genome-genome, since it is excellent for viewing duplications and inversions. One sequence is on the horizontal axis and the other on the vertical axis. A dot indicates a high degree of sequence identity. 43. Click on the browser’s go back button 44. Click on the rVista icon 45. Make sure that TRANSFAC professional library is selected and click the SUBMIT button 46. Since we are interested in the MEF2 transcription factor, scroll down to the M section and check the box next to MEF2 and click the SUBMIT button 47. Click on the CHECK button 48. Click on the Dynamic Visualization icon 49. Notice that this is a similar display to what we have already seen, except there is an extra track above the image 50. Check smooth plot in the Picture section 51. Check conserved and all in the Show section 52. Check flip above the SUBMIT button so that we see the sequence as we extracted it from NCBI (zPicture flipped our sequence to the positive strand for some reason) 53. Click SUBMIT 54. The ALL track shows all MEF2 sites detected in the human sequence, the CONSERVED track shows MEF2 sites that were found in both sequences in a conserved region 55. Click on the browser’s go back button 56. Click on the linked Highlight in the Alignment section 57. Click the SUBMIT button 58. Scroll down the alignment of the conserved segments and note the single conserved MEF2 site in blue. Ex: Extracting and aligning human and mouse APP upstream regions

Promoter/enhancer analysis approaches
Same gene, multiple species Assumed evolutionary conservation of non-coding regions Can use pairwise or multiple alignment method Examples: Precomputed: UCSC conservation tracks Dynamic: eg, rVista Different genes, same species Typical output as co-expressed clusters from microarray data Looking for over-represented, small binding sites Much better results if looking for a pattern or clustering of multiple sites Motif-finding algorithm, eg, MEME

Tutorials NCBI EBI UCSC Bulk Downloads Field Guide
Information and tutorials Science Primer EBI 2Can Tutorials UCSC Genome Browser User’s Guide Bulk Downloads Bulk Downloads Tutorial

IN CLASS EXERCISE 1. Do all three browsers show the same number of transcript variants for: APP, EGFR, TP53? 2. How many SNPs appear in the 5’ UTR of APP? 3. What is the lowest conservation score in APP exon 2? 1) Do all three browsers show the same number of transcript variants for: APP, EGFR, TP53? i) Open the UCSC browser ii) Search for Gene name iii) Open link for isoform A of gene. iv) View UCSC known gene track v) Count the number of UCSC known genes vi) Click ENSEMBL link (open in new window) vii) Click on the gene name in the detail view. (a) Open gene record. (b) This page lists the number of transcript id's. viii) Click NCBI link (open in new window) ix) The map viewer should be open to the gene record. x) Click gene name. This page will list the transcripts. xi) Repeat Steps ii through TP53 EGFR APPUCSC Ensembl NCBI 2) How many SNPs appear in the 5’ UTR of APP? i) Open UCSC browser ii) Search for APP iii) Choose 1st record iv) Turn on the SNP trakc to full v) Refresh the display vi) Zoom in on RIGHT end of gene (this gene is in the - strand, so the 5' UTR is on the right in the genome view!). vii) Count the snps 3) What is the lowest conservation score in APP exon 2? i) Same as in ex 2, but zoom in to 2nd exon from RIGHT end Turn on the conservation track (multiz align) ii) Refresh the display iii) Zoom into the lowest point of the conservation graph until you are at the single base level view iv) Click on the conservation track at this pointThis opens the conservation track info page. There should be a link at the top for 'conservation score v) Follow link and get the conservation score from the table.

Public data and tool repositories Section 2 Genome Browsers

Similar presentations

Presentation on theme: "Public data and tool repositories Section 2 Genome Browsers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Public data and tool repositories Section 2 Genome Browsers

Similar presentations

Presentation on theme: "Public data and tool repositories Section 2 Genome Browsers"— Presentation transcript:

Similar presentations

About project

Feedback