The UCSC Genome Browser Introduction

The UCSC Genome Browser Introduction
Osvaldo Graña CNIO Bioinformatics Unit Materials prepared by Mary Mangan, Ph.D. Welcome to the introductory tutorial on the UCSC Genome Browser. The University of California at Santa Cruz Genome Browser resource contains the reference (or official) public DNA sequences and working draft assemblies for human and a large collection of other genomes. There are a number of tools within this site that will provide access to the sequences themselves, and many other useful genome features to add context to the genomic information. Researchers can use this site to find genes and gene predictions, expression information, SNPs and variations, cross-species comparative data, and more. Our goal in this tutorial is to help you search, retrieve, and display the data that you want, which is relevant to your research. In this introduction we will provide an overview of the organization, graphical cues, and basic features of the searches and displays. We will explore the range of tools available for several types of searches. The materials in this presentation were prepared by Dr. Mary Mangan of OpenHelix, with guidance from the UCSC Genome Bioinformatics team. Separate tutorials about the UCSC resources address more advanced topics such as The Table Browser and Custom Tracks, the Gene Sorter and a variety of other tools associated with the browser. After you have mastered this introductory material, be sure to check those out. Version 21a_1012 1 1

UCSC Genome Browser Agenda
Introduction Basic Searches Understanding Displays Get Details or Sequences Sequence Searches (BLAT) Exercises The agenda for this tutorial is shown here. We will begin with an introduction and credits, and will move on to explore the UCSC Genome Browser resource with some basic searches. We will perform a basic text search, and examine the results and displays in some detail to have a better understanding of the organization. We will obtain sequence data, and then use the data to perform a search, employing the BLAT tool. I will then summarize this material. Finally, you will have the opportunity to explore sample exercises on the UCSC Genome Browser site, to reinforce the concepts developed in this tutorial. Let’s begin with the introduction and credits. UCSC Genome Browser: 2 2

Reference genome: base position number
Organization of Genomic Data sequence Reference genome: base position number chromosome band predicted genes phenotype and disease evolutionary conservation SNPs and structural variation gap locations known genes repeated regions microarray/expression data more… enhancer/promoter data Annotation Tracks Links out to more data A great deal of information has come to us from the official Human Genome Project, and the official projects from many other species as well. But other data has come from individual laboratories doing traditional benchwork; some has come from the literature; and some of the data has come from new large-scale technologies that have arisen in the last few years, such as microarrays and next-generation sequencing and more. So—there are tremendous volumes of data available; and many places to try to find it. The UCSC Genome Browser is a great resource because it organizes this material in one place. It uses the sequence of the genome—the official “reference” sequence of the Human Genome Project, or the official reference genome of other species—and combines this data with all kinds of other useful and important biological information, such as chromosome banding patterns, known genes, gene predictions, phenotype and disease information, enhancer and promoter data, expression data, comparative genomics and evolutionary conservation, SNPs and other variations, and so on. As I illustrate in this conceptual diagram, the data is organized along the official genomic sequence reference coordinates. The other data types are referred to as “Annotation Tracks” and are aligned on the genomic sequence framework. These tracks provide additional information about any given genomic region of interest. All of this data is aligned in one place so you can quickly find new information, and context, about regions important for your work. In addition, all the data links out to other databases, web sites, and literature so you can go as deep as you want into any specific topic in which you may be interested. 3 3

A Sample of the UCSC Genome Browser
reference sequence gene details Annotation Tracks comparisons On the previous slide we had a conceptual diagram of the UCSC Genome Browser’s representation of the genome and the annotation data—briefly I wanted to show you a sample of the kind of data we will examine as it actually looks in the Genome Viewer. Here you see a portion of the genome viewer, with the base positions—the official genome reference sequence--the top, and the many layers of data—annotation tracks-- organized in that region. From any of this data if you click on the features, you will be presented with even more detail about the items you see on new pages. The detail pages themselves link out to more resources, too. Shown here are some examples of Gene Details, multiple species alignment data, and SNPs. So much data, so well organized, is right at your fingertips now, thanks to the UCSC Genome Bioinformatics Group team. You’ll learn a great deal about any genomic region using the graphical representations and the clickable features. SNPs 4 4

Introduction and Credits Basic Searches Understanding Displays Get Details or Sequences Sequence Searches (BLAT) Exercises [end of Introduction and Credits] That completes our introduction and credits. [beginning of Basic Searches] In this section we explore the basic text searches. UCSC Genome Browser: 5 5

Specific information— new features, current status, etc.
The UCSC Homepage: navigate General information Specific information— new features, current status, etc. Shown here is the homepage for the UCSC Genome Bioinformatics site—taken on a day that happened to have a tribute to Charles Darwin. When you first arrive, you will see a page that is organized like this. At the top there is a section that contains general information about the site. Next, there is a specific section for “News”-- new species, new features, software or data changes, the current state of the data that is available. This information is worth a quick check when you visit the site, in case there have been changes since the last time you visited. But the real substance of the site—the data and tools—are accessible in a couple of ways from this page. There are navigation bars at the top and left side which will permit you to access all of the available features. You will begin your experience at the UCSC Genome Browser by navigating from these blue areas. Some features are available from both the top and the side. Some are only along the left. We won’t be able to cover all of the great tools and details in this introduction. There are separate tutorials available on our site that explore some of these, including the Table Browser & Custom Tracks, the Gene Sorter, VisiGene, and more. To actually get in and start performing basic searches in the database, there are several options—you can search by text—gene name, gene symbol, keywords, ID, and so on. To do this we will use the Genomes or the Genome Browser link. Either of these will give us access to the Gateway page where we will begin to search. 6 6

Helpful search examples
Gateway: Start Page for a Basic Search text/ID searches Helpful search examples format provided Shown here is a portion of the Genome Browser Gateway page. By default the search is set to “Human” and the current assembly when you first arrive, but we will see that you can change the species and assembly later. We will begin to talk about searching using the text search feature from this Genome Browser Gateway page. You can do a text search for information such as gene names, chromosome number, chromosome region, your favorite gene or marker identification number (ID), GenBank submitter name, and more. You can use a keyword to find records. Examples of the kinds of searches you could do are shown on the lower part of this page—see the request items, and the expected responses from the genome browser. Remember that you can just check out this section for helpful reminders of the correct query format when doing your own searches later on. We are going to go a little deeper into your search options from this gateway—we’ll take each option and explore what you can expect from a given search. Use this Gateway to search: Gene names, symbols, IDs Chromosome number: chr7, or region: chr11: Keywords: kinase, receptor See lower part of page for help with format 7 7

UCSC Genome Browser Gateway
1 2 3 4 5 assembly Here we are going to focus on the options that you have to search a genome using the Gateway page. This screen shot isolates that part of the page for us so we can focus on the specific items that are available to you. The first option is clade, and then the second is the genome, or species, choice. At one time all of the species were in a single list, but there are so many species now that they have been re-organized into these menus. You will search one species at a time in the Genome Browser. Use the pulldown menus to select and highlight the species name that you want to use in your search. Next, you have to choose an “assembly”. Assembly refers to the official “reference” genomic sequence that is used to create the framework on which to hang all the other data. The reference sequence comes from the “official” groups who release genome sequence data. In the case of human that is now the GRC or Genome Reference Consortium. The groups deposit sequence in GenBank, and then UCSC obtains the official assembly, and generates the annotation tracks for that genome. The source of that assembly and any version number from the sequencing group is indicated on that species gateway information section. You will also see other nicknames for it as well, which you may see used in various places. It would be great if everyone used a standard assembly designation in the literature. It can be confusing. The official release date is what we see in the Assembly menu. Often you will want the most current assembly, but sometimes you may want to look back at older data and you can see that is still available for a while. Even older data is still available in the UCSC archives if you need it. Archives are accessible from a link on the homepage left navigation menu. “Position or search term” is the next option. This is where you put the symbol, keyword, or ID information about where you want to examine in the genome. You can put a symbol in the position box, or use the handy gene search box to quickly find the right canonical gene. The gene box assists you with suggested text that appears as you type. The options described so far will get you to a genomic location. But if you are wondering where to find the specific data types or annotation tracks, the “track search” button will enable that. Clicking that will take you to a new search where you can explore the annotation track descriptions. The last thing I’ll point out here is the button for configuring tracks and displays. You can make changes here to the display—such as the font sizes and feature appearance, but later I’ll show you a couple of other places you can access this as well. If you are finding that the text on the viewer is too small, or the arrowhead features are difficult to view, configure their size and alter other aspects of the viewer here. Make your Gateway choices: Select clade + genome = species: search 1 species at a time Assembly: the official reference DNA sequence Position: location in the genome to examine, or text search Track search to find data types of interest (annotation tracks) Configure: make fonts bigger + other display choices 8 8

Sample Search for Human TP53
select Sample search: human, February 2009 assembly, tp53 uc002gij.2 Now that we have examined the search options, let’s perform a sample search of this database. The search that I’ll be demonstrating uses the HUMAN genome, the February 2009 assembly. If you are seeing these slides at a time when there is a later assembly, things might look slightly different. For this example, I’m going to use the human TP53 gene—this is an important and medically relevant gene that has been implicated in some cancers. It is a well characterized gene for our example. I could choose the “gene” suggestion box item for this search to get the canonical gene, but in this case I’d like to explore the full results options so I will illustrate the plain search box method. Once you have made the appropriate selections among the options, added your position or search text, you click the “submit” button and wait for your results…..some of which we see below. Here I show a part of the results page for the text search for TP53. That text appears in a number of different records within different annotation track sets, so you have to select the one you want from this results page. It depends on your needs. For my example I’ll focus on UCSC genes, which is a collection of several different gene resources that have been gathered by the UCSC team. Sometimes you can go directly to the browser—if you use a specific accession number that might happen. However, with text searches often you will have to select from the records. Usually I choose a record that appears to be the correct gene symbol or name. I look at the description text. And if there appear to be multiple entries that are likely to be splice variants, I may select the longest of them (as indicated by the nucleotide range at the end of the link). We simply have to choose one to move to that genomic region—as you will see, the other versions of that gene will be visible on the viewer when we get there. For my example here, I will choose the link that says uc002gij.2, tumor protein p53, variant 1. Click that link to go to the TP53 position in the genome with those nucleotide coordinates—we will go to chromosome 17 in that nucleotide range in the browser viewer. We’ll pick up with the viewer in the next section. Select from results list; or goes to a viewer page, if unique 9 9

Introduction and Credits Basic Searches Understanding Displays Get Details or Sequences Sequence Searches (BLAT) Summary Exercises [end of Basic Searches] That completes our introduction to the basic text search. [beginning of Understanding Displays] In this section we explore the results of searches as they appear in the Genome Viewer, to understand the layouts, displays, and controls for the viewer. UCSC Genome Browser: 10 10

Default settings; tracks can now be dragged in viewer
Overview of the Whole Genome Browser Page (2009 Human Assembly) } Genome viewer Mapping and Sequencing Tracks Genes and Gene Prediction Tracks (including sno/miRNA data) Groups of data (Tracks) Phenotype and Disease Tracks Shown here is an overview of the page that results from clicking the link in our results list. I use this slide to illustrate the major organizational concepts of the Genome Browser. At the top of the page you will see the Genome Viewer section. Here you will see the diagrammatic representation of the genome and annotation track features in this region. Soon we will examine this data and the visual cues in more detail. At the bottom of the page you will see the controls that you can use to turn the data in the viewer on or off. The data is organized into GROUPS for quickly finding data of interest. When you first arrive, the layout of the graphical viewer at the top corresponds to the order of their appearance in the track lists below. These are groups of similar data, such as Mapping and Sequencing Tracks, Genes and Gene Prediction tracks, and so on. Each GROUP contains the individual TRACKS, or the rows of annotation. Here I illustrate that the data from the Mapping and Sequencing tracks group is displayed in the uppermost part of the viewer. Next, the Genes and Gene Prediction tracks are located in the next section down in the viewer. Understanding this Group and Track organization will also help you to understand the Table Browser functions we’ll discuss later. Other data types in human include Phenotype and Disease tracks, mRNA and EST data, Expression data (such as microarray data sets), Regulation (including data such as Transcription Factor Binding sites), Comparative genomics data with many species comparisons and individual species comparisons, Variation and Repeats with SNPs and copy number variation and more. Items at the bottom of the track collection would be expected to be found in the lower section of the viewer. This is the layout that is in place when you first arrive at the browser. It is possible to drag and re-order the tracks in the viewer to set them to your preferred organization. So the organization may vary if you’ve re- ordered the tracks manually. This is a Genome Browser page at a mature stage of this assembly. You can see that there are many track and image controls seen down at the bottom of the page. At the very beginning of a release—there is only a “core” set of tracks at first, not all of the tracks are available. Over time these will be added to the browser—so the actual track options you see will accumulate over time. Tracks take time to create—within UCSC, and from other contributors all over the world. So, the first day of a new release the SNPs may not be there. However, they will appear over time. Another key point is that some tracks may be mapped to earlier assemblies and may not migrate to the new assembly. If you don’t see a track you are looking for, you may wish to explore earlier assemblies to locate the data. A further point to make here: the official “reference” sequence that forms the framework for this assembly will remain frozen over the course of time. However, the data in the annotation tracks may change. It may be updated periodically—for example, new data for ESTs and mRNAs is downloaded from GenBank every week. New releases of dbSNP are added when they become available. New data types may be added, or tracks may be updated, at any time. So although the official reference sequence remains the same, the annotation tracks data may change. Track data may be updated mRNA and EST Tracks Expression (such as microarray) Regulation (including TFBS) Comparative Genomics As a group Individual species Default settings; tracks can now be dragged in viewer Variation and Repeats (including SNPs, copy number variation) 11 11

Different Assemblies, Species, Tracks
Another point to make at this time is that the UCSC Genome Browser has dozens of different species genome browsers. Here are a few of the images of these different species. As you can see from a quick look, for each species the interface and display is very similar, and the way the software works will be similar as well. Although we are focusing on the human genome in our slides today—you should know that all these species share the software functionality that we will be talking about. However, different species will have different annotation tracks. Just because you see a certain track in the human browser, it does not mean that the same track will be available in other species, for example. Similarly, there may be data in yeast that will not be available in the human genome browser. Some species have less annotation data available. Assemblies, Species may have different data tracks Layout, software, functions the same 12 12

single species compared
Sample Genome Viewer Image, TP53 Region scale base position UCSC genes RefSeq mRNAs & ESTs many species compared SNPs single species compared ENCODE repeats At this time, let’s focus on the viewer section of the Genome Browser. This is the default view, after our search for TP53. I want to quickly orient you to the things that you are seeing when you look at the default setup of the genome viewer. One of the first things to notice is that we can see that we are in the position of the genome that we expected by looking at the label on the side of the UCSC gene track, which indicates the TP53 gene location—which I have highlighted in RED. All of the TP53 items from our original result are visible here— but we had selected one to be the nucleotide range choice. Notice that one of the TP53 symbols is highlighted black: that is the specific one that we clicked from our results list to arrive here, and the one that supplied the coordinates for our current view. In the viewer a scale bar helps you to orient to the size of the region—and you can also look at the size in base pairs near the top. Near the top of the image there is a track called “Base Position” in the track list, which I have been calling the genome reference sequence. This is the actual base of every single nucleotide of the reference sequence. As you can see, we are on chromosome 17 around base number 7 million something….The viewer displays numbers unless you are zoomed all the way in to “base”, and then you would see the individual nucleotide letters A, T, G and C themselves. As you look down the viewer, you will see many different data types are represented: UCSC genes, RefSeq genes, mRNAs, ESTs, ENCODE data, evolutionary relationships compared across many species or as individual species, SNPs and repeats. This is just the default view, though—other data types are available for you to display. Immediately from the viewer, you can see that you have a lot of information and context about the TP53 region. Let’s talk a little bit more about the display of the features in the viewer. 13 13

Visual Cues on the Genome Browser
Tick marks; a single location (STS, SNP) Intron and direction of transcription <<< or >>> < exon < < < < < < < ex 5' UTR 3' UTR Track colors may have meaning—for example, UCSC Gene track: If there is a corresponding PDB entry = black If there is a corresponding reviewed/validated seq = dark blue If there is a non-RefSeq seq = lightest blue Various data objects will be represented differently in the Genome Browser. For some objects, there are just single locations, or very short stretches of sequence. For example, STS sequence tagged sites, or SNPs, simple nucleotide polymorphisms, are indicated by vertical tick marks. Sometimes if there are several close together they may look like a broader bar—but essentially these are indicating a single small location. For the UCSC Genes track, there are several cues provided. Coding region exons are the tallest boxes. Half- size boxes indicate exons that comprise the 5’ and 3’ Untranslated Regions, or UTRs. This is based on the information from the source records (such as a GenBank record). Further, you can tell the direction of the transcription of this coding unit if you look at the little arrowheads which point to the left or to the right on the intron section. In the example diagram I have here, the arrowheads point to the left, indicating that this gene is transcribed from the 5’ UTR on the right side to the 3’ UTR on the left. I have chosen to display this orientation to make you aware that genes will be found running in both directions in the viewer. This will affect where you would look for promoter elements, for example. For some tracks, colors have important meaning. For example, in the UCSC Genes track, the color BLACK indicates that there is a PDB or Protein Data Bank structure entry for this transcript. Shades of blue indicate its status—which may be reviewed, or provisional, for example. You should check the documentation for the specific color codes for different tracks. Another track that has specific important color codes is the SNP, where the SNPs can be colored to represent different characteristics of the SNP. Some data types are represented by a histogram—for example some of the Comparative Genomics data in the track called Conservation displays a bar of a certain height; tall bars indicate the increased likelihood of an evolutionary relationship in that region. This kind of track is sometimes called the “wiggle” track. “Wig” tracks are becoming a very popular type of display for various data types. Another visual indication of the sequence relationships can be seen in the single species comparisons. Blocks indicate aligning regions, and horizontal lines indicate gaps. Single lines are simple gaps that represent likely insertions or deletions. Double lines represent more complex situations that could be a range of issues. More details on the possibilities can be found in the description of the “display conventions” in the browser documentation. Zooming and clicking on the display will bring you more information about the specific sequences involved. The different tracks will have different colors, shapes, etc. If you have a question about a specific representation you should check the documentation for an explanation of the significance. Understanding these representations will help you to quickly grasp many of the features in any genomic region. height of a blue bar is increased likelihood of conservation, red indicates a likelihood of faster-evolving regions Mammal cons. Alignment indications (Conservation pairs: “chain” or “net” style) Alignments = boxes, Gaps = lines 14 14

Tweak position or do new search
Options for Changing Images: Upper Section walk zoom Right- click items Tweak position or do new search In addition to the view of the genome that you see when you first arrive, you have the option to make lots of changes to the area of your view. Here I show the upper section of the Genome Viewer page, with several controls for adjusting your view of the genome. You can use the “move” buttons with the arrowhead indicators to walk left or right along the chromosome in this area. You can take big steps (with the triple arrowhead), medium, or little steps along with the single arrowheads. These can be very handy if you are interested in what’s going on near your search region. You can magnify the image area using the “zoom in” buttons—and as you can see you can zoom in a little bit, or up to 10-fold! Or—you can choose “base” to zoom all the way down to the nucleotide level right away. The nucleotide position numbers will be replaced by the A, T, G, and C bases. Similarly, you can “zoom out” with a different set of buttons. Alternatively, you can indicate a specific genome coordinate position in the “Position/Search” box. For example, if we wanted to see more of the possible promoter or downstream regions, we could subtract from the 5’ side, and add 1000 to the 3’ side, and get all of that extra sequence in our view. In addition, you can use this box just like the search box on the gateway page—you can use it to search for text items if you enter text and click “jump”. Other handy features include several mouse options for dragging and zooming. First let me mention that you can use the bars on the left to drag tracks up and down the page to re-order them. You could arrange the ones you want to be together this way. Another mousing option is the hold and drag method to zoom on a region in the viewer. You can hold your mouse button and drag across a section of the viewer near the top; that will select and zoom on that region. Your browser will reload with that segment in view. A quick way to use many of the navigation features is to right-click (or control-click on a Mac) to provide a window with numerous options for accessing additional details and actions. You can also simply grab a portion of the viewer and drag it like a Google map now. Those are the controls at the upper part of the page—mostly they move you along the genome horizontally or to change the nucleotide position, affecting the entire viewing area. In the next few slides we’ll talk about controlling the individual annotation tracks down below on the Genome Viewer page with the track controls, which alters the types of data displayed in your viewer. Hold/drag mouse to view section Hold/drag mouse to view section Drag (like Google Maps) Change your view or location with controls at the top Use “base” to get right down to the nucleotides Drag tracks up and down the viewer to re-arrange Various select and focus options by clicking/dragging mouse 15 15

Annotation Track Display Options
Links to info and/or filters and color key Enforce menu changes At the bottom of all the Genome Viewer pages are the controls for the data, the annotation tracks. This slide shows just a part of that section. In this slide I have focused on just one category area: “Mapping and Sequencing Tracks”. However, the pulldown menu definitions are the same for all of the annotation tracks. The first important point is this: when you arrive at a fresh Genome Browser, some tracks are ON by default, and others are HIDDEN by default. For example, note that the display menu option for “Base Position” says “dense”. And see also the display menu option for “Chromosome Band” says HIDE and is grey in color. So— when you first arrive at the genome browser you are being shown only the default set of items which are already turned on. Some of the annotation track names are pretty clear: UCSC Genes, or Human ESTs for example. Other names may seem a little bit less apparent. If you aren’t sure what type of data the track contains, all you need to do is click the hyperlink above the menu. Those links will present a page of information about the data in that track: the description of the data, the source of the data, any filters that might be available for that data, and possibly publications about the data if they are available. If there are color keys or graphical cues, they will be found on the track details linked documentation page. There are so many data types, and new ones are being added all the time. Yet it is easy to learn about the details of these annotation tracks from these important hyperlinks. Once you find the data types you want to see or hide, you can use the pulldown menus here to turn any individual annotation track ON or OFF. There are several options for data visibility here, and I’ll define those in the next slide. For my example, I’m going to illustrate what the “Spliced ESTs” track looks like with each of the menu choices. Quickly note though: you need to hit a “refresh” button after you make any changes to the menus; you need to click “refresh” to enforce those and actually see the display change in the viewer. Some data is ON or OFF by default Change track view Menu links to info about the tracks: content, methods You change the view with pulldown menus After making changes, REFRESH to enforce the change 16 16

Basic Annotation Track Menus Defined
Hide: removes a track from view Dense: all items collapsed into a single line Squish: each item = separate line, but 50% height + packed Pack: each item separate, but efficiently stacked (full height) Here I will illustrate the different appearances of the menu selections, using the Human Spliced ESTs (expressed sequence tags) track as an example. I show the same region of human chromosome 17 as our TP53 gene, in the Spliced ESTs section of the viewer, using the different menu options: Hide: completely removes the data from your image. Dense: all items become collapsed into a single line—it fuses all the rows of data into one line. In this case it means that you can see where there is EST coverage, but you don’t know anything about individual ESTs in this view. Squish: each item is on a line, but the graphics are only 50% of their regular height. Here you can see more information about individual ESTs. Pack: each item is separate, but efficiently stacked like sardines. However, they are full height diagrams— which makes it different from squish. Here you can see the GenBank accession numbers for the ESTs, which may be useful. And there may be other useful details more visible. Full: each item is on its own separate line, all the way down the browser viewer…up to a certain number of rows. If you have more than a couple of hundred items here the browser can become overloaded, and it will automatically revert to the more efficient “Pack” view. The tip here would be to zoom into a smaller segment to load all of the features in Full detail. To choose any of these options, just highlight it in the pulldown menu. To make the changes appear, you must click a “refresh” button. Let’s return to a few of the other page button options now. Full: each item on separate line (may need to zoom to fit) 17 17

Tracks with Additional Options: Filters, more….
off on In addition to the default displays that you have when you arrive in genomic regions, there may be further alterations that can affect the displays. Here we’ll examine a few types of choices you may have to affect the displays. Some tracks will have filters. These can enable you to change the features in various ways. In this example we show how to color the ESTs by the tissue field in the GenBank records. Keep in mind, though, that the GenBank records may not be consistently named, and some will not have tissue data at all. SNPs can also be filtered by different characteristics and are a nice place to explore the filter choices. Some tracks have copious data that is not all displayed by default. One example of this is the Yale Transcription Factor Binding Site (TFBS) data that is part of the ENCODE project and this example is from the assembly. Be sure to check the details page for the various cell line and transcription factor options that you can display. The ENCODE integrated regulation track combines several sub-tracks into one “super-track”. The super-track collection can be altered, and the underlying tracks can also be tweaked. You may need to click one more layer down to go to the specific details and settings for the individual component tracks within a super-track. Super-track Some tracks have filters (ESTs shown; SNPs other good example) Some tracks may have undisplayed data (Yale TFBS; 2006) Super-tracks may have multiple components, various settings 18 18

Resets, back to defaults
Mid-page Options to Change Settings Fit to browser window size Search for data types Flip display to Genomic 5’3’ Resets, back to defaults Start from scratch The final features I wanted to mention about controlling the Genome Viewer image are illustrated in this slide. This is a screen shot of the area around the middle of the Genome Browser page. First, let me draw your attention to the control buttons. The “track search” button is also available here, which lets you search for specific data types and annotation tracks that you may want to locate. The “default tracks” and “default order” buttons will get you back to the default settings—it is like an escape hatch if you made a lot of changes on the image and want to start over. The “hide all” button is nice if you wanted to set up a specific display with only those annotation tracks that you want—it will let you start to build a nice customized view for yourself with only those things you care about. We talk about the custom tracks and track hub buttons in the advanced topics tutorial. “Configure” is a button we have seen before, on the Gateway page: this button gives you access to a big web page that will let you make all sorts of changes to the viewer. You will be able to change the font and graphic size here; you can also change the window width (in pixels again) from this page. You can make broad changes to all the track menus, which are all together and grouped on this page for quick access to entire sections. There may be times that you would rather see your region of interest in the opposite orientation. Clicking the “reverse” button will quickly accomplish that. Resize will set the viewer window to a new width if you altered your browser size. I hope this provides some guidance on the many ways that you can control the Genome Browser viewer to visualize the data that is important for your research. Search for data types Reset to defaults Configure options page You control the views with numerous features 19 19

Cookies and Sessions Your browser remembers where you were (cookies)
OR To clear your “cart” or parameters, click default tracks or reset One thing that is important to know about changes you have made to the viewer: the browser remembers your settings and changes, until you clear them. A cookie is stored on your browser that remembers where you were looking in the genome, and if you made changes to the menus or filters or data choices. As we have discussed, there are a number of changes you can make—to the position, the track displays, and even the filter options. These parameters are all saved on the computer you are using. This may be great—you may always want to look at the data the same way. Or—as you move from one tool at this site to another, you “carry” your position with you. But—that may not be great—if you have forgotten that you filtered out something, or turned off a track. And if you use a shared computer in the lab or a library—you don’t know if someone made some changes since you used the browser last. The UCSC team refers to these settings as being stored in your “cart”. There are a couple of ways to clear out your “cart”: you can choose the “default tracks” or “default order” buttons from the Viewer controls to reset the viewer to default settings. Or before you even begin a search, you can choose the link that says: “Click here to reset” on the Gateway starting page, which wipes out any cart choices. If you ever find that your genome browser isn’t behaving quite like you expect, try to clear your cart and start again. Another handy feature is the “Session” option. If you have a configuration that you want to store and return to examine later, or if there is some region you want to point people to specifically—you can save your view as a “session”. At the top of a viewer page there is a link for “Session” where you can accomplish this. You will need a login to use this, but once you have that you will see how easy it is to save views, segments, track configurations, and so on—you can save multiple sessions and they can be uniquely named. You can reload a session at a later point, from any other computer. You can get a URL that you can share with colleagues if you like, or a session can be private. These will not be stored indefinitely, but will be stored for several months. Check the Session help for more details on the session lifespan. So there are many ways to customize your views to display the data that serves your research needs. There are ways to save and restore your settings of choice. Understanding how to control these features can help you to become more effective and efficient with your UCSC Genome Browser time. Save your setup as “Session” and store/share them Requires login Lifespan: 4 months 20 20

Introduction and Credits Basic Searches Understanding Displays Get Details or Sequences Sequence Searches (BLAT) Exercises [end of Understanding Displays] That completes our examination of the Genome Viewer display features. [beginning of Get Details or Sequences] In this section we go deeper than the display to find details about the items we see, and to obtain the actual sequences. UCSC Genome Browser: 21 21

Click Any Viewer Object for More Details
Click the item Many details and links to more data about TP53 New description web page opens We have spent a great deal of time on the Genome Viewer image, which offers a great deal of visual information about the genome data context and annotation tracks. But there is much more data available to you still. Here I’ve just shown the small area of the annotation track image that has been our focus, the upper section in the TP53 region, with our TP53 likely splice variants. You will remember that the one in the black highlight around the gene symbol is the one we selected in our original search. And the black color of that line indicates that this entry corresponds to an entry in the PDB, or Protein Data Bank. We want to know more information about that item specifically. To learn more, all you need to do is put your mouse somewhere along that line and click that item. When you do so, a new web page will open. Here I show just the upper section of the TP53 gene description page for this item. You will find many important details about the object that you clicked just one page down from the viewer. The point is that one click away—on any item in the Genome Viewer--there is a LOT of more information available to you. Let’s look at an entire sample gene description page. Example: click your mouse anywhere on the TP53 line 22 22

Click a SNP to get SNP details
informative description other resource links microarray data mRNA secondary structure links to sequences protein domains/structure orthologs in other species Gene Ontology™ descriptions mRNA descriptions pathways genetic association studies comparative toxicology gene model Click Annotation Track Item for Description Pages Not all genes have this much detail. Different annotation tracks carry different data. As I showed on the previous slide, one level down there are description or information pages that contain a great deal of additional information about that gene (or predicted gene, or SNP, or other item) in the viewer. I’m going to just show one sample here of the detailed information on the human TP53 UCSC Gene description page. But the other types of data also have lots of additional information one layer down as well. This page is actually quite huge, and I know that you won’t be able to see all the details right now. But later you should go and see for yourself. There is extensive information about this gene, and links to many other resources as well. Practically one-stop shopping for known genes! One thing to know: not all genes will have this level of detail, and not every species will have all this information. I have specifically chosen a well-known gene for our example. Some genes won’t have protein structures, some won’t have pathway information, some won’t have microarray data. But if the data is available, it will be available to you on these detail pages. Other pages will carry different types of data, of course. I attached here a small part of a SNP page—position, sequence, validation status, function….and so on. Different data types will have different description and details pages. You only have to click on any item in the viewer to get to these details pages. Click a SNP to get SNP details synonyms 23 23

Get DNA, with Extended Case/Color Options
So far, we have seen visual cues, and lots of text-based data. But one Frequently Asked Question that people have at this point is: “Where is the sequence data”? I want to spend a couple of slides on that topic so that you will know that you can get to the sequence level data from the browser. From the viewer, there are two handy and quick ways to get the sequence information. First, back on the TP53 viewer section, you could simply click the DNA link in the blue navigation bar at the top of the page. The link will bring you to a new GET DNA in Window web page, shown in the center. As you can see, the position you were looking at in the viewer is carried here, and is specified in the position box. This takes whatever you were examining in your viewer window. On this page you have several options to format the sequence: You can tweak the output by adding some bases upstream or downstream. You can get the sequence in upper or lower case. You can mask repeated, low complexity regions. Or you can get the reverse strand. You could just click the “get DNA” button to get the sequence in a new web page, the output will be in FASTA format. The second button option offers even more ways to customize the output DNA sequence. If you click the “extended case/color options” button, you’ll get a new page that lets you change the case of individual items, change their colors, underline specific features, and so on. The choices that you will see in the list are based on the tracks actively shown in the Genome Viewer window you were looking at. If a track is in “hide” mode, it won’t be in the list. If there’s too much to choose from, go back and turn off some tracks to make it easier to view. This is a really unique way to look at your sequence of interest, and can be copied to text documents for later review. As you can see in a sample output, different features look different by color, case, or underlines. These two options that I just describe deal with getting the whole region of DNA from the span that was present in your genome viewer window. But you have another option—you can get just the sequence you want from an annotation track item; that’s what we’ll look at in the next slide. 255  255 Use the View DNA link at the top Plain or Extended options Change colors, fonts, underline, etc. 24 24

sequence section on detail page
Get Sequence from Description Pages Click the item sequence section on detail page In this second example of how to get sequence data, I’m showing a screen shot of the TP53 annotation track in the UCSC Genes section. As before, we would click on the item to get to the TP53 details page. From the details pages you can get the specific sequence for that item. Here I’m showing a part of that details page—there is a box for the sequence section called “Sequence and Links to Tools and Databases”. You can scroll down the details page to find the sequence section. Here you will find links to the Genomic, mRNA sequence, and the protein sequence. You can use these links to get this specific sequence, plus additional options if you choose the genomic sequence—which is great for promoter studies, intron studies, and so on. So—the sequence of the items in your viewer is just a couple of clicks away, using either the DNA link at the top to get the whole window, or the links from the information pages to obtain sequence for specific items. And if you have sequence data, you can use that to perform a search. I’m going to copy the mRNA sequence here to use in the next section. Note: You can also download large lists of multiple sequences or more complex queries from the Table Browser, but that is beyond the scope of our introduction here. Please see the advanced tutorial for more details on that topic. Copy whole mRNA for next segment Click an item, go to Sequence section of description page 25 25

Introduction and Credits Basic Searches Understanding Displays Get Details or Sequences Sequence Searches (BLAT) Exercises [end of Get Details or Sequences] That completes our examination of the access to details and sequence information from the Genome Viewer. [beginning of Sequence Searching] In this section, we will examine the way to search the UCSC Genome Browser starting with sequence data. UCSC Genome Browser: 26 26

BLAT = BLAST-like Alignment Tool
Accessing the BLAT Tool In the UCSC Genome Browser, the tool you will use for sequence searching is called BLAT. Many of you will be familiar with the alignment tool called BLAST®, which stands for “Basic Local Alignment Search Tool”. If you have used the NCBI databases, and searched for similar sequences, you have probably used BLAST. But BLAT is different—it is the Blast-like alignment tool. It searches the database slightly differently than BLAST. BLAT uses an index of the sequences in the database—something like the index in the back of a biochemistry textbook. The BLAT index consists of occurrences of 11-oligomer sequences in the genome (or 4-mers for protein sequences). Just as you can quickly scan a book index to find the correct word, BLAT scans the index for matching 11-mers, and then builds the rest of the match out from there. It is a very fast way to search the sequences. BLAST does it the other way: it indexes your query and then runs your smaller index over everything. That’s the essential difference in the algorithm. The outcome will still be sequences that are aligned with each other so you can compare the matches. BLAT works best with sequences with high identity, and greater than 21 bases long—but you can find more distant matches as well. Directly from the UCSC documentation: “On DNA queries, BLAT is designed to quickly find sequences with 95% or greater similarity of length 25 bases or more. It may miss genomic alignments that are more divergent or shorter than these minimums, although it will find perfect sequence matches of 33 bases and sometimes as few as 22 bases. On protein queries, BLAT rapidly locates genomic sequences with 80% or greater similarity of length 20 amino acids or more. In general, gene family members that arose within the last 350 million years can generally be detected.” For many people it will be enough to know that there is a means of searching for your region of interest in the database by starting with a sequence. For the more casual BLAT user, check out the “Help” and “Frequently Asked Questions” documentation at the UCSC web site for a little more detail about the way BLAT works. For the more mathematically inclined, you can see the publication by Jim Kent that describes BLAT in more detail. So now we know a little bit about the BLAT tool. How do we get to it? Let’s start at the UCSC Genome Browser homepage, or from the blue navigation bar at the top of most UCSC pages. Select a link called BLAT to get started. [not read in recording] BLAST (original paper): Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol Oct 5;215(3): ct BLAT (original paper): W. James Kent (2002) BLAT - The BLAST-Like Alignment Tool, Genome Res 12: BLAT = BLAST-like Alignment Tool Rapid searches by INDEXING the entire genome Works best with high similarity matches See documentation and publication for details Kent, WJ. Genome Res :656 and “Help” 27 27

BLAT Tool Interface Make choices Paste one or more sequences
DNA limit bases Protein limit aa 25 total sequences Paste one or more sequences FASTA for more than one Shown here is the interface for BLAT. We will work our way down this page. As you can see along the top there are a few parameters you can change—some choices you have to make. First, you must choose one species to search. You search one species at a time with this tool. Then you choose an assembly—which we have seen before in the basic search section. Next, you may let the BLAT tool guess whether you have entered nucleotides or amino acids, or you can tell it which one you are using. Default is “BLAT’s guess” which has always determined the sequence composition correctly in my experience. But you can specify if you like. Sort output—on default settings here—will list the results grouped by query item if you use more than one sequence as input, and then best scoring matches first in that group. Output type specifies whether you want the output to be in the browser form, or in files you can use later. Hyperlink is the default which displays in the browser, and that’s what I’ll be using for this example. The other type, PSL or “pattern space layout” output styles, are useful for people who want a differently structured, text- based output that can be used for a variety of purposes. For my example, I will use the default “hyperlink” choice. There is a large text box where you can paste your sequence or sequences. You can paste one or more sequences, but there are limits to how much BLAT you can do, as it is a large burden on the servers. You can submit up to 25,000 bases or 10,000 amino acids, up to a total of 25 sequences. If you need to do more BLAT, UCSC asks that you download it and run a local copy. Instructions for this can be found in the documentation. There is also an option to upload your sequence (or sequences), if you keep a file of them. At the bottom of the page there is a link to the in-silico PCR tool. If you are exploring the genome with primer- sized sequences that may be a better choice. For this example, we’ll use this BLAT interface and paste in the TP53 mRNA we copied in the previous section. It is in the FASTA format, which you have to use if you are going to use multiple sequences. Finally, you click “submit” to send your query to the database. There is a special button—the “I’m Feeling Lucky” button. If you click that—just like in Google—you will be taken to the position of your best match right away, in the Genome Viewer. But I’ll be demonstrating the plain old “submit” button right now. submit Or upload 28 28

BLAT Results with Hyperlinks
go to browser/viewer go to alignment detail sorting Here we see the results of a BLAT search against the human genome, using the sample mRNA sequence. As you can see, we have sorted the list by the query and score. You can see we have a really high scoring match up at the top. After that they appear to be lower quality matches—quite small regions by the time you get to the bottom of the list. Now, you’ll remember that we asked for hyperlinked results in our setup. You can see that there are two columns of links for us. One says “browser”, one says “details”. The first thing that I will do is demonstrate a click of the “browser” link for the matches. This will link me to the position of this match in the Genome Viewer. I will show a sample of that on the next slide. Later we will click on the “details” link for the best match. That will give us a new page with sequence information, as you’ll see a couple of slides from now. Results with demo sequences, settings default; sort = Query, Score Score is a count of matches—higher number, better match Click browser to go to Genome Browser image location (next slide) Click details to see the alignment to genomic sequence (2nd slide) 29 29

BLAT Results: Browser Link
query When you link from the BLAT results to the BROWSER—you get a special track appearing in the Viewer! Just down from the top there is a new line on the browser—it says “Your Sequence from Blat Search”. And the name of my query sequence is listed over on the left. If you look at the UCSC genes, you can see that we have matched the TP53 region, which is what I would have expected from the BLAT query. Your sequence becomes a track that you can adjust like other tracks as well. In the menu area we can see you have the option to alter the visibility with the menu choices like we’ve seen before. So—we have used a sequence as a starting point to search the genome. We get to see the location of our match directly on the Genome Browser by clicking the “Browser” link from our BLAT results. So BLAT is another good place to start searching for your genes of interest in the UCSC Genome Browser tool. There was another link on the BLAT results, called “Details”. Going back to the results list will let you access that link as well. From browser click in BLAT results A new track line with Your Sequence from BLAT Search appears Also a new menu to adjust 30 30

BLAT Results, Alignment Details
Your query Side by Side Alignment Here I show the outcome if you clicked the “Details” link from the BLAT results page. You’ll have a page dedicated to the alignment of your sequence and the genomic reference sequence. It’s not possible for me to show the whole alignment page for this match—I show the uppermost segment here. You can navigate to other parts of the page using the links on the left. You can see the page is divided into several parts, which you can access by scrolling down or by clicking the links. The top part shows the query sequence you put in (in this case our human mRNA sequence). The next part of the page shows the match of your sequence (in blue) capital letters, within the context of the genomic sequence. This gives you a quick look at the possible exon/intron structure if you have used an mRNA sequence as I have. It’s a nice way to see which parts are the likely exons in an mRNA, and the likely introns in black text. The bottom part shows you the actual nucleotide-for-nucleotide matches—this may be more like the BLAST results you are used to seeing. I magnified the top of the side-by-side alignment so you can see where my query sequence on the top (starts with number …001), lines up with the genomic sequence. You can judge the quality of the match yourself in this section. In this case there are several “blocks” that correspond to each matched segment. Although I have shown nucleotide sequence in the example, you can BLAT with a protein sequence and see where the protein sequence matches in the genomic framework as well. If you had started with a protein sequence, your amino acid sequence would be displayed with the corresponding genomic nucleotide sequence. So you can start to search the UCSC Genome Browser data with a sequence, and view the results in either the Genome Viewer or at the level of alignment detail shown here. Genomic match, with color cues yours genomic 31 31

Introduction and Credits Basic Searches Understanding Displays Get Details or Sequences Sequence Searches (BLAT) Exercises [end of Sequence Searches] That completes our look at sequence searching in the UCSC Genome Browser. [beginning of Summary] In this section we will summarize this tutorial. UCSC Genome Browser: 32 32

Introduction and Credits Basic Searches Understanding Displays Get Details or Sequences Sequence Searches (BLAT) Exercises [end of Summary] That completes our summary. [beginning of Exercises] In this section we will explore exercises that reinforce concepts developed in this tutorial. UCSC Genome Browser: 33 33

Notice: The materials and slides offered are for non-commercial use only. Reproduction, distribution and/or use for commercial purposes is strictly prohibited. Copyright 2012, OpenHelix, LLC The materials and slides offered are for non-commercial use only. Reproduction, distribution and/or use for commercial purposes strictly prohibited. Copyright 2012, OpenHelix, LLC. 34 34

The UCSC Genome Browser Introduction

Similar presentations

Presentation on theme: "The UCSC Genome Browser Introduction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The UCSC Genome Browser Introduction

Similar presentations

Presentation on theme: "The UCSC Genome Browser Introduction"— Presentation transcript:

Similar presentations

About project

Feedback