Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence analysis using EMBOSS & wEMBOSS by Martin Sarachu Based on the EMBOSS tutorial, by Nikos Drakos, Val Curwen, David Martin, Gary Williams and many.

Similar presentations


Presentation on theme: "Sequence analysis using EMBOSS & wEMBOSS by Martin Sarachu Based on the EMBOSS tutorial, by Nikos Drakos, Val Curwen, David Martin, Gary Williams and many."— Presentation transcript:

1 Sequence analysis using EMBOSS & wEMBOSS by Martin Sarachu Based on the EMBOSS tutorial, by Nikos Drakos, Val Curwen, David Martin, Gary Williams and many more. Find this tutorial at www.emboss.org Throughout this tutorial, we're going to look at members of the rhodopsin family of G-protein coupled receptors. The general principles are, of course, applicable to any sequences you would like to analyse.

2 Hands-on: Look at databases available with showdb (Information>>showdb) Output is a simple table displaying the names, contents and access methods for the databases. ID allows programs to extract a single explicitly named entry from the database, for example: embl:x13776 Query indicates that programs can extract a set of matching wildcard entry names. For example: sw:pax*_human All allows programs to analyse all the entries in the database sequentially. For example: embl:* Hands-on: Retrieve sequence with identifier xlrhodop from embl DB (Edit>>seqret) Hands-on: Copy the sequence to your current project & include it into nucList Retrieving sequences from databases

3 Getting information about sequences infoseq is a small utility to list the sequences USA, name, accession number, type (nucleic or protein), length, percentage G+C (for nucleic), and/or description. Hands-on: Run infoseq (Information>>infoseq) with sequence xlrhodop in your project This sequence corresponds to a sequence in SwissProt that has the identifier OPSD_XENLA Hands-on: Retrieve the information about all OPSD sequences in SwissProt (sw DB, use the opsd_* wildcard)

4 Pairwise sequence alignment An alignment is an arrangement of two sequences which shows where the two sequences are similar, and where they differ. The most intuitive representation of the comparison between two sequences uses dot-plots. One sequence is represented on each axis and significant matching regions are distributed along diagonals in the matrix. Hands-on: Upload sequence xl23808 from your computer to the current project and add it to nucList Hands-on: Make a dotplot with dottup between xl23808 and xlrhodop (Alignment>>Dot Plots>>dottup)

5 Global alignment A global alignment is one that compares the two sequences over their entire lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length. needle is and implementation of the Needleman-Wunsch algorithm for global alignment. The computation is rigorous and needle can be time consuming to run if the sequences are long. Hands-on: do a global alignment between xlrhodop (1-470 region) and xl23808 (1110-1700 region) (Alignment>>Global>>needle) stretcher is another EMBOSS program for global alignment, it is less rigorous and therefore run more quickly. Useful for DB searching.

6 Local alignment Local alignment methods are very useful for scanning databases or when you do not know that the sequences are similar over their entire lengths. water is a rigorous implementation of the Smith Waterman algorithm for local alignments. Hands-on: perform a local alignment between xlrhodop & xl23808 (Alignement>>Local>>water) matcher is a an EMBOSS program for local alignment, it is less rigorous and therefore run more quickly. Useful for DB searching. supermatcher is designed for local alignments of very large sequences and is even less rigorous in its implementation. You can look at its documentation clicking the “Manual” button on the program’s menu.

7 Identifying the ORF We can get a rapid visual overview of the distribution of ORFs in the six frames of our sequence using plotorf. Hands-on: run plotorf with sequence xlrhodop (Nucleic>>Translation>>plotorf) Longest ORF is in frame 2 from around position 100 to 1200. Hands-on: identify the exact start and end points for translation with getorf (Nucleic>>Gene finding>>getorf) Look at output options! Translate your sequence between START and STOP codons. We know from plotorf that our ORF will be in the region 100 to 1200. Identify the actual start and end positions.

8 Translating the sequence Hands-on: you should have found that the region to be translated is from 110 to 1171 in our cDNA sequence. Use transeq to translate that region (Nucleic>>Translation>>transeq) Hands-on: copy xlhrodop.pep to your project and add it to protList pepinfo produces information on amino acid properties (size, polarity, aromaticity, charge, etc). Hands-on: run pepinfo with xlhrodop.pep and examine the information it provides (Protein>>Composition>>pepinfo)

9 Pattern matching In a number of cases, the active site of a protein can be recognized by a specific fingerprint or template, a fairly small set of residues that are unique to a family of proteins. An example is the sequence GXGXXG (where G=glycine and X=any amino acid) which defines a GTP binding site. Searching for a (rather loose) predefined string of characters in a sequence is called Pattern Matching. Hands-on: use patmatmotifs to search your protein sequence for motifs defined in PROSITE DB of protein families and domains. (Protein>>Motifs>>patmatmotifs) Look at output options! Specify a full documentation output. In our case we already know that our sequence is a rhodopsin. However, if you had an unknown sequence, we hope you can see that identifying motifs might provide you with information to help you plan further experiments.

10 Protein fingerprints PRINTS is a database that defines functional protein families, identifying each domain by a number of short, particularly well conserved sequences. A full match to one of these "fingerprints" will match all the relevant short sequences in the correct order. A partial match is recorded if some are missing or if they occur in an incorrect order. Hands-on: use pscan with your peptide sequence and examine the matches. (Protein>>Motifs>>pscan)

11 Multiple Sequence Analysis One of the most popular programs for performing multiple sequence alignments is clustalw. The EMBOSS interface to clustal is emma. pscan has told us that our sequence belongs to the rhodopsin family. We will now retrieve some further members of the family from SwissProt and produce a multiple alignment; we'll then use this multiple alignment to produce a profile of this group of sequences and use that to align them all to our original sequence. Hands-on: use seqret to retrieve a set of sequences from SwissProt DB, use the ops2_* wildcard to get all sequences whose identifiers begin ops2_ Hands-on: copy the output file to your project, rename it to ops2.fasta and add it to protList.

12 Multiple Sequence Analysis Hands-on: align these sequences using emma (Alignment>>Multiple>>emma). It will produce an alignment and a dendogram. We have aligned ops2 sequences from two fruit fly species, two crab species, locust and scallop. Hands-on: copy the alignment to your project, and view it. The sequences are similar, but there are differences. Add the alignment to your protList. Hands-on: prettyplot will give you a clearer view of differences by aligning the sequences on top of one another. (Alignment>>Multiple>>prettyplot) Identical residues are shown in red, and similar residues in green. This type of display can given you a first impression regions of conservation.

13 Profiles Profile analysis is a sequence comparison method for finding and aligning distantly related sequences. The comparison allows a new sequence to be aligned optimally to a family of similar sequences. Hands-on: prophecy is an EMBOSS program for creating a profile from a set of multiple aligned sequences. Create a profile from ops2 alignment. (Protein>>Profiles>>prophecy) Look at output options! Specify a Gribskov profile type. When prophecy finishes, copy the profile to your current project. Hands-on: use prophet to align xlrhodop.pep to the ops2 profile. (Protein>>Profiles>>prophet) The vertical bars (|) represent residues that are identical between the ops2 consensus and our rhodopsin, while the colons (:) represent conservative substitutions. We hope you can see that aligning members of a family can reveal conserved regions that may be important for structure and/or function.

14 Conclusion We have shown you some of the programs available within EMBOSS, and have introduced you to the way you can run these programs from wEMBOSS. You can search for EMBOSS programs within wEMBOSS from the “Search for programs” frame. You can examine individual program documentation from the program menu. You can get a listing of all EMBOSS programs from wossname (Information>>wossname) EMBOSS site: www.emboss.org wEMBOSS site: www.ar.embnet.org/wEMBOSS


Download ppt "Sequence analysis using EMBOSS & wEMBOSS by Martin Sarachu Based on the EMBOSS tutorial, by Nikos Drakos, Val Curwen, David Martin, Gary Williams and many."

Similar presentations


Ads by Google