HKUHKU Computer Centre Introduction to EMBOSS Christine Ho

HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKUHKU Computer Centre Web page of EMBOSS  The programs of EMBOSS is available at http://bioinfo.hku.hk/EMBOSS/ http://bioinfo.hku.hk/EMBOSS/  The files required for this lecture is available at http://bioinfo.hku.hk/tutorial/  User required to apply for a BIOINFO account to use the tools on the web and off-line, and to download the databases.  BIOINFO account is open freely to the public to register, and usage on the BIOINFO is restricted for academic and research purposes only.  How to apply BIOINFO account:  HKU members: Submit the HKUESD application Form(Cfe- 139)  Non-HKU members: submit the application form of http://www.hku.hk/ccoffice/forms/cf139.pdf http://www.hku.hk/ccoffice/forms/cf139.pdf  Question and comment: biosupport@bioinfo.hku.hk

HKUHKU Computer Centre What is EMBOSS?  EMBOSS (The European Molecular Biology Open Software Suite) is a free Open Source software analysis package that provides a comprehensive set of sequence analysis package specially developed for the needs of the molecular biology user community.  Within EMBOSS you will find around 100 programs (applications).  More information about EMBOSS can be found at http://www.uk.embnet.org/Software/EMBOSS/

HKUHKU Computer Centre Main Programs in EMBOSS  Retrieve sequences from database  Sequence alignment  Nucleic gene finding and translation  Protein secondary structure prediction  Rapid database searching with sequence patterns  Protein motif identification, including domain analysis  Nucleotide sequence pattern analysis, for example to identify CpG islands or repeats.  Codon usage analysis for small genomes  Rapid identification of sequence patterns in large scale sequence sets  Presentation tools for publication

HKUHKU Computer Centre Starting EMBOSS  There are three ways to start EMBOSS  Command line after login bioinfo.hku.hk  Web interface (EMBOSS-GUI)

HKUHKU Computer Centre Command line of EMBOSS  Inside HKU campus  telnet bioinfo.hku.hk  Outside HKU campus  Windows machine  Use putty, see http://bioinfo.hku.hk FAQ Q13http://bioinfo.hku.hk  Linux or UNIX machine  ssh @bioinfo.hku.hk

HKUHKU Computer Centre Web interface of EMBOSS  Directly access the web page at http://bioinfo.hku.hk/EMBOSS/  Or browse the BIOSUPPORT Homepage: http://bioinfo.hku.hk/ and select “Tools” Option http://bioinfo.hku.hk/

HKUHKU Computer Centre Web interface of EMBOSS  Click on the link EMBOSS - GUI

HKUHKU Computer Centre Programs in EMBOSS Parameters in EMBOSS  Input can be:  Uniform Sequence Addresses (USAs) path in the format:  database  database:entry_name or database:accession_number (e.g. embl:xlrhodop or embl:L07770)  database:wildcard (sw:opsd_a*)  filename  filename:entry  format::filename  @list  The sequence data to be pasted in the text area.

HKUHKU Computer Centre Programs in EMBOSS  Output will be:  Textual and/or graphical representation of data.  The output can be saved as text file or in some cases image file in PNG or PS format.

HKUHKU Computer Centre EMBOSS online help  The documentation for EMBOSS is available at http://bioinfo.hku.hk/emboss/http://bioinfo.hku.hk/emboss/

HKUHKU Computer Centre Difference between GCG and EMBOSS GCGEMBOSS File format supported GCG, MSF, RSF, FastA, BLAST (Other file format must be converted using program (e.g. FromFastA, FromEMBL, FromPIR, etc) ABI trace file, ACeDB, Clustal ALN (multiple alignment), EMBL, FASTA, GENBANK, NBRF (PIR), PHYLIP interleaved multiple alignment, SWISSPROT, Plain text, etc No. of sequence in one file One file can only have one sequence. One file can have multiple sequence. 3rd party package included FASTA, BLASTFASTA, BLAST, Assembly program not included. They must be run separately Upper limit of sequence size 35K2G

HKUHKU Computer Centre Replacement of GCG programs  Exchanging sequences between packages In GCGIn EMBOSS getseqNewseq Fromfasta, tofasta, fromembl, toembl From…, to… (any program that reads/writes sequences) seqret

HKUHKU Computer Centre Replacement of GCG programs  Sequence editing, manipulation and display In GCGIn EMBOSS fetchSeqret Seqed command delete command insert No complete solution yet cutseq pasteseq lineupNo good solution yet assembleunion shuffleshuffleseq reverseRevseq chopupNot needed as EMBOSS reads ‘any’ format publishShowseq, prettyseq

HKUHKU Computer Centre Replacement of GCG programs  Sequence comparison and alignment In GCGIn EMBOSS compare+dotplot (default (window stringency)) Compare+dotplot (word=n) Dotmatcher dottup GapNeedle, stretcher (for long sequences) bestfitWater, matcher (for long sequences) Pileup, clustalEmma (=CLUSTAL) prettyCons, showalign  Translation In GCGIn EMBOSS translatetranseq

HKUHKU Computer Centre Replacement of GCG programs  Patterns and gene finding In GCGIn EMBOSS FindpatternsFuzznuc, fuzztrans, fuzzprot NB: uses PROSITE syntax (not GCG) to define pattern motifsPatmatmotifs NB: ps_scan searches also PROSITE profiles codonpreferenceSyco, wobble

HKUHKU Computer Centre Replacement of GCG programs  Phylogeny In GCGIn EMBOSS distances+growtreeEdnadist or eprotdist+ eneighbor In GCGIn EMBOSS Map - With option “Find translationally silent potential restriction sites” - With option options 3’ or 5’ overhang Remap, restrict Silent restover Mapsort Mapsort+plasmidmap Restrict Cirdna (only partial solution: input file with Tick positions must be created “manually”  Mapping

HKUHKU Computer Centre Replacement of GCG programs  Protein analysis In GCGIn EMBOSS Pepplot, peptidestructure+plotstructure Garnier, pepinfo, octanol, pepwindow  Primer selection In GCGIn EMBOSS primeEprimer3 (=Primer3) Primepair, melttempNo good solution yet

HKUHKU Computer Centre Replacement of GCG programs  Keyword-based databank searching In GCGIn EMBOSS NamesWhichdb Indexsearch Stringsearch (mode A) Stringsearch (mode B) Textsearch No good solution yet but advantageously replaceable by indexsearch

HKUHKU Computer Centre Running EMBOSS program  EMBOSS programs are run by typing them at the Unix prompt, or by using an interface.  The EMBOSS command syntax follows normal Unix command conventions.  Programname -help  to get some help on the options.  Programname -opt  to make the program prompt you for common options.  tfm programname  to get the full help on a program.

HKUHKU Computer Centre Login bioinfo  Login bioinfo with ‘telnet bioinfo.hku.hk’  If you are using the temp account, please create a directory of your username at hkusua:  bioinfo% mkdir  E.g. bioinfo% mkdir chantaiman  Change directory to your created directory  Bioinfo% cd  E.g. bioinfo% cd chantaiman

HKUHKU Computer Centre wossname  It is easy to forget the name of a program.  To find EMBOSS programs, use wossname  wossname finds programs by looking for keywords in the description or the name of the program.

HKUHKU Computer Centre wossname  Type wossname at the Unix % prompt bioinfo % wossname  Displays one-line description.  Prompts you for information: Finds programs by keywords in their one-line documentation Keyword to search for: restrict SEARCH FOR 'RESTRICT ’ recode Remove restriction sites but maintain the same translation remap Display a sequence with restriction cut sites, translation etc…..

HKUHKU Computer Centre Optional parameters  To get prompted for all the optional parameters, type the following: bioinfo % wossname -opt Finds programs by keywords in their one-line documentation Keyword to search for: protein Output program details to a file [stdout]: myfile Format the output for HTML [N]: String to form the first half of an HTML link: String to form the second half of an HTML link: Output only the group names [N]: Output an alphabetic list of programs [N]: Use the expanded group name [N]:

HKUHKU Computer Centre help bioinfo % wossname -help Mandatory qualifiers: [-search] string Enter a word or words here. Optional qualifiers (* if not always prompted): -outfile outfile this program will write the program names Advanced qualifiers: -[no]emboss bool EMBOSS program documentation will be searched.  Mandatory - required, are often parameters (in ‘[]’)  Optional - use -opt to be prompted for these.  Advanced - things that are not often used!

HKUHKU Computer Centre Writing to the screen  Note that the default output file for wossname was: stdout (Standard output)  Use this whenever prompted for an output file.  This is a ‘magic’ file name.  It displays the output on the screen, not a file.

HKUHKU Computer Centre Working with sequences  EMBOSS reads sequences from files or databases.  It automatically recognizes the input sequence format.  You can easily specify many output formats.

HKUHKU Computer Centre Getting sequences from the databases  Database single entry (ID)  database:entry  For example embl:hsfau  Wildcarded entries (Query)  database:hs*  For example sw:fos_*  All entries  database:*  Most databases will support all 3 methods - some may not.

HKUHKU Computer Centre showdb bioinfo% showdb Displays information on the currently available databases # Name Type ID Qry All Comment # ==== ==== == === === ======= domo P OK OK OK DOMO sequences enspep P OK OK OK ENSEMBL PEP sequences gp P OK OK OK GENPEPT sequences gpnew P OK OK OK New GENPEPT sequences kabatp P OK OK OK KABAT Protein sequences nrl P OK OK OK NRL_3d pdb P OK OK OK PDB sequences pir P OK OK OK PIR using NBRF access for 4 files rem P OK OK OK REMTREMBL sequences

HKUHKU Computer Centre seqret  Reads in a sequence, and writes it out. bioinfo % seqret Reads and writes (returns) a sequence Input sequence: embl:xlrhodop Output sequence [xlrhodop.fasta]: bioinfo % more xlrhodop.fasta >XLRHODOP L07770 Xenopus laevis rhodopsin ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaa gaaac acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaac ggaac.

HKUHKU Computer Centre seqret from the command line  Give seqret all of its data on the command- line.  It doesn’t need to prompt for anything else. bioinfo % seqret embl:xlrhodop -outseq xlrhodop.fasta  The ‘-outseq’ can be abbreviated to ‘-out’.  Any abbreviation must be unique.  Even shorter, leave out the qualifier: bioinfo % seqret embl:xlrhodop xlrhodop.fasta

HKUHKU Computer Centre Changing output formats (reformatting)  seqret can reformat sequences by specifying the output format: bioinfo % seqret embl:xlrhodop xlrhodop.gcg -osformat gcg bioinfo % more xlrhodop.gcg !!NA_SEQUENCE 1.0 Xenopus laevis rhodopsin mRNA, complete cds. XLRHODOP Length: 1684 Type: N Check: 9453.. 1 ggtagaacag cttcagttgg gatcacaggc ttctagggat cctttgggca 51 aaaaagaaac acagaaggca ttctttctat acaagaaagg actttataga.

HKUHKU Computer Centre Multiple sequences, single files  You can use seqret to retrieve multiple sequences into a file: bioinfo% seqret “sw:opsd_a*” opsd_a.seqs  This retrieves all the sequences whose identifiers start with “opsd_a” into a file called opsd_a.seqs.

HKUHKU Computer Centre Multiple sequences, many files  If you wish to write one sequence per file, use: bioinfo % seqret “sw:opsd_a*” -ossingle  The output filenames will be based on the sequence entry names.  The program seqretsplit will split an existing multiple sequence file into many files.

HKUHKU Computer Centre Asterisk on the command line  You can't use a ‘*’ on the UNIX command- line.  UNIX tries to match it to filenames.  Use it quoted, either with quotes or a backslash: "embl:*" embl:\*  For example: bioinfo % seqret “embl:hsf*” hsf.seq

HKUHKU Computer Centre EMBOSS web interface  On the left, you can choose the program to run. You can also see all the program sorted alphabetically instead of sorted by group by clicking on the link.

HKUHKU Computer Centre Getting help in EMBOSS  Help on the program is available by clicking on the question mark.

HKUHKU Computer Centre Input to EMBOSS  If you know the entry_name or accession number, enter the sequence in the Uniform Sequence Addresses (USAs) format  E.g. embl:xlrhodop

HKUHKU Computer Centre Input to EMBOSS  If you have your own sequence file, upload the sequence by clicking the browse button.

HKUHKU Computer Centre Input to EMBOSS  You can also copy and paste your own sequence into the text area.

HKUHKU Computer Centre seqret web interface  E.g. seqret - retrieving single sequence  Input:  USA path embl:xlrhodop  Output file format: GCG 9.x/10.x  Output:  The sequence retrieved in GCG format

HKUHKU Computer Centre seqret

HKUHKU Computer Centre seqret  Seqret – retrieving multiple sequences  Input: sw:ops2_*. Output file format: Pearson FASTA  Output: multiple sequences with the identifier starting with sw:ops2_.  Save the file as ops2.fasta by right clicking on the link

HKUHKU Computer Centre coderet  Extract CDS, mRNA and translations from feature tables. If any sequences are in other entries of that database, they are automatically fetched and incorporated correctly into the final sequence.  Input: embl:X03487

HKUHKU Computer Centre coderet  Output

HKUHKU Computer Centre dottup  dottup – Comparison between 2 sequences using dot-plots.  Input:  1st sequence: embl:xl23808 (Xenopus laevis rhodopsin gene)  Second sequence: embl:xlrhodop (Xenopus laevis rhodopsin cDNA from complement of mRNA)  Output:  A dotplot showing the diagonal lines representing areas where the two sequences align well in PNG format.  The image can be saved into the computer.

HKUHKU Computer Centre dottup

HKUHKU Computer Centre dottup  The 5 diagonal lines represent areas where the two sequences align well.  Since this is aligning genomic and cDNA, the five diagonals represent the five exons of the gene.

HKUHKU Computer Centre Pairwise Sequence Alignment  An alignment is an arrangement of two sequences which shows where the two sequences are similar, and where they differ.  There is no unique, precise, or universally applicable notion of similarity.

HKUHKU Computer Centre Global Alignment  A global alignment is one that compares the two sequences over their entire lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length.  The alignment maximizes regions of similarity and minimizes gaps using the scoring matrices and gap parameters provided to the program.

HKUHKU Computer Centre needle  Function  Needleman-Wunsch global alignment  Description  This program uses the Needleman- Wunsch global alignment algorithm to find the optimum alignment (including gaps) of two sequences when considering their entire length.  The computation is rigorous.  It can be time consuming to run if the sequences are long.

HKUHKU Computer Centre Input sequence for needle

HKUHKU Computer Centre needle  needle - Needleman-Wunsch global alignment  Input:1st sequence: embl:xlrhodop, 2nd sequence: embl:xl23808  Output: Global alignment showing the 5 aligned regions.

HKUHKU Computer Centre Local alignment  Local alignment searches for regions of local similarity and need not include the entire length of the sequences.  Local alignment methods are very useful for scanning databases or other circumstances when you wish to find matches between small regions of sequences, for example, between protein domains.

HKUHKU Computer Centre water  Function  Smith-Waterman local alignment.  Description  Water uses the Smith-Waterman algorithm (modified for speed enhancements) to calculate the local alignment.

HKUHKU Computer Centre water  water - Smith-Waterman local alignment.  Input:1st sequence: embl:xlrhodop, 2nd sequence: embl:xl23808  Output: Local alignment showing the 5 aligned region.

HKUHKU Computer Centre Multiple Sequence Analysis  Multiple sequence alignments are used  To find patterns to characterize protein families.  To detect or demonstrate homology between new sequence and existing families of sequences.  To help predict the secondary and tertiary structures of the new sequences.  As an essential prelude to molecular evolutionary analysis.

HKUHKU Computer Centre emma  Function  Multiple alignment program - interface to ClustalW program  Description  EMMA calculates the multiple alignment of nucleic acid or protein sequences according to the method of Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). This is an interface to the ClustalW distribution.

HKUHKU Computer Centre Upload file to emma  Input: output from seqret (ops2.fasta) retrieving all swissprot sequences whose identifiers begin with sw:ops2_*  Click on browse button to upload the file ops2.fasta

HKUHKU Computer Centre Input sequence to emma  ops2.fasta

HKUHKU Computer Centre emma  emma – interface to ClustalW program  Output: multiple alignment saved as file ops2.aln.

HKUHKU Computer Centre prettyplot  Prettyplot – displays aligned sequences, with colouring and boxing  Input: output from program emma ops2.aln  Output: graphic display of aligned sequences. Identical residues in red, similar residues in green.

HKUHKU Computer Centre prophecy  Function  Creates matrices/profiles from multiple alignments  Description  This creates a profile matrix file from a nucleic acid or a protein sequence alignment.  The profile matrix file can then be used by program profit or prophet.

HKUHKU Computer Centre prophecy  Input:  Sequence: output from program emma ops2.aln  Select type: Gribskov

HKUHKU Computer Centre prophecy  Output: A profile to be saved as ops2.prophecy. This profile allows a new sequence to be aligned optimally to a family of similar sequences in the program prophet.

HKUHKU Computer Centre prophet  Prophet – Gapped alignment for profiles  Input:  Input sequence: The file xlrhodop.pep, output from transeq of the sequence embl:xlrhodop from 110-1171 region.  Profile or matrix file: ops2.prophecy  Output file: ops2.prophet  Output: The gapped alignment to profile. The vertical bars (|) represent residues that are identical between the ops2 consensus and our rhodopsin, while the colons (:) represent conservative substitutions. Aligning members of a family can reveal conserved regions that may be important for structure and/or function.

HKUHKU Computer Centre prophet  Output

HKUHKU Computer Centre plotorf  plotorf – plots potential opening reading frames  Input sequence: embl:xlrhodop  Output: graphical output showing the potential opening reading frames in all six frames.  The longest protein is in second frame.  The correct open reading frame is the second frame.

HKUHKU Computer Centre getorf  getorf - Finds and extracts open reading frames (ORFs)  Input:  Sequence: embl:xlrhodop  Type of sequence to output: Nucleic sequence between START and STOP codons  Output: Textual information of the region and the sequence of that region.

HKUHKU Computer Centre transeq  transeq - Translate nucleic acid sequences  Input:  sequence: embl:xlrhodop  regions to translate: 110-1171 (from information of getorf)  Output: Translated sequence of the given region.  Save the file as xlrhodop.pep

HKUHKU Computer Centre Exercise 1 Q1  Align HER2 _ERB2_HUMAN and UNKNOWN_AAL39899.1 with needle and water. What is the main difference between the two types of alignment in these two cases (the files HER2- fasta.prt and ALL39899_1.prt are at http://bioinfo.hku.hk/tutorial/)?  Repeat the Smith-Waterman alignment of HER2- fasta.prt and ALL39899_1.prt with different parameters. What happens if gap penalties are changed to 30 and 2 instead of the defaults 10 and 0.5?  BLOSUM62 is default. What happens to the local alignment (using program water) when using other matrices, e.g. EPAM10?

HKUHKU Computer Centre Exercise 1 Q2  Type gb:A7120FTSZ in the text box and run seqret. Run entret with the same sequence USA and examine the entry. What is the difference between the two entries?

HKUHKU Computer Centre Exercise 1 Q3  With the program infoseq, display information on all sequences whose name starts with ‘10’ in the SwissProt database. (hint: the sequence is sw:10*, choose the information you want to display by changing to ‘yes’)

HKUHKU Computer Centre Exercise 1 answer (A1)  Needle output

HKUHKU Computer Centre Exercise 1 answer (A1)  Water output

HKUHKU Computer Centre Exercise 1 answer (A1)  Water output with gap opening penality of 30 and gap extension penality of 2.

HKUHKU Computer Centre Exercise 1 answer (A1)  Water output with matrix of EPAM10

HKUHKU Computer Centre Exercise 1 answer (A1)  The global alignment (needle) require the whole sequences to be aligned. The % identity and % similarity is much less than local alignment (water).  If the gap penalties are changed to 30 and 2, no gap appears in the alignment  If EPAM10 is used, the score and alignment length drops. Since PAM is derived from global alignment, it gives worser result for the local alignment program water. EPAM10 is more suitable for very similar protein with no more than 10% evolutionary divergent.

HKUHKU Computer Centre Exercise 1 answer (A1) Amino Acid substitution matrices  PAM (percent accepted mutation) – lists the likelihood of change from one amino acid to another in homologous sequences during evolution.  One PAM is a unit of evolutionary divergence in which 1% of the amino acids have been changed.  some amino acid substitutions occurred more readily than others, probably because they did not have a great effect on the structure and function of a protein.

HKUHKU Computer Centre Exercise 1 answer (A1) Amino Acid substitution matrices (con’t)  BLOSUM – matrix values are based on a large set of ~2000 conserved amino acid patterns called blocks. Blocks come from a database of protein sequences representing more than 500 families of related proteins.  PAM is derived from global alignments of proteins, while BLOSUM comes from alignments of shorter sequences.  The matrix built from blocks with no more than x% of similarity is called BLOSUM X

HKUHKU Computer Centre Exercise 1 answer (A1)  PAM100 ==> Blosum90  PAM120 ==> Blosum80  PAM160 ==> Blosum62  PAM200 ==> Blosum52  PAM250 ==> Blosum45  The Blosum matrices are best for detecting local alignments.  The Blosum62 matrix is the best for detecting the majority of weak protein similarities.  The Blosum45 matrix is the best for detecting long and weak alignments.

HKUHKU Computer Centre Exercise 1 answer (A1)  If the BLOSUM62 matrix is compared to PAM160 then it is found that the BLOSUM matrix is less tolerant of substitutions to or from hydrophilic amino acids, while more tolerant of hydrophobic changes and of cysteine and tryptophan mismatches.

HKUHKU Computer Centre Exercise 1 answer (A2)  seqret output

HKUHKU Computer Centre Exercise 1 answer (A2)  entreq output

HKUHKU Computer Centre Exercise 1 answer (A2)  You will see the sequence for the Anabaena 7120 ftsZ and gsh-III genes.  EMBOSS is also capable of extracting more information than just the sequence from a database entry. The program entret will return the entire entry as a text file.

HKUHKU Computer Centre Exercise 1 answer (A3)  Output

HKUHKU Computer Centre garnier  Garnier - Predicts protein secondary structure using the Garnier-Osguthorpe-Robson (GOR) method  Secondary structure prediction is notoriously difficult to do accurately. The GOR I alogorithm is one of the first semi- successful methods.  The Garnier method is not regarded as the most accurate prediction, but is simple to calculate on most workstations.  Input: translated sequence (xlrhodop.pep) embl:xlrhodop from 110-1171 region with program transeq.  Output: Predicted protein secondary structure

HKUHKU Computer Centre garnier  Output

HKUHKU Computer Centre pepinfo  pepinfo - Plots simple amino acid properties in parallel.  Input sequence: translated sequence (xlrhodop.pep) embl:xlrhodop from 110-1171 region with program transeq.  Output: A textual and graphical representation of amino acid properties (size, polarity, aromaticity, charge, etc). Hydrophobicity profiles useful for locating turns, potential antigenic peptides and transmembrane helices.

HKUHKU Computer Centre pepinfo  Showing the residues distribution

HKUHKU Computer Centre pepinfo  Hydrophobicity profiles are useful for locating turns, potential antigentic peptides and transmembrane helices.  positive score -> a hydrophobic region.  negative score -> hydrophilic region.  show seven highly hydrophobic regions.  use the program tmap to investigate further.

HKUHKU Computer Centre patmatmotifs  Patmatmotifs – search a PROSITE motif database with a protein sequence. It can identify to which known family of protein (if any) the new sequence belongs.  PROSITE currently contains patterns and profiles specific for more than a thousand protein families or domains.  PROSITE patterns (Biologically significant amino acid patterns can be summarized in the form of regular expressions)  PROSITE profile (techniques based on weight matrices allows the detection extreme sequence divergence protein families and functional/structural domains)

HKUHKU Computer Centre patmatmotifs  Input sequence: The file xlrhodop.pep, which is output from transeq of the sequence embl:xlrhodop from 110- 1171 region.  Output: A textual representation showing where the sequence match with a motif.

HKUHKU Computer Centre pscan  Pscan – Scans proteins using PRINTS  PRINTS is a database of diagnostic protein signatures, or fingerprints.  Fingerprints are groups of conserved motifs or elements that together form a diagnostic signature for particular protein families.  An uncharacterised sequence matching all motifs or elements can then be readily diagnosed as a true match to a particular family fingerprint.  Input sequence: The file xlrhodop.pep, which is output from transeq of the sequence embl:xlrhodop from 110-1171 region.

HKUHKU Computer Centre pscan Output: A textual representation showing where the short sequences match with the PRINTS database that defines functional protein families.

HKUHKU Computer Centre fuzznuc  fuzznuc uses PROSITE style patterns to search nucleotide sequences.  Letter code for pattern  [ACG] stands for A or C or G.  {AG} stands for any nucleotides except A and G.  N(3) corresponds to N-N-N, N(2,4) corresponds to N-N or N-N-N or N-N-N-N.  [CG](5)TG{A}N(1,5)C  Input:  sequence: embl:hhtetra  Pattern: AAGCTT

HKUHKU Computer Centre fuzznuc  Output

HKUHKU Computer Centre Exercise 2 Q1  Use tmap to displays membrane spanning regions with the input sequence of xlrhodop.pep ( translated with program transeq from embl:xlrhodop at 110-1171 region). Does the result agree with pepinfo?

HKUHKU Computer Centre Exercise 2 Q2  Use fuzzpro to search sequence: CREAp_m.txt pattern: CXXXXC (the file CREAp_m.txt is from http://bioinfo.hku.hk/tutorial/)

HKUHKU Computer Centre Exercise 2 Q3  Use patmatmotifs to find pattern in swissprot sequences fos_human or fos_rat, and use these pattern to do fuzzpro. Search other fos genes of different organisms. (Hint: Use sw:fos_human for the input; Other organisms: bovin, chick, mouse, sheep.)

HKUHKU Computer Centre Exercise 2 Q4  Sometimes it is better to run the program fuzznuc in command line because more parameters can be given  In the BIOINFO terminal, type the following (you must put the command in one line in the UNIX prompt) : bioinfo% fuzznuc -sequence=embl:hhtetra -pattern=AAGCTT -mismatch=1 -complement -outf=outf.out How is the result different from previous run in web interface?

HKUHKU Computer Centre Exercise 2 answer (A1)  Bars are displayed in the plot above the regions predicted as being most likely to form transmembrane regions  May be seven transmembrane helices in this protein.  Result agree with pepinfo.

HKUHKU Computer Centre Exercise 2 answer (A2)  The symbol ‘x’ is used for a position where any amino acid is accepted.  There, the pattern CXXXXC matches the result patterns of CQFPGC and CMFPGC.

HKUHKU Computer Centre Exercise 2 answer (A2)  Patmatmotifs output using sw:FOS_HUMAN

HKUHKU Computer Centre Exercise 2 answer (A3)  When run with patmatmotifs, the sequences sw:FOS_HUMAN and sw:FOS_RAT returns the same motifs of AMIDATION, LEUCINE_ZIPPER, and BZIP_BASIC.  When run with fuzzpro with one of the pattern, the start and end position agrees with patmatmotifs.

HKUHKU Computer Centre Exercise 2 answer (A3)  Fuzzpro output with pattern “GRAQSIGRRGKVEQ” and sequence sw:fos_human

HKUHKU Computer Centre Exercise 2 answer (A4)  You can add no. of mismatches in input parameters for command line. The result with 1 mismatch can now be shown

HKUHKU Computer Centre cpgplot  CPGPLOT – Plot the CpG rich areas  CpG refers to a C nucleotide immediately followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases.  By default, this program defines a CpG island as a region where  over an average of 10 windows, the calculated % composition is over 50%  and the calculated Obs/Exp (i.e. Observed/Expected) ratio is over 0.6  and the conditions hold for a minimum of 200 bases.  These conditions can be modified by setting the values of the appropriate parameters.

HKUHKU Computer Centre cpgplot  The Observed number of CpG patterns in a window is simply the count of the number of times a 'C' is found followed immediately by a 'G'.  The Expected frequency of CpG's in a window is calculated as the number of 'C's in the window multiplied by the number of 'G's in the window, divided by the window length.  Expected = (number of C's * number of G's) / window length

HKUHKU Computer Centre cpgplot  Input: embl:rnu68037  Output

HKUHKU Computer Centre cpgplot  Output

HKUHKU Computer Centre cusp  CUSP reads one or more coding sequences (CDS sequence only) and calculates a codon frequency table.  It is important to use a codon frequency table that is appropriate for the species that your protein comes from.  Input:  Seq: embl:paamir  Codon usage table: Default (Ehum.cut)

HKUHKU Computer Centre cusp  Output:  Fract – the faction of all amino acids coded for this codon triplet.  /1000 – the number of codons per 1000 bases

HKUHKU Computer Centre cusp  Running the program in command line allows you to specify the sequence begin and sequence end bioinfo% cusp -sbeg 135 -send 1292 Create a codon usage table Input sequence(s): embl:paamir Output file [paamir.cusp]:

HKUHKU Computer Centre cusp  bioinfo% more paamir.cusp

HKUHKU Computer Centre hmoment  hmoment plots or writes out the hydrophobic moment. Hydrophic moment is the hydrophobicity of a peptide measured for a specified angle of rotation per residue.  Assumption: The angle of rotation (bonds of the backbone and amino acid side- chains) per residue in alpha helices is 100 degrees. The angle of rotation per residue in beta sheets is 160 degrees.  Input:  Sequence:sw:hbb_human  Produce graph: yes  Plot two graph: yes

HKUHKU Computer Centre hmoment  Output:  one for the alpha helix moment and one for the beta sheet moment.

HKUHKU Computer Centre End of lecture Thank you!

HKUHKU Computer Centre Introduction to EMBOSS Christine Ho

Similar presentations

Presentation on theme: "HKUHKU Computer Centre Introduction to EMBOSS Christine Ho"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HKUHKU Computer Centre Introduction to EMBOSS Christine Ho

Similar presentations

Presentation on theme: "HKUHKU Computer Centre Introduction to EMBOSS Christine Ho"— Presentation transcript:

Similar presentations

About project

Feedback