Introduction to the GCG Wisconsin Package The Center for Bioinformatics UNC at Chapel Hill Jianping (JP) Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6105.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Sequence Similarity Searching Class 4 March 2010.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Introduction to GCG & UNIX Jianping Jin Ph. D. Center for Bioinformatics University of North Carolina At Chapel Hill Tel: (919)
Bioinformatics and Phylogenetic Analysis
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
BLAST.
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Multiple sequence alignment
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Basic Overview of Bioinformatics Tools and Biocomputing Applications I Dr Tan Tin Wee Director Bioinformatics Centre.

Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Copyright OpenHelix. No use or reproduction without express written consent1.
Comparing Sequences AND Multiple Sequence Alignment Bioinformatics
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Based Analysis Tutorial
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
BLAST.
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment
Explore Evolution: Instrument for Analysis
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Introduction to the GCG Wisconsin Package The Center for Bioinformatics UNC at Chapel Hill Jianping (JP) Jin Ph.D. Bioinformatics Scientist Phone: (919) Fax: (919)

What is GCG  An integrated package of over 130 programs (the GCG Wisconsin Package).  For extensive analyses of nucleic acid and protein sequences.  Associated with most major public nucleic acid and protein databases.  Works on UNIX OS.

Why use GCG  Removes the need for the constant collection of new software by end users.  Removes the need to learn new interface as new software is released.  Provides a flow of analyses within a single interface.  Unix environment allows users to automate complex, repetitive tasks.  Allows users to use multiple processors to accelerate their jobs.  Supports almost all public databases that can be updated daily. Fast local search.

Flexibility or Automation  1. MEME: upstream regulatory motifs;  2. MotifSearch: genes sharing these potential regulatory motifs;  3. PileUp: multiple sequence alignment;  4. Distances: extract pairwise distances from the alignment;  5. GrowTree: a phylogenetics tree.

Interfaces  Command Line: Running programs from UNIX system prompt.  SeqLab: Graphic User’s Interface, requiring an X windows display.  SeqWeb: to a core set of sequence analysis program.

Limitations with GCG  The GUI interface does not give the users the full access to the power of the command line, nor to the complete set of programs.  Many programs place a limit of the maximum size of the sequences that they can handle (350 Kb). This limitation will be removed in version 11.

Databases GCG Supports  Nucleic acid databases  GenBank  EMBL (abridged)  Protein databases  NRL_3D  UniProt (SWISS-PROT, PIR, TrEMBL)  PROSITE, Pfam,  Restriction Enzymes (REBASE)

Database Update Services  DataServe: Automatically updates nucleic acid on a daily basis via FTP.  DataExtended: the most compete set of nucleic acid and protein data. The timing of the release is coordinated with the major GenBank release, 2-3 months.  DataBasic: Similar to DataExtended, but excludes EST and GSS data from GenBank and EMBL.

File Importing and Exporting  Reformat  FromEMBL  FromGenBank  FromPIRToPIR  FromStadenToStaden  FromIGToIG  FromFastAToFastA

File Formats with GCG  Single sequence files (in GCG format)  List (a list of files)  MSF (multiple sequence format)  RSF (rich sequence format)

Typical program

Result from MAP analysis

X-Windows server must be running

SeqLab Main Window (List Mode)

SeqLab Editor Mode

Display by Features

SeqLab Editor Mode (cont.)

SeqLab Output Manager

GCG Programs  1. Comparison  2. Database Searching and Retrieval  3. DNA/RNA Secondary Structure  4. Editing and Publication  5. Evolution  6. Fragment Assembly  7. Importing and exporting  8. Mapping  9. Primer Selection  10. Protein Analysis  11. Translation

Create your own sequence

PlasmidMap

FindPatterns

HmmerPfam Analysis

Gene Finding (FRAME)

Restriction Enzyme Map

Consensus Sequence

Phylogenetic Tree (Cladogram)

Peptide Structure

Peptide Structure (2)

Isoelectric Analysis

Transmemberane Domains

Neucleic Acid 2 nd Structure

Pairwise Comparison (Gap)  Neelman & Wunsch algorithm.  A global alignment covering the whole length of both sequences and the resulting sequences are of the same length with inserted gaps.  Good when two sequences are closely related.

Pairwise Comparison (BestFit)  Algorithm of Smith and Waterman.  Local homology alignment that finds the best segment of similarity b/w two sequences.  The most sensitive sequence comparison method available.

Comparison of two sequences

GapShow

Multiple Comparison (PileUp)  The method of Feng and Doolittle similar to Higgins & Sharp.  A series of progressive pairwise alignments (up to 500 seq.) generate a final alignment.  An extension of Gap, not ideal for finding the best local region of similarity, such as a shared motif.

Multiple Comparison by Pileup

Dendrogram by Pileup

Database Search  Nearly always employ local alignment algorithms.  Often use “heuristic” methods (for a screen), FASTA and BLAST.  Assures the seq.are given correct local similarity score, but no guarantee that all seq. with high Smith-Waterman scores pass through the screen.

BLAST  Accepts a number of sequences as input and specify any number of DBs. $Blast – INfile2=PIR,SWPLUS; -INfile=hsp70.msf{*}.  Support 5 BLAST programs, but no gap alignment available for TBLASTX.  For non-coding nucleotide homology search, considering either reducing the word size from 11 to 6/7, or using the FASTA.  The number of scoring matrices is limited, BLOSUM62/45/80 and PAM70 available for – MATRix parameter.

Database Search (SSearch)  A rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type.  The most sensitive method available for similarity search.  Very slow.

HmmerSearch  Use a profile HMM as a query to search a sequence database.  Profile HMM: a position specific scoring table, a statistical model of the consensus of a multiple sequence alignment.  Output can be used for any GCG program that accepts list file.

Profile Hidden Markov Model

HmmerSearch

HmmerSearch (cont.)

HS (cont.Histogram of scores)

HS (cont. resulting alignment)

NetBLAST  Sends your query sequences over the internet to a server at NCBI, Bethesda.  Some limitations on NetBLAST, e.g. prohibiting TBLASTX search vs. the nr database, only Alu, EST, GSS, STS.  Not support as many options as are available with BLAST.

NetBLAST

PSIBLAST  Similar to BLAST, except using position- specific scoring matrices during the search.  Use protein sequence(s) to iteratively search protein database(s).

MEME and MotifSearch  Multiple EM Motif Elicitation, a tool for discovering motifs in a group of DNA or protein sequences.  Motif: a sequence pattern that occurs repeatedly in a group of related sequences.  Use a set of MEME profiles to search a database for new sequences similar to the original family.

MEME PROFILE

MEME (cont.)

GrowTree (Cladogram)

Access to GCG on Campus  1. Onyen and password plus sign up to BioSci service at  2. Computer connected to the Campus network;  3. Postscript printer connected to the campus network;  4. SSH Secure Client;  5. X-Windows Server (optional).

Sign up BioScience

Log onto GCG

Log onto GCG (cont.)

GCG Welcome Page

How to get seqlab to run  Open X-Windows;  Logon to the GCG server, nun.isis.unc.edu, through SSH Secure Shell Client;  At the prompt ($) enter the command “export DISPLAY=yourMachineIP:0.0;  Enter the command “xterm &” to activate the xterm window;  On the GCG main window enter the command “seqlab &” to activate the SeqLab GUI.

How to get SeqLab to run (cont.)