Presentation on theme: "Sequence Analysis Tools"— Presentation transcript:
1 Sequence Analysis Tools Erik ArnerOmics Science Center, RIKENYokohama, Japan
2 Aim of lecture Why align sequences? How are sequences aligned to each other?VariantsLimitationsBasic understanding of common tools forSimilarity searchMultiple alignment
3 Outline Sequence analysis Basics of sequence alignment Homology/similarityBasics of sequence alignmentGlobal vs. localComputing/scoring alignmentsSubstitution matricesSimilarity searchBLASTMultiple alignmentClustalWShow of hands: how many have used BLAST? Clustal? How many know how they work?
4 Sequence analysis Sequence analysis Inferring biological properties throughSimilarity with other sequencesProperties intrinsic to the sequence itselfCombinationSequence analysis often (always?) includes sequence alignmentSequence alignment methods fundamental part of bioinformatics
5 Sequence analysis Why aligning sequences? Similarity in sequence → similarity in functionSimilarity in sequence → common ancestryHomology = similarity due to shared ancestrySimilar → importantSelective pressure
9 Sequence analysis Similarity ≠ homology Similarity = factual (% identity)Homology = hypothesis supported by evidence… but in many cases, similarity is the only tool we have accessibleNeed a measure of the significance of the similarity
10 Basics of sequence alignment Global vs. local alignmentGlobalAssumes sequences are similar across entire lengthLocalAllows locally similar sub-regions to be pinpointedIntrons/exonsProtein domains
11 Basics of sequence alignment Which one is correct?
12 Basics of sequence alignment Which one is correct?Both?None?In sequence alignment, you get what you ask for
13 Basics of sequence alignment Other types of alignmentGlocalOverlaps in shotgun sequencingStructuralthioredoxins from human and Drosophila, Wikipedia
14 Basics of sequence alignment Computing alignmentsDynamic programmingNeedleman – Wunsch (global alignment)Smith – Waterman (local alignment)For a given pair of sequences and a scoring scheme, find the optimal alignmentSeveral may exist
17 Basics of sequence alignment In sequence alignment, you get EXACTLY what you ask forHeavily penalized gaps → less gaps in alignmentHeavily penalized mismatches → more gaps in alignment
18 Basics of sequence alignment Substitution matricesDNA scoring mostly straightforwardMore clever scoring for protein sequencesBiochemical propertiesLower penalties for substitutions into amino acids with similar propertiesLow penalty for isoleucine(I) → valine(V) subsitution – both hydrophobicObserved substitution frequenciesMultiple alignments of proteins known to share ancestry and/or functionInsert pic of simple DNA subst matrix
19 Basics of sequence alignment Common substitution matricesPAMBLOSUMBLOSUM62 most widely usedDefault in BLASTRecent paper discovered bug in BLOSUM62……but buggy matrix performs “better”!Insert pic of BLOSUM62Graphic from EBIPaper in Nature Biotechnology
20 Basics of sequence alignment Gap penaltiesGaps generally considered to cause greater disruption of function than mismatchesGap open penaltyGap extension penaltyWhat matrix to use?
21 Similarity search Premise: The sequence itself is not informative; it must be analyzed by comparative methods against existing databases to develop hypothesis concerning relatives and function.Abundance of biological sequence data forbids extensive searchesAll nucleotides/amino acids in query sequence cannot be compared to all aa:s/nt:s in databaseFast searches are achieved using methods that trade off sensitivity for speed and specificityNCBI
22 Similarity search General approach: A set of algorithms (e.g. BLAST) are used to compare a query sequence to all the sequences in a specified databaseComparisons are made in a pairwise fashionEach comparison is given a score reflecting the degree of similarity between the query and the sequence being comparedThe higher the score, the greater the degree of similarityAlignments can be global or local (BLAST: local)Discriminating between real and artifactual matches is done using an estimate of probability that the match might occur by chanceSimilarity, by itself, cannot be considered a sufficient indicator of function
23 Similarity search – BLAST A set of sequence comparison algorithms introduced in 1990Breaks the query and database sequences into fragments ("words"), initially seeks matches between fragmentsInitial search is done for a word of length "W" that scores at least "T" when compared to the queryusing a given substitution matrixWord hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S“"W" parameter dictates the speed and sensitivity of the search
25 Similarity search – BLAST ScoringUnitary matrix used for DNAOnly identical nucleotides give positive scoreSubstitution matrices are used for amino acid alignmentsBLOSUM62 is defaultNon-identical amino acids may give positive scoreGapsGap scores are negativeThe presence of a gap is ascribed more significance than the length of the gapA single mutational event may cause the insertion or deletion of more than one residueInitial gap is penalized heavily, whereas a lesser penalty is assigned to each subsequent residue in the gapNo widely accepted theory for selecting gap costsIt is rarely necessary to change gap values from the default
26 Similarity search – BLAST Significance of hitsP valueGiven the database size, the probability of an alignment occurring with the same score or betterHighly significant P values close to 0Expectation valueThe number of different alignments with equivalent or better scores that are expected to occur in a database search by chanceThe lower the E value, the more significant the scoreHuman judgment
29 Similarity search – BLAST BLAST at NCBIFirst consider your research question:–Are you looking for an orthologin a particular species?•BLAST against the genome of that species.–Are you looking for additional members of a protein family across all species?•BLAST against nr, if you can’t find hits check wgs, htgs, and the trace archives.–Are you looking to annotate genes in your species of interest?•BLAST against known genes (RefSeq) and/or ESTsfrom a closely related species.
31 Multiple alignment Why align multiple sequences? Determine evolutional relationship between sequences → speciesPhylogeneticsIdentify domainsPWM:sPinpoint functional elementsHighly conserved amino acids among more divergent ones → catalytic activity?
32 Multiple alignment Multiple alignment algorithms Finding optimal alignment is very time consumingExponential complexityApproximations and heuristics used for speeding upHeuristics: "rules of thumb", educated guesses, intuitive judgments or simply common sense (from Wikipedia)Progressive alignmentGIGO
33 Multiple alignment – ClustalW Basics of progressive algorithmAll sequences are compared to each other pairwiseA guide tree is constructed, where sequences are grouped according to pairwise similarityThe multiple alignment is iteratively computed, using the guide tree
35 Multiple alignment – ClustalW HeuristicsIndividual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent onesAmino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be alignedResidue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structurePositions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions
36 Summary Know your parameters Defaults are good choices in most cases However, be aware of what they meanYou get what you ask for
37 Sequence analysis tools EMBOSSSuite of tools for various analysis tasksORF finding, alignment, secondary structure prediction...
38 Sequence analysis tools ExPASyComprehensive collection of protein analysis webtools
39 Sequence analysis tools EBI SRSOne-stop shop for sequence searching to analysis