Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence alignment Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.

Similar presentations


Presentation on theme: "Sequence alignment Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics."— Presentation transcript:

1 Sequence alignment Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics

2 Biologically significant alignment http://artedi.ebc.uu.se/programs/pairwise.html hba_human hbb_human

3 Biologically plausible alignment

4 Spurious alignment (BRCA1 variant) Examples from: Biological sequence analysis. Durbin, Eddy, Krogh, Mitchison

5 Alignment types Examples from: BLAST. Korf, Yandell, Bedell How do we align the words: CRANE and FRAME? CRANE || | FRAME 3 matches, 2 mismatches How do we align words that are different in length? COELACANTH || ||| P-ELICAN-- COELACANTH || ||| -PELICAN-- 5 matches, 2 mismatches, 3 gaps In this case, if we assign +1 points for matches, and -1 for mismatches or gaps, we get 5 x 1 + 1 x (-1) + 3 x (-1) = 0. This is the alignment score.

6 Finding the “best” alignment COELACANTH || ||| P-ELICAN-- COELACANTH | ||| PE-LICAN-- COELACANTH || P-EL-ICAN- COELACANTH PELICAN-- S=-2 S=-6S=-10 S=0

7 Global alignment – Needleman-Wunsch Example from: Higgs and Attwood Aligning words: SHAKE and SPEARE

8 Local alignment – Smith-Waterman Example from: Higgs and Attwood

9 Visualizing pair-wise alignments

10 Sequence similarity and scoring Match-mismatch-gap penalties: e.g. Match = 1 Mismatch = -5 Gap = -10 Scoring matrices

11 Multiple alignments clustalW

12 Anchored multiple alignment

13 Similarity searching vs. alignment Alignment Similarity search query database

14 The BLAST algorithms ProgramDatabaseQueryTypical Uses BLASTNNucleotide Mapping oligonucleotides, amplimers, ESTs, and repeats to a genome. Identifying related transcripts. BLASTPProtein Identifying common regions between proteins. Collecting related proteins for phylogenetic analysis. BLASTXProteinNucleotideFinding protein-coding genes in genomic DNA. TBLASTNNucleotideProteinIdentifying transcripts similar to a known protein (finding proteins not yet in GenBank). Mapping a protein to genomic DNA. TBLASTXNucleotide Cross-species gene prediction. Searching for genes missed by traditional methods.

15 BLAST report

16 http://www.ncbi.nih.gov/BLAST/ gi|7428631

17 The BLAST algorithm Sequence alignment takes place in a 2-dimensional space where diagonal lines represent regions of similarity. Gaps in an alignment appear as broken diagonals. The search space is sometimes considered as 2 sequences and somtimes as query x database. Global alignment vs. local alignment –BLAST is local Maximum scoring pair (MSP) vs. High-scoring pair (HSP) –BLAST finds HSPs (usually the MSP too) Gapped vs. ungapped –BLAST can do both

18 The BLAST algorithm RGD17 KGD14 QGD13 RGE13 EGD12 HGD12 NGD12 RGN12 AGD11 MGD11 RAD11 RGQ11 RGS11 RND11 RSD11 SGD11 TGD11 BLOSUM62 neighborhood of RGD T=12 Speed gained by minimizing search space Alignments require word hits Neighborhood words W and T modulate speed and sensitivity

19 Word length

20 2-hit seeding Alignments tend to have multiple word hits. Isolated word hits are frequently false leads. Most alignments have large ungapped regions. Requiring 2 word hits on the same diagonal (of 40 aa for example), greatly increases speed at a slight cost in sensitivity.

21 Extension of the seed alignments Alignments are extended from seeds in each direction. Extension is terminated when the maximum score drops below X. The quick brown fox jumps over the lazy dog. The quiet brown cat purrs when she sees him. Text example match +1 mismatch -1 no gaps

22 BLAST statistics >gi|23098447|ref|NP_691913.1| (NC_004193) 3-oxoacyl-(acyl carrier protein) reductase [Oceanobacillus iheyensis] Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 How significant is this similarity?

23 Scoring the alignment Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 4 4 S (score)

24 The Karlin-Altschul equation A minor constant Expected number of alignments Length of query Length of database Search space Raw score Scaling factor Normalized score The “Expect” or “E-value” The “P-value”

25 The sum-statistics Sum statistics increases the significance (decreases the E- value) for groups of consistent alignments.

26 The sum-statistics The sum score is not reported by BLAST!


Download ppt "Sequence alignment Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics."

Similar presentations


Ads by Google