Download presentation
Presentation is loading. Please wait.
Published byShannon Simmons Modified over 9 years ago
1
Pattern Matching Rhys Price Jones Anne R. Haake
2
What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence for matches to short sequence patterns (Staden 1990).
3
Why search for patterns? Usually the sequences of interest (the query sequences) are known to be indicators of some important biological function Search for patterns in nucleotide sequence –DNA or RNA Search for patterns in amino acid sequence
4
Motif multiples uses of the word Def: a pattern; typically is used to refer to a short (up to ten bases or residues) repeated or conserved pattern in nucleic acids or proteins Def: a short conserved sequence in a protein; usually associated with function –in a broader sense, motif is used for all localized regions of homology, regardless of size
5
Some examples of patterns in DNA sequence: Restriction sites:recognition sites for the restriction endonucleases Intron splice sites Codons specifying ORFs Promoters DNA binding sites for regulatory proteins
6
Restriction Sites Why identify them? Exact or inexact matches? Examples: Restriction sites
7
Splice Sites Splice donor and splice acceptor are consensus sequences –A statistical determination of the pattern;approximates the pattern C(orA)AG/GTA(orG)AGT "donor" splice site T(orC)nNC(orT)AG/G "acceptor" splice site Splice site example
8
Splice Sites Remember that they are consensus sequences Why are splice sites of interest? –Gene finding –Mutations in consensus sequence at the splice junctions common in many inherited disorders Ex: thalassemias, muscular dystrophy, Tay-Sachs, neurofibromatosis, Darier’s disease…….. One of the thalassemias: mutation at splice acceptor YYYNCAG| normal YYYNCGG| mutant
9
Codons Specifying ORFs ORFs (open reading frames) Start codon ….60-100 a.a’s and no stop codon Prokaryotic start codons: ATG, GTG or TTG usually, but is species specific Eukaryotic start: ATG Code table More on this, too, when we discuss gene finding
10
Promoters Prokaryotic promoters: Consensus sequences –TTGACA ---- 17±1 ---- TATAAT -35 -10 Eukaryotic promoters –TATA box at –25 relative to transcriptional start site consensus is 5’-TATAWAW-3’ (W= A or T) –Initiator sequence(Inr) consensus is 5’-YYCARR-3’ (Y is C or T; R is G or A) the +1 nucleotide (start) is usually the A of the Inr sequence Bind basal transcription factors –We’ll revisit this when we discuss gene finding
11
Transcription Factor Binding Sites Regulatory transcription factors are sequence-specific DNA-binding proteins; sites are often found in or near gene promoter regions DNA sequence is called the response element What are the DNA sequences like? Response elements
12
Some examples of patterns in protein sequences (motifs): Prediction of secondary and tertiary structure – e.g. transcription factors helix-turn-helix, b-zip, zinc-finger Examples Presence of active sites of enzymes Presence of cell localization signals
13
Exact vs Inexact (Approximate) Pattern Matching Exact Pattern Matching –Limited use in bioinformatics –Well-known algorithms (last week) –A common use of exact pattern matching is to compare a sequence against a large number of possible known patterns such as in the identification of restriction sites Approximate –Most of the other examples of pattern matching in bioinformatics
14
Other uses of exact pattern matching? Check PCR primers? Annotation? (text matching)
15
Why search for patterns? Pattern matching in sequences is also the basis of searching through a sequence database –Sequence alignment
16
Pairwise Sequence Alignment An alignment between 2 sequences is a pairwise match between sequences. Pairwise sequence comparison is the primary means of linking biological function to the genome and of propagating known information from one genome to another (Gibas & Jambeck).
17
Why are inexact pattern matches relevant in sequence alignments? Sequencing errors Mutation –2 primary types point mutations (affect a single nucleotide) segmental mutations (affect a few to hundreds of adjoining nucleotides) –substitutions (transitions, transversions) –insertions, deletions
18
Mutations Point mutations usually occur from a nucleotide mismatch that becomes “fixed” during the process of replication –Escapes the DNA repair mechanism Significant when occur within a coding region and also cause a change in functionality –Non-synonymous mutation –Synonymous mutation: mutated sequence codes for same amino acid as before mutation –Allowance for synonymous mutation due to wobble and degeneracy of the code Code Table
19
Evolutionary Considerations Through time mutations tend to be preserved if they are not deleterious Functionally important sequences tend to be conserved Non-functional or non-coding sequences diverge at a high rate
20
Evolutionary Considerations The tendency of functionally important sequences to remain relatively unchanged over time is the basis for sequence analysis –Allows us to draw evolutionary connections among genes that are related in sequence
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.