Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n

Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n branch@ncbi.nlm.nih.gov

n DNA Sequences – What would be the expected number of occurrences of a particular sequence in a genome? Size: human genome 6*10 9 considering both strands Base frequency: equal Sequence length: 20 nucleotides – Bernouli Model: = 0.005 – But: (GT) n with n>10 = 10 5 Sequence Composition 20 9 4 10*6

Low-complexity Regions n Simple Sequence Regions (SSR) – MICRO- or MINISATELLITES – Regions that have significant biases in AA or nucleotide composition : repeats of simple motifs – (GT) n (AAC) n (P) n (NANP) n n Low-Complexity Regions/Segments – Complexity can be measured by Shannon’s Entropy Regarding an amino acid sequence – For each composition of a complexity state, there exists a large number of possible sequences

Low-Complexity Regions n Locally abundant residues may be – continuous or loosely clustered irregular or aperiodic n >25% of AA in currently sequenced genome is in LC regions – non-globular domains  SSR n Examples: myosins, pilins, segments in antigens, short subsequences of 10-50 residues with unknown function – Beta-pleated sheets – Alpha helices – Coiled-coils

Detecting Low-Complexity n SEG and PSEG/NSEG algorithms – Wootton and Federhen Methods in Enzymology 266:33 (1996) Computers and Chemistry 17:149 (1993) n SEG – UNIX Executable available on ncbi servers seg FASTAfile Window TriggerComplexity Extension K 2 (1) K 2 (2) Longer Window lengths define more sustained regions, but overlook short biased subsequences

clobber> seg hu.piron.fa 12 2.20 2.50 >gi|730388|sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP) 1-49 MANLGCWMLVVFVATWSDLGLCKKRPKPGG WNTGGSRYPGQGSPGGNRY ppqggggwgqphgggwgqphgggwgqphgg 50-86 gwgqggg 87-104 THNQWHKPSKPKTSMKHM agaaaagavvgglggymlgsams 105-127 128-179 RPLIHFGNDYEDRYYRENMYRYPNQVYYRP VDQYSNQNNFVHDCVNITIKQH tvttttkgenftet 180-193 194-228 DVKMMERVVEQMCITQYEKESQAYYQRGSS MVLFS sppvillisflifliv 229-244 245-245 G clobber> seg hu.piron.fa 12 2.20 2.50 -l >gi|730388|sp|P40250|PRIO_CERAE(50-86) complexity=1.90 (12/2.20/2.50) ppqggggwgqphgggwgqphgggwgqphgggwgqggg >gi|730388|sp|P40250|PRIO_CERAE(105-127) complexity=2.47 (12/2.20/2.50) agaaaagavvgglggymlgsams >gi|730388|sp|P40250|PRIO_CERAE(180-193) complexity=2.26 (12/2.20/2.50) tvttttkgenftet >gi|730388|sp|P40250|PRIO_CERAE(229-244) complexity=2.50 (12/2.20/2.50) sppvillisflifliv

SEG piron with different window lengths question-based – exploratory tool – optimization step

– Intuitive explanation Take a 20-residue long sequence –(20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) –( 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ) –( 3 3 3 3 3 2 2 2 2 1 1 0 0 0 0 0 0 0 0 0) – Complexity can be described by Shannon’s Entropy (K 2 ) Regarding an amino acid sequence – For each composition of a complexity state, there exists a large number of possible sequences (K 1 ) Detecting Low-Complexity

How SEG works n seg FASTAfile Window TriggerComplexity Extension K 2 (1) K 2 (2) n Looks within window length: if complexity < K 2 (1) then extends until complexity < K 2 (2) n Uniform prior probabilities – Protein sequence data base is a heterogeneous statistical mixture such that the initially-unknown AA frequencies in Low-complexity subsets need have no similarity to frequencies in total data base – Unbiased view of low-complexity regions – Gives equiprobable compositions for any complexity state

How SEG works, continued n How do you correct for the background AA/nuc composition bias? – After randomly shuffling all the residues, determine the trigger complexity that results in 4% of the data base being within Low-complexity regions – Then use this trigger complexity and subtract 4% from %AA in Low-complexity regions

Detecting Low-complexity with repetitive motif: SSR n PSEG or NSEG n Repetition of residue types or k-grams n Period 3 (n V E n K N n V D n K D n V N n K S n K) (n m i n m i n m i n m i n m i n m i n m) (n m E n m N n m D n m D n m N n m S n m) n Sliding window along sequence in single residue steps

Evolutionary Mechanisms n Evolution of sequences in general – Evolution rate of 10 -5 – 10 -9 Base pair substitution (10 -9 ) Insertion/deletions Recombination n In SSR, Low-complexity regions, mutations are in length – with steps typically +/- one repeat unit – Evolution rate 10 -3 Biased nucleotide substitution due to increased recombination in repetitive regions Unequal crossing over (recombination) Replication slippage n Alignment of repeats does not imply relationships/ancestory

Low-Complexity and BLAST searches n Low-complexity regions results in BLAST searches being dominated by Low-complexity regions – biased AA/nuc composition n BLAST added “mask low-complexity” by default – Seg parameters: 12 2.2 2.5 n BLAST now also uses a compositional bias filter on the whole database – Masks if composition bias using seg 10 1.8 2.1 n YOU MAY WANT TO TURN THESE OPTIONS OFF and use your own organism-specific seg paramenters when doing protein homology searching n YOU WILL NEED TO TURN THESE OPTIONS OFF if you are interested in looking at sequence similarities of repetitive/low complexity regions.

Example:Plasmodium falciparum n Using whole genome sequences is important to limit pcr sequencing bias for antigens: hydrophilic proteins n Considering GC-content / AA bias – P. falciparum is approximately 28 % GC n Visualization of individual proteins

A helpful tool here and in general n SEALS: A system for Easy Analysis of Lots of Sequences, R. Walker and E. Koonin, NCBI n www.ncbi.nlm.nih.gov/ www.ncbi.nlm.nih.gov/ CBBresearch/Walker/SEALS/index.html n Demonstrate getting an appropriate data set – Taxnode2gi, gi2fasta – Daffy – Purge – Gref – Fanot n Use cleaned data set of P. falciparum proteins

Protein Analysis n Setting the trigger complexity: – Dbcomp – Shuffledb – Seg n Run SEG on P. falciparum MSP1, PfEMP2, Cg2 – Options –p (tree form output) -l (only report Low-C segs) -h (don’t report Low-C segs) -x (substitute Low-C with x) n Run PSEG on P. falciparum MSP1, PfEMP2, Cg2 with different –z (periodicity)

Usefulness of studying Low-Complexity Within a protein secondary structure, homology searchers, protein location genetic disorders Within taxa microsatellite markers polymorphism comparisons between proteins Between taxa Synteny, orthologs different selection pressures upon different organisms parasites: immunogenicity, rapid evolution of antigens, recombination

Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n

Similar presentations

Presentation on theme: "Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n

Similar presentations

Presentation on theme: "Low-complexity and Repetitive Regions n OraLee Branch n John Wootton n NCBI n"— Presentation transcript:

Similar presentations

About project

Feedback