Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.

Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics

Outline of the Talk:  The need for fast search engine;  SSAHA – Sequence Search and Alignment using Hashing Algorithm;  Hash table;  Sequence search based on the hash table;  Search speed;  Memory requirement;  How to use the package.

Algorithms and Software Tools  Algorithms - Dynamic programming; - Hash method; - Suffix tree; - …  Software tools - FASTA; - BLAST; - Cross_Match; - Mummer; - …  CPU vs Memory

Smith-Waterman Algorithm n Only works effectively when gap penalties are used n In example shown –match = +1 –mismatch = -1/3 –gap = -1+1/3k (k=extent of gap) n Start with all cell values = 0 n Looks in subcolumn and subrow shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account H ij =max{H i-1, j-1 +s(a i,b j ), max{H i-k,j -W k }, max{H i, j-l -W l }, 0}

Mapping the string ababc into a suffix tree. ab abc c b c c root Suffix Tree Example

Motivation for sequence indexing –faster (economy) –remove reliance on the external service and network delays (user independence) –integrate fully with a database engine (convenience) –exhaustive instead of heuristics (quality) –enable different statistics in sequence evaluation (flexibility)

Objectives: With SSAHA algorithm, we aim to achieve the following objectives: (ii)To explore applications such as large scale sequence assembly and single nucleotide polymorphism (SNP) detection; (i)To develop a sequence search engine to search genomic sequences with a fast speed and acceptable accuracy; (iii)To provide possible tools for sequence analysis based on the search engine.

Sequence Representation Sequence S: (s 1 s 2, …, s i, …, s m ) i =1,2, …, m K-tuple: (s i s i+1...s i+k-1 ) Using two binary digits for each base, we may have the following representations: “A” =00; “C” = 01; “G” = 10; “T” = 11 For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way where  i = 0 or 1, depending on the value of the sequence base and E max is the maximum value of the possible E values.

Overlap Hashing W = N/k ATGGGCAGATGT CCATGTTCGGAT CCATGTTCGGAT CATTACGTAAGC CATTACGTAAGC ATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGC ATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGC ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Non-overlap hashing W = N-k+1 W = N-k+1 (k = 12) Non-overlap Hashing v Overlap Hashing

Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S3=(GGATCCCCTGTCCTCTCTGTCACATA) Hash Table : A 2-tuple hashing table of S1, S2 and S3

Query sequence: S q = (TGCAACAT) Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT

k-tuplesf(t)F(t)-(t-1)F s (t) TG1, 13 01, 5 2, 7 01, 13 3, 9 02, -2 GC CA2, 32, 1-22, 1 2, 92, 7-22, 1 2, 212, 19-22, 4 2, 272, 25-22, 7 2, 332, 31-22, 7 3, 213, 19-22, 7 3, 233, 21-22, 7 AA2, 192, 16-32, 16 AC1, 91, 5-42, 16 2, 52, 1-42, 19 2, 112, 7-42, 21 CA2, 32, -2-52, 25 2, 92, 4-52, 28 2, 212, 16-52, 31 2, 272, 22-53, -3 2, 332, 28-53, 9 3, 213, 16-53, 16 3, 233, 18-53, 18 AT2, 132, 7-63, 19 3, 33, -3-63, 21 Array of index and offset data S q = (TGCAACAT) Query sequence:

S q = (TGCAACAT) Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT

Sequence Search Sequence search is carried out using the generated hash table. Suppose we have a query sequence with length n, S q = (s 1, s 2, s 3,...,s n ), and we want to find whether this sequence is one of the sequences in the database or a small segment of the sequence. Based on S q, we have an integer array using where t = 1, 2, …, n+1-k. Note that overlapping for the query sequence is allowed while making the above array. For each element E(t), there are two arrays of sequence index and offset data with a length of entry repeats N t in the hash table: E(t) = (E 1, E 2, …, E t, … E n+1-k ) f 1 (t) = {H 1 (E(t),1), H 1 (E(t),2), …, H 1 (E(t),N t, )} f 2 (t,g) = {H 2 (E(t),1), H 2 (E(t),2), …, H 2 (E(t),N t, )}

F 1 (t) = f 1 (t) F 2 (t) = {H 2 ’ (E(t),1), H 2 ’ (E(t),2), …, H 2 ’ (E(t),N t )} with H 2 ’ (E(t),i) = H 2 (E(t),2)-(t-1) i = 1,2,…, N t The above calculation to adjust offsets should be done for every element in the array. Frequency Array Subject Query t-1 Match Start Reference Point t-1 Match Start Reference Point

In order to carry out search quickly and effectively, it would be helpful in the computer code to combine these two integer arrays into a single long integer array. We are targeting implementations on 64 bit machines. The long integer array can be expressed as F (t) = {H (E(t),1), H (E(t),2),…, H (E(t),N t )} with H(E(t),i) = 2 32 H 1 (E(t),i) + H 2 ’ (E(t),i)i = 1,2,…, N t 64 Bit Machines It is seen from the above equation that the offset value takes the low bits while the index part takes high orders of bits in the long integer. Index Offset

For the query sequence, there are n+1-k arrays in total and it is necessary that we combine all the arrays into one single arrays and F = {F (1), F(2),…, F(t),…, F(n+1-k)} Finally when the array is sorted into an ascending order, i.e. F -> F s with F s,1 < F s,2 < … < F s,i < … the search results can be determined by the number of the data repeats in the array. In a section within the F s array, if the found repeat level is higher than a given threshold level, this means that there is a match between the query sequence and sequences in the database. Array Sorting

Power Law: CPU time v query length Fig. 1 Normalized CPU time plotted against the number of k-tuples in query (k=12) using Quicksort. Averaged length of frequency array: where N i is the average length of the entruy repeats. ^

Query file: 39,000 reads 39,000 reads Speed and Resolution – Effects of k Subject file: 1.5 Gbp of human DNA kE max +1CPU (Get hash table) T 1 (s)CPU (Search only) T 2 (s)* 865,53637847702 9262,1443828225 101,048,5763881793 114,194,304408387 1216,777,216427102 1367,108,86445457 14268,435,45647749

SSAHA Memory Memory for subject: M s = 4*N s /k+ 4*2 2k Memory for query: M q = N q House keeping: 10-20% total Total memory: M s = 1.2*(M s +M q )

R i +j R i+1 RiRiRiRi SSAHA Memory: One array combined read index and offset

Matching Positions Found by SSAHA Subject Query t-1 Match Start Reference Point t-1 Match Start Reference Point

SSAHA2 = SSAHA + Cross_Match SSAHA for matching seeds, cross_match for sequence alignment. SSAHA seeds Edge length Sequence for cross_match Edge length

SSAHA2 Command Line./ssaha2 query_file subject_file Options: -kmer: length of kmer words;default kmer=12 -seeds:number of exact kmer words;default seeds=10 -align: '1' - show full alignment; '0' - no alignment;default '1' -sense: '1' - search with higher sensitivity; '0' - normal;default '0' -tags: '1' - show a tag of 'ALIGNMENT'; '0' - no tag;default '0' -depth: number of reported hits with best alignment;default depth=50 -score: minimum score of smith-waterman;default score=30 -cut: number of word occurrence in the dataset; default cut=200 -memory: memory assigned in MBs for cross_match;default memory=2000 -array: memory assigned in MBs for storing frequence arrays;default memory=4 -edge: extension of both ends on the subject;default edge=200 -best: report the best alignment from the hit list;default '0' -start: start read from the query file;default start=0 -end: end read from the query file; default start= Total number of the reads in the query file; -kmer: length of kmer words;default kmer=12 -seeds:number of exact kmer words;default seeds=10 -align: '1' - show full alignment; '0' - no alignment;default '1' -sense: '1' - search with higher sensitivity; '0' - normal;default '0' -tags: '1' - show a tag of 'ALIGNMENT'; '0' - no tag;default '0' -depth: number of reported hits with best alignment;default depth=50 -score: minimum score of smith-waterman;default score=30 -cut: number of word occurrence in the dataset; default cut=200 -memory: memory assigned in MBs for cross_match;default memory=2000 -array: memory assigned in MBs for storing frequence arrays;default memory=4 -edge: extension of both ends on the subject;default edge=200 -best: report the best alignment from the hit list;default '0' -start: start read from the query file;default start=0 -end: end read from the query file; default start= Total number of the reads in the query file;

COOKBOOK BACends placement - find the best hit in the database: -seeds 14 -kmer 13 -align 0 -tags 1 -depth 5 -score 200 -cut 50000; EST/cDNA alignment - produce splice on the subject sequence: -seeds 4 -kmer 13 -align 0 -tags 1 -depth 5 -score 20 -edge 20000; Primer/gene Marks alignment - find the matches of short motifs to the database: -seeds 1 -kmer 13 -tags 1 -score 12 -skip 1 -sense 1 -cut 50000; Search with higher sensitivity: -seeds 2 -kmer 13 -tags 1 -score 20 -sense 1 -cut 50000; Both query and subject are large (q: 100Kb < query < 1MB; s: no limit): -seeds 50 -kmer 13 -tags 1 -score 2000 -array 40 -memory 10000;

Summary:  Speed - Fast enough to perform genomic scale searches between large genomes;  Memory – linear;  Sensitivity – not as good as BLAST, but applicable in assembly and SNP detection;

Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.

Similar presentations

Presentation on theme: "Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.

Similar presentations

Presentation on theme: "Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics."— Presentation transcript:

Similar presentations

About project

Feedback