Indexing Graphs for Path Queries with Applications in Genome Research

Indexing Graphs for Path Queries with Applications in Genome Research
Presented by: Evan Stene Spring 2017

Paper Information Authors: Published In: IEEE/ACM Transactions on Computational Biology and Bioinformatics Date of Publication: January 2014 Jouni Sirén Niko Välimäki Veli Mäkinen Department of Computer Science University of Chile Research Programs Unit University of Helsinki

Background – Alignment
Part of process for converting the biological data to computer readable data (sequencing) Reading is prone to error Sequence to search (reference genome) will never match all queries exactly

Combining Reference Sequences
Fig. 1. Pattern AGCTGTGT matching the multiple alignment when allowing it to change row when necessary. Mention that small variations can be used with a reference as well Also that backtracking is the alternative and is slow

Background – Suffix Arrays
Array of all suffixes of a string in sorted in ascending lexicographic order Allows finding subsequence s in O(|s|) Closed form function exists to find letter prepending suffix Useful for compressed indices E.g using only sampled values

Background – Prefix Doubling
Sort suffixes by their prefixes (iteratively) Each iteration uses prefixes of length 2i Each step effectively groups similar prefixes Suffixes not belonging to any group (unique prefix) are in sorted order Requires up to lg(n) iterations

Background – Prefix Doubling Example
Index 1 2 3 4 5 6 7 8 9 10 11 12 String $ A B R C D # SA1 ( 1 11 ) ( 2 9 ) ( 3 10 ) SA2 8 ) SA3 SA4 Mention that # and $ are beginning/end of string markers LF = C[T[SA[i] – 1]] + rank(SA[i]) where rank is within that letter e.G For finding AB at 8,9 T[SA[9] – 1] = A C[A] = 1 Rank(SA4[9]) = 1 since it’s the first B

Motivation Combine reference sequences to provide greater accuracy when aligning Create a suffix array like structure using a directional acyclic graph (DAG) Relate the closed form transformations of suffix arrays to the DAG Compress the sorted graph to fit into local memory

Building the Graph G A C G T A – C T G G A C G T A – – – G
G A T G T A – C T G G A C – T A C C T G Use examples G, T, A from end back and C at pos 3 Fig. 4. A reverse deterministic automaton corresponding to the first 10 positions of the multiple alignment in Fig. 1.

Prefix Sorting For each node v in graph A, create the following tuple and store in an array: (from(v), w, rank(v)) One for each w Sort each tuple according to its rank For tuples with unique ranks, set rank(v) = (rank(v), 0) All other tuples combine as follows: Tuples with from(u) and w = from(v) can be combined with rank(u,v) = (rank(u), rank(v)) Sort by the newly formed tuples and reassign the rank by location in array Merge nodes with same rank and from values Use example 1st A node from previous slide 1st part doubling, 2nd part pruning W is 0 if no successors from(v) = first node in path (forming the prefix) rank(v) = rank of prefix starting from v among all other prefixes w = successor to the last node in path

Prefix Sorting – Adding the edges
Fig. 5. A prefix-sorted automaton built for the automaton in Fig. 4. The strings above nodes are prefixes p(v). Use middle T as edge example, it belongs to 4 prefixes

GCSA BWT is being stored in list of incoming edges

Graph as Array AGC < AGZ -> prefixes for nodes 0 and 1
Mention offset is not stored in this example Figures by: Daehwan Kim

Searching Example Figures by: Daehwan Kim

Example Continued… Figures by: Daehwan Kim

Example Ends Talk about how this had a unique match, if the range was still >1 the results are ambigous. Also if the range slips to 0 at any point, the query returns 0 or backtracking is required Figures by: Daehwan Kim

Compression Talk about how to get to a node using the bit vector and how to count the number of outgoing edges The compression of the incoming edge letter is only possible on genomic data The authors of the paper use separate bit vectors for each letter with a 1 indicating that it appears at that index Figures by: Daehwan Kim

Comparison Human genome version is about 3.1 Bln bp.
Determinization = building the graph Backbone = main reference sequence

Comparison 0 Errors = exact matching
Some of the errors in GCSA are due to difficulty mapping certain highly repetitive regions

Indexing Graphs for Path Queries with Applications in Genome Research

Similar presentations

Presentation on theme: "Indexing Graphs for Path Queries with Applications in Genome Research"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Indexing Graphs for Path Queries with Applications in Genome Research

Similar presentations

Presentation on theme: "Indexing Graphs for Path Queries with Applications in Genome Research"— Presentation transcript:

Similar presentations

About project

Feedback