Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genomic Data Clustering on FPGAs for Compression

Similar presentations


Presentation on theme: "Genomic Data Clustering on FPGAs for Compression"— Presentation transcript:

1 Genomic Data Clustering on FPGAs for Compression
Andreas Zingg

2 Background - Bioinformatics
Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg

3 Background - Bioinformatics
Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg

4 Background - Bioinformatics
Genome Entirety of an organisms hereditary Information Encoded in DNA DNA Consists of nitrogenous Bases Bases appear in pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg

5 Background - Bioinformatics
Base Pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg

6 Background - Bioinformatics
Base Pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg

7 Genomic Data DNA is cut into small sequences
Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg

8 Genomic Data DNA is cut into small sequences
Sequences are read by machine Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response ACTGATTG GCCTATCGATGAC TGAT TATCGACG Andreas Zingg

9 ~ 300 GB The Problem Generated Data is really big
One Human Genome generates data in the order of 300 GB ~ 300 GB This might take a while Andreas Zingg

10 The Solution Compress the data! Andreas Zingg

11 The Solution Compress the data! But how? Andreas Zingg

12 Exploit data redundancy
The Solution Exploit data redundancy Map the data to the human reference genome About 90% of genomic sequences share similarities with the human reference genome Andreas Zingg

13 Mapping to the reference genome
Human Reference Genome Aligned reads Andreas Zingg

14 Mapping to the reference genome
Can map about 90 % of sequences to the reference genome Compress Mapped sequences using their relative location to the reference Andreas Zingg

15 Mapping to the reference genome
What about the remaining 10%? Can map about 90 % of sequences to the reference genome Compress Mapped sequences using their relative location to the reference Andreas Zingg

16 Clustering What about the remaining 10%?
Find Clusters and map sequences to these Clusters Andreas Zingg

17 Clustering What about the remaining 10%?
Find Clusters and map sequences to these Clusters Using what algorithm? Andreas Zingg

18 Clustering What about the remaining 10%?
Find Clusters and map sequences to these Clusters Using what algorithm? K-Means? Andreas Zingg

19 Clustering What about the remaining 10%?
Find Clusters and map sequences to these Clusters Using what algorithm? K-Means? What should our K be? Andreas Zingg

20 No Useful Clustering Algorithm
No useful clustering algorithm for compression of genomic data Exact number of K does not matter As long as there are high correlated clusters, compression is possible Instead of a searching for exactly K clusters, find clusters using a small threshold neighbourhood function Present clustering Algorithm Andreas Zingg

21 Matching function For 2 Sequences s1 and s2 a matching function is defined: le: sequence size d: Distance between sequences N: distance threshold Andreas Zingg

22 Matching function N = 1 le = 8 Reverse Complement Match! Match! Match!
No Match! Andreas Zingg

23 Basic Clustering Idea Andreas Zingg

24 Basic Clustering Idea Complexity: 𝑂 𝑛 2 Andreas Zingg

25 Basic Clustering Idea Complexity: 𝑂 𝑛 2
More than 2 years on an Intel core i7 4790 Not practical Andreas Zingg

26 Parallel Clustering Compare sequences with multiple cluster references at the same time Use FPGA board to implement parallel clustering algorithm To compare sequences FPGA can use 6-bit lookup tables Andreas Zingg

27 Setup Modular interface to cluster sequences
CPU and FPGA interchangeable Allows for performance and result comparison Andreas Zingg

28 FPGA top hierarchy Andreas Zingg

29 Matching Unit Andreas Zingg

30 FPGA initialization phase
Andreas Zingg

31 FPGA main phase (multiple possible)
Andreas Zingg

32 Shortcomings Limited number of parallel clustering units Andreas Zingg

33 Shortcomings Limited number of parallel clustering units
Requires phase repetitions Andreas Zingg

34 Shortcomings Limited number of parallel clustering units
Requires phase repetitions Number of executions and memory latency increases Clustering process is slowed down Andreas Zingg

35 Shortcomings Limited number of parallel clustering units
Requires phase repetitions Number of executions and memory latency increases Clustering process is slowed down Worst case: None of the sequences match with the references of current clusters Cache must be able to store all sequences Andreas Zingg

36 Proposed Workarounds Cache not big enough
Increase memory capacity Cut input into smaller pieces that fit in cache, and handle those Parallelizable, however, solution might be sub-optimal Phase repetitions slow down clustering process Use HMC-Modules Use maximum number of parallel clustering units Maximum nr parallel units: limited FPGA size latency of sequence distribution over FPGA surface Andreas Zingg

37 Test Setup Unmapped paired sequences of 126 bases from real human sample FPGA based version at 125MHz Software version on Intel Core i Haswell 4-Core at 4GHz Andreas Zingg

38 Runtime dependant on input size
Andreas Zingg

39 Times needed to cluster a real case file
Software configuration ( : extrapolated) FPGA Hardware configuration ( : extrapolated) Andreas Zingg

40 Results Software solution takes 2.6 years FPGAs take ~12 hours
Make the task practical Speed gain: ~1000 x Energy saved: ~700 x Andreas Zingg

41 Conclusion Goal achieved
Opens path for new clustering based compression algorithms Proved even on large datasets, high Complexity algorithms ( 𝑂 𝑛 2 ) can run in reasonable amount of time when provided with specialized hardware Andreas Zingg

42 My Take Well structured Easy to read and understand
Interesting insight in a new field Speedup is not explained well Andreas Zingg

43 Questions Andreas Zingg


Download ppt "Genomic Data Clustering on FPGAs for Compression"

Similar presentations


Ads by Google