Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.

Similar presentations


Presentation on theme: "Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced."— Presentation transcript:

1 Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced Studies

2 Self Introduction, for Better Understanding Takeaki Uno: Takeaki Uno: National Institute of Informatics (institute for activating joint research) Research Area: Research Area: Algorithms, (data mining, genome science) + + Enumeration, Graph algorithms, Pattern mining, Similarity search (Tree algorithms, Dynamic programming, NP-hardness, Reverse search,…) + + Implementations for Data Mining Algorithms, for String Similarity Search,… + + Web systems for small business optimizations + + Problem solver, or implementer, for research colleagues

3 Difficulty on Big Data One of recent main topic on informatics is BIG DATA   I will talk my opinions for “ what kind of problems shall we study ”, and “ what kind of approaches shall we take ” from view point of theoretical computer science Difficulties on big data are, roughly, modeling ; noise, diversity, too much symbolic, ill-formulate computation ; time / memory efficient, online / dynamic,… Here we do not consider the topics of natural sciences/engineering - devices, architectures, social implementations, …

4 Profiling Big Data The features of big data are; - - huge size; of course - - noisy; accuracies differ, objectives to get the data differ - - individuality; many kinds of records (people) - - very sparse (graph), but locally dense (has localities) We have disadvantages, but at the same time have advantages - - data can be re-used for other objectives (so, cost is cheap) - - accuracy increases automatically by using many records Locality/minority is important: we have to classify the data (or extract necessary parts) according to the objectives for analysis

5 Objectives for Big Data Big data is used for many objectives by different users, and life span of problems may not be long, according to rapid changes of the world   as fundamental research, we should not focus on each specified application To be useful as fundamental tools in the application areas with rapid changes and diversity + + abstract general problems common to many applications + + methods (algorithms, processes) should be simple, as much as possible, and the output can be explained clearly what it is + + models shoube determined according to computational efficiency

6 Algorithms as “Natural Science” For fast computation, algorithms have to fit real-world data   have to observe / analyze the behaviors of the algorithms on the real-world data, and design the algorithms according to the results   it would be a “natural science approach” (there are not many) feasible hypothesis  verify by experiments Parallelization is necessary, but… since best algorithms change, have do from scratch many times (in optimization, improvements by algorithms > architecture)   extract simple good problems and parallelize them

7 A Direction; find similar string pairs Motivation : Motivation : Similarity search on string data (especially, genome) index construction is heavy, LSH (approximate hash) is not good, and parallelization is not easy Modeling: Modeling: For given a set S of strings of fixed length l, find all pairs of short strings such that their Hamming distance is at most d.   avoid overheads of executing many queries; big data needs many queries, Hamming distance is simple/basic ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ATGCCGCG & AAGCCGCC GCCTCTAT & GCTTCTAA TGTAATGA & GGTAATGG ... ATGCCGCG & AAGCCGCC GCCTCTAT & GCTTCTAA TGTAATGA & GGTAATGG ...

8 Block Combination Consider partition of each string into k ( >d ) blocks   If two strings are similar, they share at least k-d same blocks   for each string, candidates of similar strings are those having at least k-d same blocks For all combinations of k-d blocks, we find the records having the same blocks (exact search)   we have to do several times, but not so many

9 Cost for Mismatches No index, thus memory efficient (work space = input data size) Worst case time complexity is n 2, but works fine in real-world data Pairwise comparison can be parallelized easily good subproblem; linear speed up even in GPU computing Combination with LSH works in many kinds of data Euclidean distance, Manhattan distance…

10 Implementation; performance Comparison of Mouse X and human X chromosome (150MB for each, with 10% error) 15min. By PC Mouse X chr. Human X chr Note: BLASTZ BLASTZ 2weeksMURASAKI 2-3 hours with 1% error


Download ppt "Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced."

Similar presentations


Ads by Google