Download presentation
Presentation is loading. Please wait.
1
Parallel Analytic Systems
Carlos Ordonez 1
2
Big Data Analytics research
Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing) How? Fast external algorithms; memory-efficient data structures at two storage levels. Parallel: multi-threaded or multi-node Ideal goals: linear time O(n), linear speedup Hardware? Multicore CPU or parallel cluster Infrastructure? Distributed RAM, parallel file system; any large files Challenging: Theory+programming in action
3
Systems research today
Transaction processing? Main memory, lock-free Efficient analysis? joins, compiled queries, streams, exploit ample RAM, multi-core, leverage R/ScaLAPACK Compiler versus interpreter? Massive storage? Posix vs HDFS Fast external algorithms? Simple tasks. Parallel computation? Multi-core with threads, Shared-nothing (embedded message-passing) Exploiting new hardware? Interesting,difficult,customized Analytics: queries, cubes, statistics, Machine learning
4
DB Systems involves Core CS research: Theory+Programming
Theory we use: Time complexity (big O()) and I/O cost (disk, solid state memory) Many data structures (arrays, trees, hash tables, linked lists) Relational model and information retrieval models Linear algebra Multivariate statistics, machine learning models Compilers and programming languages: parsing/compiling/optimizing code; recursion Programming: Languages: C++ and Java combined with R, SQL, Scala Systems: parallel DBMSs, Spark OS: Unix, but we have a lot of past work on MS Windows Libraries: MPI, R Systems aspects: Threads, binary I/O, parallel file systems, code generation, code optimization, interpreter runtime..a lot of fun.
5
Typical Problems Summarization for linear models: vector outer products Exploration: cubes, lattices, ad-hoc queries Graph transitive closure (linear recursion), clique enumeration Bayesian models: MCMC, classification, regression, variable/feature selection (read slide title) Our research coverd the entire spectrum of data mining, going from exploratory OLAP analysis up to predictive models. (then read the titles of each application) 5
6
Why join the parallel DBMS group?
Balance between theory (mathematics) and programming (C++) Mature and stable CS research area Great interaction between industry and academic research Always looking for motivated students Visit my web page, DBLP. Google “Ordonez SQL”, stop by
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.