Download presentation
Presentation is loading. Please wait.
Published byArleen Clarke Modified over 8 years ago
1
András Benczúr benczur@sztaki.mta.hu Head, “Big Data – Momentum” Research Group Big Data Analytics http://datamining.sztaki.hu/ Institute for Computer Science and Control, Hungarian Academy of Sciences in collaboration with Volker Markl TU Berlin
2
Data Management vs. Data Analytics Data Management for traditional gather, access, search Analytics (Machine Learning) for insights, predictions
3
Deep Analytics needs
4
Twitter Example: Meryl Streep – Oscar, 2012
6
kép: http://mirror.co.uk Twitter Example: Meryl Streep – Oscar, 2012
8
kép: http://bbc.com Twitter Example: Meryl Streep – Oscar, 2012
10
What was The Analytics Challenge? One year 1 billion Tweet collection, 100GB Ad Hoc queries (Meryl Streep) may have 100,000+ hits Fast response needed to support the analyst Solutions o In Memory databases (SAP HANA, …) – cost and physical limitations o Customized approximate data structures (Bloom filters, MinHash fingerprints)
11
Need for Networked Analytics
12
Information in interconnectivity Number and influnce, impressibility of followers, tweets Statistical properties on temporal dynamics and number of users reached by messages
13
Predictive Claims Processing Hungarian Insurance company cases Rule generation (incl. social media) Feature engineering Machine learning & alert generation Days since contract Known fraud Normal sample
14
3-4 transactions distance raise the flag
15
Need for Real Time Analytics
16
Software AND Human Latencies
17
Deep Analysis of Big Data is Key to Competitiveness
18
Data Science: Deep Analytics + Big Data
19
Data Scientist magic triangle Application Scalable Data Management Machine Learning, Statistics, Data Analysis Data Science Control Flow Iterative Algorithms Error Estimation Active Sampling Sketches Curse of Dimensionality Decoupling Convergence Monte Carlo Mathematical Programming Linear Algebra Stochastic Gradient Descent Regression Statistics Hashing Parallelization Query Optimization Fault Tolerance Relational Algebra / SQL Scalability Data Analysis Language Compiler Memory Management Memory Hierarchy Data Flow Hardware Adaptation Indexing Resource Management NF 2 /XQuery Data Warehouse/OLAP ML DM Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics) Real-Time
20
Apache Flink: the emerging European tool TU Berlin / DFKI (DE) SICS (SE) SZTAKI (HU)
21
Data Scientist Supply Chain
23
Registered teams - affiliation
24
Extra Slide: Fully Distributed Modeling Needs no central service – suitable for: o Ad hoc networks o Privacy requirements Model delta updates are sent to peers Results for applicability in: o Classification o Recommender Systems R P 1 Q 1 R P 2 (Q 2 + Q) Measurement QQ QQ
25
Conclusions Software Latency o For data streaming solutions, we have to combine Batch pre-computed models updated real time (lambda architecture) Very low memory data approximation Carefully selected database operations to optimize communication o Machine learning, prediction, classification made Highly time sensitive, streaming Fully distributed: each element learns by passing model error to peers Human Latency o Shortage of Data Scientists worldwide o Needs training AND systems with reduced learning curve
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.