Presentation is loading. Please wait.

Presentation is loading. Please wait.

Benczúr András MTA SZTAKI

Similar presentations

Presentation on theme: "Benczúr András MTA SZTAKI"— Presentation transcript:

1 Benczúr András MTA SZTAKI
„Big data” Benczúr András MTA SZTAKI

2 Big Data – the new hype “big data” is when the size of the data itself becomes part of the problem “big data” is data that becomes large enough that it cannot be processed using conventional methods Google sorts 1PB in 33 minutes ( ) Amazon S3 store contains 499B objects ( ) New Relic: 20B+ application metrics/day ( ) Walmart monitors 100M entities in real time ( ) Source: The Emerging Big Data slide from the Intelligent Information Management DG INFSO/E2 Objective ICT Info day in Luxembourg on 26 September 2011

3 Big Data Planes SPSS R SAS Speed Size Matlab power station sensors
IT logs Web content fraud detection online reputation news curation navigation mobility media pricing custom hardware Matlab R Mahout SciPy custom software Revolution SAS SPSS GraphLab Big Data Services batch real time KDB Esper InfoBright MySql Hadoop Progress MapR HBase Netezza Vertica MegaByte PetaByte Greenplum Speed Size MOA S4 Big analytics Fast data

4 Overview Introduction – Buzzwords Part I: Background
Examples Mobility and navigation traces Sensors, smart city IT logs Wind power Scientific and Business Relevance Part II: Infrastructures NoSQL, Key-value store, Hadoop, H*, Pregel, … Part III: Algorithms Brief History of Algorithms Web processing, PageRank with algorithms Stream algoritmusok Entity resolution – detailed comparison (QDB 2011)

5 Navigation and Mobility Traces
Streaming data at mobile base stations Privacy issues Regulations to let only anonymized data leave beyond network operations and billing Do regulation policy makers know about deanonymization attacks? What your wife/husband will not know – your mobile provider will

6 Sensors – smart home, city, country, …
Road and parking slot sensors Mobile parking traces Public transport, Oyster cards Bike hire schemes Source: Internet of Things Comic Book,

7 … even agriculture …

8 … and wind power stations

9 Corporate IT log processing
? Traditional methods fail Aggregation into Data Warehouse Our experience: GB/day 3-60 M events Identify bottlenecks Optimize procedures Detect misuse, fraud, attacks

10 Scientific and business relevance
VLDB 2011 (~100 papers): 6 papers on MapReduce/Hadoop, 10 on big data (+keynote), 11 NoSQL architectures, 6 GPS/sensory data tutorials, demos (Microsoft, SAP, IBM NoSQL tools) session: Big Data Analysis, MapReduce, Scalable Infrastructures EWEA 2011: 28% of papers on wind power raise data size issues SIGMOD 2011: out of 70 papers, 10 on new architectures and extensions for analytics Gartner 2011 trend No. 5: Next Generation Analytics - „significant changes to existing operational and business intelligence infrastructures” The Economist : „Monstrous amounts of data … Information is transforming traditional businesses” News special issue on Big Data this April

11 New challenges in database technologies
Question of research and practice: Applicability to a specific problem? Applicability as a general technique?

12 Overview Part I: Background Part II: Infrastructures
Examples Scientific and Business Relevance Part II: Infrastructures NoSQL Key-value stores Hadoop és Hadoopra épülő eszközök Bulk Synchronous Parallel, Pregel Streaming, S4 Part III: Algorithms, Examples

13 Most jön sok külső slide show …
NoSQL bevezető – Key-value stores BerkeleyBD – nem osztott Voldemort – Cassandra, Dynamo, … Hadoop alapon is létezik (lent): HBase Hadoop – Erdélyi Miki fóliái HBase – Cascading – nem lesz Mahout – Miért kell más? Mi más kell? Bulk Synchronous Parallel Graphlab – Danny Bickson slides MOA – Streaming S4 –

14 Bulk Synchronous Parallel architecture
HAMA: Pregel klón

15 Use of large matrices Main step in all distributed algorithms
Network based features in classification Partitioning for efficient algorithms Exploring the data, navigation (e.g. ranking to select a nice compact subgraph) Hadoop apps (e.g. PageRank) move the entire data around in each iteration Baseline C++ code keeps data local Hadoop Hadoop + KeyValue store Best C++ custom code

16 BSP vs. MapReduce MapReduce: Data locality not preserved between Map and Reduce invocations or MapReduce iterations. BSP: Tailored towards processing data with locality. Proprietary: Google Pregel Open-source (will be?? … several flaws now): HAMA Home developed C++ code base Both: Easy parallelization and distribution. Sidlo et al., Infrastructures and bounds for distributed entity resolution. QDB 2011

17 Overview Part I: Background Part II: Infrastructures
NoSQL, Key-value store, Hadoop, H*, Pregel, … Part III: Algorithms, Examples, Comparison Data and computation intense tasks, architectures History of Algorithms Web processing, PageRank with algorithms Entity resolution – detailed with algorithms Summary, conclusions

18 Types of Big Data problems
Data intense Web processing, info retrieval, classification Log processing (telco, IT, supermarket, …) Compute intense: Expectation Maximization, Gaussian mixture decomposition, image retrieval, … Genom matching, phylogenetic trees, … Data AND compute intense: Network (Web, friendship, …) partitioning, finding similarities, centers, hubs, … Singular value decomposition

19 Hardware Data intense: Compute intense: Data AND compute intense??
Map-reduce (Hadoop), cloud, … Compute intense: Shared memory Message passing Processor arrays, … → became affordable choice recently, as graphics co-procs! Szekrény jelenleg 150TB/500GB de csak 200 core Data AND compute intense??

20 Big data: Why now? Hardware is just getting better, cheaper?
But data is getting larger, easier to access Bad news for algorithms slower than ~ linear

21 Moore’s Law: doubling in 18 months
But in a key aspect, the trend has changed! From speed to no of cores

22 „Numbers Everyone Should Know”
Disk 10+TB RAM 100+ GB CPU L2 1+ MB L1 10+ KB GPU onboard memory Global GB Block shared 10+ KB RAM L1 cache reference 0.5 ns L2 cache reference 7 ns Main memory reference 100 ns Read 1 MB sequentially from memory 250,000 ns Intra-process communication Mutex lock/unlock 100 ns Read 1 MB sequentially from network 10,000,000 ns Disk Disk seek 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Jeff Dean, Google

23 Back to Databases, this means …
CPU Read 1 MB sequentially… memory 250,000 ns network 10,000,000 ns disk 30,000,000 ns MEMORY CPU M CPU M CPU M CPU M CPU Cost Security Integrity control more difficult Lack of standards Lack of experience Complexity of management and control Increased storage requirements Increased training cost Sub-linear speed-up Linear speed-up (ideal) Number of CPUs Number of transactions/second 1000/Sec 5 CPUs 2000/Sec 10 CPUs 16 CPUs 1600/Sec Connolly, Begg: Database systems: a practical approach to design, implementation, and management], International computer science series, Pearson Education, 2005

24 “The brief history of Algorithms”
P, NP Thinking Machines: hypercube, … PRAM theoretic models External memory algs SIMD, MIMD, message passing CM-5: many vectorprocs Map-reduce Google Multi-core Many-core Cloud Flash disk Cray: vectorprocessors

25 Earliest history: P, NP P: Graph traversal Spanning tree
NP: Steiner trees 1 25 15 5 2

26 Why do we care about graphs, trees?
Image segmentation name ID Mary Smith 50071 Entity Resolution 1 Mary Doe 50071 2 3 M. Doe 79216 M. Smith 34302

27 History of algs: spanning trees in parallel
iterative minimum spanning forest every node is a tree at start; every iteration merges trees Bentley: A parallel algorithm for constructing minimum spanning trees 1980 Harish et al. Fast Minimum Spanning Tree for Large Graphs on the GPU 2009 2 3 1 4 6 5 8 7

28 Overview Part I: Background Part II: Infrastructures
NoSQL, Key-value store, Hadoop, H*, Pregel, … Part III: Algorithms, Examples, Comparison Data and computation intense tasks, architectures History of Algorithms Web processing, PageRank with algorithms Streaming algoritmusok Entity resolution – detailed with algorithms Summary, conclusions

29 The Web is about data too
Big Data interpetation: recommenders, personalization, info extraction Posted by John Klossner on Aug 03, 2009 WEB 1.0 (browsers) – Users find data WEB 2.0 (social networks) – Users find each other WEB 3.0 (semantic Web) – Data find each other WEB 4.0 – Data create their own Facebook page, restrict friends. WEB 5.0 – Data decide they can work without humans, create their own language. WEB 6.0 –Human users realize that they no longer can find data unless invited by data. WEB 7.0 – Data get cheaper cell phone rates. WEB 8.0 – Data horde all the good YouTube videos, leaving human users with access to bad ’80′s music videos only. WEB 9.0 – Data create and maintain own blogs, are more popular than human blogs. WEB 10.0 – All episodes of Battlestar Gallactica will now be shown from the Cylons’ point of view.


31 Building a Virtual Web Observatory on large temporal data of Internet archives

32 Partner approaches to hardware
Hanzo Archives (UK): Amazon EC2 cloud + S3 Internet Memory Foundation: 50 low-end servers We: indexing 3TB compressed, .5B pages Open source tools not yet mature One week of processing on 50 old dual cores Hardware worth approx €10,000; Amazon price around €5000

33 Text REtrieval Conference measurement
Documents stored in HBase tables over the Hadoop file system (HDFS) Indexelés: 200 példány saját C++ kereső 40 Lucene példány, utána top 50,000 találat saját kereső SolR? Katta? Tényleg real time működnek? Ranking? Realistic - Even spam is important! Spam Obvious parallelization: each node processes all pages of one host Link features (eg. PageRank) cannot be computed in this way M. Erdélyi, A. Garzó, and A. A. Benczúr: Web spam classification: a few features worth more (WebQuality 2011)

34 Distributed storage: HBase vs WARC files
Many, many medium sized files very inefficient w/ Hadoop Either huge block size wasting space Or data locality lost as blocks may continue at non-local HDFS node HBase Data locality preserving ranges – cooperation w/ Hadoop Experiments up to 3TB compressed Web data WARC to HBase One-time expensive step, no data locality One-by-one inserts fail, very low performance MapReduce jobs to create HFiles, the native HBase format HFile transfer HBase insertion rate 100,000 per hour

35 The Random Surfer Model
Starts at a random page—arrives at quality page u Nodes = Web pages Edges = hyperlinks

36 PageRank: The Random Surfer Model
Chooses random neighbor with probability 1-  u

37 The Random Surfer Model
Or with probability  “teleports” to random page—gets bored and types a new URL u

38 The Random Surfer Model
And continues with the random walk …

39 The Random Surfer Model
And continues with the random walk …

40 The Random Surfer Model
Until convergence … ? [Brin, Page 98]

41 PageRank as Quality PR(k+1) = PR(k) ( (1 - ) M +  · U )
A quality page is pointed to by several quality pages PR(k+1) = PR(k) ( (1 - ) M +  · U ) = PR(1) ( (1 - ) M +  · U )k

42 Personalized PageRank
Or with probability  “teleports” to random page—selected from her bookmarks u

43 Algorithmics Estimated 10+ billions of Web pages worldwide
PageRank (as floats) fits into 40GB storage Personalization just to single pages: 10 billions of PageRank scores for each page Storage exceeds several Exabytes! NB single-page personalization is enough:

44 For certain things are just too big?
For light to reach the other side of the Galaxy … takes rather longer: five hundred thousand years. The record for hitch hiking this distance is just under five years, but you don't get to see much on the way. D Adams, The Hitchhiker's Guide to the Galaxy. 1979

45 Markov Chain Monte Carlo
Reformulation by simple tricks of linear algebra From u simulate N independent random walks Database of fingerprints: ending vertices of the walks from all vertices Query PPR(u,v) := # ( walks u→v ) / N N ≈ 1000 approximates top 100 well Fogaras-Racz: Towards Scaling Fully Personalized PageRank, WAW 2004

46 SimRank: similarity in graphs
“Two pages are similar if pointed to by similar pages” [Jeh–Widom KDD 2002]: Same trick: path pair summation (can be sampled [Fogaras–Rácz WWW 2005]) over u = w0,w1, ,wk−1,wk = v2 u = w’0 ,w’1 , ,w’k−1,w’k = v1 DB application e.g: Yin, Han, Yu. LinkClus: efficient clustering via heterogeneous semantic links, VLDB '06

47 Communication complexity bounding
Bit-vector probing (BVP) Theorem: B ≥ m for any protocol Reduction from BVP to Exact-PPR-compare Alice has a bit vector Input: x = (x1, x2, …, xm ) Bob has a number Input: 1 ≤ k ≤ m Xk = ? Communication B bits Alice has x = (x1, x2, …, xm ) G graph with V vertices, where V2 = m Pre-compute an Exact PPR data of size D Bob has 1 ≤ k ≤ m u, v, w vertices PPR(u,v) ? PPR(u,w) Xk = ? Communication Exact PPR, D bits Thus D = B ≥ m= V2

48 Theory of Streaming algorithms
Distinct values példa – Motwani slides Szekvenciális, RAM algoritmusok Külső táras algoritmusok Mintavételezés negatív eredmény „Sketching” technika

49 Overview Part I: Background Part II: Foundations Illustrated
Examples Scientific and Business Relevance Part II: Foundations Illustrated Data and computation intense tasks, architectures History of Algorithms Web processing, PageRank with algorithms Streaming algoritmusok Entity resolution – detailed with algorithms Summary, conclusions

50 Distributed Computing Paradigms and Tools
Distributed Key-Value Stores: distributed B-tree index for all attributes Project Voldemort MapReduce: map → reduce operations Apache Hadoop Bulk Synchronous Parallel: supersteps: computation → communication → barrier sync Apache Hama

51 Entity resolution Records with attributes – r1, … , r5
Mary Major 50071 ... e1 r2: J. Doe 79216 ... e2 r3: John Doe 79216 ... r4: Richard Miles 34302 ... e3 r5: Richard G. Miles 34302 ... Records with attributes – r1, … , r5 Entities formed by sets of records – e1, e2 , e3 Entity Resolution problem = partition the records into customers, …

52 A communication complexity lower bound
Set intersection cannot be decided by communicating less than Θ(n) bits [Kalyanasundaram, Schintger 1992] Implication: if data is over multiple servers, one needs Θ(n) communication to decide if it may have a duplicate with another node Best we can do is communicate all data Related area: Locality Sensitive Hashing no LSH for „minimum”, i.e. to decide if two attributes agree similar to negative results on Donoho’s Zero „norm” (number of non-zero coordinates)

53 Wait – how about blocking?
Blocking speeds up shared memory parallel algorithms [many in literature, eg. Whang, Menestrina, Koutrika, Theobald, Garcia-Molina. ER with Iterative Blocking, 2009] Even if we could partition the data with no duplicates split, the lower bound applies just to test we are done Still, blocking is a good idea – we may have to communicate much more than Θ(n) bits

54 Distributed Key-Value Store (KVS)
the record graph can be served from the KVS KVS mainly provides random access to a huge graph that would not fit in main memory computing nodes, many indexing nodes implement a graph traversal Breadth-First Search (BFS) with queues spanning forest with Union-Find basic textbook algorithms [Cormen-Leiserson-Rivest,…]

55 MapReduce (MR) Sorting as the prime MR app
For each feature, sort to form a graph of records MR has no data locality All data is moved around the network in each iteration Find connected components of the graph Again the textbook matrix power method can be implemented in MR Iterated matrix multiplication is not MR-friendly [Kang, Tsourakakis, Faloutsos. Pegasus framework, 2009] This step will move huge amounts around: the whole data even if we have small components only

56 Bulk Synchronous Parallel (BSP)
One master node + several processing nodes perform merge sort Algorithm: resolve local data at each node locally send attribute values in sorted order central server identifies and sends back candidates find connected components this is the prominent BSP app, very fast

57 Experiments: scalability
15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity)

58 Experiments: scalability
15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity)

59 Experiments: scalability
HAMA phases Hadoop phases 15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity)

60 Conclusions Big Data is founded on several subfields
Architectures – processor arrays, many-core affordable Algorithms – design principles from the ‘90-s Databases – distributed, column oriented, NoSQL Data mining, Information retrieval, Machine learning, Networks – for the top application Hadoop and Stream processing are the two main efficient techniques Limitations for data AND compute intense problems Many emerging alternatives (e.g. BSP) Selecting the right architecture is often the question

61 Head, Informatics Laboratory
Kérdések? András Benczúr Head, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences

Download ppt "Benczúr András MTA SZTAKI"

Similar presentations

Ads by Google