Presentation is loading. Please wait.

Presentation is loading. Please wait.

„Big data” Benczúr András MTA SZTAKI. Benczúr – Big Data - Szeged- 2012 március 23 Big Data – the new hype “big data” is when the size of the data itself.

Similar presentations


Presentation on theme: "„Big data” Benczúr András MTA SZTAKI. Benczúr – Big Data - Szeged- 2012 március 23 Big Data – the new hype “big data” is when the size of the data itself."— Presentation transcript:

1 „Big data” Benczúr András MTA SZTAKI

2 Benczúr – Big Data - Szeged március 23 Big Data – the new hype “big data” is when the size of the data itself becomes part of the problem “big data” is data that becomes large enough that it cannot be processed using conventional methods Google sorts 1PB in 33 minutes ( ) Amazon S3 store contains 499B objects ( ) New Relic: 20B+ application metrics/day ( ) Walmart monitors 100M entities in real time ( ) Source: The Emerging Big Data slide from the Intelligent Information Management DG INFSO/E2 Objective ICT Info day in Luxembourg on 26 September 2011

3 Benczúr – Big Data - Szeged március 23 Fast data Big analytics Big Data Services Big Data Planes

4 Benczúr – Big Data - Szeged március 23 Overview Introduction – Buzzwords Part I: Background Examples Mobility and navigation traces Sensors, smart city IT logs Wind power Scientific and Business Relevance Part II: Infrastructures NoSQL, Key-value store, Hadoop, H*, Pregel, … Part III: Algorithms Brief History of Algorithms Web processing, PageRank with algorithms Stream algoritmusok Entity resolution – detailed comparison (QDB 2011)

5 Benczúr – Big Data - Szeged március 23 Navigation and Mobility Traces Streaming data at mobile base stations Privacy issues Regulations to let only anonymized data leave beyond network operations and billing Do regulation policy makers know about deanonymization attacks? What your wife/husband will not know – your mobile provider will

6 Benczúr – Big Data - Szeged március 23 Sensors – smart home, city, country, … Road and parking slot sensors Mobile parking traces Public transport, Oyster cards Bike hire schemes Source: Internet of Things Comic Book,

7 Benczúr – Big Data - Szeged március 23 … even agriculture …

8 Benczúr – Big Data - Szeged március 23 … and wind power stations

9 Benczúr – Big Data - Szeged március 23 Our experience: GB/day 3-60 M events Corporate IT log processing Aggregation into Data Warehouse Identify bottlenecks Optimize procedures Detect misuse, fraud, attacks ? Traditional methods fail

10 Benczúr – Big Data - Szeged március 23 Scientific and business relevance VLDB 2011 (~100 papers): 6 papers on MapReduce/Hadoop, 10 on big data (+keynote), 11 NoSQL architectures, 6 GPS/sensory data tutorials, demos (Microsoft, SAP, IBM NoSQL tools) session: Big Data Analysis, MapReduce, Scalable Infrastructures EWEA 2011 : 28% of papers on wind power raise data size issues SIGMOD 2011: out of 70 papers, 10 on new architectures and extensions for analytics Gartner 2011 trend No. 5: Next Generation Analytics - „significant changes to existing operational and business intelligence infrastructures” The Economist : „Monstrous amounts of data … Information is transforming traditional businesses” News special issue on Big Data this April

11 Benczúr – Big Data - Szeged március 23 New challenges in database technologies Question of research and practice: Applicability to a specific problem? Applicability as a general technique?

12 Benczúr – Big Data - Szeged március 23 Overview Part I: Background Examples Scientific and Business Relevance Part II: Infrastructures NoSQL Key-value stores Hadoop és Hadoopra épülő eszközök Bulk Synchronous Parallel, Pregel Streaming, S4 Part III: Algorithms, Examples

13 Benczúr – Big Data - Szeged március 23 Most jön sok külső slide show … NoSQL bevezető – Key-value stores BerkeleyBD – nem osztott Voldemort – behemoth.strlen.net/~alex/voldemort-nosql_live.ppt behemoth.strlen.net/~alex/voldemort-nosql_live.ppt Cassandra, Dynamo, … Hadoop alapon is létezik (lent): HBase Hadoop – Erdélyi Miki fóliái HBase – datasearch.ruc.edu.cn/course/cloudcomputing20102/slides/Lec07.ppt datasearch.ruc.edu.cn/course/cloudcomputing20102/slides/Lec07.ppt Cascading – nem lesz Mahout – cwiki.apache.org/MAHOUT/faq.data/Mahout%20Overview.ppt cwiki.apache.org/MAHOUT/faq.data/Mahout%20Overview.ppt Miért kell más? Mi más kell? Bulk Synchronous Parallel Graphlab – Danny Bickson slides MOA – Streaming S4 –

14 Benczúr – Big Data - Szeged március 23 Bulk Synchronous Parallel architecture HAMA: Pregel klón

15 Benczúr – Big Data - Szeged március 23 Use of large matrices Main step in all distributed algorithms Network based features in classification Partitioning for efficient algorithms Exploring the data, navigation (e.g. ranking to select a nice compact subgraph) Hadoop apps (e.g. PageRank) move the entire data around in each iteration Baseline C++ code keeps data local Hadoop Hadoop + KeyValue store Best C++ custom code

16 Benczúr – Big Data - Szeged március 23 BSP vs. MapReduce MapReduce: Data locality not preserved between Map and Reduce invocations or MapReduce iterations. BSP: Tailored towards processing data with locality. Proprietary: Google Pregel Open-source (will be?? … several flaws now): HAMA Home developed C++ code base Both: Easy parallelization and distribution. Sidlo et al., Infrastructures and bounds for distributed entity resolution. QDB 2011

17 Benczúr – Big Data - Szeged március 23 Overview Part I: Background Part II: Infrastructures NoSQL, Key-value store, Hadoop, H*, Pregel, … Part III: Algorithms, Examples, Comparison Data and computation intense tasks, architectures History of Algorithms Web processing, PageRank with algorithms Entity resolution – detailed with algorithms Summary, conclusions

18 Benczúr – Big Data - Szeged március 23 Types of Big Data problems Data intense Web processing, info retrieval, classification Log processing (telco, IT, supermarket, …) Compute intense: Expectation Maximization, Gaussian mixture decomposition, image retrieval, … Genom matching, phylogenetic trees, … Data AND compute intense: Network (Web, friendship, …) partitioning, finding similarities, centers, hubs, … Singular value decomposition

19 Benczúr – Big Data - Szeged március 23 Hardware Data intense: Map-reduce (Hadoop), cloud, … Compute intense: Shared memory Message passing Processor arrays, … → became affordable choice recently, as graphics co-procs! Data AND compute intense??

20 Benczúr – Big Data - Szeged március 23 Big data: Why now? Hardware is just getting better, cheaper? But data is getting larger, easier to access Bad news for algorithms slower than ~ linear

21 Benczúr – Big Data - Szeged március 23 Moore’s Law: doubling in 18 months But in a key aspect, the trend has changed! From speed to no of cores

22 Benczúr – Big Data - Szeged március 23 „Numbers Everyone Should Know” RAM L1 cache reference 0.5 ns L2 cache reference 7 ns Main memory reference 100 ns Read 1 MB sequentially from memory 250,000 ns Intra-process communication Mutex lock/unlock 100 ns Read 1 MB sequentially from network 10,000,000 ns Disk Disk seek 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Jeff Dean, Google Disk 10+TB RAM 100+ GB CPU L2 1+ MB L1 10+ KB GPU onboard memory Global 4-8 GB Block shared 10+ KB

23 Benczúr – Big Data - Szeged március 23 Back to Databases, this means … Sub-linear speed-up Linear speed-up (ideal) Number of CPUs Number of transactions/second 1000/Sec 5 CPUs 2000/Sec 10 CPUs 16 CPUs 1600/Sec  Cost  Security  Integrity control more difficult  Lack of standards  Lack of experience  Complexity of management and control  Increased storage requirements  Increased training cost Read 1 MB sequentially… memory 250,000 ns network 10,000,000 ns disk 30,000,000 ns M CPU M M M M MEMORY CPU Connolly, Begg: Database systems: a practical approach to design, implementation, and management], International computer science series, Pearson Education, 2005

24 Benczúr – Big Data - Szeged március 23 “The brief history of Algorithms” P, NP PRAM theoretic models Thinking Machines: hypercube, … Cray: vectorprocessors SIMD, MIMD, message passing Map-reduce Google Multi-core Many-core Cloud Flash disk External memory algs CM-5: many vectorprocs

25 Benczúr – Big Data - Szeged március 23 P: Graph traversal Spanning tree NP: Steiner trees Earliest history: P, NP

26 Benczúr – Big Data - Szeged március 23 Why do we care about graphs, trees? Mary Smith M. Doe Mary Doe M. Smith name ID Image segmentation Entity Resolution

27 Benczúr – Big Data - Szeged március 23 History of algs: spanning trees in parallel iterative minimum spanning forest every node is a tree at start; every iteration merges trees Bentley: A parallel algorithm for constructing minimum spanning trees 1980 Harish et al. Fast Minimum Spanning Tree for Large Graphs on the GPU

28 Benczúr – Big Data - Szeged március 23 Overview Part I: Background Part II: Infrastructures NoSQL, Key-value store, Hadoop, H*, Pregel, … Part III: Algorithms, Examples, Comparison Data and computation intense tasks, architectures History of Algorithms Web processing, PageRank with algorithms Streaming algoritmusok Entity resolution – detailed with algorithms Summary, conclusions

29 Benczúr – Big Data - Szeged március 23 The Web is about data too Posted by John Klossner on Aug 03, 2009 WEB 1.0 (browsers) – Users find data WEB 2.0 (social networks) – Users find each other WEB 3.0 (semantic Web) – Data find each other WEB 4.0 – Data create their own Facebook page, restrict friends. WEB 5.0 – Data decide they can work without humans, create their own language. WEB 6.0 –Human users realize that they no longer can find data unless invited by data. WEB 7.0 – Data get cheaper cell phone rates. WEB 8.0 – Data horde all the good YouTube videos, leaving human users with access to bad ’80′s music videos only. WEB 9.0 – Data create and maintain own blogs, are more popular than human blogs. WEB 10.0 – All episodes of Battlestar Gallactica will now be shown from the Cylons’ point of view. Big Data interpetation: recommenders, personalization, info extraction

30 Benczúr – Big Data - Szeged március 23

31 Longitudinal Analytics of Web Archive Data Building a Virtual Web Observatory on large temporal data of Internet archives

32 Benczúr – Big Data - Szeged március 23 Partner approaches to hardware Hanzo Archives (UK): Amazon EC2 cloud + S3 Internet Memory Foundation: 50 low-end servers We: indexing 3TB compressed,.5B pages Open source tools not yet mature One week of processing on 50 old dual cores Hardware worth approx €10,000; Amazon price around €5000

33 Benczúr – Big Data - Szeged március 23 Text REtrieval Conference measurement Documents stored in HBase tables over the Hadoop file system (HDFS) Indexelés: 200 példány saját C++ kereső 40 Lucene példány, utána top 50,000 találat saját kereső SolR? Katta? Tényleg real time működnek? Ranking? Realistic - Even spam is important! Spam Obvious parallelization: each node processes all pages of one host Link features (eg. PageRank) cannot be computed in this way M. Erdélyi, A. Garzó, and A. A. Benczúr: Web spam classification: a few features worth more (WebQuality 2011)

34 Benczúr – Big Data - Szeged március 23 Distributed storage: HBase vs WARC files o WARC o Many, many medium sized files very inefficient w/ Hadoop o Either huge block size wasting space o Or data locality lost as blocks may continue at non-local HDFS node o HBase o Data locality preserving ranges – cooperation w/ Hadoop o Experiments up to 3TB compressed Web data o WARC to HBase o One-time expensive step, no data locality o One-by-one inserts fail, very low performance o MapReduce jobs to create HFiles, the native HBase format o HFile transfer o HBase insertion rate 100,000 per hour

35 Benczúr – Big Data - Szeged március 23 u The Random Surfer Model Nodes = Web pages Edges = hyperlinks Starts at a random page—arrives at quality page

36 Benczúr – Big Data - Szeged március 23 u PageRank: The Random Surfer Model Chooses random neighbor with probability 1- 

37 Benczúr – Big Data - Szeged március 23 u The Random Surfer Model Or with probability  “teleports” to random page—gets bored and types a new URL

38 Benczúr – Big Data - Szeged március 23 The Random Surfer Model And continues with the random walk …

39 Benczúr – Big Data - Szeged március 23 The Random Surfer Model And continues with the random walk …

40 Benczúr – Big Data - Szeged március 23 The Random Surfer Model Until convergence … ? [Brin, Page 98]

41 Benczúr – Big Data - Szeged március 23 PR (k+1) = PR (k) ( (1 -  ) M +  · U ) = PR (1) ( (1 -  ) M +  · U ) k PageRank as Quality A quality page is pointed to by several quality pages

42 Benczúr – Big Data - Szeged március 23 u Or with probability  “teleports” to random page—selected from her bookmarks Personalized PageRank

43 Benczúr – Big Data - Szeged március 23 Algorithmics Estimated 10+ billions of Web pages worldwide PageRank (as floats) fits into 40GB storage Personalization just to single pages: 10 billions of PageRank scores for each page Storage exceeds several Exabytes! NB single-page personalization is enough:

44 Benczúr – Big Data - Szeged március 23 For light to reach the other side of the Galaxy … takes rather longer: five hundred thousand years. The record for hitch hiking this distance is just under five years, but you don't get to see much on the way. D Adams, The Hitchhiker's Guide to the Galaxy For certain things are just too big?

45 Benczúr – Big Data - Szeged március 23 Reformulation by simple tricks of linear algebra From u simulate N independent random walks Database of fingerprints: ending vertices of the walks from all vertices Query PPR(u,v) := # ( walks u → v ) / N N ≈ 1000 approximates top 100 well Fogaras-Racz: Towards Scaling Fully Personalized PageRank, WAW 2004 Markov Chain Monte Carlo

46 Benczúr – Big Data - Szeged március 23 SimRank: similarity in graphs “Two pages are similar if pointed to by similar pages” [Jeh–Widom KDD 2002]: Same trick: path pair summation (can be sampled [Fogaras–Rácz WWW 2005]) over u = w 0,w 1,...,w k−1,w k = v 2 u = w’ 0,w’ 1,...,w’ k−1,w’ k = v 1 DB application e.g: Yin, Han, Yu. LinkClus: efficient clustering via heterogeneous semantic links, VLDB '06

47 Benczúr – Big Data - Szeged március 23 Communication complexity bounding Bit-vector probing (BVP) Theorem: B ≥ m for any protocol Reduction from BVP to Exact-PPR-compare Alice has a bit vector Input: x = (x 1, x 2, …, x m ) Bob has a number Input: 1 ≤ k ≤ m X k = ? Communication B bits Alice has x = (x 1, x 2, …, x m ) G graph with V vertices, where V 2 = m Pre-compute an Exact PPR data of size D Communication Exact PPR, D bits Bob has 1 ≤ k ≤ m u, v, w vertices PPR(u,v) ? PPR(u,w) X k = ? Thus D = B ≥ m= V 2

48 Benczúr – Big Data - Szeged március 23 Theory of Streaming algorithms Distinct values példa – Motwani slides Szekvenciális, RAM algoritmusok Külső táras algoritmusok Mintavételezés negatív eredmény „Sketching” technika

49 Benczúr – Big Data - Szeged március 23 Overview Part I: Background Examples Scientific and Business Relevance Part II: Foundations Illustrated Data and computation intense tasks, architectures History of Algorithms Web processing, PageRank with algorithms Streaming algoritmusok Entity resolution – detailed with algorithms Summary, conclusions

50 Benczúr – Big Data - Szeged március 23 Distributed Computing Paradigms and Tools Distributed Key-Value Stores: distributed B-tree index for all attributes Project Voldemort MapReduce: map → reduce operations Apache Hadoop Bulk Synchronous Parallel: supersteps: computation → communication → barrier sync Apache Hama

51 Benczúr – Big Data - Szeged március 23 e1e1 Mary Major r1:r1: J. Doe r2:r2: John Doe r3:r3: Richard Miles r4:r4: Richard G. Miles r5:r5: e2e2 e3e3 A1A1 A2A2 A3A3 Records with attributes – r 1, …, r 5 Entities formed by sets of records – e 1, e 2, e 3 Entity Resolution problem = partition the records into customers, … Entity resolution

52 Benczúr – Big Data - Szeged március 23 A communication complexity lower bound Set intersection cannot be decided by communicating less than Θ(n) bits [Kalyanasundaram, Schintger 1992] Implication: if data is over multiple servers, one needs Θ(n) communication to decide if it may have a duplicate with another node Best we can do is communicate all data Related area: Locality Sensitive Hashing no LSH for „minimum”, i.e. to decide if two attributes agree similar to negative results on Donoho’s Zero „norm” (number of non-zero coordinates)

53 Benczúr – Big Data - Szeged március 23 Wait – how about blocking? Blocking speeds up shared memory parallel algorithms [many in literature, eg. Whang, Menestrina, Koutrika, Theobald, Garcia-Molina. ER with Iterative Blocking, 2009] Even if we could partition the data with no duplicates split, the lower bound applies just to test we are done Still, blocking is a good idea – we may have to communicate much more than Θ (n) bits

54 Benczúr – Big Data - Szeged március 23 Distributed Key-Value Store (KVS) the record graph can be served from the KVS KVS mainly provides random access to a huge graph that would not fit in main memory computing nodes, many indexing nodes implement a graph traversal Breadth-First Search (BFS) with queues spanning forest with Union-Find basic textbook algorithms [Cormen-Leiserson-Rivest,…]

55 Benczúr – Big Data - Szeged március 23 MapReduce (MR) Sorting as the prime MR app For each feature, sort to form a graph of records MR has no data locality All data is moved around the network in each iteration Find connected components of the graph Again the textbook matrix power method can be implemented in MR Iterated matrix multiplication is not MR-friendly [Kang, Tsourakakis, Faloutsos. Pegasus framework, 2009] This step will move huge amounts around: the whole data even if we have small components only

56 Benczúr – Big Data - Szeged március 23 Bulk Synchronous Parallel (BSP) One master node + several processing nodes perform merge sort Algorithm: resolve local data at each node locally send attribute values in sorted order central server identifies and sends back candidates find connected components  this is the prominent BSP app, very fast

57 Benczúr – Big Data - Szeged március 23 Experiments: scalability 15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity)

58 Benczúr – Big Data - Szeged március 23 Experiments: scalability 15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity)

59 Benczúr – Big Data - Szeged március 23 Experiments: scalability 15 older blade servers, 4GB memory, 3GHz CPU each insurance client dataset (~ 2 records per entity) Hadoop phases HAMA phases

60 Benczúr – Big Data - Szeged március 23 Conclusions Big Data is founded on several subfields Architectures – processor arrays, many-core affordable Algorithms – design principles from the ‘90-s Databases – distributed, column oriented, NoSQL Data mining, Information retrieval, Machine learning, Networks – for the top application Hadoop and Stream processing are the two main efficient techniques Limitations for data AND compute intense problems Many emerging alternatives (e.g. BSP) Selecting the right architecture is often the question

61 Benczúr – Big Data - Szeged március 23 Kérdések? András Benczúr Head, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences


Download ppt "„Big data” Benczúr András MTA SZTAKI. Benczúr – Big Data - Szeged- 2012 március 23 Big Data – the new hype “big data” is when the size of the data itself."

Similar presentations


Ads by Google