Presentation on theme: "Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU"— Presentation transcript:
1Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU
2Three levels of Big Data Data AnalysisSoftware InfrastructureHardware InfrastructureSaaSPaaSIaaS
3Contradiction—First and Second Level Data AnalysisMeachine LearningData WarehouseStatisticsSoftWare InfrastructMapReducePregelGraphLabGraphBuilderSpark
4Evolution of Big Data Tech IntelligenceLevelMLBaseMahoutBC-PDMGraph appBDASClouderaGraphLabSharkSparkHiveHivePigPregelGraphBuilderSoftwareArchitectureLevelMapReduceMapReduceMapRMapReduceHBaseHDFS
54V in Big Data Volume Variety Velocity Value V Why? Big Data is just that – data sets that are so massive that typical software systems are incapable of economically storing, let alone managing and computing, the information. A Big Data platform must capture and readily provide such quantities in a comprehensive and uniform storage framework to enable straightforward management and developmentVarietyOne of the tenets of Big Data is the exponential growth of unstructured data. The vast majority of data now originates from sources with either limited or variable structure, such as social media and telemetry. A Big Data platform must accommodate the full spectrum of data types and forms.VelocityAs organizations continue to seek new questions, patterns, and metrics within their data sets, they demand rapid and agile modeling and query capabilities. A Big Data platform should maintain the original format and precision of all ingested data to ensure full latitude of future analysis and processing cycles.ValueDriving relevant value, whether as revenue or cost savings, from data is the primary motivator for many organizations. The popularity of long tail business models has forced companies to examine their data in detail to find the patterns, affiliations, and connections to drive these new opportunities
6Model VS MapReduce Pregel GraphLab Spark Frame Performance GoogleMapReduceGood at data-independence tasks, not machine learning and graph processing(data-dependent and iterative tasks).based on acyclic data flowThink like a key.PregelGood at iterative and data-dependent computations, include graph processing.Using BSP(Bulk Synchronous Parallel) Model.A Message Passing abstraction.CMUGraphLabGood at iterative and data-dependent computations , especially nature graph problem.Using asynchronous distributed shared memory model.A Shared-State abstraction.Think like a vertex.UC Berkeley BDASSparkGood at Iterative algorithms, Interactive data mining, OLAP reports.Using RDDs(resilient distributed datasets) abstraction, which using In-Memory Cluster Computing and distributed-memory model.
17Graphlab Working Pattern FunctionsMRMap-ReduceMap_reduce_verticesMap_reduce_edgesTransform_verticesTransform_edgesGASGather-Apply-ScatterGather_edgesGatherApplyScatter_edgesScatter
18Distributed Execution of a PowerGraph Vertex-Program Machine 1Machine 2MasterGatherY’Y’Y’Y’YΣΣ1Σ2YMirrorApplyYYMachine 3Machine 4Σ3Σ4ScatterMirrorMirror
19Graphlab vs Pregel--Example Depends on the popularity their followersDepends on popularity of her followersWhat’s the popularityof this user?Popular?
20Graphlab vs Pregel-- PageRank Rank of user iWeighted sum of neighbors’ ranksUpdate ranks in parallelIterate until convergence
21The Pregel Abstraction Vertex-Programs interact by sending messages.Pregel_PageRank(i, messages) :// Receive all the messagestotal = 0foreach( msg in messages) :total = total + msg// Update the rank of this vertexR[i] = total// Send new messages to neighborsforeach(j in out_neighbors[i]) :Send msg(R[i] * wij) to vertex jiMalewicz et al. [PODC’09, SIGMOD’10]
22The Pregel Abstraction ComputeCommunicateBarrierPut equation on slide
23The GraphLab Abstraction Vertex-Programs directly read the neighbors stateGraphLab_PageRank(i)// Compute sum over neighborstotal = 0foreach( j in in_neighbors(i)):total = total + R[j] * wji// Update the PageRankR[i] = total// Trigger neighbors to run againif R[i] not converged thenforeach( j in out_neighbors(i)):signal vertex-program on jijLow et al. [UAI’10, VLDB’12]
24GraphLab Execution Scheduler The scheduler determines the order that vertices are executedCPU 1efgkjihdcbabcSchedulerefbaihijCPU 2The process repeats until the scheduler is empty
25GraphLab vs. Pregel (BSP) 51% updated only onceMulticore PageRank (25M Vertices, 355M Edges)
26Graph-parallel Abstractions Better for MLPregelMessagingShared StateiiSynchronousAsynchronous
27Challenges of High-Degree Vertices Sends manymessages(Pregel)Touches a largefraction of graph(GraphLab)Edge meta-data too large for single machineSequentially process edgesAsynchronous Execution requires heavy locking (GraphLab)Synchronous Execution prone to stragglers (Pregel)
29Berkeley Data Analytics Stack MapReduceMPIGraphLabetcMLBaseValueBlinkDB(approximate queries)VelocityShark(Spark+Hive)-SQLSparkShared RDDs(distributed memory)VarietyMesos(Cluster resource manager)HDFSVolume
30Spark-MotivationMost current cluster programming models are based on acyclic data flow from stable storage to stable storageMapReduceInputOutputAcyclic
31SparkIterative algorithms, including many machine learning algorithms and graph algorithms like PageRank.Interactive data mining, where a user would like to load data into RAM across a cluster and query it repeatedly.OLAP reports that run multiple aggregation queries on the same data.
32SparkSpark allows iterative computation on the same data, which would form a cycle if jobs were visualizedSpark offers an abstraction called resilient distributed datasets (RDDs) to support these applications efficiently
33RDDsResilient Distributed Dataset (RDD) serves as an abstraction to raw data, and some data is kept in memory and cached for later use.Spark allows data to be committed in RAM for an approximate20x speedup over MapReduce based on disks. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analyticsRDDs are immutable and created through parallel transformations such as map, filter, groupBy and reduce
35Logistic Regression Performance 127 s / iterationfirst iteration 174 sfurther iterations 6 sThis is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
36MLBase Motivation-2 Gaps In spite of the modern primacy of data, the complexity of existingML algorithms is often overwhelming——many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. They need to tune and compare several suitable algorithmsFurther more, existing scalable systems that support machinelearning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitivesSo we design a systems which is extensibility to novel ML algorithms.Acyclic
37MLBase—4 pieces MQL A simple declarative way to specify ML tasks CapabilityMQLA simple declarative way to specify ML tasksML-LibraryA library of distributed algorithmsSet of high-level operators to enable ML researchers to scalably implement a wide range of ML methodswithout deep systems knowledgeML-OptimizerA novel optimizer to select and dynamically adapt the choice of learning algorithmML-RuntimeA new run-time optimized for the data-access patterns of these high-level operators
38MLBase ArchitectureThis is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
39MLBaseThis is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
40Error guideJust Hadoop Frame ? In a sense, the distributed platforms just a language, we can not miss them, also not only depend on them. The things more important is as follows:Machine Learning! Reading: Machine Learning A Probabilistic Perspective.Deep Learning.
41FUJITSUParallel time series regression Led by Dr. Yang Group LI Zhong-hua WANG Yun-zhi JIANG Wen-rui
42Parallel time series regression PropertyPerformancePlatformHadoop from Apache. MapReduce from Google(Open Source)GraphLab from Carnegie Mellon University(Open Source)Both are Good at distributed parallel processingMapReduce – good at acyclic data flowGraphLab - Good at iterative and data-dependent computationsVolumeSupport for big data. The algorithm has good scalability. When a large amount of data comes, the algorithm can handle it without any modification, just by increasing the number of clustersVelocityRapid and agile modeling and handling capabilities for big data.InterfaceUsing XML file for input parameters setting, allowing customers set parameters intuitively
43Parallel time series regression DecomposeMapReduceCycLenCalcuMapReduceIndicative FragMapReduceTBSCProMapReduceClusteringGraphLabChoose ClusterMapReduce
44Design for Parallel Indicative fragment Indicative fragment - identification the best length of indicative fragment.Assume - days:90 Max indicative fragment Length:96Compare - Serial and parallel time complexity11C902Serial13311222C9023C902Generate all the96* (90*89/2)operation pairsbefore the parallelcomputation962233969696C9029696Time Complexity: 96*C902ParallelTime Complexity: 1