Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Slides:

Advertisements

Similar presentations

epiC: an Extensible and Scalable System for Processing Big Data

Advertisements

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Spark: Cluster Computing with Working Sets

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

An Overview of the BSP Model of Parallel Computation Overview Only.

©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Pregel: A System for Large-Scale Graph Processing

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Distributed Asynchronous Bellman-Ford Algorithm

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.

Data Structures and Algorithms in Parallel Computing Lecture 4.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Managing Massive Trajectories on the Cloud

Big Data is a Big Deal!.

PROTECT | OPTIMIZE | TRANSFORM

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Introduction to Spark Streaming for Real Time data analysis

Introduction to Distributed Platforms

Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno

Tutorial: Big Data Algorithms and Applications Under Hadoop

Distributed Network Traffic Feature Extraction for a Real-time IDS

Spark Presentation.

Parallel Programming By J. H. Wang May 2, 2017.

Chapter 15: Distributed Structures

PREGEL Data Management in the Cloud

Parallel Density-based Hybrid Clustering

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.

Hadoop Clusters Tess Fulkerson.

Mining Spatio-Temporal Reachable Regions over Massive Trajectory Data

Central Florida Business Intelligence User Group

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

SpatialHadoop: A MapReduce Framework for Spatial Data

Chapter 16: Distributed System Structures

Mining the Most Influential k-Location Set from Massive Trajectories

Introduction to Spark.

Data Structures and Algorithms in Parallel Computing

Applying Twister to Scientific Applications

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Big Data - in Performance Engineering

湖南大学-信息科学与工程学院-计算机与科学系

Linchuan Chen, Peng Jiang and Gagan Agrawal

G063 - Distributed Databases

CMPT 733, SPRING 2016 Jiannan Wang

Distributed Systems CS

CS110: Discussion about Spark

Scalable Parallel Interoperable Data Analytics Library

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Pregelix: Think Like a Vertex, Scale Like Spandex

Overview of big data tools

Spark and Scala.

Multithreaded Programming

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Lecture 8: Synchronous Network Algorithms

Computational Advertising and

CMPT 733, SPRING 2017 Jiannan Wang

Big-Data Analytics with Azure HDInsight

Chapter 9 Graph algorithms

2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.

Map Reduce, Types, Formats and Features

Presentation transcript:

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph Chao Ma May 2016

Contents • Abstract • Introduction • Related Work • TrajGraph centrality computation • Conclusion

Abstract We implement parallel computing methods for a new visual analytics paradigm, named TrajGraph, which is a novel, scalable parallel-graph model designed for trajectory data management and for effective analysis of the large-scale urban trajectory datasets. It supports fast computation and aggregation over various data queries in distributed environments. The centrality metrics such as network pagerank and betweenness from TrajGraph can be used to characterize the time-varying importance of streets by utilizing the real traffic data.

Introduction

Introduction Overview of Taxi Trajectory Datasets: • Advanced sensing technologies and computing infrastructures have produced a variety of trajectory data of humans and vehicles in urban spaces. • The trajectory data records real time moving paths sampled as a series of positions over urban networks.

Introduction Why Parallel Computation: • The large-scale urban trajectory data should be quickly computed and queried over geospatial-temporal constraints to support both real-time and historical visual analysis. • To achieve the goal, specific trajectory data management techniques are needed for efficient storage, indexing, update, and retrieval.

Introduction TrajGraph Model: • We create TrajGraph from the trajectory networks by mapping, area grid, road segments to graph vertices and creating edges between them according to their linkage. • we implement graph and network algorithms such as pagerank and betweenness to characterize the time-varying importance of streets by utilizing the real traffic data.

Related Work

Related Work • Recently large-scale graph processing is becoming a hot research topic. • The well-known model MapReduce has been used to parallel process large dataset, which however is not always effective in processing graph data. • Pregel [54], BSP [55], and Apache Spark with graphx package [56], are designed as parallel computing systems targeted to large-scale graphs.

Related Work The BSP Computing Model. The BSP model consists of: - A set of processor-memory pairs. - A communications network that delivers messages in a point-to-point manner. - A mechanism for the efficient barrier synchronization for all or a subset of the processes. - There are no special combining, replicating, or broadcasting facilities.

Related Work Pregel: A System for Large-Scale Graph Processing This figure given a strongly connected graph where each vertex contains a value, it propagates the largest value to every vertex. In each superstep, any vertex that has learned a larger value from its messages sends it to all its neighbors. When no further vertices change in a superstep, the algorithm terminates.

Related Work • Apache Spark is an open-source cluster-computing framework. Spark's in-memory primitives provide performance up to 100 times faster for certain applications. • GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multi-graph with properties attached to each vertex and edge.

TrajGraph centrality computation

TrajGraph centrality computation Road Level TrajGraph Model • TrajGraph is a graph model constructed to represent a road network where taxi trajectories travel on. • We define every street segment in a city as a graph vertex. Then we read all trajectories in a given period T. If a taxi travels from road segment A to B, we add an edge AB between them.

TrajGraph centrality computation Road Level TrajGraph Model This figure is an example TrajGraph with vertices (A to G). It represents a street network shown six junctions (J1 to J6). When taxis travel over the streets, some turns over the junctions are disallowed (shown in red arrows). A TrajGraph edge eij is created from a vertex i to a vertex j, if taxis can travel from street i to street j.

TrajGraph centrality computation Region Level graph of ShenZhen by graph partitioning ： This Figure illustrates two different ways to partitioning a street-level graph of Shenzhen, where the colors are selected to show different regions on the map.

Trajectory Database in Graph Parallel Model TrajGraph Parallel Model： This figure shows that in the parallel TrajGraph, vertices will be partitioned and distributed into multiple computing nodes or processes. Based on BSP, graph computation is divided into a sequence of supersteps. In each superstep, computation over each partition of the graph is executed concurrently, and then messages are created. Barrier synchronization at the end of the superstep ensures that all messages have been transmitted

TrajGraph centrality computation Road Level TrajGraph Generation We pre-processed the traditional taxi trajectory data by filtering out the street segment crosses then count the times of cross. Table shows the road segment vertex and edge relations.

TrajGraph centrality computation Pagerank Centrality Algorithm Pagerank originally is a algorithm determines the importance of a web page in Internet. It works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. In our work, through an iterative process on TrajGraph, the importance of a street is scored according to the concept that links to high-scoring streets increase the score more than links to low-scoring streets. So the streets with high Pagerank are preferred hub streets by drivers.

TrajGraph centrality computation Betweenness Centrality Algorithm Betweenness centrality defines as an indicator of a node's centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. A node with high betweenness centrality has a large influence on the transfer of items through the network, under the assumption that item transfer follows the shortest paths. In our work, it can measure if a street/region is a backbone in the urban network. That is, if the backbone is broken, great transportation problem will arise as many drivers need to divert from the bottleneck.

TrajGraph centrality computation Urban network centralities shown on a part of ShenZhen, China.

TrajGraph centrality computation Centrality Parallel Computation The core of our method is to parallel compute the centralities of TrajGraph. To implement parallel graph computing, I parallelized our algorithms by employing GraphX over Apache Spark. The engine supports in-memory iterative computing whiles the graph data is processed through distributed Hadoop HDFS. The implementation follows the parallel Pagerank and a single-to-many parallel shortest path algorithm.

TrajGraph centrality computation Performance I tested our graph computing algorithms in three ways including: (P1) Non-parallel computing over a desktop computer (Intel Xeon E5520 with 4 cores at 2.27GHz and 16GB memory); (P2) Parallel computing over the desktop computer; (P3) Parallel computing over a 4-node cluster where each node is the same as the desktop. All the platforms ran 64-bit Linux system with Apache Spark Standalone distributed system.

TrajGraph centrality computation Performance In this test, we partitioned TrajGraph to three types of regin-level nodes for comparing the performence, which are 100 Partitions, 1000 Partitions and 3000 Partitions. Table shows the size of the original street-level TrajGraph and a few region-level TrajGraphs created after partitioning.

TrajGraph centrality computation Performance This table depicts the computational time on the original big TrajGraph. It shows that the graph generation from trajectories and Pagerank computing could be finished in seconds, while betweenness and closeness (another centrality metric) computing used multiple hours since they need to compute shortest paths between each pair of vertices, a well-known time-consuming problem in graph computing.

TrajGraph centrality computation Performance This table shows the computation of centralities could be finished in seconds with P1. The commonly used 100-partition TrajGraph was computed in milliseconds leading to interactive performance for visualization.

Conclusion 1. We implement parallel computing methods for a new visual analytics paradigm, named TrajGraph. 2. The centrality metrics such as network pagerank and betweenness from TrajGraph used to characterize the time-varying importance of streets by utilizing the real traffic data. 3. Parallelly compute the centralities of TrajGraph in distributed system.

Reference 1. TrajGraph: A Graph-Based Visual Analytics Approach to Studying Urban Network Centralities Using Taxi Trajectory Data IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 22, NO. 1, JANUARY 2016 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7192687 2. Implementing Graph Based Parallel Computation of Big Taxi Trajectory Data https://etd.ohiolink.edu/!etd.send_file?accession=kent1442683650&disposition=inline

Thanks so much