A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

Slides:

Advertisements

Similar presentations

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Advertisements

CS6800 Advanced Theory of Computation

Microarchitectural Performance Characterization of Irregular GPU Kernels Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Carthagène A brief introduction to combinatorial optimization: The Traveling Salesman Problem Simon de Givry.

Carthagène A brief introduction to combinatorial optimization: The Traveling Salesman Problem Simon de Givry Thales Research & Technology, France (minor.

Great Theoretical Ideas in Computer Science for Some.

Reducibility Class of problems A can be reduced to the class of problems B Take any instance of problem A Show how you can construct an instance of problem.

1 Optimization Algorithms on a Quantum Computer A New Paradigm for Technical Computing Richard H. Warren, PhD Optimization.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

9.2 The Traveling Salesman Problem. Let us return to the question of finding a cheapest possible cycle through all the given towns: We have n towns (points)

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,

OpenFOAM on a GPU-based Heterogeneous Cluster

Approximation Algorithms: Combinatorial Approaches Lecture 13: March 2.

Numerical geometry of non-rigid shapes

The Theory of NP-Completeness

EAs for Combinatorial Optimization Problems BLG 602E.

Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.

ECE669 L10: Graph Applications March 2, 2004 ECE 669 Parallel Computer Architecture Lecture 10 Graph Applications.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

Ch. 11: Optimization and Search Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 some slides from Stephen Marsland, some images.

Ant Colony Optimization: an introduction

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Hon Wai Leong, NUS (CS6234, Spring 2009) Page 1 Copyright © 2009 by Leong Hon Wai CS6234 Lecture 1 -- (14-Jan-09) “Introduction”  Combinatorial Optimization.

Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

The Traveling Salesman Problem Approximation

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Graph Theory Hamilton Paths and Hamilton Circuits.

Representing and Using Graphs

Great Theoretical Ideas in Computer Science.

Spring 2015 Mathematics in Management Science Network Problems Networks & Trees Minimum Networks Spanning Trees Minimum Spanning Trees.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Combinatorial Optimization Chapter 1. Problems and Algorithms  1.1 Two Problems (representative problems)  The Traveling Salesman Problem 47 drilling.

Spring 2015 Mathematics in Management Science Traveling Salesman Problem Approximate solutions for TSP NNA, RNN, SEA Greedy Heuristic Algorithms.

CS 200 Algorithms and Data Structures

PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Discrete optimization of trusses using ant colony metaphor Saurabh Samdani, Vinay Belambe, B.Tech Students, Indian Institute Of Technology Guwahati, Guwahati.

Princeton University COS 423 Theory of Algorithms Spring 2001 Kevin Wayne Approximation Algorithms These lecture slides are adapted from CLRS.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

Genetic Algorithms Przemyslaw Pawluk CSE 6111 Advanced Algorithm Design and Analysis

Optimizing Pheromone Modification for Dynamic Ant Algorithms Ryan Ward TJHSST Computer Systems Lab 2006/2007 Testing To test the relative effectiveness.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

CIRCUITS, PATHS, AND SCHEDULES Euler and Königsberg.

SPANNING TREES Lecture 20 CS2110 – Fall Spanning Trees  Definitions  Minimum spanning trees  3 greedy algorithms (incl. Kruskal’s & Prim’s)

SPANNING TREES Lecture 21 CS2110 – Fall Nate Foster is out of town. NO 3-4pm office hours today!

Lecture 7 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.

Lecture 25 NP Class. P = ? NP = ? PSPACE They are central problems in computational complexity.

GPU Accelerated Vessel Segmentation Using Laplacian Eigenmaps Lin Cheng, Hyunsu Cho and Peter A. Yoon Trinity College.

I can describe the differences between Hamilton and Euler circuits and find efficient Hamilton circuits in graphs. Hamilton Circuits I can compare and.

Management Science 461 Lecture 7 – Routing (TSP) October 28, 2008.

Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 14 Section 3 - Slide 1 Copyright © 2009 Pearson Education, Inc. AND.

Unit 2 Hamiltonian Circuits. Hamiltonian Circuit: A tour that starts at a vertex of a graph and visits each vertex once and only once, returning to where.

Exhaustive search Exhaustive search is simply a brute- force approach to combinatorial problems. It suggests generating each and every element of the problem.

CPU Efficiency Issues.

Fine-Grained Complexity Analysis of Improving Traveling Salesman Tours

3. Brute Force Selection sort Brute-Force string matching

traveling salesman problem

Multithreading Why & How.

3. Brute Force Selection sort Brute-Force string matching

CSC 380: Design and Analysis of Algorithms

CSC 380: Design and Analysis of Algorithms

Spanning Trees Lecture 20 CS2110 – Spring 2015.

6- General Purpose GPU Programming

Force Directed Placement: GPU Implementation

3. Brute Force Selection sort Brute-Force string matching

Presentation transcript:

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science

The Traveling Salesman Problem  Common combinatorial optimization problem  Wire routing, logistics, robot arm movement, etc.  Given n cities, find shortest Hamiltonian tour  Must visit all cities exactly once and end in first city  Usually expressed as a graph problem  We use complete, undirected, planar, Euclidean graph  Vertices represent cities  Edge weights reflect distances A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

TSP Algorithm  Optimal solution is NP-hard  Heuristic algorithms used to approximate solution  We use an iterative hill climbing search algorithm  Generate k random initial tours (k climbers)  Iteratively refine them until local minimum reached  In each iteration, apply best opt-2 move  Find best pair of edges (a,b) and (c,d) such that replacing them with (a,d) → and (b,c) minimizes tour length A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

GPU Requirements  Lots of data parallelism  Need 10,000s of ‘independent’ threads  Sufficient memory access regularity  Sets of 32 threads should have ‘nice’ access patterns  Sufficient code regularity  Sets of 32 threads should follow the same control flow  Plenty of data reuse  At least O(n 2 ) operations on O(n) data A Parallel GPU Version of the Traveling Salesman Problem Thepcreport.net July 2011

TSP_GPU Implementation  Assuming 100-city problems & 100,000 climbers  Climbers are independent, can be run in parallel  Plenty of data parallelism  Potential load imbalance  Different number of steps required to reach local minimum  Every step determines best of 4851 opt-2 moves  Same control flow (but different data)  Coalesced memory access patterns  O(n 2 ) operations on O(n) data A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

Code Optimizations  Key code section: finding best opt-2 move  Doubly nested loop  Only computes difference in tour length, not absolute length  Highly optimized to minimize memory accesses  “Caches” rest of data in registers  Requires only 6 clock cycles per move on a Xeon CPU core  Local minimum compared to best solution so far  Best solution updated if needed, otherwise tour is discarded  Other small optimizations (see paper) A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

GPU Optimizations  Random tours generated in parallel on GPU  Minimizes data transfer to GPU  (CPU only generates distance matrix and prints result)  2D distance matrix resident in shared memory  Ensures hits in software-controlled fast data cache  Tours copied to local memory in chunks of 1024  Enables accessing them with coalesced loads & stores A Parallel GPU Version of the Traveling Salesman Problem gamedsforum.ca July 2011

Evaluation Method  Systems  NVIDIA Tesla C2050 GPU (1.15 GHz 14 SMs w/ 32 PEs)  Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons)  Datasets  Five 100-city inputs from TSPLIB  Implementations  CUDA (GPU), Pthreads (CPU), serial C (CPU)  Use almost identical code for finding best opt-2 move A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

Runtime Comparison (kroE100 Input)  GPU is 7.8x faster than CPU with 8 cores  One GPU chip is as fast as 16 or 32 CPU chips A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

Speedup over Serial (kroE100 Input)  Pthreads code scales well to 32 threads (4 CPUs)  CPU performance fluctuates (NUMA), GPU stable A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

Solution Quality  Optimal tour found in 4 of 5 cases with 100,000 climbers  200,000 climbers find best solution in fifth case  Runtime independent of input and linear in climbers A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

Summary  TSP_GPU source code is freely available at  TSP_GPU algorithm  Highly optimized implementation for GPUs  Evaluates almost 20 billion tour modifications per second on a single GPU (as fast as 32 8-core Xeons)  Produces high-quality results  May be better suited for GPU than ACO and GA algos.  Acknowledgments  NSF TeraGrid (NICS), NVIDIA Corp., and Intel Corp. A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011