Presentation on theme: "Parallelizing GIS applications for IBM Cell Broadband engine and x86 Multicore platforms Bharghava R, Jyothish Soman, K S Rajan International."— Presentation transcript:
Parallelizing GIS applications for IBM Cell Broadband engine and x86 Multicore platforms Bharghava R, Jyothish Soman, K S Rajan International Institute of Information Technology, Hyderabad, India (This is an ongoing project (Aug 2009 onwards) at the IIIT-H, with Hardware Support from IBM) The Limitations in VLSI Technology has resulted in minimal gains from increasing the compute power of a single processor cores. Moore’s Law of 2X increase in performance every 18 months seems unachievable. All leading processor development companies have shifted focus to design and development of Multicore processors, which are aimed at tackling the 3 walls as mentioned below. Computer Architecture has hit 3 walls: Power Wall - Power is expensive, but transistors are “free” Can put more transistors on a chip than have the power to turn on Memory Wall - Loads and stores are slow, but multiplies fast 200 clocks to DRAM, but even FP multiplies only 4 clocks ILP Wall - Diminishing returns on finding more Instruction Level Parallelism Power Wall + Memory Wall + ILP Wall = Brick Wall In the single core era, software developers took the architecture for granted and relied on efficient compilers. The same approach cannot be followed for multicore processors, as compiler technology has not advanced as much as its hardware counterpart. Also, existing compilers for parallelizing code make use of inherent parallelism alone, which is not sufficient to exploit the compute power of parallel systems. Uniprocessor Performance (SPECint) From Hennessy and Patterson, Computer Architecture: A Quantitative Approach Existing Parallel Platforms Clusters made of Off the Shelf Components (COTS). This platform consists of multiple CPUs, communication over a LAN. Supercomputer are also based on Clusters. Overhead for installation and maintenance is huge. Wide availability, and easy to design Coprocessors This provides application specific acceleration. E.g.. General Purpose Graphics Processing Units (GPGPU) Good performance for massively parallel algorithms. Best Performance/dollar for specific applications, which are limited in number. Suffers from memory limitation Multicore Architectures These consists of multiple instances of simple processing cores, wither homogeneous or heterogeneous. These can also be part of the above two system architectures. Processors which fall into this category are: Intel/AMD multicores, Cisco Metro, IBM Power Series, STI Cell Broadband Engine, Sun Niagara T1/T2 Parallelization of GIS Applications Due to suboptimal compiler based parallelization, most of the parallel algorithms are manually written, and have to be tailor-made for the host processor architecture. Also, programmers are used to thinking more on terms of ILP. Developing parallel algorithms in itself is a daunting task. There are parallelizing frameworks available for the 3 categories mentioned above: MPI for Clusters CUDA SDK for NVIDIA GPUs, and Stream SDK for AMD GPUs OpenMP for x86 based multicore processors According to “A Berkeley View: A New Framework & a New Platform for Parallel Research” , there are 14 motifs, or parallelization patterns in majority of scientific applications. GIS Applications are inherently data parallel, and can exploit current parallel architectures. Algorithms pertaining to Geospatial Information Systems (GIS) focus on the following categories: Network Analysis / Graph Traversal Nearest Neighbour Searches Dense Linear Algebra Sparse Linear Algebra Structured Grids As an example, parallel Teracost [2,3] algorithm is implemented in STI’s Cell Broadband Engine (CBE), and r.cost has been implemented on stock Intel multicore processors. r.mapcalc is used to show the raw performance improvements of embarrassingly parallel applications. Preliminary Results: We achieved a speedup of ~6 for MapCalc over a single processor implementation. The absence of linear speedup is due to the communication overhead of the Element Interconnect Bus. This can be masked further by buffering. Cell Processor – Advantages and Feasibility Developed through joint collaboration between Sony, IBM, Toshiba 1 Cell Processor core consists of 1 PPE core – Control processor, compatible with Power 64b 8 computer intensive SPE cores SPEs also have a DMA engine for data movement Element Interconnect Bus (EIB) – For communication Token Ring architecture, 8 concurrent rings. Commercial Availability Blade Server 2 Cell cores +16 GB RAM per blade High Installation & Maintenance costs Playstation 3 1 Cell core, 6 usable SPEs (one is used for the GAME OS, The other is unused to increase yield)m 256 MB RAM Low cost system, but limited memory. Graphics processor not open to programming. Add-on PCI-e cards 1 Cell core + 1 GB RAM Easy installation. High cost. References  "The landscape of parallel computing research: A view from berkeley", Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS , December.  Hazel, T., Toma, L., Vahrenhold, J., and Wickremesinghe, R Terracost: Computing least-cost-path surfaces for massive grid terrains. J. Exp. Algorithmics 12 (Jun. 2008),  Neteler, M. and Mitasova, H. "Open source GIS: a GRASS GIS approach", Kluwer Academic Pub.