Shuai Ding, Jinru He, Hao Yan, Torsten Suel Using Graphics Processors for High Performance IR Query Processing April,23 2009.

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

Jared Hoberock and Nathan Bell

Reflection nurulquran.com.

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.

Processes and Operating Systems

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

Objectives: Generate and describe sequences. Vocabulary:

Hao Yan, Shuai Ding, Torsten Suel

David Burdett May 11, 2004 Package Binding for WS CDL.

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.

Summative Math Test Algebra (28%) Geometry (29%)

Chapter 6 File Systems 6.1 Files 6.2 Directories

1 Chapter 11 I/O Management and Disk Scheduling Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and.

1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Polygon Scan Conversion – 11b

PUBLIC KEY CRYPTOSYSTEMS Symmetric Cryptosystems 6/05/2014 | pag. 2.

1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.

The 5S numbers game..

Break Time Remaining 10:00.

The basics for simulations

Figure 12–1 Basic computer block diagram.

Turing Machines.

PP Test Review Sections 6-1 to 6-6

Bright Futures Guidelines Priorities and Screening Tables

Briana B. Morrison Adapted from William Collins

EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.

Bellwork Do the following problem on a ½ sheet of paper and turn in.

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

Chapter 6 File Systems 6.1 Files 6.2 Directories

Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1 Section 5.5 Dividing Polynomials Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1.

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

Sequence Quickies 1  ORB Education. Visit for the other resources in this pack.

Adding Up In Chunks.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

Artificial Intelligence

Chapter 9: Subnetting IP Networks

Subtraction: Adding UP

A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008.

1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.

Essential Cell Biology

Clock will move after 1 minute

PSSA Preparation.

Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.

Immunobiology: The Immune System in Health & Disease Sixth Edition

Physics for Scientists & Engineers, 3rd Edition

Energy Generation in Mitochondria and Chlorplasts

Select a time to count down from the clock above

Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.

Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©

Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned.

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

Presentation transcript:

Shuai Ding, Jinru He, Hao Yan, Torsten Suel Using Graphics Processors for High Performance IR Query Processing April,

The problem? Search engine: 1000s queries/sec on billions of pages Large hardware investment Graphical processing units (GPUs) Can we build a high performance IR system (query processing) on GPUs? 2

Outline 3 Graphical processing units (GPUs) Query processing on CPUs Query processing on GPUs Discussion

Part I: Graphical processing units (GPUs) 4

Graphical processing units (GPUs) 5 Special purposes processors to accelerate applications Driven by gaming industry High degree of parallelism (96-way, 128-way,...) Programmable via various libraries and SDEs

JUNE 00, 2008PRESENTATION TO

Some characteristics (GTS8800) 7 Lower clock speed (500Mhz) but more processors (96) 230 of GFlops for GPU 60 GB/s memory access to global GPU memory A few GB/s transfer rate from main memory to GPU Transfers can be overlapped with computing Some startup overhead for starting tasks on GPU Consider GPU as co-processor for CPU

8 GPU vs. CPU performance (Released by NVIDIA)

Related work 9 Scientific computing GPU terasort, Govindaraju et al, SIGMOD 06 Joins on GPUS, He et al, SIGMOD 08 Mapreduce on GPUs, He et al., PACT 08 GPU vendors (NVIDIA, ATI) General-purpose programming environment

Challenges in GPU programming 10 Need to program in parallel SIMD type programming model Memory issues: global memory, shared memory, register (Bank conflict) Synchronization in CUDA

Part II: Query processing on CPUs 11

Inverted index and inverted lists 12 A collection of N documents Each document identified by an ID Inverted index consists of lists for each term T I armadillo = { [678 2], [2134 3], [3970 1], …… } aardvark 3452, 11437, …... arm 4, 19, 29, 98, 143,... armada 145, 457, 789,... armadillo 678, 2134, 3970,... armani 90, 256, 372, 511,.... zebra 602, 1189, 3209,...

Inverted lists compression 13 Decrease size and increase overall performance First take the gaps or differences then encode the smaller numbers I armadillo = { [678 2], [2134 3], [3970 1], …… } I armadillo = { [678 2], [1456 3], [1836 1], …… }

Compression techniques 14 Rice coding PForDelta coding (Heman et al ICDE 2006)

Rice coding 15 Take the gaps, consider the average of the numbers (the gaps) (34) (178) (291) (453) … becomes (34) (144) (113) (162) so average is g = ( ) / 4 = Rice coding: round this to smaller power of two: b = 64 (6 bits) then for each number x, encode it as x/b in unary followed by x mod b binary (6 bits) 33 = 0*64+33 = = 2*64+15 = = 1*64+48 = = 2*64+33 = Result: , , , Unary length: not fixed Binary length: fixed

PForDelta (PFD) (Heman et al ICDE 2006) 16 Idea: compress/decompress many values at a time (e.g., 128) Choose b that 90% fit in the b slot, code the other 10% as exceptions Suppose in next 128 numbers, 90% are < 32 : choose b=5 Allocate 128 x 5 bits, plus space for exceptions exceptions stored at end as ints (using 4 bytes each)

JUNE 00, 2008PRESENTATION TO example: b=5 and sequence 23, 41, 8, 12, 30, 68, 18, 45, 21, 9,.. - exceptions (grey) form linked list within the locations (e.g., 3 means next except. 3 away) - one extra slot at beginning points to location of first exception (or store in separate array) space for bit numbers space for exceptions (4 bytes each, back to front) location of 1 st exception PForDelta (PFD)

Query Processing 18 BM25 AND queries and OR queries

Query Processing 19 Document-At-A-Time (DAAT) vs. Term-At-A-Time (TAAT)

Query Processing Document-At-A-Time (DAAT) vs. Term-At-A-Time (TAAT) DAAT: Widely used, efficient, skipping, but sequential

Skipping 21 Polytechnic... University... Brooklyn But it is sequential. How can we adapt the skipping into TAAT?

JUNE 00, 2008PRESENTATION TO Part III: Query Processing on GPUs

Architecture of Query Processor 23 Index is effectively in main memory Index partially caching in GPU global memory CPU can decide to execute query on CPU or GPU

General steps 24 Sort the list from shortest to longest Decompress the shortest list Decompress the next list and combine with the previous one until no list is left (How to use skipping to avoid decompressing the whole list?) Rank the result

JUNE 00, 2008PRESENTATION TO Rice compression Assign each number to a single thread Divide the compressed data into sub-groups and assign each sub-group to different thread gaps = { }, b = = 0*64+33 = = 2*64+15 = = 1*64+48 = = 2*64+33 = , , ,

JUNE 00, 2008PRESENTATION TO Rice compression Prefix sum: (also known as the scan) each element in the result list is obtained from the sum of the elements in the list up to its index for(i = 1 ; i < n; i++) array[i] += array[i-1] GPU can do prefix scan (M. Harris, Parallel prefix scan with CUDA)

JUNE 00, 2008PRESENTATION TO Rice compressionreduce to prefix scan 27 docids = { } gaps = { }, we get b = = 0*64+33 = = 2*64+15 = = 1*64+48 = = 2*64+33 = , , , unary : binary: , , , unary : binary: docids:

JUNE 00, 2008PRESENTATION TO Rice compression 28 b-bit prefix on binary part I b 1-bit prefix on unary part I u Compact the result (prefix again) Combine the result

JUNE 00, 2008PRESENTATION TO Rice compressioncan we do better? 29 Localize the prefix Polytechnic... University... Brooklyn Helpful in skipping

PForDelta (PFD) compression 30 The original PFD:

PForDelta compression 31 The original PFD: Not suitable for GPU, especially the linked list part. GPU-based PFD Use the same b for each list Store the exceptions in two arrays Recursively compress these two arrays

Size for Rice and PFD 32 After two levels the size is as small as or even better than before

Speed for Rice and PFD 33 Millions of integers per second Prefix vs. without prefix

Speed for PForDelta 34 CPU performs better for short lists GPU has better performance especially without prefix

List intersection algorithm 35 DAAT is by nature sequential so not suitable for GPUs. We try something like TAAT Assign each docid to one thread in the shorter lists then binary search in the longer lists

List intersection algorithmcan we do better? 36 Recursive intersection ! (R.Cole Parallel merge sort)

Result 37 It works especially for long lists 2 level gives best result

Skipping?? 38 First, merge the last docid to decide which blocks need decompressing Then do the decompression and intersection Polytechnic... University... Brooklyn

Ranked query 39 Given a list of N results, how to rank them?

Ranked query 40 Reduce K times for top K result, K*N operations

JUNE 00, 2008PRESENTATION TO Ranked queryCan we do better?(trick ) reduce Top result Block of size c block N*(K/C+1) operations

Conjunctive (AND) queries and disjunctive (OR) queries 42 Up to this point we only talk about conjunctive queries. What about disjunctive queries? Brute force TAAT works well on GPUs. Process one list at a time. This just fits into the GPU parallel model

Experiments on gov2 43 On 25.2M documents, single core for CPU Randomly 1000 queries from the trace Time in ms GPU outperforms CPU

Scheduling 44 One observation: For queries with short lists CPU outperforms GPU and for queries with long list GPU outperforms CPU Assign queries to GPU or CPU Use both CPU and GPU Learning the cost: the shortest list length, etc. Three queues, job stealing, etc.

Scheduling 45 GPU+CPU serialized outperforms using only one of them Using GPU+CPU in parallel works best Using GPU+CPU is better than 2 times CPU or GPU

Part IV Discussion 46

JUNE 00, 2008PRESENTATION TO Discussion So, should we we build search engines using GPUs? Ranking function and energy consumption Using GPUs to learn about opportunities for future CPUs (multi-core ) Learn about opportunities for future GPUs (energy iuuse, memory issue)

JUNE 00, 2008PRESENTATION TO Thanks for your time