Authors: Yi Wang, Yuan Zu, Ting Zhang, Kunyang Peng, Qunfeng Dong, Bin Liu, Wei Meng, Huichen Dai, Xin Tian, Zhonghu Xu, Hao Wu, Di Yang Publisher: NSDI.

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

Variations of the Turing Machine

Network Layer Delivery Forwarding and Routing

Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.

EE384y: Packet Switch Architectures

AP STUDY SESSION 2.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.

Processes and Operating Systems

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

Author: Julia Richards and R. Scott Hawley

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

Objectives: Generate and describe sequences. Vocabulary:

1 Hyades Command Routing Message flow and data translation.

David Burdett May 11, 2004 Package Binding for WS CDL.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination. Introduction to the Business.

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×

Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.

Chapter 6 File Systems 6.1 Files 6.2 Directories

1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.

Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.

Scalable Name Lookup in NDN Using Effective Name Component Encoding

Break Time Remaining 10:00.

Turing Machines.

Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.

PP Test Review Sections 6-1 to 6-6

Chapter 10: Applications of Arrays and the class vector

11 Data Structures Foundations of Computer Science ã Cengage Learning.

Bright Futures Guidelines Priorities and Screening Tables

EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.

Developing the Project Plan

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

Chapter 6 File Systems 6.1 Files 6.2 Directories

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.

1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 v3.1 Module 10 Routing Fundamentals and Subnets.

Adding Up In Chunks.

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

Artificial Intelligence

1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.

Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.

Essential Cell Biology

Clock will move after 1 minute

PSSA Preparation.

Essential Cell Biology

The DDS Benchmarking Environment James Edmondson Vanderbilt University Nashville, TN.

Immunobiology: The Immune System in Health & Disease Sixth Edition

Physics for Scientists & Engineers, 3rd Edition

Energy Generation in Mitochondria and Chlorplasts

Select a time to count down from the clock above

Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Scalable Name Lookup in NDN Using Effective Name Component Encoding

Statistical Optimal Hash-based Longest Prefix Match

Presentation transcript:

Authors: Yi Wang, Yuan Zu, Ting Zhang, Kunyang Peng, Qunfeng Dong, Bin Liu, Wei Meng, Huichen Dai, Xin Tian, Zhonghu Xu, Hao Wu, Di Yang Publisher: NSDI 2013 Presenter: Chia-Yi Chu Date: 2013/07/03 1

 Introduction  Algorithms & Data Structures  The CPU-GPU System: Packet Latency and Stream Pipeline  Memory Access Performance  Implementation  Experimental Evaluation 2

 Content-Centric Networking (CCN) ◦ use a content name to identify a piece of data instead of using an IP address to locate a device. ◦ every distinct content/entity is referenced by a unique name. ◦ forward packets based on the requested content name(s) carried in each packet header, by looking up a forwarding table consisting of content name prefixes.  CCN name lookup complies with longest prefix matching (LPM) and backbone CCN routers can have large-scale forwarding tables. 3

 Names and Name Tables ◦ Hierarchically structured and composed of explicitly delimited name components ◦ Ex. /com/parc/bulletin/NSDI.html 4

5

 Challenges 1.Content names are far more complex than IP addresses. 2.CCN name tables could be far larger than today’s IP forwarding tables. 3.Wire speeds have been relentlessly accelerating. 4.CCN routers have to handle one new type of FIB update. 6

 Name table aggregation ◦ The hierarchical structure of NDN names and the longest prefix matching property  Enable us to aggregate NDN name tables into smaller ones. 1.One of them is the shortest prefix of the other in the name table 2.They must map to the same next hop port(s). 7

8

 FSM ◦ a two-dimensional state transition table. ◦ each state has 256 transitions, and each transition corresponds to a distinct input character. ◦ In 3M table,  20,440,366 states.  4 bytes for encoding state ID.  1,024 bytes are needed for each row.  The entire state transition table takes GB memory space. ◦ more than 80% of states have only one single valid transition, plus more than 13% of states (which are accepting states) that have no valid transition at all. 9

 Aligned transition array (ATA) ◦ store valid transitions into what we call an aligned transition array (ATA). ◦ take the sum of current state ID and input character as an index into the transition array ◦ Need to assign each state s a unique state ID, and its input character for verification. 10

 Multi-striding ◦ d characters are processed on each state transition. ◦ component delimiter ‘/’ can only be the last character we read upon each state transition. ◦ Upon state transition, we keep reading in d input characters unless ‘/ ’ is encountered, where we stop. 11

12

13

14

◦ Create a number of small ATAs, each ATA using one of the prime numbers as its maximum length. 1.Try to store the two valid transitions on y and z into an ATA with prime number L.. 2.If the two valid transitions do not collide with each other but collide with some valid transition(s) previously stored in that ATA, we shall try another ATA with the same maximum length. 3.if the two valid transitions collide with each other, we shall move on trying to store state x into an ATA with a different maximum length, until ATAs with all different maximum lengths have been tried. 15

16

 Name table update ◦ Name deletion 1.simply conduct a lookup of name P in the name table. 2.Then backtrack towards the root, remembering all the nodes we have traversed along the path from the root to the leaf node. 3.deleting the node is equivalent to deleting its stored valid transition in MATA. 17

◦ Name insertion 1.Conduct a lookup of name P in the name table, where we traverse the character trie in a top-down manner. 2.To add an existing node’s new transition on x into MATA, we directly locate the transition array element in which the new transition should be stored. 3.If that element is vacant, we simply store the new transition into that element 4.Otherwise, the node needs to be relocated to resolve storage collision. 18

 GPU achieves high processing throughput by exploiting massive data-level parallelism ◦ a large batch of names are processed by a large number of GPU threads concurrently ◦ can lead to extended per packet lookup latency 19

 names are processed in 16MB batches 20

 Resolve this latency-throughput dilemma by exploiting the multi-stream mechanism featured in NVIDIA’s Fermi GPU architecture.  A stream is a sequence of operations that execute in issue-order. 21

 Each stream is composed of a number of lookup threads, each thread consisting of three tasks. 1. DataFetch : copy input names from host CPU to GPU device (via PCIe bus). 2. Kernel : perform name lookup inside GPU. 3. WriteBack : write lookup results back from GPU device to host CPU (via PCIe bus). 22

23

 The Kernel task of stream i runs (on the kernel engine) in parallel with the WriteBack task of stream i-1 followed by the DataFetch task of stream i+1 (both running on the copy engine). 24

25

 3M name table with 16MB batch size organized into 1 ∼ 512 streams, using 2,048 threads.  Reduces lookup latency to 101μs while maintaining lookup throughput (using 128 or more streams). 26

27

28

 To reduce the amount of slow DRAM accesses, by exploiting GPU’s memory access coalescence mechanism.  the off-chip DRAM (e.g. global memory) is partitioned into 128-byte memory blocks. When a piece of data is requested, the entire 128-byte block containing the data is fetched (with one memory access).  When multiple threads simultaneously read data from the same block, their read requests will be coalesced into one single memory access (to that block). 29

 Employ an effective technique for optimizing memory access performance called input interweaving ◦ which stores input names in an interweaved layout.  Every 32 threads (with consecutive thread IDs) are bundled together as a separate warp, running synchronously in a SIMD manner.  when the 32 threads simultaneously read the first piece of data from each of the names they are processing, resulting in 32 separate memory accesses. 30

31

 Platform, environment and tools ◦ CPU: Linux Operating System version fc15.x86_64 ◦ GPU: CUDA NVIDIA-Linux operating system version x86_

 System framework 33

 Name Tables ◦ 3M name table  2,763,780 entries  obtain existing domain name information from DMOZ ◦ 10M name table  10,000,000 entries  use a web crawler program to collect domain names  3M + 7M 34

 Name Traces ◦ formed by concatenating name prefixes selected from the name table and randomly generated suffixes. ◦ Average workload trace is generated by randomly choosing names from the name table ◦ Heavy workload trace is generated by randomly choosing from the top 10% longest names in the name table 35

 STT ◦ The baseline method: two-dimensional state transition table  ATA  4-stride MATA  MATA-NW ◦ Improve MATA with interweaved name 36

 Memory Space 37 3M name table10M name table STT19.49GB69.62GB ATA101x102x MATA130x142x

 Lookup Performance ◦ CPU-GPU System Performance 38

39

40

41

◦ GPU Engine Core Performance 42

43

 Scalability 44

45

 Name table update 46