Minimizing Delay in Shared Pipelines Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) Yoram Revah, Aviran Kadosh.

Slides:



Advertisements
Similar presentations
Approximation algorithms for geometric intersection graphs.
Advertisements

Compressing Forwarding Tables Ori Rottenstreich (Technion, Israel) Joint work with Marat Radan, Yuval Cassuto, Isaac Keslassy (Technion, Israel) Carmi.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Fast Algorithms For Hierarchical Range Histogram Constructions
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
EE 685 presentation Optimal Control of Wireless Networks with Finite Buffers By Long Bao Le, Eytan Modiano and Ness B. Shroff.
Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.
June 3, 2015Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.
Module R R RRR R RRRRR RR R R R R Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip Zvika Guz, Isask ’ har Walter, Evgeny Bolotin, Israel.
Frame-Aggregated Concurrent Matching Switch Bill Lin (University of California, San Diego) Isaac Keslassy (Technion, Israel)
On the Code Length of TCAM Coding Schemes Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) 1.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Review.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Path Protection in MPLS Networks Ashish Gupta Design and Evaluation of Fault Tolerance Algorithms with Performance Constraints.
Worst-Case TCAM Rule Expansion Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel)
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion MSM.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Guaranteed Smooth Scheduling in Packet Switches Isaac Keslassy (Stanford University), Murali Kodialam, T.V. Lakshman, Dimitri Stiliadis (Bell-Labs)
October 5th, 2005 Jitter Regulation for Multiple Streams David Hay and Gabriel Scalosub Technion, Israel.
1 Combinatorial Dominance Analysis The Knapsack Problem Keywords: Combinatorial Dominance (CD) Domination number/ratio (domn, domr) Knapsack (KP) Incremental.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Gerhard Maierbacher Scalable Coding Solutions for Wireless Sensor Networks IT.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 The Designs and Analysis of a Scalable Optical Packet Switching Architecture Speaker: Chia-Wei Tuan Adviser: Prof. Ho-Ting Wu 3/4/2009.
Statistical Approach to NoC Design Itamar Cohen, Ori Rottenstreich and Isaac Keslassy Technion (Israel)
Variable-Length Codes: Huffman Codes
May 7 th, 2006 On the distribution of edges in random regular graphs Sonny Ben-Shimon and Michael Krivelevich.
Optimal Partition of QoS Requirements on Unicast Paths and Multicast Trees Dean H.Lorenz and Ariel Orda Department of Electrical Engineering Technion –
Max-flow/min-cut theorem Theorem: For each network with one source and one sink, the maximum flow from the source to the destination is equal to the minimal.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
1 Distributed Computing Optical networks: switching cost and traffic grooming Shmuel Zaks ©
DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.
The Poisson Process. A stochastic process { N ( t ), t ≥ 0} is said to be a counting process if N ( t ) represents the total number of “events” that occur.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
1 Automatic Refinement and Vacuity Detection for Symbolic Trajectory Evaluation Orna Grumberg Technion Haifa, Israel Joint work with Rachel Tzoref.
Delay-Throughput Tradeoff with Correlated Mobility in Ad-Hoc Networks Shuochao Yao*, Xinbing Wang*, Xiaohua Tian* ‡, Qian Zhang † *Department of Electronic.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Universit at Dortmund, LS VIII
Networks of Queues Plan for today (lecture 6): Last time / Questions? Product form preserving blocking Interpretation traffic equations Kelly / Whittle.
Palette: Distributing Tables in Software-Defined Networks Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay.
Interaction of Overlay Networks: Properties and Implications Joe W.J. Jiang Dah-Ming Chiu John C.S. Lui The Chinese University of Hong Kong.
Competitive Queue Policies for Differentiated Services Seminar in Packet Networks1 Competitive Queue Policies for Differentiated Services William.
1 Prune-and-Search Method 2012/10/30. A simple example: Binary search sorted sequence : (search 9) step 1  step 2  step 3  Binary search.
Expectation for multivariate distributions. Definition Let X 1, X 2, …, X n denote n jointly distributed random variable with joint density function f(x.
On Finding an Optimal TCAM Encoding Scheme for Packet Classification Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel)
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Guaranteed Smooth Scheduling in Packet Switches Isaac Keslassy (Stanford University), Murali Kodialam, T.V. Lakshman, Dimitri Stiliadis (Bell-Labs)
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Improved Competitive Ratios for Submodular Secretary Problems ? Moran Feldman Roy SchwartzJoseph (Seffi) Naor Technion – Israel Institute of Technology.
Queueing in switched networks Damon Wischik, UCL thanks to Devavrat Shah, MIT TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Compression for Fixed-Width Memories Ori Rottenstriech, Amit Berman, Yuval Cassuto and Isaac Keslassy Technion, Israel.
1 Using Network Coding for Dependent Data Broadcasting in a Mobile Environment Chung-Hua Chu, De-Nian Yang and Ming-Syan Chen IEEE GLOBECOM 2007 Reporter.
1 Power Efficient Monitoring Management in Sensor Networks A.Zelikovsky Georgia State joint work with P. BermanPennstate G. Calinescu Illinois IT C. Shah.
Approximation Algorithms based on linear programming.
Unconstrained Submodular Maximization Moran Feldman The Open University of Israel Based On Maximizing Non-monotone Submodular Functions. Uriel Feige, Vahab.
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
The minimum cost flow problem
The Variable-Increment Counting Bloom Filter
RE-Tree: An Efficient Index Structure for Regular Expressions
Tapping Into The Unutilized Router Processing Power
Random Testing.
Effective Social Network Quarantine with Minimal Isolation Costs
Depth Estimation via Sampling
Elementary Sorting Algorithms
Worst-Case TCAM Rule Expansion
Presentation transcript:

Minimizing Delay in Shared Pipelines Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) Yoram Revah, Aviran Kadosh (Marvell Israel)

Pipeline Sharing for H.264 video-compression standard 2 9 profiles with 21 features are organized in 9 pipelines in a 9x9 multicore scalability problem: more profiles, more features more cores

Pipeline Sharing for H.264 video-compression standard profiles with 21 features are organized in 9 pipelines in a 9x9 multicore Pipeline Sharing: The 9 profiles are serviced by one of two possible pipelines in a smaller 5x5 multicore

Regular architecture, without pipeline sharing: o 3 pipelines for the k=3 (uniformly distributed) packet types with N=8 cores o Average delay of T=(3+2+3)/3≈2.67 time slots With pipeline sharing: o 2 pipelines for the k=3 packet types with only N=6 cores o Average delay of T=(4+2+4)/3≈3.33 time slots o Tradeoff: Less cores, larger average delay Another Pipeline Sharing Example 4

Traffic: o There are k packet types with known probabilities, each requires to perform tasks among {1, …, r } possible tasks o Example: Type 1 w.p. p 1 =0.6 requires tasks S 1 ={1,2,3} and Type 2 w.p. p 2 =0.4 requires tasks S 2 ={1,4} Pipeline sharing: o A limited number of N homogeneous cores is given. Each core can serve any task o The cores should be divided into pipelines, each serves one or more packet types o The delay of a packet equals the length of its pipeline Optimization Problem: o Divide the N cores into pipelines, such that the average delay is minimized o For a given N, we denote by T OPT (N) the minimal possible average delay Model and Problem Definition 5

Outline  Introduction and Problem Definition  General Properties  Optimal Algorithm for a Special Case  Greedy Algorithm  Experimental Results  Summary 6

Property: (i) At least cores are required (ii) For all the optimal average delay satisfies (iii) for General Properties 7 Number of cores (N) Average delay (T OPT (N)) (i) (ii) (iii)

Property: Let k be the number of packet types. Then, (i) Given an unlimited number of cores, the number of solutions with d pipelines is given by where, is the Stirling number of the second kind of k,d (ii) The total number of solutions is Example: Consider k=3 packet types There is a single solution with 1 pipeline: There are solutions with 2 pipelines: There is solution with 3 pipelines: The total number of solutions is General Properties 8

Property: Assume that packet types can be partitioned into two disjoint sets, i.e. they can be ordered s.t. Then, s.t. an optimal solution given N cores can be obtained as the union of the two sets of pipelines in the optimal solutions for packet types [1,m] with N 0 cores and for packet types [(m+1),k] with (N-N 0 ) cores. Proof Outline: Any pipeline in an optimal solution cannot serve tasks from both sets Otherwise, it could be partitioned into two smaller pipelines to reduce the average delay. Separability 9

Outline  Introduction and Problem Definition  General Properties  Optimal Algorithm for a Special Case  Greedy Algorithm  Experimental Results  Summary 10

We suggest an optimal algorithm for a special case of the sets of required tasks: for Example: k=3, The condition is equivalent to the requirement Assume that are ordered s.t. for Let be the sets of tasks served by the pipelines in an optimal solution Proposition: The pipelines in an optimal solution satisfy, i.e. the pipelines in the solution are among the pipelines in the input Simple Case of the Required Tasks: S i = [1,X i ] 11

Proposition: Assume that are ordered such that Then, (i) The packet types are served by an increasing order of pipelines. In particular, the packet types served by each pipeline form a subset of consecutive packet types from the input. (ii) The last packet type is served by the last pipeline. Simple Case of the Required Tasks: S i = [1,X i ] 12

Let (for ) be the minimal possible average delay obtained in the service of the first i types with at most n cores. Let be the corresponding set of pipelines. Proposition: (i) For (ii) For j that minimizes (i), Algorithm: o In step i (for ), calculate o Return: (optimal delay), (set of pipelines) o Complexity: Simple Case of the Required Tasks: S i = [1,X i ] 13

Outline  Introduction and Problem Definition  General Properties  Optimal Algorithm for a Special Case  Greedy Algorithm  Experimental Results  Summary 14

Idea: Start with a pipeline for each type, merge pairs of pipelines with common cores Intuition: For each pair with common cores, we prefer to merge o More common cores o Less non-common cores o Low probability to be served by a pipeline in the pair Marginal cost of a possible merging operation o x – expected increment in the average delay o y – decreased number of cores o Marginal cost of R=x/y Algorithm: Until the required number of cores is obtained, merge pairs of pipelines with minimal marginal cost R=x/y Not necessarily optimal but very efficient on synthetic and real-life applications Pipeline Merging Algorithm 15 p1p1 p2p2 p3p3

Synthetic Simulation Parameters: o We consider k=8 packet types o Tasks selected among r =10 possible tasks {1,…,r =10} o Each type requires a specific task w.p. 0.5 without any dependency between different types or tasks o Two options for the packet types probabilities:  fixed prob. – all types w.p. 1/k =  variable prob. – geometrically decreasing probabilities of 2 -1, 2 -2, 2 -3, 2 -4, 2 -5, 2 -6, 2 -7 and 2 -7 o Results are based on the average of 10 3 iterations Experimental Results 16

Effectiveness of the number of cores on the average delay in time slots. k=8 packet types with r =10 possible tasks, each required w.p. 0.5 by each type iterations. Experimental Results 17 Less cores  larger delay A lower bound: Average number of tasks per type 10∙0.5=5 Maximal observed total number of tasks

Summary of the synthetic simulations. k=8 packet types with r =10 possible tasks, each required w.p. 0.5 by each type iterations. Experimental Results cores instead of 60, delay of 8.47 / 8.11 time slots instead of ~5 A difference of less than 2%

Possible Tasks: (1) Parsing, (2) Ingress interface attributes, (3) Ingress ACL, (4) L2 bridging, (5) L3 routing, (6) L3 replication, (7) MPLS switching, (8) header modification, (9) L2 replication, (10) Egress interface resolution, (11) Egress ACL Packet-processing Application 19

k=5 packet types w.p. (0.25, 0.15, 0.2, 0.3, 0.1) and r =11 possible tasks. The total number of solutions is G(5) = 52. The greedy algorithm starts with 28 cores, then reduces it to 24, 20, 15 and finally 11. It obtains the optimal delay in 16 out of 18 values of N. Packet-processing Application 20 total number of tasks The greedy algorithm obtains the optimal delay in 16 out of 18 values of N.

H.264 video-compression standard profiles with 21 features are organized in 9 pipelines in a 9x9 multicore Pipeline Sharing: The 9 profiles are serviced by one of two possible pipelines in a smaller 5x5 multicore

k=9 popular (uniformly-distributed) profiles (types) of the H.264 standard with r =21 supported features (tasks). Results are based on the greedy algorithm. H.264 video-compression standard instead of 75 cores, delay larger by 64% 49% less cores, delay larger by only 21% total number of features

Concluding Remarks New approach of sharing pipelines to reduce the number of required cores Analysis of the optimal average delay Optimal solution for Greedy algorithm for the general case 22

Thank You