Future Nets: Beyond IP Networking

Slides:



Advertisements
Similar presentations
Page 1 / 14 The Mesh Comparison PLANET’s Layer 3 MAP products v.s. 3 rd ’s Layer 2 Mesh.
Advertisements

CSCI 465 D ata Communications and Networks Lecture 20 Martin van Bommel CSCI 465 Data Communications & Networks 1.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
Data Center Networking Major Theme: What are new networking issues posed by large-scale data centers? Network Architecture? Topology design? Addressing?
Data Center Fabrics. Forwarding Today Layer 3 approach: – Assign IP addresses to hosts hierarchically based on their directly connected switch. – Use.
CSE 534 Fundamentals of Computer Networks Lecture 4: Bridging (From Hub to Switch by Way of Tree) Based on slides from D. Choffnes Northeastern U. Revised.
Multi-Layer Switching Layers 1, 2, and 3. Cisco Hierarchical Model Access Layer –Workgroup –Access layer aggregation and L3/L4 services Distribution Layer.
Revisiting Ethernet: Plug-and-play made scalable and efficient Changhoon Kim and Jennifer Rexford Princeton University.
William Stallings Data and Computer Communications 7 th Edition (Selected slides used for lectures at Bina Nusantara University) Internetworking.
Virtual Layer 2: A Scalable and Flexible Data-Center Network Work with Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Parantap Lahiri,
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
1 Interconnecting LAN segments Repeaters Hubs Bridges Switches.
Data Center Network Topologies: FatTree
A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Data-Center Traffic Management COS 597E: Software Defined Networking.
A Scalable, Commodity Data Center Network Architecture Mohammad AI-Fares, Alexander Loukissas, Amin Vahdat Presented by Ye Tao Feb 6 th 2013.
A Scalable, Commodity Data Center Network Architecture
Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks Data.
A Scalable, Commodity Data Center Network Architecture.
Datacenter Networks Mike Freedman COS 461: Computer Networks
SPRING 2011 CLOUD COMPUTING Cloud Computing San José State University Computer Architecture (CS 147) Professor Sin-Min Lee Presentation by Vladimir Serdyukov.
(part 3).  Switches, also known as switching hubs, have become an increasingly important part of our networking today, because when working with hubs,
Connecting LANs, Backbone Networks, and Virtual LANs
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Chapter 1: Hierarchical Network Design
1 The Google File System Reporter: You-Wei Zhang.
LAN Overview (part 2) CSE 3213 Fall April 2017.
Networking the Cloud Presenter: b 電機三 姜慧如.
Common Devices Used In Computer Networks
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Connecting to the Network Networking for Home and Small Businesses.
LAN Switching and Wireless – Chapter 1
Floodless in SEATTLE : A Scalable Ethernet ArchiTecTure for Large Enterprises. Changhoon Kim, Matthew Caesar and Jenifer Rexford. Princeton University.
VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 1: Introduction to Scaling Networks Scaling Networks.
OS Services And Networking Support Juan Wang Qi Pan Department of Computer Science Southeastern University August 1999.
Department of Computer Science A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat SIGCOMM’08 Reporter:
Click to edit Master subtitle style
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Chapter 3 - VLANs. VLANs Logical grouping of devices or users Configuration done at switch via software Not standardized – proprietary software from vendor.
1 Data Link Layer Lecture 23 Imran Ahmed University of Management & Technology.
Advanced Computer Networks Lecturer: E EE Eng. Ahmed Hemaid Office: I 114.
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 1: Hierarchical Network Design Connecting Networks.
PART1: NETWORK COMPONENTS AND TRANSMISSION MEDIUM Wired and Wireless network management 1.
Data Centers and Cloud Computing 1. 2 Data Centers 3.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Data Center Architectures
Data Center Networking
Future Nets: Beyond IP Networking
Instructor Materials Chapter 1: LAN Design
CIS 700-5: The Design and Implementation of Cloud Networks
Lecture 2: Cloud Computing
Data Center Network Topologies II
Overview: Cloud Datacenters
Lecture 2: Leaf-Spine and PortLand Networks
CS4470 Computer Networking Protocols
Revisiting Ethernet: Plug-and-play made scalable and efficient
NOX: Towards an Operating System for Networks
Cloud Computing.
NTHU CS5421 Cloud Computing
IS3120 Network Communications Infrastructure
NTHU CS5421 Cloud Computing
Internet and Web Simple client-server model
Reconciling Zero-conf with Efficiency in Enterprises
Lecture 8, Computer Networks (198:552)
Data Center Traffic Engineering
Presentation transcript:

Future Nets: Beyond IP Networking Building Large Networks (at the edge)… Large Scale Ethernets and enterprise networks - Scaling Ethernets to millions of nodes Building networks for the backend of the Internet – networks for cloud computing and data centers Slides by Prof. Zhi-Li Zhang, UMN Advanced Networking Course CSci5221

Even within a Single Administrative Domain Large ISPs and enterprise networks Large data centers with thousands or tens of thousands machines Metro Ethernet More and more devices are “Internet-capable” and plugged in Likely rich and more diverse network topology and connectivity

Data Center Networks Data centers Backend of the Internet Mid- (most enterprises) to mega-scale (Google, Yahoo, MS, etc.) E.g., A regional DC of a major on-line service provider consists of 25K servers + 1K switches/routers To ensure business continuity, and to lower operational cost, DCs must Adapt to varying workload  Breathing Avoid/Minimize service disruption (when maintenance, or failure)  Agility Maximize aggregate throughput  Load balancing

Challenges posed by These Trends Scalability: capability to connect tens of thousands, millions or more users and devices routing table size, constrained by router memory, lookup speed Mobility: hosts are more mobile need to separate location (“addressing”) and identity (“naming”) Availability & Reliability: must be resilient to failures need to be “proactive” instead of reactive need to localize effect of failures Manageability: ease of deployment, “plug-&-play” need to minimize manual configuration self-configure, self-organize, while ensuring security and trust …….

Quick Overview of Ethernet Dominant wired LAN technology Covers the first IP-hop in most enterprises/campuses First widely used LAN technology Simpler, cheaper than token LANs, ATM, and IP Kept up with speed race: 10 Mbps and now to 40 Gbps Soon 100 Gbps would be widely available Metcalfe’s Ethernet sketch

Ethernet Frame Structure Addresses: source and destination MAC addresses Flat, globally unique, and permanent 48-bit value Adaptor passes frame to network-level protocol If destination address matches the adaptor Or the destination address is the broadcast address Otherwise, adapter discards frame Type: indicates the higher layer protocol Usually IP

Interaction w/ the Upper Layer (IP) Bootstrapping end hosts by automating host configuration (e.g., IP address assignment) DHCP (Dynamic Host Configuration Protocol) Broadcast DHCP discovery and request messages Bootstrapping each conversation by enabling resolution from IP to MAC addr ARP (Address Resolution Protocol) Broadcast ARP requests Both protocols work via Ethernet-layer broadcasting (i.e., shouting!) Ethernet broadcast domain - A group of hosts and switches to which the same broadcast or flooded frame is delivered Too large a broadcast domain leads to Excessive flooding and broadcasting overhead Insufficient security/performance isolation

State of the Practice: A Hybrid Architecture Enterprise networks comprised of Ethernet-based IP subnets interconnected by routers Ethernet Bridging - Flat addressing - Self-learning - Flooding - Forwarding along a tree R R IP Routing (e.g., OSPF) - Hierarchical addressing - Subnet configuration - Host configuration - Forwarding along shortest paths R Broadcast Domain (LAN or VLAN) R R

Ethernet Bridging: “Routing” at L2 Routing determines paths to destinations through which traffic is forwarded Routing takes place at any layer (including L2) where devices are reachable across multiple hops IP routing Overlay routing P2P, or CDN routing Ethernet bridging IP Layer App Layer Link Layer

Ethernet (Layer-2) “Routing” Self-learning algorithm for dynamically building switch (forwarding) tables “Eavesdrop” on source MACs of data packets Associate source MACs with port # (cached, “soft-state”) Forwarding algorithm If dst MAC found in switch table, send to the corresp. port Otherwise, flood to all ports (except the one it comes from) Dealing with “loopy” topologies Running (periodically) spanning tree algorithm to convert it into a tree (rooted at an “arbitrary” node) 802.11 Wireless LANs use somewhat similar methods Use the same 48-bit MAC addresses more complex frame structures; End hosts need to explicitly associate with APs

Layer 2 vs. Layer 3 Again Neither bridging nor routing is satisfactory. Can’t we take only the best of each? Architectures Features Ethernet Bridging IP Routing Ease of configuration   Optimality in addressing Host mobility Path efficiency Load distribution Convergence speed Tolerance to loop SEATTLE 

SEATTLE (Scalable Ethernet ArchiTecTure for Larger Enterprises) Plug-and-playable enterprise architecture ensuring both scalability and efficiency Objectives Avoiding flooding Restraining broadcasting Keeping forwarding tables small Ensuring path efficiency SEATTLE architecture – design principles Hash-based location management Shortest-path forwarding Responding to network dynamics (reactive location resolution and caching) Lessons Trading a little data-plane efficiency for huge control-plane scalability makes a qualitatively different system

SEATTLE design Flat addressing of end-hosts Switches use hosts’ MAC addresses for routing Ensures zero-configuration and backwards-compatibility (Obj # 5) Automated host discovery at the edge Switches detect the arrival/departure of hosts Obviates flooding and ensures scalability (Obj #1, 5) Hash-based on-demand resolution Hash deterministically maps a host to a switch Switches resolve end-hosts’ location and address via hashing Ensures scalability (Obj #1, 2, 3) Shortest-path forwarding between switches Switches run link-state routing to maintain only switch-level topology (i.e., do not disseminate end-host information) Ensures data-plane efficiency (Obj #4)

(A large single IP subnet) Seattle Optimized forwarding directly from D to A Deliver to x y x C Host discovery or registration A Traffic to x Tunnel to egress node, A Hash (F(x) = B) Tunnel to relay switch, B Hash (F(x) = B) D Entire enterprise (A large single IP subnet) LS core Notifying <x, A> to D B Store <x, A> at B Switches E End-hosts Control flow Data flow

Cloud Computing and Data Centers: What’s Cloud Computing? Data Centers and “Computing at Scale” Case Studies: Google File System Map-Reduce Programming Model Optional Material Google Bigtable

Cloud Computing and Data Centers Why Study this: they represent part of current and “future” trends how applications will be serviced, delivered, … what are important “new” networking problems? more importantly, what lessons can we learn in terms of (future) networking design? closely related, and there are many similar issues/challenges (availability, reliability, scalability, manageability, ….) (but of course, there are also unique challenges in networking)

Internet and Web Simple client-server model a number of clients served by a single server performance determined by “peak load” doesn’t scale well (e.g., server crashes), when # of clients suddenly increases -- “flash crowd” From single server to blade server to server farm (or data center)

Internet and Web … From “traditional” web to “web service” (or SOA) no longer simply “file” (or web page) downloads pages often dynamically generated, more complicated “objects” (e.g., Flash videos used in YouTube) HTTP is used simply as a “transfer” protocol many other “application protocols” layered on top of HTTP web services & SOA (service-oriented architecture) A schematic representation of “modern” web services database, storage, computing, … web rendering, request routing, aggregators, … front-end back-end

Data Center and Cloud Computing Data center: large server farms + data warehouses not simply for web/web services managed infrastructure: expensive! From web hosting to cloud computing individual web/content providers: must provision for peak load Expensive, and typically resources are under-utilized web hosting: third party provides and owns the (server farm) infrastructure, hosting web services for content providers “server consolidation” via virtualization VMM Guest OS App Under client web service control

Cloud Computing Cloud computing and cloud-based services: beyond web-based “information access” or “information delivery” computing, storage, … Cloud Computing: NIST Definition "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction." Models of Cloud Computing “Infrastructure as a Service” (IaaS), e.g., Amazon EC2, Rackspace “Platform as a Service” (PaaS), e.g., Micorsoft Azure “Software as a Service” (SaaS), e.g., Google

Data Centers: Key Challenges With thousands of servers within a data center, How to write applications (services) for them? How to allocate resources, and manage them? in particular, how to ensure performance, reliability, availability, … Scale and complexity bring other key challenges with thousands of machines, failures are the default case! load-balancing, handling “heterogeneity,” … data center (server cluster) as a “computer” “super-computer” vs. “cluster computer” A single “super-high-performance” and highly reliable computer vs. a “computer” built out of thousands of “cheap & unreliable” PCs Pros and cons?

Case Studies Google File System (GFS) a “file system” (or “OS”) for “cluster computer” An “overlay” on top of “native” OS on individual machines designed with certain (common) types of applications in mind, and designed with failures as default cases Google MapReduce (cf. Microsoft Dryad) MapReduce: a new “programming paradigm” for certain (common) types of applications, built on top of GFS Other examples (optional): BigTable: a (semi-) structured database for efficient key-value queries, etc. , built on top of GFS Amazon Dynamo:A distributed <key, value> storage system high availability is a key design goal Google’s Chubby, Sawzall, etc. Open source systems: Hadoop, …

Google Scale and Philosophy Lots of data copies of the web, satellite data, user data, email and USENET, Subversion backing store Workloads are large and easily parallelizable No commercial system big enough couldn’t afford it if there was one might not have made appropriate design choices But truckloads of low-cost machines 450,000 machines (NYTimes estimate, June 14th 2006) Failures are the norm Even reliable systems fail at Google scale Software must tolerate failures Which machine an application is running on should not matter Firm believers in the “end-to-end” argument Care about perf/$, not absolute machine perf

Typical Cluster at Google Cluster Scheduling Master Lock Service GFS Master Machine 1 Scheduler Slave GFS Chunkserver Linux User Task 1 Machine 2 User Task Machine 3 User Task 2 BigTable Server BigTable Master

Google: System Building Blocks Google File System (GFS): raw storage (Cluster) Scheduler: schedules jobs onto machines Lock service: Chubby distributed lock manager also can reliably hold tiny files (100s of bytes) w/ high availability 5 replicas (need majority vote) Bigtable: a multi-dimensional database MapReduce: simplified large-scale data processing

Google File System Key Design Considerations Component failures are the norm hardware component failures, software bugs, human errors, power supply issues, … Solutions: built-in mechanisms for monitoring, error detection, fault tolerance, automatic recovery Files are huge by traditional standards multi-GB files are common, billions of objects most writes (modifications or “mutations”) are “append” two types of reads: large # of “stream” (i.e., sequential) reads, with small # of “random” reads High concurrency (multiple “producers/consumers” on a file) atomicity with minimal synchronization Sustained bandwidth more important than latency

GFS Architectural Design A GFS cluster: a single master + multiple chunkservers per master running on commodity Linux machines A file: a sequence of fixed-sized chunks (64 MBs) labeled with 64-bit unique global IDs, stored at chunkservers (as “native” Linux files, on local disk) each chunk mirrored across (default 3) chunkservers master server: maintains all metadata name space, access control, file-to-chunk mappings, garbage collection, chunk migration why only a single master? (with read-only shadow masters) simple, and only answer chunk location queries to clients! chunk servers (“slaves” or “workers”): interact directly with clients, perform reads/writes, …

GFS Architecture: Illustration Separation of control and data flows

GFS: Summary GFS is a distributed file system that support large-scale data processing workloads on commodity hardware GFS has different points in the design space Component failures as the norm Optimize for huge files Success: used actively by Google to support search service and other applications But performance may not be good for all apps assumes read-once, write-once workload (no client caching!) GFS provides fault tolerance Replicating data (via chunk replication), fast and automatic recovery GFS has the simple, centralized master that does not become a bottleneck Semantics not transparent to apps (“end-to-end” principle?) Must verify file contents to avoid inconsistent regions, repeated appends (at-least-once semantics)

Google MapReduce The problem Many simple operations in Google Grep for data, compute index, compute summaries, etc But the input data is large, really large The whole Web, billions of Pages Google has lots of machines (clusters of 10K etc) Many computations over VERY large datasets Question is: how do you use large # of machines efficiently? Can reduce computational model down to two steps Map: take one operation, apply to many many data tuples Reduce: take result, aggregate them MapReduce A generalized interface for massively parallel cluster processing

Data Center Networking Major Theme: What are new networking issues posed by large-scale data centers? Network Architecture? Topology design? Addressing? Routing? Forwarding? CSci5221: Data Center Networking, and Large-Scale Enterprise Networks: Part I

Data Center Interconnection Structure Nodes in the system: racks of servers How are the nodes (racks) inter-connected? Typically a hierarchical inter-connection structure Today’s typical data center structure Cisco recommended data center structure: starting from the bottom level rack switches 1-2 layers of (layer-2) aggregation switches access routers core routers Is such an architecture good enough?

Cisco Recommended DC Structure: Illustration Internet CR AR … S LB Data Center Layer 3 A Layer 2 Key: CR = L3 Core Router AR = L3 Access Router S = L2 Switch LB = Load Balancer A = Rack of 20 servers with Top of Rack switch

Data Center Design Requirements Data centers typically run two types of applications outward facing (e.g., serving web pages to users) internal computations (e.g., MapReduce for web indexing) Workloads often unpredictable: Multiple services run concurrently within a DC Demand for new services may spike unexpected Spike of demands for new services mean success! But this is when success spells trouble (if not prepared)! Failures of servers are the norm Recall that GFS, MapReduce, etc., resort to dynamic re-assignment of chunkservers, jobs/tasks (worker servers) to deal with failures; data is often replicated across racks, … “Traffic matrix” between servers are constantly changing

Data Center Costs Data centers typically run two types of applications outward facing (e.g., serving web pages to users) internal computations (e.g., MapReduce for web indexing) Workloads often unpredictable: Multiple services run concurrently within a DC Demand for new services may spike unexpected Spike of demands for new services mean success! But this is when success spells trouble (if not prepared)! Failures of servers are the norm Recall that GFS, MapReduce, etc., resort to dynamic re-assignment of chunkservers, jobs/tasks (worker servers) to deal with failures; data is often replicated across racks, … “Traffic matrix” between servers are constantly changing

Data Center Costs Total cost varies Long provisioning timescales: Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs Network Switches, links, transit *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money Total cost varies upwards of $1/4 B for mega data center server costs dominate network costs significant Long provisioning timescales: new servers purchased quarterly at best Source: the Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR 2009. Greenberg, Hamilton, Maltz, Patel.

Overall Data Center Design Goal Agility – Any service, Any Server Turn the servers into a single large fungible pool Let services “breathe” : dynamically expand and contract their footprint as needed We already see how this is done in terms of Google’s GFS, BigTable, MapReduce Benefits Increase service developer productivity Lower cost Achieve high performance and reliability These are the three motivators for most data center infrastructure projects!

Achieving Agility … Workload Management Storage Management means for rapidly installing a service’s code on a server dynamical cluster scheduling and server assignment  E.g., MapReduce, Bigtable, … virtual machines, disk images  Storage Management means for a server to access persistent data distributed file systems (e.g., GFS)  Network Management Means for communicating with other servers, regardless of where they are in the data center Achieve high performance and reliability

Networking Objectives Uniform high capacity Capacity between servers limited only by their NICs No need to consider topology when adding servers => In other words, high capacity between two any servers no matter which racks they are located ! Performance isolation Traffic of one service should be unaffected by others Ease of management: “Plug-&-Play” (layer-2 semantics) Flat addressing, so any server can have any IP address Server configuration is the same as in a LAN Legacy applications depending on broadcast must work

Is Today’s DC Architecture Adequate? Hierarchical network; 1+1 redundancy Equipment higher in the hierarchy handles more traffic more expensive, more efforts made at availability  scale-up design Servers connect via 1 Gbps UTP to Top-of-Rack switches Other links are mix of 1G, 10G; fiber, copper Uniform high capacity? Performance isolation? typically via VLANs Agility in terms of dynamically adding or shrinking servers? Agility in terms of adapting to failures, and to traffic dynamics? Ease of management? Internet CR AR … S LB Data Center Layer 3 A Layer 2 Key: CR = L3 Core Router AR = L3 Access Router S = L2 Switch LB = Load Balancer A = Top of Rack switch

Case Studies Other Approaches: A Scalable, Commodity Data Center Network Architecture a new Fat-tree “inter-connection” structure (topology) to increases “bi-section” bandwidth needs “new” addressing, forwarding/routing VL2: A Scalable and Flexible Data Center Network consolidate layer-2/layer-3 into a “virtual layer 2” separating “naming” and “addressing”, also deal with dynamic load-balancing issues Other Approaches: PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric BCube: A High-Performance, Server-centric Network Architecture for Modular Data Centers

A Scalable, Commodity Data Center Network Architecture Main Goal: addressing the limitations of today’s data center network architecture single point of failure oversubscription of links higher up in the topology trade-offs between cost and providing Key Design Considerations/Goals Allows host communication at line speed no matter where they are located! Backwards compatible with existing infrastructure no changes in application & support of layer 2 (Ethernet) Cost effective cheap infrastructure and low power consumption & heat emission

Fat-Tree Based DC Architecture Inter-connect racks (of servers) using a fat-tree topology Fat-Tree: a special type of Clos Networks (after C. Clos) K-ary fat tree: three-layer topology (edge, aggregation and core) each pod consists of (k/2)2 servers & 2 layers of k/2 k-port switches each edge switch connects to k/2 servers & k/2 aggr. switches each aggr. switch connects to k/2 edge & k/2 core switches (k/2)2 core switches: each connects to k pods Fat-tree with K=2

Fat-Tree Based Topology … Why Fat-Tree? Fat tree has identical bandwidth at any bisections Each layer has the same aggregated bandwidth Can be built using cheap devices with uniform capacity Each port supports same speed as end host All devices can transmit at line speed if packets are distributed uniform along available paths Great scalability Fat tree network with K = 3 supporting 54 hosts

Cost of Maintaining Switches Netgear ~ 3K Procurve – 4.5K

Fat-tree Topology is Great, But … Does using fat-tree topology to inter-connect racks of servers in itself sufficient? What routing protocols should we run on these switches? Layer 2 switch algorithm: data plane flooding! Layer 3 IP routing: shortest path IP routing will typically use only one path despite the path diversity in the topology if using equal-cost multi-path routing at each switch independently and blindly, packet re-ordering may occur; further load may not necessarily be well-balanced Aside: control plane flooding!

FAT-Tree Modified Enforce a special (IP) addressing scheme in DC unused.PodNumber.switchnumber.Endhost Allows host attached to same switch to route only through switch Allows inter-pod traffic to stay within pod Use two level look-ups to distribute traffic and maintain packet ordering First level is prefix lookup used to route down the topology to servers Second level is a suffix lookup used to route up towards core maintain packet ordering by using same ports for same server

More on Fat-Tree DC Architecture Diffusion Optimizations Flow classification Eliminates local congestion Assign to traffic to ports on a per-flow basis instead of a per-host basis Flow scheduling Eliminates global congestion Prevent long lived flows from sharing the same links Assign long lived flows to different links What are potential drawbacks of this architecture?