Nikolaos Chrysos - Datacenter nodes running at 100G

ibm_project

IBM high-performance clusters (2009-2014): At IBM, we designed and implemented in ASIC a datacenter fabric that carries the Ethernet and PCIe (primarily east-west) traffic in a cluster of 640 processing nodes (servers), arranged in four racks. This design was the among the few first ones offering 100G Ethernet connections to servers. Despite its sheer size and speed, the fabric offers advanced QoS and ultra-low (sub-microsecond) end-to-end latency, thanks to speculative transmissions that dominate at low utilization levels. At high-utilization, or when congestion occurs at some nodes, sources need to request permission before they can inject new traffic. This request-grant admission step adds 1-2 microseconds to end-to-end latency, but guarantees robust operation (and reasonable/predictable latency) under adversarial traffic conditions. Furthermore, deep pipelining and smart implementation techniques made it possible to offer sophisticated service even for 64B Ethernet frames i.e., when the processing timer per frame gets as low as 6.6 nanoseconds.

Server-rack fabric overview: As shown in the figure below, the fabric uses a two-level fat-tree (spine-leaf) topology---every leaf switch is connected to every spine switch. The leaf switches are integrated into the backplane of the processor racks, and each one constitutes the network edge for five (5) servers, providing a 100 Gb/s (bidirectional) link to every server. The fabric supports four racks, with 32 edge switches each, thus providing a total of 640 100G ports. The leaf switches have dedicated input and output buffers per port and per priority level. At its ingress interface, a leaf switch segments the incoming Ethernet frames into variable-size fabric-internal packets (or cells), and stores the latter in 256B buffer units that are linked together to form VOQs. Each packet can use any of the available leaf-to-spine links, as enforced by a packet-level spraying mechanism that overcomes the limitations of flow-level hashing. The original Ethernet frames are reordered and reassembled at their egress leaf switch (output buffers), and forwarded to their destination server. The spines themselves are cell-based, CIOQ switching elements, agnostic of the higher-level protocols. They reside in separate chassis, and each one provides 136, 25 Gb/s ports, enabling high-density (indirect) connections among the leaf switches. The spines feature small input and output buffers (16 cells per port), which are flow controlled using hop-by-hop backpressure. We use 32 spines in our four-rack system. With 32, 25 Gb/s links per leaf switch connecting to the spines, the fabric features an over-provisioning ratio of 8:5; this speedup accommodates any internal overhead, leaving some headroom to compensate for scheduling inefficiencies.

Congestion Management: To deal with congestion, the fabric uses (i) VOQs at ingress leaf switches, (ii) an end-to-end request-credit protocol, and (iii) (packet-) cell-level multipathing (spraying). Together, these prevent hotspots at fabric-egress and internal ports. In addition, to prevent buffer hogging inside ingress leaf switches, the fabric exploits an advantageous variation of Quantized Congestion Control (QCN), which can properly identify the congested flows and throttle their sending rate inside the upstream Converged Network Adapters (CNAs, a.k.a. network interfaces). A distinctive feature of our QCN is that it fairly allocates the bandwidth of congested links, overcoming the statistical sampling errors of standard QCN.

Related publications

N. Chrysos, F. Neeser, B. Vanderpool, M. Rudquist, K. Valk, T. Greenfield, C. Basso, " Integration and QoS of Multicast Traffic in a Server-Rack Fabric with 640 100G Ports", IEEE/ACM Symposium on Architectures for Networking and Communications Systems (ANCS), Los Angeles, Oct. 2014 paper
N. Chrysos, F. Neeser, M. Gusat, C. Minkenberg, W. Denzel, C. Basso, M. Rudquist, K. Valk, B. Vanderpool "Large Switches or Blocking Multi-Stage Networks? An Evaluation of Routing Strategies for Datacenter Fabrics", Computer Networks, Elsevier, to appear
N. Chrysos, F. Neeser, R. Clauberg, D. Crisan, K. Valk, C. Basso, C. Minkenberg, M. Gusat, "Unbiased QCN for Scalable Server-Rack Fabrics", IEEE Micro, to appear paper
N. Chrysos, F. Neeser, M. Gusat, R. Clauberg, C. Minkenberg, C. Basso, and K. Valk, " Tandem Queue Weighted Fair Smooth Scheduling", Design Automation for Embedded Systems, pp. 1-15, March 2014, DOI: 10.1007/s10617-014-9132-y
N. Chrysos, F. Neeser, M. Gusat, C. Minkenberg, W. Denzel, and C. Basso, "All Routes to Efficient Datacenter Fabrics", International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip (INA-OCMC) ACM, Vienna, 2014
F. Neeser, N. Chrysos, R. Clauberg, D. Crisan, M. Gusat, C. Minkenberg, K. Valk, C. Basso, "Occupancy Sampling For Terabit CEE Switches", IEEE Annual Symposium on High-Performance Interconnects, CA, US, August, 2012
N. Chrysos, L. Chen, C. Minkenberg, C. Kachris, M. Katevenis, "End-to-end Congestion Management for Non-Nlocking Multi-Stage Switching Fabrics", ACM/IEEE Symp. on Architecture for Networking and Communications Systems (ANCS), San Diego, CA, 2010
N. Chrysos, "Congestion Management for Non-Blocking Clos Networks", ACM/IEEE Symposium on Architecture for Networking and Communications Systems (ANCS), Orlando, Dec. 2007