Nikolaos Chrysos - Ultra-high-radix crossbar switches

ibm_project Scalable Clos On-Chip (2011-2014): In 2011, our datacenter network team at IBM decided to scale our server-rack interconnect with a second-tier of high-radix spine switches. When we tried to place and route a 32-port crossbar, our EDA tools did not face difficulties, but the 136-port crossbar that we were targeting was much tougher to build. We first considered to apply bit slicing and manual placement, but, eventually, resorted to a hierarchical design for the 136-port crossbar, which our tools could handle seamlessly. Our final solution (SCOC) is a switch node architecture that can be used as an efficient crossbar replacement. SCOC is a combined-input-output-queued (CIOQ) switch with virtual output queues (VOQs) at the inputs, built around a bufferless, non-blocking, Clos network, with cost that grows as NxN/m. Integer m is a design parameter, practically in the range between 4-16. Whereas buffered fabrics are preferred for off-chip networks, in this paper we show through the example of SCOC that the situation is reversed inside the chip. The lack of internal buffers allows SCOC to use packet-level multipathing, which delivers consistent performance irrespective of the spatial orientation in the workload, without having to cope with out-of-order packet delivery. The catch, of course, is that a bufferless Clos needs global scheduling which, in practice, performs sub-optimally. SCOC takes advantage of the abundance of wires that are available on-chip to remedy its scheduling inefficiencies using cheap on-chip speedup. The combination of the two, packet-level multipathing and speedup, empowers SCOC with remarkable performance.

From a performance perspective, SCOC is indistinguishable from efficient flat crossbars. Maintaining this service level proved to be an uphill, but worthwhile struggle. The gradual contention resolution and the asynchronous operation of SCOC could seriously endanger the determinism expected of correct operation. We drafted a set of microbenchmarks that helped us identify more than five violations of fairness, and came up with practical solutions to resolve each one of them.

ibm_project

Implementation: We have implemented a 136-port SCOC switch, for 25 Gb/s ports, and m=4. This number of ports was selected in order to limit the periphery of the chip, which is dictated by the I/O cores. The chip is a 32 nm ASIC designed using standard EDA tools, operates at 454 MHz and consumes 150 Watt. Internally, the SCOC chip runs at approximately 9.9 Tb/s, thus over-provisioning the user bandwidth by a factor $s=1.45$, and provides a fall-through packet latency of just 61 ns. The switch comprises two crossbar chips that share chip I/O links: one crossbar is used for data packets and one crossbar is used for end-to-end (ETE) control messages.

In retrospect, being able to synthesize the arbiters and crossbars in an automated way enabled us to flexibly delve into the design process, add enhancements, and examine different system-level alternatives. If a semi- or full-custom approach had been used instead, any changes or additions would have incurred a significantly higher design cost. When it was built (late 2012), SCOC was the highest-throughput switch chip ever built .

Related publications

N. Chrysos, C. Minkenberg, M. Rudquist, C. Basso, B Vanderpool, "SCOC: High-Radix Crossbars Made Of Bufferless Clos Networks", IEEE Symposium on High Performance Computer Architecture (HPCA), CA, Feb. 2015 paper