Nikolaos Chrysos - Ultra-high-radix crossbar switches
Scalable Clos On-Chip (2011-2014): In 2011,
our datacenter network team
at IBM decided to scale our server-rack interconnect with a
second-tier of high-radix spine switches. When we tried to place and
route a 32-port crossbar, our EDA tools did not face difficulties, but
the 136-port crossbar that we were targeting was much tougher to
build. We first considered to apply bit slicing and manual placement,
but, eventually, resorted to a hierarchical design for the 136-port
crossbar, which our tools could handle seamlessly. Our final solution
(SCOC) is a switch node architecture that can be used as an efficient
crossbar replacement. SCOC is a combined-input-output-queued (CIOQ)
switch with virtual output queues (VOQs) at the inputs, built around a
bufferless, non-blocking, Clos network, with cost that grows as NxN/m.
Integer m is a design parameter, practically in the range between
4-16. Whereas buffered fabrics are preferred for off-chip networks, in
this paper we show through the example of SCOC that the situation is
reversed inside the chip. The lack of internal buffers allows SCOC to
use packet-level multipathing, which delivers consistent performance
irrespective of the spatial orientation in the workload, without
having to cope with out-of-order packet delivery. The catch, of
course, is that a bufferless Clos needs global scheduling which, in
practice, performs sub-optimally. SCOC takes advantage of the
abundance of wires that are available on-chip to remedy its scheduling
inefficiencies using cheap on-chip speedup. The combination of
the two, packet-level multipathing and speedup, empowers SCOC with
remarkable performance.
From a performance perspective, SCOC is indistinguishable from efficient flat crossbars. Maintaining this service level proved to be an uphill, but worthwhile struggle. The gradual contention resolution and the asynchronous operation of SCOC could seriously endanger the determinism expected of correct operation. We drafted a set of microbenchmarks that helped us identify more than five violations of fairness, and came up with practical solutions to resolve each one of them.
Implementation: We have implemented a 136-port SCOC switch, for 25 Gb/s ports, and m=4. This number of ports was selected in order to limit the periphery of the chip, which is dictated by the I/O cores. The chip is a 32 nm ASIC designed using standard EDA tools, operates at 454 MHz and consumes 150 Watt. Internally, the SCOC chip runs at approximately 9.9 Tb/s, thus over-provisioning the user bandwidth by a factor $s=1.45$, and provides a fall-through packet latency of just 61 ns. The switch comprises two crossbar chips that share chip I/O links: one crossbar is used for data packets and one crossbar is used for end-to-end (ETE) control messages.
In retrospect, being able to synthesize the arbiters and crossbars in an automated way enabled us to flexibly delve into the design process, add enhancements, and examine different system-level alternatives. If a semi- or full-custom approach had been used instead, any changes or additions would have incurred a significantly higher design cost. When it was built (late 2012), SCOC was the highest-throughput switch chip ever built .
Related publications
N. Chrysos, C. Minkenberg, M. Rudquist, C. Basso, B Vanderpool, "SCOC:
High-Radix Crossbars Made Of Bufferless Clos Networks", IEEE Symposium
on High Performance Computer Architecture (HPCA), CA, Feb. 2015
paper