The world’s largest chip scales to new heights. 

By Paul Alcorn

Cerebras, the company that builds the world’s largest chip, the Wafer Scale Engine 2 (WSE-2), unveiled its Andromeda supercomputer today. Andromeda combines 16 of the wafer-sized WSE-2 chips into one cluster with 13.5 million AI-optimized cores that the company says delivers up to 1 Exaflop of AI computing horsepower, or 120 Petaflops of 16-bit half-precision. 

The chips are housed in sixteen CS-2 systems. Each chip delivers up to 12.1 TB/s of internal bandwidth (96.8 Terabits) to the AI cores, but the data is fed to the CS-2 processors via 100 GbE networking spread across 124 server nodes in 16 racks. In total, those servers are powered by 284 third-gen EPYC Milan processors wielding 64 cores apiece, totaling 18,176 cores. 

The entire system consumes 500 KW, which is a drastically lower amount of power than somewhat-comparable GPU-accelerated supercomputers. However, scaling a workload across such massively-parallel supercomputers has long been one of the primary inhibitors — at some point, scaling tends to break down, so adding more hardware results in a rapidly diminishing point of returns. 

However, Cerebras says that its implementation scales nearly linearly with GPT-class large language models, like GPT-3, GPT-J, and GPT-NeoX. Andromeda can also process 2.5-billion and 25 billion-parameter models that standard GPU clusters simply can’t handle due to memory limitations. 

As a reminder, the Cerebras WSE-2 is the world’s largest single-chip processor. Each 7nm chip is specifically designed to tackle AI workloads with 850,000 AI-specific cores spread out over 46,225 mm2 of silicon packed with 2.6 trillion transistors. The chip has 40 GB of on-chip SRAM memory, 20 petabytes of memory bandwidth, and 220 petabits of aggregate fabric bandwidth. Each WSE-2 consumes 15kW of power. 

Workload scaling is sub-par on most large systems, leading to a diminishing point of returns, often due to code, memory, fabric and/or networking limitations. However, Cerebras has shown that its CS-2 systems scale nearly linearly via data parallelism with no changes to the underlying code — the company’s Andromeda supercomputer began crunching through workloads within ten minutes of being fully connected. 

The sixteen CS-2s use the company’s MemoryX and Swarm-X interconnect to simplify and orchestrate splitting the model up across the systems. This approach stores model parameters off-chip in a MemoryX cabinet while keeping the model on-chip, allowing a single system to compute larger AI models than before and combating the typical latency and memory bandwidth issues that often restrict scalability with groups of processors. Cerebras says this allows the system to scale near-linearly across up to 192 CS-2 systems. 

Andromeda is deployed at the Colovore data center in Santa Clara, California. The company has opened Andromeda up to both customers and academic researchers, including the Argonne National Laboratory, which states it has already  put the entire COVID-19 genome into a sequence window and ran the workload across up to 16 nodes with “near-perfect linear scaling.” That project is now a finalist for the prestigious ACM Gordon Bell Special Prize. Other users include JasperAI and the University of Cambridge.