The Silicon Highway: Navigating Interconnects for Banked Shared Caches

Welcome back to the Shared Cache Lab. If you’ve been following our journey, you know our “Silicon Shield” logo isn’t just for show—it represents the literal defense line we build between the cores. Today, we’re moving from the “rules” of the conversation (protocols like MESI) to the actual “highways” where that conversation happens.

​In a modern high-performance chip, we don’t just have one giant block of cache. As we discussed in our last post, that would be an energy and latency nightmare. Instead, we “slice” the cache into multiple banks. But how do the cores actually get to those banks? Let’s break it down with a classic scenario: 2 Cores & 4 Cache Banks.

​The Highway Problem

​Imagine Core 0 and Core 1 are two high-speed commuters, and they need to pick up “packages” from Banks 0, 1, 2, and 3. The physical layout of the road between them determines everything.

1. The Unified Bus (The Single-Lane Road)

In older or simpler designs, we used a unified bus. Think of this as a single-lane bridge. If Core 0 is fetching a matrix row from Bank 2, the bridge is occupied. Core 1 has to sit at the “toll booth” and wait, even if it just wants a simple variable from Bank 0. This creates a massive bottleneck. In our “Shield” world, we call this a serialization point—it’s easy to verify, but the performance is terrible.

2. The Crossbar (The Multi-Level Interchange)

Now, imagine “God Mode” for architects: the Crossbar. This is a grid where every core has a direct, private path to every single bank. Core 0 can be chatting with Bank 3 while Core 1 is simultaneously pulling data from Bank 1. No waiting. No traffic jams. It is the gold standard for performance. But there’s a catch—the wiring complexity grows exponentially. If you have 16 cores and 16 banks, you’re looking at a forest of wires that eats up precious silicon area.

3. The Mesh (The Scalable Neighborhood)

For the massive AI/ML workloads I verify at Google, we often look toward Mesh or Ring topologies. Here, banks and cores are nodes in a grid. To get from Core 0 to Bank 3, your data might “hop” through Bank 1 and Bank 2. It’s scalable and area-efficient, but it introduces Latency Jitter. The time it takes to get your data depends on how “far” the bank is from the core.

​The Architect’s Dilemma: Uniform vs. Non-Uniform

​This leads to the big question of access. In a Uniform system, the interconnect is designed so that every core feels like every bank is the same distance away. In a Non-Uniform (NUMA) system, Core 0 might access Bank 0 in 5 cycles but take 15 cycles to reach Bank 3.

​As a verification lead, this is where the “Shield” gets tested. When you have a non-uniform mesh, you run into Race Conditions. What happens if Core 0 sends an Invalidate signal to Bank 3, but Core 1’s Read request gets there first because it was physically closer? These are the “silent killers” of silicon that we hunt for every day.

​Let’s Hear From You

​In my 16 years of experience, I’ve seen Crossbars fail because of arbitration “starvation” and Buses crawl because of congestion.

Pro question for the community: If you were designing a next-gen AI chip with 8 matrix-heavy cores, would you risk the area cost of a full Crossbar for that 900 GB/s bandwidth, or would you settle for a Mesh and hope your software compiler can handle the non-uniform latency?

​Drop your thoughts in the comments—I’m curious to see if you value Area or Throughput more in your designs!

–Hardik Makhaniya

Leave a comment