Think one GPU is very much like another? Think again. It turns out that there’s surprising variability in the performance delivered by chips of the same model. That can make getting your money’s worth by renting time on a GPU from a cloud provider a real roll of the dice, according to research from the College of William & Mary, Jefferson Lab, and Silicon Data.
“It’s called the silicon lottery,” says Carmen Li, founder and CEO of Silicon Data, which tracks GPU rental prices and benchmarks cloud-computing performance.
The silicon lottery’s existence has been known since at least 2022, when researchers at the University of Wisconsin tied it to variations in the performance of GPU-dependent supercomputers. Li and her colleagues figured that the effect would be even more pronounced for AI cloud customers.
Performance varies for GPU models in the cloud
So they ran 6,800 instances of the index firm’s benchmark test on 3,500 randomly selected GPUs operated by 11 cloud-computing providers. The 3,500 GPUs comprised 11 models of Nvidia GPU, the most advanced being the Nvidia H200 SXM. (The team wasn’t just picking on Nvidia; the GPU giant makes up most of the rental cloud market.)
The benchmark, called SiliconMark, is intended to provide a snapshot of a GPU’s ability to run large language models, or LLMs. It tests 16-bit floating-point computing performance, measured in trillions of operations per second, and a GPU’s internal-memory bandwidth, measured in gigabytes per second. The results showed that the computing performance varied for all models, but for the 259 H100 PCIe GPUs it differed by as much as 34.5 percent, and the memory bandwidth of the 253 H200 SXM GPUs varied by as much as 38 percent.

SOURCE: SILICON DATA
Differences in how the GPU is cooled, how cloud operators configure their computers, and how much use the chip has seen can all contribute to variations in performance of otherwise identical chips. But Silicon Data’s analysis showed that the real culprit was variations in the chips themselves, likely due to manufacturing issues.
Such randomness has real dollars-and-cents consequences, the researchers argue, because there’s a chance that a pricier, more advanced GPU won’t deliver better performance than an older model chip.
So what should GPU renters do? “The most practical approach is to benchmark the actual rental they receive,” says Jason Cornick, head of infrastructure at Silicon Data. “Running a benchmark tool [such as SiliconMark] allows them to compare their specific instance’s performance against a broader corpus of data.”
From Your Site Articles
Related Articles Around the Web