RiseML Blog last week reported benchmarks that suggest Google’s custom TPUv2 chips and Nvidia V100 GPUs offer roughly comparable performance on select deep learning tasks but that the cost for access to TPUv2 technology on Google Cloud is less than the cost of accessing V100s on AWS. Google began providing public access to TPUv2 in February via its Cloud TPU offering which includes four TPUv2 chips.
(Update: In an interesting turn of events, Google announced today it was offering access to the V100. See HPCwire article, Google Is Latest ‘Big Three’ Cloud Provider to Offer V100 GPUs.)
Elmar Haußmann, cofounder and CTO of RiseML, wrote in the company blog, “In terms of raw performance on ResNet-50, four TPUv2 chips (one Cloud TPU) and four V100 GPUs are equally fast (within 2% of each other) in our benchmarks. We will likely see further optimizations in software (e.g., TensorFlow or CUDA) that improve performance and change this.
“What often matters most in practice though, is the time and cost it takes to reach a certain accuracy on a certain problem instance. The current pricing of Cloud TPUs coupled with a world-class implementation of ResNet-50 results in an impressive time- and cost-to-accuracy on ImageNet, which allows to train a model to an accuracy of 76.4% for about $73.”
The RiseML blogpost is brief and best read in full. RiseML compared four TPUv2 chips (which form one Cloud TPU) to four Nvidia V100 GPUs: “Both have a total memory of 64 GB, so the same models can be trained and the same batch sizes can be used. In our experiments, we also train models in the same fashion: the four TPUv2 chips on a Cloud TPU run a form of synchronous data parallel distributed training as do the four V100s.”
After discussion with Google and Nvidia over which benchmark to use: “[We chose] to use the ResNet-50 model on ImageNet, a de facto standard and reference point for image classification. Reference implementations of ResNet-50 are publicly available, but there is currently no single implementation that supports both training on a Cloud TPU and multiple GPUs,” wrote Haußmann.
“For the V100s, Nvidia recommended to use MXNet or TensorFlow implementations, both available in Docker images on the Nvidia GPU Cloud. However, we found both implementations didn’t converge well out-of-the-box with multiple GPUs and the resulting large batch sizes. This requires adjustments, in particular, in the learning rate schedule.
“Instead, we used the ResNet-50 implementation from TensorFlow’s benchmark repository and ran it in a Docker image (tensorflow/tensorflow:1.7.0-gpu, CUDA 9.0, CuDNN 7.1.2). It is considerably faster than Nvidia’s recommended TensorFlow implementation and only slightly slower than the MXNet implementation. However, it converged well. This also has the added benefit of comparing two implementations in the same framework at the same version (TensorFlow 1.7.0),” he wrote.
In the future, wrote Haußmann, benchmarks of models from other domains and with different network architectures are needed to provide further insight. “One interesting point to consider as well is how much effort it is to make efficient use of a given hardware platform. For example, mixed-precision computation comes with a great performance increase, but implementation and behaviour on GPUs and TPUs differs,” he wrote.