Today marks the official release of the NVIDIA CUDA Toolkit version 6.5, which had previously been only available in its pre-release form. In a company blog post, NVIDIA’s Chief Technologist for GPU Computing Software Mark Harris covers the toolkit’s new features and improvements, including “support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more.”
The release is part of a greater ecosystem that includes CUDA on ARM, released last year, and and the Jetson TK1 developer board, released in March. At last week’s Hot Chips conference, NVIDIA revealed more information about the upcoming Tegra K1 “Project Denver” 64-bit ARM CPU architecture.
“CUDA 6.5 takes the next step,” writes Harris, “enabling CUDA on 64-bit ARM platforms.”
He continues: “The heritage of ARM64 is in low-power, scale-out data centers and microservers, while GPUs are built for ultra-fast compute performance. When we combine the two, we have a compelling solution for HPC.”
Harris paints the marriage of ARM64 and GPGPUs as a best of both worlds scenario with ARM64 providing power efficiency, system configurability, and a large, open ecosystem, and the GPUs facilitating high-throughput, power-efficient compute performance, and a robust HPC ecosystem that includes hundreds of CUDA-accelerated applications. As with other CPU-GPU hybrid systems, the ARM64 CPUs can offload the compute-intensive tasks to the GPUs. In this way, “CUDA and GPUs make ARM64 competitive in HPC from day one,” concludes Harris.
Figure one depicts the performance of three CUDA-accelerated applications on ARM64+GPU systems as being on par with x86+GPU systems. The bigger competitive threat for NVIDIA will come from Intel’s Xeon CPU-MIC architecture.
Currently available CUDA+ARM64 development platforms include Cirrascale’s RM1905D HPC Development Platform and the E4 ARKA EK003. These are equipped with Applied Micro X-Gene 8-core 2.4GHz ARM64 CPUs, Tesla K20 GPUs, and CUDA 6.5. Eurotech has plans release a similarly-outfitted system soon, which it says will enable a peak performance of 1 petaflops in one square meter.
The remainder of the blog is dedicated to addressing the ways that CUDA 6.5 improves performance and productivity. Highlights include the ability to specify cuFFT device callbacks; improved support for CUDA Fortran tools; and a new CUDA occupancy calculator and occupancy-based launch configuration API interfaces. The latest CUDA release also includes support for Microsoft Visual Studio 2013 for Windows.
Optimizations to double-precision functions in the CUDA math library resulted in decent performance gains for some applications. For example, a double precision n-body gravitational simulation code running on an NVIDIA Tesla K40 GPU ran 15 percent faster with no application code changes.