There’s another silicon startup coming onto the HPC/hyperscale scene with some intriguing and bold claims. Silicon Valley-based Tachyum Inc., which has been emerging from stealth over the last year and a half, is unveiling a processor codenamed “Prodigy,” said to combine features of both CPUs and GPUs in a way that offers a purported 10x performance-per-watt advantage over current technologies. The company is primarily focused on the hyperscale datacenter market, but has aspirations to support brainier applications, noting that “Prodigy will enable a super-computational system for real-time full capacity human brain neural network simulation by 2020.”
Tachyum says that its Prodigy universal processing architecture marries the programmability of CPUs with the power efficiency and performance features of the GPGPU.
“Rather than build separate infrastructures for AI, HPC and conventional compute, the Prodigy chip will deliver all within one unified simplified environment, so for example AI or HPC algorithms can run while a machine is otherwise idle or underutilized,” said Tachyum CEO and Cofounder Radoslav ‘Rado’ Danilak. “Instead of supercomputers with a price tag in the hundreds of millions, Tachyum will make it possible to empower hyperscale datacenters to produce more work in a radically more efficient and powerful format, at a lower cost.”
AI was a focus during the press activities that accompanied Tachyum’s participation at the GlobeSec conference in Bratislava, Slovakia, last week. Danilak indicated the technology is in the running for a prominent brain modeling project, but otherwise downplayed the AI use case when we interviewed him for this story, affirming the hyperscale datacenter as the company’s primary target. He said, “AI is just 3-5 percent of silicon today, and 95 percent is server, so our chip is shooting for that 95 percent of market.”
The CEO further clarified: “We don’t sell into enterprise market – that would not be fruitful. Our market is hyperscalers. Most of the [target] customers have their own application source code and we provide the full compiler toolchain from open source, like GCC and so on, porting Linux and baseline applications. So in our primary market, we provide tools so they can recompile and go, they don’t need to rewrite applications.”
The thrust of Tachyum’s proposition is that hyperscale servers are only being utilized at 30-40 percent, and are not used in the night because they are off-peak. Prodigy chips can be software reconfigured to run AI at night, enabling “10x more AI for free,” said Danilak.
In a presentation at Flash Memory Summit last year, the CEO discussed the coming datacenter power wall, noting “a new computational mechanism is needed to overcome this plateau.” Further, “ARM A72 not an answer; Intel Atom has similar performance & power; FPGA, GPU, TPU apply only to limited applications versus CPU.”
The Prodigy platform has 64 cores with fully coherent memory, barrier, lock and standard synchronization, including transactional memory. Single-threaded performance will be higher than a conventional core, the CEO said. Each chip will have two 400 Gigabit Ethernet ports.
Power efficiencies are gained by moving out-of-order execution capability to software. “All the register rename, checkpointing, seeking, retiring, which is consuming majority of the power, is basically gone, replaced with simple hardware. All the smartness of out-of-order execution was put to compiler,” the CEO told us.
“We are kind of a hybrid,” he continued. “[The industry has] in-order-execution machines like low-power Arm, but they have not demonstrated good performance on single thread, then you have big machines like Intel Xeon which have very good performance per thread but they are very power hungry. We are able to get the performance of Xeon per thread but power comparable to low power Arm, so we attack and reduce that cost of scheduling by moving hardware to a very complicated piece of the software.”
Citing a paper by Google’s Urs Hölzle enumerating the failings of wimpy cores, Danilak asserted that Google and other hyperscalers passed on low-power Arm because of low-performance, single-thread performance. “So from day one we designed our platform to go to into the server,” Danilak said. “We built a machine which is fastest on single-threaded but also on parallel applications because if you don’t do that, Amdahl’s law will get you. You need to have the non-vectorized parts of the application be really fast too to get the good scaling.”
Danilak claims that that by enabling a 4x reduction in datacenter TCO through improved power efficiency and reduced footprint, hyperscalers like Google and Facebook could save billions of dollars by moving to Prodigy. In terms of performance, the CEO said that a 256,000 server configuration based on Prodigy chips would deliver 32 exaflops of Tensorflow performance. That’s 125 teraflops per Prodigy chip. As a point of reference, Google’s new TPU (v3) chip promises 90 teraflops of unspecified floating performance; Volta with NVlink offers 125 mixed-precision Tensor teraflops. The pitch for Prodigy is that it is applicable for a wider range of datacenter applications.
The Prodigy architecture is fully compliant with IEEE-standard double-precision, single- and half-precision and also 8-bit floating point. The programming model includes C, C++, Java, Fortran, and Ada. “We support full staging, memory system, precise exception, and full coherency system so that allows you to run existing applications and simplifies use and deployment of applications,” the CEO said.
Tachyum says it has found a way around the “slow wire” limitations that impede today’s semiconductor devices. It is working with a fab on a semi-custom COT-flow (customer-owned tooling) design, using 7-nm technology, and expects to have prototypes out next year with sampling to follow. Ahead of tape-out, Tachyum will provide early adopters and other partners with FPGA-based emulation systems.
The CEO acknowledged the non-recurring engineering costs are significant, but indicated that the chips will be priced below Xeons and will offer a performance-per-dollar advantage over today’s high-end CPUs and GPUs.
Danilak has an accomplished track record as a technologist and entrepreneur. He founded ultra-dense flash storage company Skyera and SandForce, supplier of SSD controllers. Skyera was acquired by Western Digital in 2014 and SandForce was sold to LSI in 2011 for $377 million (LSI’s SSD business was later acquired by Seagate in 2014). He was also part of the Wave Computing team that built the 10GHz processing element of deep learning DPU.
Tachyum’s technology has garnered an endorsement from Christos Kozyrakis, professor of electrical engineering and computer science at Stanford. “Despite efficiency gains from virtualization, cloud computing, and parallelism, there are still critical problems with datacenter resource utilization particularly at a size and scale of hundreds of thousands of servers. Tachyum’s breakthrough processor architecture will deliver unprecedented performance and productivity,” said Kozyrakis, who joined Tachyum as a corporate advisor in January.
Tachyum received venture funding earlier this year from European investment company IPM Growth and says it will do one more round at the end of this year to get the chip to production. In March, Tachyum moved its headquarters to a larger facility in San Jose, Calif., and announced it was looking to expand its team.