AI Chip Start-up Groq to Detail Technology Progress in Fall

By John Russell

August 13, 2020

AI chip startup Groq announced yesterday it had closed its most recent funding round, saying the new investments will help it double in size by the end of this year and double again by the end of next year as it transitions to commercial development. Groq’s new neural network chip – the Tensor Streaming Processor – is well along in development and designed to speed inferencing and consume low power. Among the new investors are D1 Capital, announced yesterday, and TDK Ventures, announced last week.

The amount of the new investments was not disclosed (see Groq release) but it’s no doubt good news for the start-up. Groq is one of many newcomers seeking to make waves in the AI market. It was started in the 2017 timeframe by former Google employees including Groq CEO Jonathan Ross who participated in TPU development at Google. Bill Leszinske, VP of products and marketing, told HPCwire more details about the technology roadmap and early customers would be discussed at the AI Hardware Summit this fall.

“Jonathan Ross will specifically talk about the products that we’re shipping and have shipped to customers,” Leszinske said. “We have our core piece of silicon and the initial software stack that we’ve been working on. That has shipped as a PCIe add-in card that goes into servers and very high-end workstations.” Groq is currently targeting a few market segments, in particular those with low power and high performance requirements such as autonomous vehicles. He cited HPC more broadly as a target market as well as power-hungry hyperscaler datacenters. Leszinske said several government labs would be working with the new chip. That wouldn’t be surprising given many national labs – Argonne, for example – have set up AI chip test beds to assess new devices.

The TSP is being fabbed on a 14nm process by Global Foundaries and a fairly thorough description of its architecture was presented at IEEE’s 2020 International Symposium on Computer Architecture (link to paper).

Briefly, the idea is to remove some of the overhead (instructions) required to use general purpose microprocessors by physically moving and reorganizing functional elements (e.g. with needed memory and support located nearby). “Instead of creating a small programmable core and replicating it dozens or hundreds of times, the TSP houses a single enormous processor that has hundreds of functional units. This novel architecture greatly reduces instruction-decoding overhead, and handles integer and floating-point data, which makes delivering the best accuracy for inference and training a breeze,” says Groq.

Proximity, reduced number of instructions per op required, and efficient data flow all work together to reduce latency and increase performance, says Groq. Here’s a brief portion of the description of the architecture excerpted from the Groq paper:

“To understand the novelty of our approach, consider the chip organization shown in Figure 1(a). In a conventional chip multiprocessor (CMP) each “tile” is an independent core which is interconnected using the on-chip network to exchange data between cores. Instruction execution is carried out over several stages: 1) instruction fetch (IF), 2) instruction decode (ID), 3) execution on ALUs (EX), 4) memory access (MEM), and 5) writeback (WB) to update the results in the GPRs. In contrast from conventional multicore, where each tile is a heterogeneous collection of functional units but globally homogeneous, the TSP inverts that and we have local functional homogeneity but chip-wide (global) heterogeneity.

“The TSP reorganizes the homogeneous two-dimensional mesh of cores in Figure 1(a) into the functionally sliced microarchitecture shown in Figure 1(b). In this approach, each tile implements a specific function and is stacked vertically into a “slice” in the Y-dimension of the 2D on-chip mesh. We disaggregate the basic elements of a core in Figure 1(a) per their respective functions: instruction control and dispatch (ICU), memory (MEM), integer (INT) arithmetic, float point (FPU) arithmetic, and network (NET) interface, as shown by the slice labels at the top of Figure 1(b).

“In this organization, each functional slice is independently controlled by a sequence of instructions specific to its on-chip role. For instance, the MEM slices support Read but not Add or Mul, which are only in arithmetic functional slices (the VXM and MXM slices).”

The company reports the following specs:

  • INT8 – 1PetaOp/s @ 1.25GHz
  • FP16 – 250mTFLOPS @ 1.25GHz
  • Transistors – 26.8B transistors
  • Process – 14nm
  • SRAM – 220MB on-die
  • Memory Bandwidth – 80TB/s on-die
  • Host Interface – PCIe Gen 4 x 16 31.GB/s in each direction
  • C2C Interface – Up to 16 chip-to-chip interconnects
  • Driver – Open source, simple stack
  • ECC – End to end chip protection

One question is around programming required to take advantage of the architecture. The company is still developing tools.

Leszinske said, “We have two efforts. One is this low API that gives folks high control over implementing exactly what they want and getting the most performance out of the hardware. In addition to that, we are working on a compiler. Given customer interest we had to prioritize the Groq API over the compiler as our first set of deliverables. So as we’re releasing software to customers, which again, we’ll talk about at the end of September publicly, we are using the Groq API as the initial development platform for customers. We’ll talk about the compiler later on in the year, as we start to ramp some early customers.”

Groq has also struck deal with HPC cloud specialist Nimbix which presumably would become both a development platform and deployment platform.

Link to Groq video on its architecture: https://www.youtube.com/watch?time_continue=82&v=pb0PYhLk9r8&feature=emb_logo

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

The New MLPerf Storage Benchmark Runs Without ML Accelerators

October 3, 2024

MLCommons is known for its independent Machine Learning (ML) benchmarks.  These benchmarks have focused on mathematical ML operations and accelerators (e.g., Nvidia GPUs). Recently, MLCommons introduced the results of i Read more…

DataPelago Unveils Universal Engine to Unite Big Data, Advanced Analytics, HPC, and AI Workloads

October 3, 2024

DataPelago today emerged from stealth with a new virtualization layer that it says will allow users to move AI, data analytics, and ETL workloads to whatever physical processor they want, without making code changes, the Read more…

IBM Quantum Summit Evolves into Developer Conference

October 2, 2024

Instead of its usual quantum summit this year, IBM will hold its first IBM Quantum Developer Conference which the company is calling, “an exclusive, first-of-its-kind.” It’s planned as an in-person conference at th Read more…

Stayin’ Alive: Intel’s Falcon Shores GPU Will Survive Restructuring

October 2, 2024

Intel's upcoming Falcon Shores GPU will survive the brutal cost-cutting measures as part of its "next phase of transformation." An Intel spokeswoman confirmed that the company will release Falcon Shores as a GPU. The com Read more…

Texas A&M HPRC at PEARC24: Building the National CI Workforce

October 1, 2024

Texas A&M High-Performance Research Computing (HPRC) significantly contributed to the PEARC24 (Practice & Experience in Advanced Research Computing 2024) conference. Eleven HPRC and ACES’ (Accelerating Computin Read more…

A Q&A with Quantum Systems Accelerator Director Bert de Jong

September 30, 2024

Quantum technologies may still be in development, but these systems are evolving rapidly and existing prototypes are already making a big impact on science and industry. One of the major hubs of quantum R&D is the Q Read more…

The New MLPerf Storage Benchmark Runs Without ML Accelerators

October 3, 2024

MLCommons is known for its independent Machine Learning (ML) benchmarks.  These benchmarks have focused on mathematical ML operations and accelerators (e.g., N Read more…

DataPelago Unveils Universal Engine to Unite Big Data, Advanced Analytics, HPC, and AI Workloads

October 3, 2024

DataPelago today emerged from stealth with a new virtualization layer that it says will allow users to move AI, data analytics, and ETL workloads to whatever ph Read more…

Stayin’ Alive: Intel’s Falcon Shores GPU Will Survive Restructuring

October 2, 2024

Intel's upcoming Falcon Shores GPU will survive the brutal cost-cutting measures as part of its "next phase of transformation." An Intel spokeswoman confirmed t Read more…

How GenAI Will Impact Jobs In the Real World

September 30, 2024

There’s been a lot of fear, uncertainty, and doubt (FUD) about the potential for generative AI to take people’s jobs. The capability of large language model Read more…

IBM and NASA Launch Open-Source AI Model for Advanced Climate and Weather Research

September 25, 2024

IBM and NASA have developed a new AI foundation model for a wide range of climate and weather applications, with contributions from the Department of Energy’s Read more…

Intel Customizing Granite Rapids Server Chips for Nvidia GPUs

September 25, 2024

Intel is now customizing its latest Xeon 6 server chips for use with Nvidia's GPUs that dominate the AI landscape. The chipmaker's new Xeon 6 chips, also called Read more…

Building the Quantum Economy — Chicago Style

September 24, 2024

Will there be regional winner in the global quantum economy sweepstakes? With visions of Silicon Valley’s iconic success in electronics and Boston/Cambridge� Read more…

How GPUs Are Embedded in the HPC Landscape

September 23, 2024

Grasping the basics of Graphics Processing Unit (GPU) architecture is crucial for understanding how these powerful processors function, particularly in high-per Read more…

Shutterstock_2176157037

Intel’s Falcon Shores Future Looks Bleak as It Concedes AI Training to GPU Rivals

September 17, 2024

Intel's Falcon Shores future looks bleak as it concedes AI training to GPU rivals On Monday, Intel sent a letter to employees detailing its comeback plan after Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst fir Read more…

AMD Clears Up Messy GPU Roadmap, Upgrades Chips Annually

June 3, 2024

In the world of AI, there's a desperate search for an alternative to Nvidia's GPUs, and AMD is stepping up to the plate. AMD detailed its updated GPU roadmap, w Read more…

Granite Rapids HPC Benchmarks: I’m Thinking Intel Is Back (Updated)

September 25, 2024

Waiting is the hardest part. In the fall of 2023, HPCwire wrote about the new diverging Xeon processor strategy from Intel. Instead of a on-size-fits all approa Read more…

Ansys Fluent® Adds AMD Instinct™ MI200 and MI300 Acceleration to Power CFD Simulations

September 23, 2024

Ansys Fluent® is well-known in the commercial computational fluid dynamics (CFD) space and is praised for its versatility as a general-purpose solver. Its impr Read more…

Shutterstock_1687123447

Nvidia Economics: Make $5-$7 for Every $1 Spent on GPUs

June 30, 2024

Nvidia is saying that companies could make $5 to $7 for every $1 invested in GPUs over a four-year period. Customers are investing billions in new Nvidia hardwa Read more…

Shutterstock 1024337068

Researchers Benchmark Nvidia’s GH200 Supercomputing Chips

September 4, 2024

Nvidia is putting its GH200 chips in European supercomputers, and researchers are getting their hands on those systems and releasing research papers with perfor Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Leading Solution Providers

Contributors

Everyone Except Nvidia Forms Ultra Accelerator Link (UALink) Consortium

May 30, 2024

Consider the GPU. An island of SIMD greatness that makes light work of matrix math. Originally designed to rapidly paint dots on a computer monitor, it was then Read more…

Quantum and AI: Navigating the Resource Challenge

September 18, 2024

Rapid advancements in quantum computing are bringing a new era of technological possibilities. However, as quantum technology progresses, there are growing conc Read more…

IBM Develops New Quantum Benchmarking Tool — Benchpress

September 26, 2024

Benchmarking is an important topic in quantum computing. There’s consensus it’s needed but opinions vary widely on how to go about it. Last week, IBM introd Read more…

Google’s DataGemma Tackles AI Hallucination

September 18, 2024

The rapid evolution of large language models (LLMs) has fueled significant advancement in AI, enabling these systems to analyze text, generate summaries, sugges Read more…

Microsoft, Quantinuum Use Hybrid Workflow to Simulate Catalyst

September 13, 2024

Microsoft and Quantinuum reported the ability to create 12 logical qubits on Quantinuum's H2 trapped ion system this week and also reported using two logical qu Read more…

IonQ Plots Path to Commercial (Quantum) Advantage

July 2, 2024

IonQ, the trapped ion quantum computing specialist, delivered a progress report last week firming up 2024/25 product goals and reviewing its technology roadmap. Read more…

Intel Customizing Granite Rapids Server Chips for Nvidia GPUs

September 25, 2024

Intel is now customizing its latest Xeon 6 server chips for use with Nvidia's GPUs that dominate the AI landscape. The chipmaker's new Xeon 6 chips, also called Read more…

US Implements Controls on Quantum Computing and other Technologies

September 27, 2024

Yesterday the Commerce Department announced  export controls on quantum computing technologies as well as new controls for advanced semiconductors and additiv Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire