Nvidia is trying to uncomplicate AI with a cloud service that makes AI and its many forms of computing less vague and more conversational.
The NeMo LLM service, which Nvidia called its first cloud service, adds a layer of intelligence and interactivity for users to harmoniously interact with complex AI models in domains such as biotechnology and medicine.
Some AI models that have been developed or are in research can be complicated, and need to be turned into useful enterprise applications that can fit in real-world commercial settings, said Ian Buck, the general manager and vice president of Accelerated Computing at Nvidia.
“We need to tailor these large language models to answer questions in certain ways, to give them context, and the domain problem to solve,” said Buck in a press briefing ahead of the company’s fall GPU Technology Conference, which is being held virtually this week.
Large language models are seen as a foundation technology to simplify user interaction with AI. The more recent DALL-E 2, which has 3.5 billion parameters, can generate images from a natural description of a few words, like one would use to describe art.
The NeMo LLM will make large language models easier to access so enterprises can play with it, experiment and deploy these models for their specific use case.
While DALL-E 2 is a simple example of a generic use of a large language model, and Nvidia is tuning NeMo LLM service to add a conversational element to specialized domains that include finance, technology or medicine.
“This service will help bring large language models to all sorts of different use cases – to generate profit summaries, for product reviews, to build technical Q&A, for medical use cases,” Buck said.
The cloud service takes pre-existing, pre-trained models such as the NeMo Megatron model (530 billion parameters), GPT-3 (5 billion and 20 billion parameter variants) or T5 (3 billion parameter variant) and builds a domain specific framework around it. The LLM will help models answer questions in a language best suited to a specific domain.
“You don’t need to train the large language model from scratch. We’ve already done that and made it easy for you,” Buck said.
The service is easy to utilize, and doesn’t require a lot of coding. A developer has to enter domain prompts, examples of questions and how it wants to be answered, and text or summarizations. The servers then train the model to answer the questions in that particular way. The output is a cloud-based API for users to interact with the service or use in applications.
Nvidia is also kicking off the NeMo LLM cloud service with BioNeMo, which provides researchers access to pre-trained chemistry and biology language models. These services will help research interact and manipulate protein and data for applications like drug discovery.
“Luckily chemistry and biology have their own languages – SMILE strings for chemistry, amino acids for proteins, and nucleic acids for DNA and RNA,” said Kimberly Powell, vice president and general manager of healthcare at Nvidia.
The first of two BioNeMo protein models, the ESM-1, captures or encodes important biological features of large protein databases. The model was originally developed by Meta (Facebook’s parent company), and was retrained by Nvidia and is now being offered as a service. That model is designed for downstream use in the research or enterprise communities.
“Users of the service can input an amino acid sequence and the model will infer 1000s of representations per second. That can be used to train a task specific model like predicting a protein stability or solubility,” Powell said.
The BioNeMo service also provides a model developed by the OpenFold consortium, which predicts 3D protein structure from an amino acid sequence in just minutes.
“Otherwise, you have to use experiments to determine 3D structures. And they’re very difficult, expensive and can take years,” Powell said.
The OpenFold Consortium, which includes academics, startups and companies in biotechnology and pharmaceutical sectors, developed the open-source protein language model. Nvidia will serve the model, but will also continue to iterate and co-develop the models with the consortium. Unlike ESM-1, Nvidia didn’t retrain the model.
Users will get early access to BioNemo next month.
The NeMo LLM cloud service will be deployed in datacenters that Nvidia classifies as “AI factories.” Customers can throw raw data into the factory, with the output being a glossy end product that is ready to deploy.
The NeMo LLM is the latest addition to a stable of software machines deployed in Nvidia’s AI factory. Other software products in Nvidia’s AI factory include RIVA, which is a speech AI, and Merlin, which is a recommender system.
The NeMo LLM will take advantage of the new H100 GPUs based on the latest Hopper architecture, which Nvidia says is now in full production (although the full SXM capability is awaiting the availability of Intel’s Sapphire Rapids CPUs). Nvidia said eight H100 GPUs can match the output of 64 previous-generation A100 GPUs.
Large language models like NeMo in Nvidia’s cloud service are based on the transformer architecture, which helps AI understand what parts of a sentence, image, or disparate data points are related to each other. That is unlike convolutional neural networks, which look only at their immediate neighboring relationships.
“Transformers can rein in the more distinct relationships and that’s important for a whole class of problems. Natural language processing is important because in order to understand the meaning of a word, you have to look at the whole sentence, and even a paragraph, and the same is the case with a number of other domains,” Paresh Kharya, senior director of product management and marketing at Nvidia, told HPCwire.
Transformers allowed Nvidia to create more distinct relationships in languages, and also train on unlabeled datasets.
“It greatly expanded the volume of data. In the case of NLP, it’s all the data on the internet. In the case of genomics and protein sequencing, the known structures and the behaviors and patterns is the data set that we have,” Khariya said.
The Hopper architecture has transformer engines that work at FP8 precision. Along with the software, Hopper is able to dynamically tune and adapt to the precision needed by the different layers in a model, and able to speed up training without changing or impacting the accuracy.
The new pretrained models offered by NeMo LLM take advantage of an emerging method called “prompt learning.”
The prompt learning method involves taking a large language model that has already been pre-trained, and adding a few examples on the type of tasks, the answers expected, and the types of responses expected when faced with a certain type of question. At the end of the learning cycle, based on the input, the main pre-trained model doesn’t change, but a prompt token is issued, which provides the context.
“The next time you’re asking a question of a similar type, you provide that question along with that prompt token. And that token gives the model the context it needs to answer that question more accurately,” Kharya said.
The process is called P-tuning, which takes advantage of the new transformer cores in the Hopper GPU. The P-tuning process can provide up to a five times speed up in deployment of LLMs compared to the previous-generation A100 GPUs, Kharya said.
The models could be trained or tuned over multiple types of GPUs besides Hopper, and the performance nonetheless goes up with faster bandwidth and connectivity with Hopper’s HBM3 memory and NVLink interconnect.
Nvidia said access to NeMo LLM service will be direct to enterprise starting next month and won’t be available to the public.