AWS ParallelCluster

By Mark Duffield

November 13, 2018

Orchestration software has played a key role in cluster bring-up and management for decades. Dating back to solutions like SunCluster, PSSP, and community solutions such as CFEngine, the need to launch many resources together to enable large parallel applications continues to be a vital part of the High Performance Computing (HPC) environment. AWS has many cloud native approaches to running your clustered workloads on AWS, but the need to recreate or replicate an environment similar or nearly identical to what you are currently running in your data center may be a necessary first step in moving workloads to AWS.

What if you could build a familiar cluster environment using AWS cloud native resources?

Today we announce AWS ParallelCluster, an AWS supported, open source cluster management tool that makes it easy for scientists, researchers, and IT administrators to deploy and manage High Performance Computing (HPC) clusters in the AWS cloud. With AWS ParallelCluster, many AWS cloud native products are used to launch a cluster environment that should be familiar to those running HPC workloads. For example, AWS CloudFormation, AWS Identity and Access Management (IAM), Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), Amazon Elastic Compute Cloud (Amazon EC2), Amazon EC2 Auto Scaling, Amazon Elastic Block Store (Amazon EBS), Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB.

AWS ParallelCluster is released via the Python Package Index (PyPI) and can be installed via pip. It is available at no additional cost, and you only pay for the AWS resources needed to run your applications. ParallelCluster leverages CloudFormation to build out your cluster environment. This is the same CloudFormation that you can use to launch just one instance, or a VPC, or an S3 bucket, but now you’re using it launch an entire HPC cluster environment.

Many of you will be familiar with CfnCluster. ParallelCluster used the code base that CfnCluster was built upon, and then we extended it to include new features, functionality, and (of course) bug improvements and fixes. If you are a previous user of CfnCluster, we encourage you to start using ParallelCluster when you can, and going forward create new clusters only using ParallelCluster. You can use your existing CfnCluster config files with ParallelCluster. (Although you can still use CfnCluster, it will no longer be developed.)

Some key features in the initial release of ParallelCluster that were not in CfnCluster are:

  • AWS Batch integration
  • Multiple EBS volumes
  • Better scaling performance – faster, with updates AutoScaling all at once
  • Support for “bring your own AMI” Custom AMI
  • Private cluster using proxy

And we’re not even close to done! We will continue to iterate ParallelCluster based on customer requests and feedback.

Getting Started

Grab a cup of caffeine, and let’s get to it!

You will need:

Decision time #1. You can use ParallelCluster anywhere you can access the internet, but you will need your AWS API keys, or you will need to set up an IAM Role and assign that to an instance to launch the necessary resources for your cluster. For this post, I’ll assume you are using either a Linux or MacOS operating system, you have admin access, and you have access to your API Keys. Please reach out to an AWS Solutions Architect if you have questions about using an IAM Role instead.

Before I install ParallelCluster, I’ll make sure I can access the console using the AWS CLI. To install the AWS CLI you can follow the steps Installing the AWS Command Line Interface, or to install in a Python virtual environment you can followInstall the AWS Command Line Interface in a Virtual Environment. I’ll be using a Python virtual environment for everything.

An optional first step for those wanting to use a Python virtual environment:

[duff]$ virtualenv ~/Envs/pcluster-virtenv
[duff]$ source ~/Envs/pcluster-virtenv/bin/activate
(pcluster-virtenv) [duff]$ 

Now let’s install the AWS CLI and verify functionality by creating a bucket:

(pcluster-virtenv) [duff]$ pip install --upgrade awscli
(pcluster-virtenv) [duff@]$ aws configure
AWS Access Key ID []: <aws_access_key>
AWS Secret Access Key []: <aws_secret_access_key>
Default region name []: us-east-1
Default output format []: json
(pcluster-virtenv) [duff]$ aws s3 mb s3://duff-parallelcluster
make_bucket: duff-parallelcluster 

I’ve installed, setup, and verified functionality of the AWS CLI. Let’s install ParallelCluster now.

Decision time #2. The VPC that ParallelCluster will use must have DNS Resolution = yes and DNS Hostnames = yes. It should also have DHCP options with the correct domain-name for the region, as defined in the docs: VPC DHCP Options. The subnet that will be used will need to have access to the internet, and there are several way to enable this. For this blog, I will use a Public subnet (a subnet that has an IGW attached and routes to the internet), but you can use a Private subnet as long as the subnet routes to the internet (e.g. through a NAT Gateway or a proxy server).

The VPC settings can be verified by going to the Console and looking at the configuration, you should see this:

Now I’ll install ParallelCluster using the virtual environment I setup:

(pcluster-virtenv) [duff]$  pip install aws-parallelcluster
... output snipped...
Successfully installed aws-parallelcluster-2.0.0rc1 ...

Before I can launch a cluster I’ll need to configure ParallelCluster. Note that I leave “AWS Access Key ID” and “AWS Secret Access Key ID” blank, as I already configured this with the AWS CLI setup. Also, because we really want to make this easy on you, we’ll display possible values from your account:

(pcluster-virtenv) [duff@]$ pcluster configure
Cluster Template [default]:
AWS Access Key ID []:  <blank>
AWS Secret Access Key ID []: <blank>
Acceptable Values for AWS Region ID:
    ap-south-1
    eu-west-3
    eu-west-2
    eu-west-1
    ap-northeast-2
    ap-northeast-1
    sa-east-1
    ca-central-1
    ap-southeast-1
    ap-southeast-2
    eu-central-1
    us-east-1
    us-east-2
    us-west-1
    us-west-2
AWS Region ID []: us-east-1
VPC Name [public]:
Acceptable Values for Key Name: <blank>
    duff_key_us-east-1
Key Name []: duff_key_us-east-1
Acceptable Values for VPC ID:
    vpc-12345678901234567
    vpc-abcdefghigjlmnopq
VPC ID []: vpc-abcdefghigjlmnopq
Acceptable Values for Master Subnet ID:
    subnet-abcdefghigjlmnop1
    subnet-abcdefghigjlmnop2
    subnet-abcdefghigjlmnop3
    subnet-abcdefghigjlmnop4
    subnet-abcdefghigjlmnop5
    subnet-abcdefghigjlmnop6
Master Subnet ID []: subnet-abcdefghigjlmnop1

Okay, let’s see what that did. It created the file ~/.parallelcluster/config, let’s cat that and have a look.

(pcluster-virtenv) [duff]$ cat ~/.parallelcluster/config
[aws]
aws_region_name = us-east-1

[cluster default]
vpc_settings = public
key_name = duff_key_us-east-1

[vpc public]
master_subnet_id = subnet-abcdefghigjlmnop1
vpc_id = vpc-abcdefghigjlmnopq

[global]
update_check = true
sanity_check = true
cluster_template = default

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

ParallelCluster uses the file ~/.parallelcluster/config by default for all configuration parameters. You can see an example configuration file site-packages/aws-parallelcluster/examples/config in the github repo. The config file has several sections (if you’re a Python programmer we’re using ConfigParser). Each section has a set of parameters that used to launch the cluster. If I’m not careful, and I accidentally put a config parameter in the wrong section, it will be silently ignored and I’ll be stuck wondering what happened. Refer to the ParallelCluster Configuration docs for more info. If the parameter is not specified in the config file, then the default value is used.

Currently, ParallelCluster supports three schedulers: sge, torque, and slurm. The default is sge, and that’s what I’ll be using.

For now, the only changes I will make in the config file is to add the SSH source location ssh_from in the VPC section, and change the compute_instance_type in the cluster section.

By default, we will allow SSH inbound from any source IP (0.0.0.0/0), and I want to restrict this to just my IP address. I recommend that you do something similar by adding your IP address or trusted CIDR block (e.g. 10.10.0.0/16). I updated my [vpc public] section:

[vpc public]
master_subnet_id = subnet-abcdefghigjlmnop1
vpc_id = vpc-abcdefghigjlmnopq
ssh_from = 11.22.33.44/32

And I will also update the [cluster default] section, and change the compute instance type to c4.large, rather than using the default instance t2.micro:

[cluster default]
vpc_settings = public
key_name = duff_key_us-east-1
compute_instance_type = c4.large

Now that we understand a bit about the config file and we know how to add configuration parameters, let’s launch our first cluster with the create command:

(pcluster-virtenv) [duff]$ pcluster create hello-cluster1

When we start the cluster create, we’ll see a status update as the resources are being brought up. And because I’m interested to see how long it takes to launch a cluster, I’ll be using time:

(pcluster-virtenv) [duff]$ time pcluster create hello-cluster1
Beginning cluster creation for cluster: hello-cluster1
Creating stack named: parallelcluster-hello-cluster1
Status: parallelcluster-hello-cluster1 - CREATE_IN_PROGRESS

When the cluster creation has completed, I have both the public and private IP addresses and the username for login. And because I used time, I see that it took 8 mins and 33 seconds to create the cluster:

MasterPublicIP: 35.153.251.20
ClusterUser: ec2-user
MasterPrivateIP: 172.31.0.14

real	8m33.425s
user	0m2.620s
sys	    0m0.353s

Let’s login with the built-in ssh alias we give you with ParallelCluster pcluster ssh <cluster_name>, and see what cluster resources are already avaiablbe.

(pcluster-virtenv) [duff@]$ pcluster list
hello-cluster1

(pcluster-virtenv) [duff@]$ pcluster ssh hello-cluster1
The authenticity of host '35.153.251.20 (35.153.251.20)' can't be established.
ECDSA key fingerprint is SHA256:u9+A0i6Y94JcRGYW8eyi5e4N+iiNtpPTPAwPY5PQcWk.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '35.153.251.20' (ECDSA) to the list of known hosts.
Last login: Sun Nov 11 20:12:12 2018

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2018.03-release-notes/

[ec2-user@ip-172-31-0-14 ~]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
ip-172-31-10-95         lx-amd64        2    1    1    2  0.02    3.7G  156.2M     0.0     0.0
ip-172-31-13-199        lx-amd64        2    1    1    2  0.02    3.7G  156.8M     0.0     0.0

From the output above, you can see that I already have a cluster of instances running. By default, we’re going to use t2.micro for the compute instance type, but I configured this cluster to use the c4.large, and because hyper-threading is on, we see two CPUs and one core for each instance.

Let’s submit a simple hostname job that will show the AutoScaling feature of ParallelCluster using the mpiruncommand.

[ec2-user@ip-172-31-0-14 ~]$ echo /usr/lib64/openmpi/bin/mpirun hostname | qsub -pe mpi 16
Your job 1 ("STDIN") has been submitted
[ec2-user@ip-172-31-0-14 ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      1 0.00000 STDIN      ec2-user     qw    11/11/2018 20:25:38                                   16

Now I have a job requesting more instances than I have, which kicks off scaling action. When I have enough instances, in this case I’ll need 8 total instances, the job will run. A few minutes later, I have the resources and the job has already run to completion:

[ec2-user@ip-172-31-0-14 ~]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
ip-172-31-0-72          lx-amd64        2    1    1    2  0.11    3.7G  189.0M     0.0     0.0
ip-172-31-10-65         lx-amd64        2    1    1    2  0.29    3.7G  189.2M     0.0     0.0
ip-172-31-14-49         lx-amd64        2    1    1    2  0.11    3.7G  189.1M     0.0     0.0
ip-172-31-2-78          lx-amd64        2    1    1    2  0.06    3.7G  189.4M     0.0     0.0
ip-172-31-3-226         lx-amd64        2    1    1    2  0.11    3.7G  185.5M     0.0     0.0
ip-172-31-4-248         lx-amd64        2    1    1    2  0.11    3.7G  186.2M     0.0     0.0
ip-172-31-5-112         lx-amd64        2    1    1    2  0.08    3.7G  188.9M     0.0     0.0
ip-172-31-5-50          lx-amd64        2    1    1    2  0.08    3.7G  189.0M     0.0     0.0
[ec2-user@ip-172-31-0-14 ~]$ qstat

Now that the job has run and I have these instnaces just sitting there doing nothing, what happens now? If the instances have been running for more than 10 minutes, but are not running a job, we will terminate those instnaces for you. So after 10 minutes I look at qhost again:

[ec2-user@ip-172-31-0-14 ~]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
[ec2-user@ip-172-31-0-14 ~]$ qstat

The instances have been terminated, and I’m not being charged for idle instances. The scaling features are configurable.

Okay. I have launched what looks and acts like a traditional HPC environment using many AWS Cloud native resources, to include an AutoScaling cluster that will terminate instances when that are not being used. What about using an environment without the scheduler overhead?

Say hello to AWS Batch.

AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.

With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 and Spot Instances.

So now I’ll launch a Batch enviroment and let ParallelCluster do all of the work for me. When launching a AWS Batch enviroment, we’ll leverage even more AWS resources. For example, AWS CodeBuild, Amazon Elastic Container Registry (Amazon ECR), and NFS server will be brought up on the master instance.

I’ll start by editing my config file: ~/.parallelcluster/config, and add this section using some of the same parameters from the [cluster default] section.

[cluster awsbatch]
scheduler = awsbatch
key_name = duff_key_us-east-1
vpc_settings = public

Now that I have a separate cluster template defined, I can launch a separate master instance that will be both the NFS server for my Batch jobs, and will also be the submit host for my batch jobs. I’ll create a cluster now, specifying my awsbatch cluster.

(pcluster-virtenv) [duff@]$ pcluster create awsbatch --cluster-template awsbatch
Beginning cluster creation for cluster: awsbatch
Creating stack named: parallelcluster-awsbatch
Status: parallelcluster-awsbatch - CREATE_COMPLETE
MasterPublicIP: 54.158.75.19
ClusterUser: ec2-user
MasterPrivateIP: 172.31.15.217
ResourcesS3Bucket: parallelcluster-awsbatch-6wjsibr8elx9km0r

From the output above, you can see I’ve successfully created an AWS Batch submit host. I’ll log in and see what’s there:

(pcluster-virtenv) [duff@]$ pcluster ssh awsbatch
The authenticity of host '54.158.75.19 (54.158.75.19)' can't be established.
ECDSA key fingerprint is SHA256:/K8LQYyLliS0+Q7+BZtkhe6ChyM9Oz/RZz0aTCKJ3KQ.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '54.158.75.19' (ECDSA) to the list of known hosts.
Last login: Tue Nov 13 00:46:30 2018

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2018.03-release-notes/
[ec2-user@ip-172-31-15-217 ~]$ awsbhosts
ec2InstanceId        instanceType    privateIpAddress    publicIpAddress      runningJobs
-------------------  --------------  ------------------  -----------------  -------------
i-05af380e4950366d4  c4.xlarge       172.31.4.66         18.209.11.53                   0

I see that I have a c4.xlarge instance ready to run jobs. I’ll test with hello world.

[ec2-user@ip-172-31-15-217 ~]$ awsbsub echo hello world
Job 2387b7f5-14c7-41c1-bbf8-c5e50017580a (echo) has been submitted.

The job is submitted, and should go from RUNNABLE to STARTING to RUNNING, and then either SUCCEEDED or FAIL.

[ec2-user@ip-172-31-15-217 ~]$ awsbstat
jobId                                 jobName    status    startedAt    stoppedAt    exitCode
------------------------------------  ---------  --------  -----------  -----------  ----------
2387b7f5-14c7-41c1-bbf8-c5e50017580a  echo       RUNNABLE  -            -            -
[ec2-user@ip-172-31-15-217 ~]$ set -o vi
[ec2-user@ip-172-31-15-217 ~]$ awsbstat
jobId                                 jobName    status    startedAt    stoppedAt    exitCode
------------------------------------  ---------  --------  -----------  -----------  ----------
2387b7f5-14c7-41c1-bbf8-c5e50017580a  echo       STARTING  -            -            -

[ec2-user@ip-172-31-15-217 ~]$ awsbstat
jobId                                 jobName    status    startedAt            stoppedAt    exitCode
------------------------------------  ---------  --------  -------------------  -----------  ----------
2387b7f5-14c7-41c1-bbf8-c5e50017580a  echo       RUNNING   2018-11-13 00:52:31  -            -

Now I see that my job is running, and I can also check with the awsbout command:

[ec2-user@ip-172-31-15-217 ~]$ awsbout 2387b7f5-14c7-41c1-bbf8-c5e50017580a
2018-11-13 00:52:31: Starting Job 2387b7f5-14c7-41c1-bbf8-c5e50017580a
2018-11-13 00:52:31: hello world

After my job has completed, I can check the status with the awsbstat command:

[ec2-user@ip-172-31-15-217 ~]$ awsbstat -s SUCCEEDED
jobId                                 jobName    status     startedAt            stoppedAt              exitCode
------------------------------------  ---------  ---------  -------------------  -------------------  ----------
2387b7f5-14c7-41c1-bbf8-c5e50017580a  echo       SUCCEEDED  2018-11-13 00:52:31  2018-11-13 00:53:02           0

With AWS ParallelCluster you can leverage the benefits of the AWS Cloud, while maintaining a faimiliar, cluster environment. We’re excited about ParallelCluster and we look forward to hearing from you!

Cheers,
Mark

Return to Solution Channel Homepage
Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire