NSF Project Sets Up First Machine Learning Cyberinfrastructure – CHASE-CI

By John Russell

July 25, 2017

Earlier this month, the National Science Foundation issued a $1 million grant to Larry Smarr, director of Calit2, and a group of his colleagues to create a community infrastructure in support of machine learning research. The ambitious plan – Cognitive Hardware and Software Ecosystem, Community Infrastructure (CHASE-CI) – is intended to leverage the high-speed Pacific Research Platform (PRP) and put fast GPU appliances into the hands of researchers to tackle machine learning hardware, software, and architecture issues.

Given the abrupt rise of machine learning and its distinct needs versus traditional FLOPS-dominated HPC, the CHASE-CI effort seems a natural next step in learning how to harness PRP’s high bandwidth for use with big data projects and machine learning. Perhaps not coincidentally Smarr is also principal investigator for PRP. As described in the NSF abstract, CHASE-CI “will build a cloud of hundreds of affordable Graphics Processing Units (GPUs), networked together with a variety of neural network machines to facilitate development of next generation cognitive computing.”

Those are big goals. Last week, Smarr and co-PI Thomas DeFanti spoke with HPCwire about the CHASE-CI project. It has many facets. Hardware, including von Neumann (vN) and non von Neumann (NvN) architectures, software frameworks (e.g., Caffe and TensorFlow), six specific algorithm families (details near the end of the article), and cost containment are all key target areas. In building out PRP, the effort leveraged existing optical networks such as GLIF by building termination devices based on PCs and providing them to research scientists. The new device — dubbed FIONA (Flexible I/O Network Appliances) – was developed by PRP co-PI Philip Papadopoulos and is critical to the new CHASE-CI effort. A little background on PRP may be helpful.

Larry Smarr, director, Calit2

As explained by Smarr, the basic PRP idea was to experiment with a cyberinfrastructure that was appropriate for a broad set of applications using big data that aren’t appropriate for the commodity internet because of the size of the of the datasets. To handle the high speed bandwidth, you need a big bucket at the end of the fiber notes Smarr. FIONAs filled the bill; the devices are stuffed with high performance, high capacity SSDs and high speed NICs but based on the humble and less expensive PC.

“They could take the high data rate without TCP backing up and thereby lowering the overall bandwidth, which traditionally has been a problem if you try to go directly to spinning disk,” says Smarr. Currently, there are on the order of 40 or 50 of these FIONAs deployed across the West Coast. Although 100 gigabit throughput is possible via the fiber, most researchers are getting 10 gigabit, still a big improvement.

DOE tests the PRP performance regularly using the visualization tool MaDDash (Monitoring and Debugging Dashboard). “There are test transfers of 10 gigabytes of data, four times a day, among 25 organizations, so that’s roughly about 300 transfers four times a day. The reason why we picked that number, 10 gigabytes, was because that’s the amount of data you need to get TCP up to full speed,” says Smarr.

Thomas DeFanti, co-PI, CHASE-CI

Networks are currently testing out at 5, 6, 7, 8 and 9 gigabits per second, which is nearly full utilization. “Some of them really nail it at 9.9 gigabits per second. If you go to 40 gigabit networks that we have, we are getting 13 and 14 gigabits per second and that’s because of the [constrained] software we are using. If we go to a different software, which is not what scientists routinely use [except] the high energy physics people, then we can get 30 or 40 or 100 gigabits per second – that’s where we max out with the PC architecture and the disk drives on those high end units,” explains DeFanti.

The PRP has proven to be very successful, say Smarr and DeFanti. PRP v1, basically the network of FIONAs, is complete. PRP v2 is in the works. The latter is intended to investigate advanced software concepts such as software defined networking and security and not intended to replace PRP v1. Now, Smarr wants to soup up FIONAs with FPGAs, hook them into the PRP, and tackle machine learning. And certainly hardware is just a portion of the machine learning challenge being addressed.

Data showing increase in PRP performance over time.

Like PRP before it, CHASE-CI is a response to an NSF call for community computer science infrastructure. Unlike PRP, which is focused on applications (e.g. geoscience, bioscience) and whose architecture was largely defined by guidance from domain scientists, CHASE-CI is being driven by needs of computer scientists trying to support big data and machine learning.

The full principal investigator team is an experienced and accomplished group including: Smarr, (Principal Investigator), Calit2; Tajana Rosing (Co-Principal Investigator), Professor, CSE Department, UCSD; Ilkay Altintas (Co-Principal Investigator), Chief Data Science Officer, San Diego Supercomputer Center; DeFanti (Co-Principal Investigator), Full Research Scientist at the UCSD Qualcomm Institute, a division of Calit2;  and Kenneth Kreutz-Delgado (Co-Principal Investigator), Professor, ECE Department, UCSD.

“What they didn’t ask for [in the PRP grant] was what computer scientists need to support big data and machine learning. So we went back to the campuses and found the computers scientists, faculty and staff that were working on machine learning and ended up with 30 of them that wrote up their research to put into this proposal,” says Smarr. “We asked what was bottlenecking the work and [they responded] it was a lack of access to GPUs to do the compute intensive aspects of machine learning like training data sets on big neural nets.”

Zeroing in on GPUs, particularly GPUs that emphasize lower precision, is perhaps predictable.

“[In traditional] HPC you need 64-bit and error correction and all of that kind of stuff which is provided very nicely by Nvidia’s Tesla line, for instance, but actually because of the noise that is inherent in the data in most machine learning applications it turns out that single precision 32-bit is just fine and that’s much less expensive than the double precision,” says Smarr. For this reason, the project is focusing on less expensive “gaming GPUs” which fit fine into the slots on the FIONAs since they are PCs.

The NSF proposal first called for putting ten GPUs into each FIONA. “But we decided eight is probably optimal, eight of these front line game GPUs and we are deploying 32 of those FIONAs in this new grant across PRP to these researchers, and because they are all connected at 10 gigabits/s we can essentially treat them as a cloud,” says Smarr. There are ten campuses initially participating: UC San Diego, UC Berkeley, UC Irvine, UC Riverside, UC Santa Cruz, UC Merced, Sand Diego State University, Caltech, Stanford, and Montana State University. [A brief summary of researchers and their intended focus by campus is at the end of the article (taken from the grant proposal).]

As shown in the cost comparison below, the premium for high end GPUs such as the Nvidia P100 is dramatic. The CHASE-CI plan is to stick with commodity gaming GPUs, like the Nvidia 1080, since they are used in large volumes which keeps the prices down and the improvements coming. Nevertheless Smarr and DeFanti emphasize they are vendor agnostic and that other vendors have expressed interest in the program.

“Every year Nvidia comes out with a new set of devices and then halfway way through the year they come out with an accelerated version so in some sense you are on a six month cycle. The game cards are around $600 and every year the [cost performance] gets better,” says DeFanti. “We buy what’s available, build, test, and benchmark and then we wait for the next round. [Notably], people in the community do have different needs – some need more memory, some would rather have twice as many $250 GPUs because they are really fast and just have less memory. So it is really kind of custom and there’s some negotiation with users, but they all fit in the same chassis.”

DeFanti argues the practice of simulating networks on CPUs has slowed machine learning’s growth. “Machine learning involves looking at gigabytes of data, millions of pictures, and basically doing that is a brute force calculation that works fine in 32 bit, nobody uses 64 bit. You chew on these things literally for a week even on a GPU, which is much faster than a CPU for these kind of these things. That’s a reason why this field was sort of sluggish just using simulators; it took too much time on desktop CPUs. The first phase of this revolution is getting everybody off simulators.”

That said, getting high performance GPUs into the hands of researchers and students is only half of the machine learning story. Training is hard and compute-intensive and can take weeks-to-months depending upon the problem. But once a network is trained, the computer power required for the inference engine is considerably less. Power consumption becomes the challenge particularly because these trained networks are expected to be widely deployed on mobile platforms.

Here, CHASE-CI is examining a wide range of device types and architectures. Calit2, for example, has been working with IBM’s neuromorphic True North chip for a couple of years. It also had a strong role in helping KnuEdge develop its DSP-based neural net chip. (KnuEdge, of course, was founded by former NASA administrator Daniel Goldin.) FPGAs also show promise.

Says Smarr, “They have got to be very energy efficient. You have this whole new generation of largely non von Neumann architectures that are capable of executing these machine learning algorithms on say streams of video data, radar data, LIDAR data, things like that, that make decisions in real time like approval on credit cards. We are building up a set of these different architectures – von Neumann and non von Neumann – and making those available to these 30 machine learning experts.”

CHASE-CI is also digging into the needed software ecosystem to support machine learning. The grant states “representative algorithms from each of the following six families will be selected to form a standardized core suite of ‘component algorithms’ that have been tuned for optimal performance on the component coprocessors.” Here they are:

  • Deep Neural Network (DNN) and Recurrent Neural Network (RNN) algorithms, including layered networks having fully-connected and convolutional layers (CNNs), variational autoencoders (VAEs), and generalized adversarial networks (GANs). Training will be done using, modern approaches to backpropagation, stochastic sampling, bootstrap sampling, and restricted (and unrestricted) Boltzmann ML. NNs provide powerful classification and detection performance and can automatically extract a hierarchy of features.
  • Reinforcement Learning (RL) algorithms and related approximate Markov decision process (MDP) algorithms. RL and inverse-RL algorithms have critical applications in areas of dynamic decision-making, robotics and human/robotic transfer learning.
  • Variational Autoencoder (VAE) and Markov Chain Monte Carlo (MCMC) stochastic sampling algorithms supporting the training of generative models and metrics for evaluating the quality of generative model algorithms. Stochastic sampling algorithms are appropriate for training generative models on analog and digital spiking neurons. Novel metrics are appropriate.
  • Support Vector Machine (SVM) SVMs can perform an inner product and thresholding in a high-dimensional feature space via use of the “kernel trick”.
  • Sparse Signal Processing (SSP) algorithms for sparse signal processing and compressive sensing, including Sparse Baysian Learning (SBL). Algorithms which exploit source sparsity, possibly using learned over-complete dictionaries, are very important in domains such as medical image processing and brain-computing interfacing.
  • Latent Variable (LVA) Algorithms for source separation algorithms, such as PCA, ICA, and IVA. LV models typically assume that a solution exists in some latent variable sparse within which the components are statistically independent. This class of algorithms includes factor analysis (FA) and non-negative matrix factorization (NMF) algorithms.

Despite such ambitious hardware and software goals, Smarr suggests early results from CHASE-CI should be available sometime in the fall. Don’t get the wrong idea, he cautions. CHASE-CI is a research project for computer science not a production platform.

“We are not trying to be a big production site or anything else. But we are trying to really explore, as this field develops, not just the hardware platforms we’ve talked about but the software and algorithm issues. There’s a whole bunch of different modes of machine learning and statistical analysis that we are trying to match between the algorithms and the architectures, both von Neumann and non von Neumann.

“For 30 years I have been sort of collecting architectures and mapping a wide swath of algorithms on them to port applications. Here we are doing it again but now for machine learning.”

Sample List of CHASE-CI Researchers and Area of Work*

  • UC San Diego: Ravi Ramamoorthi (ML for processing light field imagery), Manmohan Chandraker (3D scene reconstruction), Arun Kumar (deep learning for database systems), Rajesh Gupta (accelerator-centric SOCs), Gary Cottrell (comp. cognitive neuroscience & computer vision), Nuno Vasconcelos (computer vision & ML), Todd Hylton (contextual robotics), Jurgen Schulze (VR), Ken Kreutz-Delgado (ML and NvN), Larry Smarr (ML and microbiome), Tajana Rosing (energy efficiency of running MLs on NvNs), Falko Kuester (ML on NvNs in drones).
  • UC Berkeley: James Demmel (CA algorithms), Trevor Darrell (ML libraries)
  • UC Irvine: Padhraic Smyth (ML for biomedicine and climate science), Jeffrey Krichmar 
(computational neuroscience), Nikkil Dutt (FPGAs), Anima Anandkumar (ML)
  • UC Riverside: Walid Najjar (FPGAs), Amit Roy-Chowdhury (image and video analysis)
  • UC Santa Cruz: Dimitris Achlioptas (linear layers and random bipartite graphs), Lise Getoor (large-scale graph processing), Ramakrishna Akella (multi-modal prediction and retrieval), Shawfeng Dong (Bayesian deep learning and CNN models in astronomy)
  • UC Merced: Yang Quan Chen (agricultural drones)
  • San Diego State: Baris Aksanli (adaptive learning for historical data)
  • Caltech: Yisong Yue (scalable deep learning methods for complex prediction settings)
  • Stanford: Anshul Kundaje (ML & genetics), Ron Dror (structure-based drug design using ML)
  • Montana State: John Sheppard (ML and probablistic methods to solve large systems problems)

*  Excerpted from the CHASE-CI grant proposal

Images courtesy of Larry Smarr, Calit2

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Simulating Car Crashes with Supercomputers – and Lego

October 18, 2019

It’s an experiment many of us have carried out at home: crashing two Lego creations into each other, bricks flying everywhere. But for the researchers at the General German Automobile Club (ADAC) – which is comparabl Read more…

By Oliver Peckham

NASA Uses Deep Learning to Monitor Solar Weather

October 17, 2019

Solar flares may be best-known as sci-fi MacGuffins, but those flares – and other space weather – can have serious impacts on not only spacecraft and satellites, but also on Earth-based systems such as radio communic Read more…

By Oliver Peckham

Federated Learning Applied to Cancer Research

October 17, 2019

The ability to share and analyze data while protecting patient privacy is giving medical researchers a new tool in their efforts to use what one vendor calls “federated learning” to train models based on diverse data Read more…

By George Leopold

Using AI to Solve One of the Most Prevailing Problems in CFD

October 17, 2019

How can artificial intelligence (AI) and high-performance computing (HPC) solve mesh generation, one of the most commonly referenced problems in computational engineering? A new study has set out to answer this question and create an industry-first AI-mesh application... Read more…

By James Sharpe

NSB 2020 S&E Indicators Dig into Workforce and Education

October 16, 2019

Every two years the National Science Board is required by Congress to issue a report on the state of science and engineering in the U.S. This year, in a departure from past practice, the NSB has divided the 2020 S&E Read more…

By John Russell

AWS Solution Channel

Making High Performance Computing Affordable and Accessible for Small and Medium Businesses with HPC on AWS

High performance computing (HPC) brings a powerful set of tools to a broad range of industries, helping to drive innovation and boost revenue in finance, genomics, oil and gas extraction, and other fields. Read more…

HPE Extreme Performance Solutions

Intel FPGAs: More Than Just an Accelerator Card

FPGA (Field Programmable Gate Array) acceleration cards are not new, as they’ve been commercially available since 1984. Typically, the emphasis around FPGAs has centered on the fact that they’re programmable accelerators, and that they can truly offer workload specific hardware acceleration solutions without requiring custom silicon. Read more…

IBM Accelerated Insights

How Do We Power the New Industrial Revolution?

[Attend the IBM LSF, HPC & AI User Group Meeting at SC19 in Denver on November 19!]

Almost everyone is talking about artificial intelligence (AI). Read more…

What’s New in HPC Research: Rabies, Smog, Robots & More

October 14, 2019

In this bimonthly feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

By Oliver Peckham

Using AI to Solve One of the Most Prevailing Problems in CFD

October 17, 2019

How can artificial intelligence (AI) and high-performance computing (HPC) solve mesh generation, one of the most commonly referenced problems in computational engineering? A new study has set out to answer this question and create an industry-first AI-mesh application... Read more…

By James Sharpe

NSB 2020 S&E Indicators Dig into Workforce and Education

October 16, 2019

Every two years the National Science Board is required by Congress to issue a report on the state of science and engineering in the U.S. This year, in a departu Read more…

By John Russell

Crystal Ball Gazing: IBM’s Vision for the Future of Computing

October 14, 2019

Dario Gil, IBM’s relatively new director of research, painted a intriguing portrait of the future of computing along with a rough idea of how IBM thinks we’ Read more…

By John Russell

Summit Simulates Braking – on Mars

October 14, 2019

NASA is planning to send humans to Mars by the 2030s – and landing on the surface will be considerably trickier than landing a rover like Curiosity. To solve Read more…

By Staff report

Trovares Drives Memory-Driven, Property Graph Analytics Strategy with HPE

October 10, 2019

Trovares, a high performance property graph analytics company, has partnered with HPE and its Superdome Flex memory-driven servers on a cybersecurity capability the companies say “routinely” runs near-time workloads on 24TB-capacity systems... Read more…

By Doug Black

Intel, Lenovo Join Forces on HPC Cluster for Flatiron

October 9, 2019

An HPC cluster with deep learning techniques will be used to process petabytes of scientific data as part of workload-intensive projects spanning astrophysics to genomics. AI partners Intel and Lenovo said they are providing... Read more…

By George Leopold

Optimizing Offshore Wind Farms with Supercomputer Simulations

October 9, 2019

Offshore wind farms offer a number of benefits; many of the areas with the strongest winds are located offshore, and siting wind farms offshore ameliorates many of the land use concerns associated with onshore wind farms. Some estimates say that, if leveraged, offshore wind power... Read more…

By Oliver Peckham

Harvard Deploys Cannon, New Lenovo Water-Cooled HPC Cluster

October 9, 2019

Harvard's Faculty of Arts & Sciences Research Computing (FASRC) center announced a refresh of their primary HPC resource. The new cluster, called Cannon after the pioneering American astronomer Annie Jump Cannon, is supplied by Lenovo... Read more…

By Tiffany Trader

Supercomputer-Powered AI Tackles a Key Fusion Energy Challenge

August 7, 2019

Fusion energy is the Holy Grail of the energy world: low-radioactivity, low-waste, zero-carbon, high-output nuclear power that can run on hydrogen or lithium. T Read more…

By Oliver Peckham

DARPA Looks to Propel Parallelism

September 4, 2019

As Moore’s law runs out of steam, new programming approaches are being pursued with the goal of greater hardware performance with less coding. The Defense Advanced Projects Research Agency is launching a new programming effort aimed at leveraging the benefits of massive distributed parallelism with less sweat. Read more…

By George Leopold

Cray Wins NNSA-Livermore ‘El Capitan’ Exascale Contract

August 13, 2019

Cray has won the bid to build the first exascale supercomputer for the National Nuclear Security Administration (NNSA) and Lawrence Livermore National Laborator Read more…

By Tiffany Trader

AMD Launches Epyc Rome, First 7nm CPU

August 8, 2019

From a gala event at the Palace of Fine Arts in San Francisco yesterday (Aug. 7), AMD launched its second-generation Epyc Rome x86 chips, based on its 7nm proce Read more…

By Tiffany Trader

Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips

August 19, 2019

Silicon startup Ayar Labs continues to gain momentum with its DARPA-backed optical chiplet technology that puts advanced electronics and optics on the same chip Read more…

By Tiffany Trader

Using AI to Solve One of the Most Prevailing Problems in CFD

October 17, 2019

How can artificial intelligence (AI) and high-performance computing (HPC) solve mesh generation, one of the most commonly referenced problems in computational engineering? A new study has set out to answer this question and create an industry-first AI-mesh application... Read more…

By James Sharpe

D-Wave’s Path to 5000 Qubits; Google’s Quantum Supremacy Claim

September 24, 2019

On the heels of IBM’s quantum news last week come two more quantum items. D-Wave Systems today announced the name of its forthcoming 5000-qubit system, Advantage (yes the name choice isn’t serendipity), at its user conference being held this week in Newport, RI. Read more…

By John Russell

Chinese Company Sugon Placed on US ‘Entity List’ After Strong Showing at International Supercomputing Conference

June 26, 2019

After more than a decade of advancing its supercomputing prowess, operating the world’s most powerful supercomputer from June 2013 to June 2018, China is keep Read more…

By Tiffany Trader

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

A Behind-the-Scenes Look at the Hardware That Powered the Black Hole Image

June 24, 2019

Two months ago, the first-ever image of a black hole took the internet by storm. A team of scientists took years to produce and verify the striking image – an Read more…

By Oliver Peckham

Intel Confirms Retreat on Omni-Path

August 1, 2019

Intel Corp.’s plans to make a big splash in the network fabric market for linking HPC and other workloads has apparently belly-flopped. The chipmaker confirmed to us the outlines of an earlier report by the website CRN that it has jettisoned plans for a second-generation version of its Omni-Path interconnect... Read more…

By Staff report

Crystal Ball Gazing: IBM’s Vision for the Future of Computing

October 14, 2019

Dario Gil, IBM’s relatively new director of research, painted a intriguing portrait of the future of computing along with a rough idea of how IBM thinks we’ Read more…

By John Russell

Kubernetes, Containers and HPC

September 19, 2019

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user — while using resources more efficiently and reducing costs. Read more…

By Daniel Gruber, Burak Yenier and Wolfgang Gentzsch, UberCloud

Intel Debuts Pohoiki Beach, Its 8M Neuron Neuromorphic Development System

July 17, 2019

Neuromorphic computing has received less fanfare of late than quantum computing whose mystery has captured public attention and which seems to have generated mo Read more…

By John Russell

Rise of NIH’s Biowulf Mirrors the Rise of Computational Biology

July 29, 2019

The story of NIH’s supercomputer Biowulf is fascinating, important, and in many ways representative of the transformation of life sciences and biomedical res Read more…

By John Russell

Quantum Bits: Neven’s Law (Who Asked for That), D-Wave’s Steady Push, IBM’s Li-O2- Simulation

July 3, 2019

Quantum computing’s (QC) many-faceted R&D train keeps slogging ahead and recently Japan is taking a leading role. Yesterday D-Wave Systems announced it ha Read more…

By John Russell

With the Help of HPC, Astronomers Prepare to Deflect a Real Asteroid

September 26, 2019

For years, NASA has been running simulations of asteroid impacts to understand the risks (and likelihoods) of asteroids colliding with Earth. Now, NASA and the European Space Agency (ESA) are preparing for the next, crucial step in planetary defense against asteroid impacts: physically deflecting a real asteroid. Read more…

By Oliver Peckham

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This