Machine learning (ML) has impacted nearly every aspect of our daily lives, from online customer support to search engine result filtering. Because of this, ML has moved so far into the mainstream of society that it is now often regarded simply as artificial intelligence (AI), even though this oversimplifies the complex nature of ML. Properly supported, advanced ML projects can drive some of tomorrow’s most transformative technologies such as self-driving cars, big data analytics, voice & facial recognition, and augmented reality. However, as this technology, and the underlying hardware and software tools that enable it, progresses, there is increasing expectation that “better” clusters are ones that don’t just perform better, but also faster
Supporting Machine Learning and Deep Learning Workloads
As ML and deep learning (DL) models continue to grow in both scale and complexity, they demand solutions with extensive computing power, high-speed and high-capacity storage, and low-latency, high-bandwidth interconnects. Modern AI hardware technologies can provide plenty of performance. However, these systems require a large investment of time, expertise, and funding.
That’s why organizations partner with expert solution designers and consultants with years of experience deploying reliable, high-performance AI environments. These solutions can require investments in the millions but aren’t always ‘ready-to-run’ when they arrive on site.
To train and deploy these different AI models, users rely on various AI frameworks and development tools that support specific types of AI. Sourcing and integrating these software tools is an extra step in the procurement process that takes time, resources, and expertise to execute properly.
That’s why top AI solution providers take the extra step to include pre-configured, pre-tested software stacks, such as the Silicon Mechanics AI Stack or Silicon Mechanics Scientific Computing Stack. End users save time by avoiding the efforts required to set up their own stack. However, there is the additional value of your engineering team being intimately familiar with the applications required for different workloads. The more we know about what you’ll be doing, the more we can optimize the cluster’s design to support your particular situation.
The Equivalent of a LAMP Stack
The open source, LAMP stack has had a huge impact in the growth of software development, which in turn has led to some amazing AI applications and use cases.
The benefits of a pre-installed, pre-tested software mentioned above are potentially so strong that these cluster software stacks may become as ubiquitous as the LAMP stack has become in software development. The major difference is that LAMP is well defined while AI and big data stacks are still emerging, as more organizations get involved in these sectors and as adoption of different.
Today, each engineering team looks at what types of clients and partners it has and then determines what sort of software it can effectively source, test, and integrate. In our case, the team here at Silicon Mechanics created this stack for our customers:
- Ubuntu, an open-source Linux distribution, commonly used for AI and HPC systems
- TensorFlow, an open-source software library focused on developing deep neural networks
- PyTorch, an open-source ML library for natural language processing and computer vision applications
- Keras, an open-source software library that provides a Python interface for artificial neural networks. Keras supports TensorFlow, Microsoft Cognitive Toolkit, Theano, and PlaidML
- cuDNN, a GPU-accelerated library for deep neural networks. cuDNN provides implementations for forward and backward convolution, pooling, normalization, activation layers, and other standard routines.
- NVIDIA CUDA, a parallel computing platform and API that allows software to use NVIDIA GPUs for general purpose processing, a key component of enabling AI.
- NVIDIA HPC, a comprehensive software development kit for GPU accelerating HPC modeling and simulation applications. It includes C, C++, and Fortran compilers, libraries, and analysis tools.
- R, a language and environment for statistical computing and graphics that enables data manipulation, calculation and graphical display
- And more…
Integrating Hardware and Software
Beyond the software stack itself, another way we’ve found to boost the performance and speed of clusters is to ensure the hardware is optimized for the type of workloads it will be running. As noted above, engineers who know your workload can optimize the cluster much better for your specific needs. We even went so far as to use the pre-source, pre-integrated, pre-tested concept to the cluster so we don’t have to start from scratch with our designs each time we work with a client.
Instead, we’ve designed a specific reference architecture for AI environments, the Silicon Mechanics Atlas AI Cluster. Using best-of-breed technology (including NVIDIA A100 GPUs for industry-leading GPU performance) in white box servers, the Linux-based Atlas AI Cluster provides performance, reliability, and scalability for AI along with the fast start of an integrated, tested software stack specific to AI
The Silicon Mechanics Atlas AI Cluster also features low total cost of ownership compared to traditional supercomputers.
To maximize the ROI of your AI platform, we use a building block approach, where computing, storage, and networking components are configured in standardized reliable sizes which can be scaled incrementally to meet specific performance needs. This lets us push the boundaries of AI clusters, and optimize AI models to accommodate a wide variety of use cases such as natural language processing, predictive analytics, cybersecurity, business intelligence, virtual assistants, and robotics to name a few.
Organizations looking to leverage ML and DL must find smarter ways to optimize for different AI models. As open software and hardware experts, we pride ourselves on working directly with you to understand your technical and business requirements, and pair you with our best-fit computing solutions for your AI needs. We encourage you to learn more about our Atlas AI cluster and our AI software stack, to see why it is the right platform your AI deployment.
Learn more about Silicon Mechanics’ approach to AI workloads at SiliconMechanics.com