GPU virtualization for machine learning workloads: Part 1

GPU virtualization for machine learning workloads: Part 1

This post was originally published on this site ---

The Article is authored by  Uday Kurkure<[email protected]>, Lan Vu <[email protected]> and Hari Sivaraman <[email protected]>

Many of today’s most successful enterprises are turning to machine learning as a source of competitive advantage. With its ability to rapidly analyze enormous amounts of data from disparate sources and contextualize the data’s relevance to business goals and objectives, ML helps reduce enterprise complexity and speed up business decision-making. As human feedback and project results accumulate, the value of the resulting output continues to increase. At the same time, ML allows enterprises to automate simple or repetitive tasks, freeing up staff to expend their energy in areas where creativity and intuition are critically valuable.

Still, success in the enterprise is contingent on its ability to move quickly and nimbly – in spite of the increasingly immense demands placed on IT infrastructure. ML training requires up to hundreds of thousands of processing repetitions, during which up to millions of data samples with multiple iterations and time periods are consumed and analyzed. Similarly, ML inference – determining what multiple, disparate data sources are telling us – requires real-time processing of up to millions of concurrent requests within a production or cloud environment. So, while the benefits of deploying ML can be dramatic from a competitive standpoint, enterprise leaders must consider how to best accelerate computation without significantly increasing overhead.

In this blog, we’ll take a closer look at the types of ML workloads and networks driving the need for increased performance, the various options available for accelerating processing, and discuss the reasons we believe virtualized GPU (vGPU) acceleration on VMware vSphere® is the right choice for enterprise ML deployments.

ML workload performance demands

The rapid introduction of complex ML models is placing increasingly high demands on IT infrastructure. To further explain, let’s take a brief look at the capabilities of three recently developed training models designed for image recognition, speech-to-text translation, and language translation.

Used primarily for computer vision tasks, ResNet is a convolutional network model capable of training networks 150+ layers deep — more than 8x deeper than previously employed Visual Geometry Group (VGG) networks. Microsoft ResNet-50 is trained on more than one million images 50 layers deep, capable of classifying images into 1000 object categories.

Baidu Research Deep Search 2 converts speech to text, starting with a normalized spectrogram translated into a series of text characters. Deep Search 2 consists of convolutional network layers capable of recognizing English and Mandarin (two vastly different languages) while also accounting for other variables like background or atmospheric noise and challenging accents.

Google Neural Machine Translation (GNMT) employs an example-based machine translation (EBMT) method through which the system “learns from millions of examples.” Using an artificial neural network, GNMT translates between languages by encoding the semantics of a sentence rather than memorizing translated text by phrase. Google introduced GNMT as a way of improving fluency and accuracy of Google Translate.

High performance requirements for machine learning

 

 

*Source: GitHub/tensorflow

Accelerated processing architectures for ML

While multicore CPU systems form the basis of any ML infrastructure, adding CPUs to address the performance demands of ML adds unnecessarily high overhead. Alternatively, a handful of options exist today to complement existing multicore systems and provide the required compute acceleration, improving ML training by 5-10x.

Field Programmable Grid Array (FPGA)

In the simplest terms, an FPGA is a combination of logic and IP blocks that is configurable at any point and time, including post-deployment or installation. FPGAs are configured for the specific application they’re meant to support and deliver high performance-per-watt. At the same time, FPGA performance is lower for sequential operations (two or more operations occurring simultaneously) and can be difficult to program.

Application Specific Integrated Circuit (ASIC)

An integrated circuit (IC) customized for a specific end application (e.g., Google Cloud TPU for TensorFlow ML framework), an ASIC is tailored to deliver an optimal level of performance and power consumption. ASICs can’t be modified without redesigning the silicon, however, contributing to a lengthy development period along with generally higher costs.

Graphic Processing Unit (GPU)

Initially designed for 2D and 3D graphics, GPUs are used today across a broad range of compute-intensive applications like HPC and ML/DL. While GPU speeds are not comparable to those of CPU, GPUs feature thousands of identical processor cores capable of parallel processing, delivering significant processing power for ML applications. GPUs consume more power than FPGAs and ASICs, so pursuing ways to most efficiently leverage GPU capabilities is important.

Virtualizing GPUs with VMware vSphere

VMware vSphere allows for different methods of virtualizing GPUs. Frequently considered a first step in exposing GPUs to VMs, VMware DirectPath I/O provides direct guest OS access to a GPU while bypassing the ESXi hypervisor. With this “passthrough” method, a VM may consume one or many GPUs to support compute-intensive high-performance computing (HPC) and ML workloads. DirectPath I/O is often employed when a single application will utilize one or more full GPUs.

NVIDIA vGPU enables sharing of vGPUs across multiple virtual machines (VMs). VMware vSphere customers run vGPUs on vSphere for multiple applications, including acceleration of 2D and 3D graphics workloads for VMware Horizon, enabling VMware Blast Extreme protocol (a remote display protocol that relies on the GPU rather than CPU to reduce latency and improve bandwidth), and as a general purpose GPU (GPGPU) for ML and HPC workloads.

NVIDIA vGPU-sharing allows for multiple VMs to be powered by one GPU. Conversely, multiple vGPUs can be used to power a single VM, making highly compute-intensive applications (like ML) possible. Through integration with vSphere, GPU clusters are managed within vCenter to help customers maximize utilization and protection.

To more fully understand the benefits of virtualizing GPUs though DirectPath I/O or NVIDIA vGPU, consider the following language modeling example and figure:

Named for the way it channels information through the mathematical operations performed at the nodes of the network, a recurrent neural network (RNN) cycles its outputs back through the network as inputs to create a sort of feedback loop. RNNs are most commonly used for natural language processing and speech recognition. In our complex language modeling scenario, ML operates on a Penn Tree Bank (PTB) database of 929,000 training words, 73,000 validation words, 82,000 test words and a 10,000-word vocabulary.

The resulting overhead of NVIDIA vGPU and DirectPath I/O is minimal when compared to native GPU performance.

 

 

Looking ahead: increased flexibility and utilization of NVIDIA vGPUs with BitFusion

VMware is constantly exploring ways to help our customers take further advantage of GPU technology to accelerate ML and other compute-intensive workloads.

 

BitFusion FlexDirect enables NVIDIA vGPUs to be abstracted, partitioned and shared much like traditional compute resources. BitFusion enables VMs to access GPU resources physically installed in disparate ESXi hosts and creates a common pool of infrastructure available to any vSphere environment.

 

BitFusion and NVIDIA vGPUs in vSphere drive many customer use-cases and benefits, with performance comparable to bare metal, for applications like:

 

  • ML workloads on Linux for data scientist/ML researchers
  • Virtual desktop infrastructure (VDI) for office qorkers on Windows
  • 3D CAD workloads on Windows and Linux for scientists
  • Simulations on Linux
  • End users in different time zones using GPUs at different times
  • Improved data center resource utilization using vGPUs in data centers
  • Provide remote GPU access to users on non-GPU servers

 

Faster time to value: NVIDIA vGPUs and VMware vSphere

While the benefits of ML workloads to the enterprise are compelling, the ability to move fast and remain agile are increasingly important to remaining competitive. VMware vSphere combines the power of GPU technology with the already vast data center management benefits of virtualization. Together, NVIDIA vGPUs and vSphere help enterprise IT decision-makers and staff optimize their infrastructure for the compute-intensive performance demands of ML workloads, while enabling flexible allocation and utilization of GPU resources to effectively manage IT overhead and costs.

More to come…

In a follow-up to this blog, we’ll go into further detail around the deployment benefits of BitFusion and begin a discussion of vGPU vMotion, a feature of vSphere 6.7 that enables workload leveling and server software upgrades without incurring end-user downtime.

 

The post GPU virtualization for machine learning workloads: Part 1 appeared first on VMware vSphere Blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.