Product

A Critical NVIDIA AI Vulnerability and The Principle of Isolation

September 27, 2024

While everyone is distracted on social media with the CUPS vulnerability, Wiz researchers dropped the mic with a massive GPU vulnerability that puts all AI infrastructure at risk.

AI runs on Kubernetes and insecure container technology. Wiz identified yet another vulnerability that takes advantage of container insecurity that provides an explosive blast radius for bad actors. As an industry, we can’t keep pointing more agents and observability at the problem which only alerts you after the attack has occurred. We need to fix it at the source, and stop it before it happens.

In this post, we’ll discuss how Edera enables organizations to achieve secure multi-tenancy, enabling security and millions of dollars in cloud spend. But first, a quick history lesson on how we got here, why the cloud is so insecure, and how we intend on fixing it.

A history of multi-tenancy

Historically, it has been a well-understood concept that different system tenants should be isolated to ensure data protection. This dates back to mainframes, with many of the earliest mainframe computers including hypervisors (notably IBM’s System/370 Control Program, which lives on to this day as part of IBM’s PR/SM hypervisor technology) to allow isolation of workloads from one another. However, as we shifted away from mainframe systems, we started to think less about multi-tenancy as an aspect of system design.

The rationale at the time was that multi-tenancy was not important: each tenant of a mainframe-based system would migrate to their own compute infrastructure. As the world has shifted back to centralized infrastructure to exploit the economic benefits of cloud computing, we have not shifted back to multi-tenancy in system design as the norm. The failure to prioritize multi-tenancy as a systems design concern has led to and will continue to lead to significant data breaches.

Multi-tenant AI compute: a disaster in the making

We live in an interesting time. NVIDIA, once known for making high-performance GPUs mainly for gaming and visualization workloads, have shifted their product strategy towards advancing heterogeneous computing with their CUDA framework. This has generated unprecedented demand for CUDA-based compute in the HPC and AI industries.

While tenant isolation is available at the cloud provider, due to cost, they are not frequently used to isolate tenants. This creates a scenario where two cloud provider customers are protected from each other, but their own customers are not protected from each other because economics often override the security of a system. This is where we see massive data breaches: malicious tenants build an exploit chain and attack the infrastructure to escape their compute environment (usually a container or a Kubernetes namespace). The lack of isolation enables attackers to inspect other tenants' environments and exfiltrate customer data.

This is not a hypothetical scenario. Wiz Research recently published an advisory concerning the NVIDIA Container Toolkit, which exploits weaknesses in the typical approach taken to enable GPU-accelerated compute that many AI providers are using. This enables an attacker to gain access to the underlying compute infrastructure, allowing for the exfiltration of the AI provider’s customers.

A better way

Multi-tenant isolation is critical to ensure the end-to-end integrity of a tenant’s (and their customers’) data. The Wiz advisory states:

"Additionally, this research highlights, not for the first time, that containers are not a strong security barrier and should not be relied upon as the sole means of isolation. When we design applications, especially multi-tenant applications, we should always “assume a vulnerability” and design to have at least one strong isolation barrier such as virtualization (as explained in the PEACH framework)."

We couldn’t agree more. At Edera, we are building products to enable our customers to build secure-by-design multi-tenancy infrastructure while allowing developers and security teams to stay focused on their core competencies.

Edera Protect runs all workloads in isolated zones, preventing container escapes entirely. Mounting arbitrary host resources is not possible. Edera Protect AI implements driver isolation, which places all NVIDIA and GPU driver components inside its own untrusted zone, preventing GPU driver and utility vulnerabilities from exploiting the entire Kubernetes environment. Our GPU design treats all NVIDIA and GPU technologies as untrusted. A connected GPU does not have any access to the Kubernetes host operating system at all, instead it is quarantined into a zone until its use is requested by Kubernetes.

Edera Protect AI simplifies securing AI infrastructure and accelerated compute. Edera AI works alongside Edera Protect or on its own, to isolate workloads, and automates the provisioning of isolated GPUs to containers, either through direct device passthrough or device virtual functions. Edera Protect AI runs without the NVIDIA Container Toolkit and securely manages access to GPUs directly, simplifying workload management significantly.

Edera Services provides a strong professional services team which can help you build your products securely by design.

Ready to secure your multi-tenancy environment with effortless, secure-by-design isolation? Connect with the experts at Edera today and discover how you can achieve it with just a few lines of YAML.

Authors

Ariadne Conill

Jed Salazar

Share with the world

Copy link

https://edera.dev/stories/

the-principle-of-isolation

About Edera

Introducing secure-by-design AI and Kubernetes no matter where you run your infrastructure. Security simplified.