What Is a Kubernetes Sandbox and Why Does It Matter?
First, let's demystify what a "sandbox" even is. In Kubernetes, a pod sandbox is a container runtime environment created by the kubelet using the CRI. When a pod starts up, the kubelet asks the CRI to create a sandbox where it will run containers. The sandbox acts as a dedicated workspace that includes network interfaces, storage mounts, and crucially for our story, cgroup management.
Under the hood, one of the main components of a sandbox is cgroups. Before we get there though, we need to introduce a couple concepts - systemd slices and systemd scope. A systemd slice unit provides a way to group processes and manage their resources (read more on systemd.slice). A slice unit typically contains systemd units that manage processes. Most may be familiar with service units, containerd and kubelet run as service units which are launched during boot. Service units describe a process that is forked from the main init process and are managed by systemd. Container runtimes need the ability to fork a process outside of the standard init process, but still need a mechanism to manage their resources. This is where scope units come in. Systemd scope units provide a mechanism to group processes forked outside of systemd and delegate resource management to systemd via cgroups (read more on systemd.scope).
For Kubernetes specifically, there are dedicated slice units (specifically kubepods.slice). Within kubepods.slice there is a hierarchy of different slices that correspond to QoS Classes (besteffort, burstable, and guaranteed). When a pod is created, the kubelet creates a slice within the hierarchy of kubepods.slice/<qos>.slice/kubepods-<qos>-pod<pod-uid>.slice
. Runc based container runtimes will then launch container processes and create a systemd scope within the given slice to manage the processes.
Here's where things get interesting. Cgroups have a concept of controllers - think of them as different knobs you can turn to control CPU, memory, I/O, and other resources. Parent cgroups can define which controllers child cgroups can control. Kubernetes checks the available controllers on the cgroup slice it created to determine whether a pod "exists" or not. If there is anything out of order with those controllers, it considers the pod to be unhealthy and will kill it and start a new pod.
This is ultimately one of the causes of the SandboxChanged error - everything about the pod looks healthy, but the underlying cgroup doesn't have the controllers it needs. The culprit (from my experience) is systemd fighting with your container runtime to manage cgroups.
3 Common Causes of SandboxChanged Errors
Cause #1: Misconfigured containerd and cgroups
The simplest case is just "I forgot to configure containerd properly."
If the container runtime does not allow systemd to manage the cgroups, then it will launch processes and write the cgroup controllers itself, typically writing directly to the filesystem. Systemd will come along and say "I'm not managing this, this is wrong" and make modifications to the cgroup controllers for the pod's slice. This modification by systemd will cause the kubelet to consider the pod non-existent and it will kill and recreate the pod, throwing the SandboxChanged error.
This one's usually easy to spot and fix - just make sure your containerd configuration has SystemdCgroups = true
.
Cause #2: Race Condition in Amazon Linux 2023 (nodeadm)
The second, less obvious way I encountered this was with Amazon Linux 2023 (AL2023). AL2023 uses a tool called nodeadm
to configure the node on the fly by reading userdata passed to the instance at boot. This is a great way to do it - I find it extremely helpful when working with AWS EKS.
Seemingly randomly, one day, I booted an EKS cluster running our product on AL2023 and all the pods in the cluster were cycling with SandboxChanged errors. Having some experience with this issue, I first checked the containerd config. It had all the correct settings to let systemd control the cgroups, so that wasn't it, or so I thought.
After a few hours of banging my head against the wall, and with some help from the incredible engineers on our team, I realized the systemd units were starting in a different order than usual. As part of our product installation, we introduce a few systemd units to run the Edera daemons, our CRI, and a few other bits and pieces. Introducing more systemd units resulted in a change in how the systemd solver orders units to start.
Squinting at journalctl, I could see that the nodeadm unit and containerd unit were being started at the exact same time. So, containerd was starting with its default configuration (which sets SystemdCgroups = false) and then right after, nodeadm was writing the containerd configuration. By the time the correct containerd configuration was written, containerd was already chugging along with the wrong config, duking it out with systemd over the existence of containers.
Patching the containerd and nodeadm systemd units to correctly order nodeadm to write configuration files before starting containerd fixed the issue here. Ultimately, though, the issue was again - systemd fighting with the container runtime.
Cause #3: Edera Runtime and Host cgroup Conflicts
Finally, the most recent and even less obvious issue was with the Edera runtime itself. This again happened on an AL2023 host, but as it turns out it wasn't an AL2023 specific issue. The symptoms: we would spin up a pod running on the Edera runtime and it would just sit there, cycling with SandboxChanged, mocking me to my face.
Instrumenting our CRI showed that kubelet was doing the normal pod startup, and then suddenly out of nowhere and for no apparent reason it would send StopContainer and StopPodSandbox. I checked the kubelet logs (verbosity level 1000), and I even added my own logging to the kubelet. I could see the healthy state of the pod, the containers were all healthy, then for no apparent reason the kubelet would decide to kill the container.
kuberuntime_manager.go:1126] "computePodActions got for pod" podActions="KillPod: true, CreateSandbox: true, UpdatePodResources: false, Attempt: 0 ...
kuberuntime_container.go:809] "Killing container with a grace period" ...
To make matters worse, this problem was not consistent. This was not an issue on hosts with cgroupv1, but seemingly only cgroupv2 hosts. However, not all cgroupv2 hosts. One of my dev machines is an Ubuntu host with cgroupv2 and a kubeadm cluster and I have never seen the issue on that host. Even with EKS and AL2023, I saw this issue when I just deployed a single Edera pod - but not on a busy cluster with multiple pods.
Remember all the information from above, where I discussed cgroup controllers and how kubelet considers a pod to exist if those controllers are all in place? This is where I learned that detail the hard way. Reading the kubelet code, looking through the cgroupv2 pod manager, I found the line that checked for the existence of a pod. Instrumenting that line told me the cpuset controller was missing. Setting up a quick little script to print out the cgroup.controllers for my pod - there it was. I could see systemd was fighting with MY runtime now. The cpuset controller would come and go, come and go, over and over. This is ultimately what caused the pod to crashloop with SandboxChanged.
Why Edera Pods Triggered SandboxChanged Errors
Edera runs pods in extremely lightweight virtual machines. We don't believe in shared kernels so we don't use cgroups on the host to sandbox processes. We do use cgroups and namespaces to run containers inside our lightweight virtual machines. As it turns out, this doesn't always jive with the current container infrastructure, so we run into interesting problems like this bug I'm telling you about now.
The kubelet creates a cgroup slice on the host and hands that to the container runtime. Then it proceeds to check for the existence of a pod by checking information within that cgroup slice. The initial iteration of the Edera runtime didn't do anything with that cgroup on the host, instead translating resource management configuration into our own dynamic virtual machine spec. This implementation was ultimately incorrect for running Kubernetes pods because kubelet expects there to be stable controllers within the cgroup slice it created. Systemd expects to manage a process within a slice, and when there is no process it will try to reclaim control, effectively causing the instability with controllers.
So to get to the meat of the problem: The kubelet creates the cgroup slice but because nobody tells systemd it's supposed to manage it (which is what a typical container runtime would do), systemd actively fought with it - removing unused controllers and causing kubelet to consider the pod nonexistent.
Answering the Puzzling Questions About SandboxChanged
This led me to some interesting questions that helped me understand the broader issue:
Why does the error become more sparse on a busy cluster?
Systemd wouldn't just fight with the leaf slice's controllers, it would modify the controller set for the higher level burstable or best-effort slices. Edera has the ability to run a pod or pass a pod through to a different container runtime (containerd/cri-o/etc). This gives us a nice property of isolating only the workloads that need it and not forcing isolation as an all or nothing solution.
Passing a pod through to the typical container runtime means it spins up under the shared kernel in the cgroup slice on the host. Since we had something managing the host cgroup slice properly, it would keep systemd at bay - keeping the controllers steady.
When kubelet spun up a host cgroup for an Edera pod, it would inadvertently get the benefit of containerd properly managing the top level controllers. The inherited controller list would remain stable and Kubernetes would consider the pod to exist, even though there was no process running within the cgroup on the host (the process was running isolated within a virtual machine).
When we ran a single Edera pod, we received no benefits from our delegate runtime. When the cluster was busy, we were more likely to see a pod running under the delegate runtime in the same slice and thus, the error would not exhibit.
Why does it seem to only fail on AL2023?
As I mentioned above, I have two hosts. Both use cgroupv2, but only one shows this issue. Well, as it turns out the answer to the previous question is also the answer here. My dev host running kubeadm never saw this issue. That's not anything special about the host - it actually has to do with the CNI.
I run Cilium on my kubeadm host. When Cilium is installed on a cluster, it takes over some responsibility for kube-proxy. Kube-proxy then runs under the delegate container runtime in the besteffort kubepods slice, right next to my silly little Edera pod. My AL2023 host on EKS is using the AWS CNI, which does not make this change, so the best-effort slice is empty when I go to launch my Edera pod and I get the SandboxChanged errors. If I swap our AWS CNI for Cilium on EKS the issue actually goes away. Well, it doesn't go away - it's just hidden.
How We Solved SandboxChanged Errors (The Edera Fix)
So, what is the solution? Well, the solution is to play by Kubernetes rules for the time being. Our CRI still isolates pods into their own lightweight VMs, that’s not going away. We just tell systemd to manage the slice created for us. We launch an extremely lightweight process (similar to the pause container) into the host cgroup and let systemd manage it for us.
We do this by forking off a host process and talking to systemd directly (using systemd-run, and in the future communicating over dbus) to create a scope to manage the newly forked host process. Simple, straightforward, everyone is happy. There are many improvements we can make here but the core solution is relatively simple as you can see.
Quick Troubleshooting Checklist: SandboxChanged in Kubernetes
If you're seeing SandboxChanged errors, here's where I'd start:
- Check your containerd configuration - Make sure SystemdCgroups = true is set
- If it’s dynamic, know how containerd is being configured - Ensure configuration happens before containerd starts
- Check cgroup controllers - Use a script to monitor which controllers are present in your pod's cgroup and whether they’re flapping
- Consider what else is running - Some other pods may be affecting how yours is behaving. You can use a tool like systemd-cgls to see the underlying pod cgroups
My Parting Words of Wisdom
In all of my endeavors to fix these issues I haven’t found much information on what SandboxChanged really means. It’s at the lower levels of Kubernetes so it’s not always obvious what’s going on - especially just looking at the logs. This isn’t an exhaustive list of everything that could cause the error, but it is a detailed account of what I’ve found and how I’ve fixed it. My hope in writing this is to help folks get to a quicker conclusion when they see these errors.
The key insight is that SandboxChanged errors on systemd-based hosts often boil down to inconsistencies within the cgroups. Kubelet expects a pods sandbox to have a cgroup slice with stable contollers. Systemd expects that cgroup to be anchored with a process it manages via a delegated scope unit. When you see these errors, you're usually witnessing a battle for control over resource management - and Kubernetes is just doing its job by restarting pods when things get out of sync.