What Are Privileged Containers and Why Do They Matter?

Privileged containers have long been considered a security risk because they bypass many of the isolation mechanisms that normally protect host systems (more on this here). When a container runs in privileged mode (i.e., –privileged), it shares the same user namespace as the host and gains an extensive set of capability bits that allow it to perform operations typically restricted to the host system.

Many organizations face these common scenarios requiring privileged mode:

  • CI/CD pipelines that need to build and test container images
  • Security tools that require deep system access
  • Network CNIs that manage container and host interfaces
  • Storage management utilities that need to interact with host devices

However, running these workloads in privileged mode traditionally means accepting significant security risks – until now.

Traditional approaches to privileged containers come with significant drawbacks:

  • Shared kernel vulnerabilities: Container escapes can lead to full host compromise
  • Lateral movement opportunities: Attackers can pivot across multiple containers
  • Limited blast radius containment: Security incidents can quickly cascade

Security Alert:Privileged containers in traditional environments have full access to the host system and are considered a primary vector for container escapes.

How Edera Protect Makes Privileged Mode Safe

Coming in the next major release of Edera Protect, we have added support for privileged mode, allowing for containers requiring privileged access to work in isolation on the Edera Protect platform, which unlocks new potential for reducing the risks of running these workloads.

First, let's talk about the caveats and best practices…

Privileged mode on Edera Protect is designed for workloads that require elevated privilege but can otherwise be isolated, such as CI/CD build jobs and other ephemeral workloads which require privilege.  It is not designed to be a drop-in replacement for workloads that modify the host machine – those workloads should continue to be run as standard containers without isolation.

In addition, we generally recommend running privileged workloads in their own isolated zones – our next-gen kubernetes sandbox – whenever possible, to ensure that other workloads are protected from the privileged workload’s elevated access. 

What is privileged mode anyway?

Normally, when OCI containers – the most common and widely adopted format for container images – are run, they are run in unprivileged mode: they have their own user namespaces which cannot inherit privilege from the primary user namespace used by the host system.  Privileged mode flips the script: when you run a container as a privileged one, it shares the same user namespace as the host, meaning that the containerized environment has full access to the host’s resources.

Besides the lack of a user namespace, privileged mode primarily works by granting a default set of capability bits. If we use a tool like Docker or Podman, we can directly observe how privileged mode differs from unprivileged mode.  First, let's look at an unprivileged container:

$ docker run -it alpine:latest                                                                         
/ # grep ^Cap /proc/self/status
CapInh:	0000000000000000
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000

What you can see from the output is that in the unprivileged container, the capability sets (CapPrm, CapEff, CapBnd) are limited: 00000000a80425fb is a partial set of Linux capabilities.

Now to compare, we can look at a privileged one:

In the privileged container, those sets are: 000001ffffffffff — This represents all capabilities (full 64-bit mask).

$ docker run --privileged -it alpine:latest
/ # grep ^Cap /proc/self/status
CapInh:	0000000000000000
CapPrm:	000001ffffffffff
CapEff:	000001ffffffffff
CapBnd:	000001ffffffffff
CapAmb:	0000000000000000

Some capability bits work within user namespaces and apply only to resources controlled by that specific namespace. Others—like CAP_SYSLOG, which lets you read and clear the kernel log buffer—are only effective in the initial (host) user namespace. To illustrate, we can compare between a privileged and unprivileged container with the CAP_SYSLOG capability added. First, let’s try to read the kernel log buffer in a totally unprivileged container:

$ docker run -it alpine:latest
/ # dmesg | head -n 5
dmesg: klogctl: Operation not permitted

As expected, reading the log buffer is not allowed because the kernel does not allow access to it by default. What happens if we give ourselves CAP_SYSLOG?

$ docker run --cap-add CAP_SYSLOG -it alpine:latest
/ # dmesg | head -n 5
[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x610f0000]
[    0.000000] Linux version 6.10.14-linuxkit (root@buildkitsandbox) (gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309, GNU ld (GNU Binutils) 2.42) #1 SMP Thu Mar 20 16:32:56 UTC 2025
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000070000000-0x00000000ffffffff]
[    0.000000]   DMA32    empty

It works, but only because Docker does not use user namespaces by default which allows CAP_SYSLOG to be meaningful.  We can verify this by using Edera’s am-i-isolated tool:

$ docker run -it ghcr.io/edera-dev/am-i-isolated:latest | grep namespaces 
  🔴 host namespaces test failed: found host namespaces: user

If we restart Docker with user namespace support enabled, we get:

$ docker run -it ghcr.io/edera-dev/am-i-isolated:latest | grep namespaces
  🟢 host namespaces test passed: host namespaces not passed through

Now, if we try to use CAP_SYSLOG, it should be ineffective:

$ docker run --cap-add CAP_SYSLOG -it alpine:latest
/ # dmesg
dmesg: klogctl: Operation not permitted

In contrast to Docker and Kubernetes, the Edera Protect platform uses user namespaces by default as an additional form of hardening, so privileged mode on Edera Protect automatically drops the use of user namespaces for compatibility.

Capability enforcement in the kernel

Why is it that CAP_SYSLOG doesn’t work inside a user namespace?  To get to the bottom of it, we need to look at the kernel.

The kernel privilege model is built around POSIX capabilities: small bits which, if set on a process, allow it to perform privileged operations.  In this privilege model, the root user has a default effective capability set comprising all known capability bits available in the kernel, which is 41 bits as of Linux 5.9, the last time a new capability was added to the kernel.

By shifting superprivilege away from the root user (which retains superprivilege by default) into capability bits, a few things are unlocked:

  • Applications started with superprivilege can voluntarily lower their capability bits
  • Applications started without superprivilege can have select capability bits enabled, either through a launcher or through filesystem attributes
  • Users who are not root can be granted select capability bits allowing for limited privileged operations without the need for sudo

But how does this work in practice?  In the example above, we granted the CAP_SYSLOG capability, but it is ineffective.  To find out why, let's take a look at our CAP_SYSLOG example from the perspective of the kernel.  When a user requests the kernel log buffer, they perform a klogctl(2) syscall.

If we browse the source code for the klogctl syscall, we see the following fragment:

static int check_syslog_permissions(int type, int source)
{
        if (source == SYSLOG_FROM_PROC && type != SYSLOG_ACTION_OPEN)
                go to ok;

        if (syslog_action_restricted(type)) {
                if (capable(CAP_SYSLOG))
                        go to ok;
                return -EPERM;
        }
ok:
        return security_syslog(type);
}

The main function of interest here is capable(), which is used by the kernel to determine if the current process has the appropriate capability in the initial user namespace, &init_user_ns:

bool capable(int cap)
{
        return ns_capable(&init_user_ns, cap);
}

ns_capable() is just a wrapper around ns_capable_common(), which is a wrapper around the LSM capable hook:

static bool ns_capable_common(struct user_namespace *ns,
                              int cap,
                              unsigned int opts)
{
        int capable;

        if (unlikely(!cap_valid(cap))) {
                pr_crit("capable() called with invalid cap=%u\n", cap);
                BUG();
        }

        capable = security_capable(current_cred(), ns, cap, opts);
        if (capable == 0) {
                current->flags |= PF_SUPERPRIV;
                return true;
        }
        return false;
}

We can find the appropriate hook in the POSIX capabilities LSM (with the author’s annotations in the comments):

static inline int cap_capable_helper(const struct cred *cred,
                                     struct user_namespace *target_ns,
                                     const struct user_namespace *cred_ns,
                                     int cap)
{
        struct user_namespace *ns = target_ns;

        /* See if cred has the capability in the target user namespace
         * by examining the target user namespace and all of the target
         * user namespace's parents.
         */
        for (;;) {
                /* Do we have the necessary capabilities? */
                if (likely(ns == cred_ns))
                        return cap_raised(cred->cap_effective, cap) ? 0 : -EPERM;

                /*
                 * If we're already at a lower level than we're looking for,
                 * we're done searching.
                 */
                if (ns->level <= cred_ns->level)
                        return -EPERM;

                /*
                 * The owner of the user namespace in the parent of the
                 * The user namespace has all caps.
                 */
                if ((ns->parent == cred_ns) && uid_eq(ns->owner, cred->euid))
                        return 0;

                /*
                 * If you have a capability in a parent user ns, then you have
                 * it over all children user namespaces as well.
                 */
                ns = ns->parent;
        }

        /* We never get here */
}

Given this, we can deduce the following rules for capability bits:

  1. Users which create unprivileged user namespaces are treated like the root user in the initial user namespace, but only over resources governed by the namespace.
  2. Capabilities only propagate downward, not back towards the initial user namespace. This is why CAP_SYSLOG is meaningless in a user namespace, for example.

Coming to Edera Protect 1.1

Privileged mode support will be available in the upcoming Edera Protect 1.1 release. This feature enables organizations to:

  1. Consolidate infrastructure by running privileged workloads on the same clusters as unprivileged ones
  2. Improve security posture by containing privileged operations within isolated zones
  3. Simplify operations with a consistent platform for all workloads
  4. Reduce costs by eliminating the need for separate dedicated infrastructure

To learn more about how Edera Protect's privileged mode can help secure your sensitive workloads, or to test out a beta version get in touch

By combining the flexibility of privileged containers with the strong isolation guarantees of Edera Protect, organizations can run sensitive workloads more securely than ever before.