Since DevOps and serverless architecture emerged, cloud vendors have supported container services as well as traditional virtual machine (VM) services. A traditional VM is strongly isolated from a host machine because a Virtual Machine Monitor (VMM), aka hypervisor, splits it with virtualized hardware. In contrast, a container uses kernel-level isolation techniques such as namespace and cgroup. They make containers faster than VMs. However, containers share the host kernel, so attackers can escape from the container with a kernel vulnerability.
Recent containers leverage hypervisor technology to overcome this problem. Kata container uses KVM/QEMU to isolate containers. Amazon's Firecracker makes microVMs that use a KVM-based lightweight hypervisor for isolation. Google's gVisor also uses the lightweight hypervisor with a user-level kernel. These architectures provide strong isolation, but there is still room for improvement. Attackers can still escape from them directly with a KVM vulnerability since KVM runs in the hypervisor privilege (Ring -1). Many researchers have tried to protect the hypervisor by getting System Management Mode (SMM, Ring -2) and monitoring it. However, they needed BIOS/UEFI firmware modification.
In this talk, I present Alcatraz, a new and practical hypervisor sandbox to prevent escapes from the KVM/QEMU and KVM-based microVMs. Alcatraz consists of Hyper-box and a tailored kernel. Hyper-box is a pico hypervisor made from scratch to isolate KVM. Unlike others, it becomes the host hypervisor (Ring -1) and downgrades KVM's privilege to the guest hypervisor (Ring 0). Hyper-box has nested hypervisor functions for sandboxing the KVM and does not need SMM or firmware modification. It also monitors all system calls to prevent escapes and unauthorized privilege escalations. A tailored Linux kernel removes legacy system calls to reduce the attack surface and cooperate with Hyper-box. Alcatraz can be used on laptops, desktops, and servers that run untrusted code in VMs and microVMs.