这件事起源于有小伙伴在某群里问,在 K8s 中,能不能把 volume 挂载直接挂到根目录?我的第一反应是不能。容器会使用 union filesystem 将容器的内容挂到根目录下,这点在正常情况下是无法更改的。但是就止于此吗?发现给不出合理解释的时候,突然感觉自己对于容器的认知只停留在了很表面的阶段。

一、从 runc 源码开始

于是我翻到了 runc 的代码,一起看看他是怎么做的,看看有没有什么切入点。我们首先关注容器的创建这一部分:libcontainer/init_linux.go:78

func newContainerInit(t initType, pipe *os.File, consoleSocket *os.File, fifoFd, logFd int, mountFds []int) (initer, error) {
	var config *initConfig
	if err := json.NewDecoder(pipe).Decode(&config); err != nil {
		return nil, err
	if err := populateProcessEnvironment(config.Env); err != nil {
		return nil, err
	switch t {
	case initSetns:
		// mountFds must be nil in this case. We don't mount while doing runc exec.
		if mountFds != nil {
			return nil, errors.New("mountFds must be nil; can't mount from exec")
		return &linuxSetnsInit{
		}, nil
	case initStandard:
		return &linuxStandardInit{
		}, nil
	return nil, fmt.Errorf("unknown init type %q", t)

这里做的事情比较简单,一个是从 Pipe 拿到初始化配置,解析配置中注入的 env,将其设置到本进程中。容器初始化的方式有两种,其一是 initSetns,启动一个已有的容器。其次是 initStandard,启动一个标准容器。

initStandard 中与 rootfs 最密切相关的就是 err := prepareRootfs(l.pipe, l.config, l.mountFds),在 prepareRootfs 之前,主要进行了网络的初始化,比如 lo 网卡和 route 的初始化。不过我们主要还是关注 rootfs 部分,从注释我们可以看到这里主要做了这几件事情:设备、挂载点、fs的初始化,最后提醒你调用 finalizeRootfs 来完成初始化,我们先以 prepareRootfs 为核心,逐行解析这里面发生了什么:

// prepareRootfs sets up the devices, mount points, and filesystems for use
// inside a new mount namespace. It doesn't set anything as ro. You must call
// finalizeRootfs after this function to finish setting up the rootfs.
func prepareRootfs(pipe io.ReadWriter, iConfig *initConfig, mountFds []int) (err error) {
	config := iConfig.Config
	if err := prepareRoot(config); err != nil {
		return fmt.Errorf("error preparing rootfs: %w", err)
	if mountFds != nil && len(mountFds) != len(config.Mounts) {
		return fmt.Errorf("malformed mountFds slice. Expected size: %v, got: %v. Slice: %v", len(config.Mounts), len(mountFds), mountFds)
	mountConfig := &mountConfig{
		root:            config.Rootfs,
		label:           config.MountLabel,
		cgroup2Path:     iConfig.Cgroup2Path,
		rootlessCgroups: iConfig.RootlessCgroups,
		cgroupns:        config.Namespaces.Contains(configs.NEWCGROUP),
	setupDev := needsSetupDev(config)
	for i, m := range config.Mounts {
		// Just before the loop we checked that if not empty, len(mountFds) == len(config.Mounts).
		// Therefore, we can access mountFds[i] without any concerns.
		if mountFds != nil && mountFds[i] != -1 {
			mountConfig.fd = &mountFds[i]
		} else {
			mountConfig.fd = nil
		if err := mountToRootfs(m, mountConfig); err != nil {
			return fmt.Errorf("error mounting %q to rootfs at %q: %w", m.Source, m.Destination, err)
	if setupDev {
		if err := createDevices(config); err != nil {
			return fmt.Errorf("error creating device nodes: %w", err)
		if err := setupPtmx(config); err != nil {
			return fmt.Errorf("error setting up ptmx: %w", err)
		if err := setupDevSymlinks(config.Rootfs); err != nil {
			return fmt.Errorf("error setting up /dev symlinks: %w", err)
	// Signal the parent to run the pre-start hooks.
	// The hooks are run after the mounts are setup, but before we switch to the new
	// root, so that the old root is still available in the hooks for any mount
	// manipulations.
	// Note that iConfig.Cwd is not guaranteed to exist here.
	if err := syncParentHooks(pipe); err != nil {
		return err
	// The reason these operations are done here rather than in finalizeRootfs
	// is because the console-handling code gets quite sticky if we have to set
	// up the console before doing the pivot_root(2). This is because the
	// Console API has to also work with the ExecIn case, which means that the
	// API must be able to deal with being inside as well as outside the
	// container. It's just cleaner to do this here (at the expense of the
	// operation not being perfectly split).
	if err := unix.Chdir(config.Rootfs); err != nil {
		return &os.PathError{Op: "chdir", Path: config.Rootfs, Err: err}
	s := iConfig.SpecState
	s.Pid = unix.Getpid()
	s.Status = specs.StateCreating
	if err := iConfig.Config.Hooks[configs.CreateContainer].RunHooks(s); err != nil {
		return err
	if config.NoPivotRoot {
		err = msMoveRoot(config.Rootfs)
	} else if config.Namespaces.Contains(configs.NEWNS) {
		err = pivotRoot(config.Rootfs)
	} else {
		err = chroot()
	if err != nil {
		return fmt.Errorf("error jailing process inside rootfs: %w", err)
	if setupDev {
		if err := reOpenDevNull(); err != nil {
			return fmt.Errorf("error reopening /dev/null inside container: %w", err)
	if cwd := iConfig.Cwd; cwd != "" {
		// Note that spec.Process.Cwd can contain unclean value like  "../../../../foo/bar...".
		// However, we are safe to call MkDirAll directly because we are in the jail here.
		if err := os.MkdirAll(cwd, 0o755); err != nil {
			return err
	return nil

1、prepareRoot

1.1 RootPropagation

func prepareRoot(config *configs.Config) error {
   flag := unix.MS_SLAVE | unix.MS_REC
   if config.RootPropagation != 0 {
      flag = config.RootPropagation
   if err := mount("", "/", "", "", uintptr(flag), ""); err != nil {
      return err
   // Make parent mount private to make sure following bind mount does
   // not propagate in other namespaces. Also it will help with kernel
   // check pass in pivot_root. (IS_SHARED(new_mnt->mnt_parent))
   if err := rootfsParentMountPrivate(config.Rootfs); err != nil {
      return err
   return mount(config.Rootfs, config.Rootfs, "", "bind", unix.MS_BIND|unix.MS_REC, "")

prepareRoot 的最一开始,先进行了一次 mount,这次 mount 实际上是一个 propagation 的递归修改(unix.MS_REC)。默认情况下 flag 是 unix.MS_SLAVE。从 linux 小手册上可以得知,这个 flag 表示 mount 点从属挂载下的 mount 事件单向传播,此从节点下的挂载将不会影响到主节点。由于它这里 mount 的是 "/" 目录,而且使用了递归参数,即表示在此 ns 中的任何 mount 操作,都不对外界产生影响,不过反过来(准确的说是 peer group 之间)是产生影响的。

我们这里模拟一下,进行一个 tmpfs 的 mount,并设置传播等级为 shared:

mount -t tmpfs myt /root/dir1 --make-shared

findmnt -o TARGET,PROPAGATION 查看一下传播等级:

|-/var/lib/kubelet/pods/6c4a58a7-557f-4cc8-b95f-4170c6ac2ab8/volume-subpaths/dashboard-manager-secret/customer-dashboard-manager/2 private
|-/root/dir1 shared

我模拟 runc clone 一个 ns,然后同样查看传播等级,发现结果与上面一样。执行 mount --make-rslave /,再次查看传播等级,发现已经变成了 slave,而原先的 private 则保持不变:

|-/var/lib/kubelet/pods/6c4a58a7-557f-4cc8-b95f-4170c6ac2ab8/volume-subpaths/dashboard-manager-secret/customer-dashboard-manager/2 private
|-/root/dir1 private,slave

行为也和 man page 的描述一致,不是 shared 的并不会因为此命令而改变:

MS_SLAVE
              If this is a shared mount that is a member of a peer group
              that contains other members, convert it to a slave mount.
              If this is a shared mount that is a member of a peer group
              that contains no other members, convert it to a private
              mount.  Otherwise, the propagation type of the mount is
              left unchanged.

当然我们可以看到这里留了个口子,可以依据 config.RootPropagation 来改变这个默认行为,docker 的默认是 rprivate,即双向的 mount 都互不产生影响。K8s 的默认也是 private,在 K8s 1.2.1以后,支持对 Volume 进行传播等级配置,比如 HostToContainer,其实就是 MS_SLAVE。还有一种 Bidirectional,则是 MS_SHARED,表示此 ns 下的 mount 与外界共享,这个口子灵活又危险,比如可以在容器里进行 device 的 mount/unmount。

1.2 rootfsParentMountPrivate

这块的注释非常全,其实就是检查一下准备作为 root 的这个目录是不是 shared,如果是 shared,则改为 private。也就是无论如何,容器都要求 rootfs 为 private,即使我们将 RootPropagation 设置为 shared 或者其他。

这块意图也合理,如果 rootfs 如果随意被 propagation 影响,很容易导致容器崩溃。(不过我也不太确定我这个猜想是否正确。)

另外,这里注释提到,把他改成 private 也是避免后续做 bind 操作的时候,将 mount 传播到其他 namespace。以及,pivot_root 也不允许此 mount 为 shared。

1.3 bind

Bind 和硬链看起来有点点像,不过底层实现完全不同。man page 提到 bind 是一种对 fs attach 的操作,而软硬链是借助 inode 来完成的。

mount(config.Rootfs, config.Rootfs, "", "bind", unix.MS_BIND|unix.MS_REC, "")

REC 参数的意图和上面提到的 propagation 时的一致,就是递归。man page 中提到,如果没有 REC 参数,则 bind 只 mount 当前这个目录,而目录底下的 submounts 不会被复制。我们发现它把 rootfs 目录 bind 到 rootfs 目录了,这是为了创建一个 mountpoint。这个 mountpoint 是容器根目录的 mount,比如:

➜  ~ mount
/dev/disk3s3s1 on / (apfs, sealed, local, read-only, journaled)

2、mountToRootfs

在 bind 完 rootfs 这个 mountpoint 后,会根据 config.Mounts 中的配置,去逐个创建对应的 mount,这里就是处理我们挂载的地方:

func mountToRootfs(m *configs.Mount, c *mountConfig) error {
	rootfs := c.root
	mountLabel := c.label
	mountFd := c.fd
	dest, err := securejoin.SecureJoin(rootfs, m.Destination)
	if err != nil {
		return err
	switch m.Device {
		case "proc", "sysfs": ...
		case "mqueue": ...
		case "tmpfs": ...
		case "bind": ...
		case "cgroup": ...
		default: ...
	if err := setRecAttr(m, rootfs); err != nil {
		return err
	return nil

整体的流程不难看懂,不同的类型有不通的 mount 流程,而最后的 setRecAttr 感兴趣的可以看下 mount_setattr(2)

就以 proc/sysfs 为例,就是检查一下 dst,确保是一个目录,并且不能是 symlink。注释这里提到了有意思的 symlink-exchange attacks,感兴趣的可以看看 mounts outside,提到了 symlink 导致的 mount 逃逸,讲的十分详细(其实我也就大略看了一下)。

	case "proc", "sysfs":
		// If the destination already exists and is not a directory, we bail
		// out This is to avoid mounting through a symlink or similar -- which
		// has been a "fun" attack scenario in the past.
		// TODO: This won't be necessary once we switch to libpathrs and we can
		//       stop all of these symlink-exchange attacks.
		if fi, err := os.Lstat(dest); err != nil {
			if !os.IsNotExist(err) {
				return err
		} else if fi.Mode()&os.ModeDir == 0 {
			return fmt.Errorf("filesystem %q must be mounted on ordinary directory", m.Device)
		if err := os.MkdirAll(dest, 0o755); err != nil {
			return err
		// Selinux kernels do not support labeling of /proc or /sys
		return mountPropagate(m, rootfs, "", nil)

底层调用的都是 mountPropagate ,这是 runc 对 mount 的一层安全封装,确保没有一些恶意挂载:

// Do the mount operation followed by additional mounts required to take care
// of propagation flags. This will always be scoped inside the container rootfs.
func mountPropagate(m *configs.Mount, rootfs string, mountLabel string, mountFd *int) error {}

其他类型的挂载实际上大同小异,它们基本都围绕 “安全” 为核心,对挂载做各种检查,并执行。

3、setupDev

	if setupDev {
		if err := createDevices(config); err != nil {
			return fmt.Errorf("error creating device nodes: %w", err)
		if err := setupPtmx(config); err != nil {
			return fmt.Errorf("error setting up ptmx: %w", err)
		if err := setupDevSymlinks(config.Rootfs); err != nil {
			return fmt.Errorf("error setting up /dev symlinks: %w", err)

这块内容略过,主要我也不是很了解比如 mknod 之类的指令。对 linux 有一定了解的小伙伴应该知道 dev 指的是设备,对应 /dev 目录。

我们知道 docker 可以用 --device 来绑定设备,createDevices 本质上也是通过 mount 来完成的,它这里会将 host 的设备通过 bind 或者 mknode 到容器目录中。

setupPtmx 是将 pts/ptmx 软链到了容器中,以便支持 pty。最后部分的 setupDevSymlinks 则是一些小优化,比如它会把标准输入输出的 fd 通过软链放到 /dev 底下。

4、容器初始化时简单的 hook

4.1 syncParentHooks

	// Signal the parent to run the pre-start hooks.
	// The hooks are run after the mounts are setup, but before we switch to the new
	// root, so that the old root is still available in the hooks for any mount
	// manipulations.
	// Note that iConfig.Cwd is not guaranteed to exist here.
	if err := syncParentHooks(pipe); err != nil {
		return err

这块内容与主题无关,不过有点小意思。我们知道 runc 由父进程来创建 namespace,再由子进程来初始化容器,这里就用了 Pipe 来实现 PreStart,这个点正好是还没 chroot/pivot_root 的时候,理论上是可以做一些危险操作的,不过要注意,这个调用是发生在父进程:

// syncParentHooks sends to the given pipe a JSON payload which indicates that
// the parent should execute pre-start hooks. It then waits for the parent to
// indicate that it is cleared to resume.
func syncParentHooks(pipe io.ReadWriter) error {
	// Tell parent.
	if err := writeSync(pipe, procHooks); err != nil {
		return err
	// Wait for parent to give the all-clear.
	return readSync(pipe, procResume)

4.2 createContainerHooks

这个 Hooks 则是发生在当前进程(容器主进程),代码很简单,不多说:

	// The reason these operations are done here rather than in finalizeRootfs
	// is because the console-handling code gets quite sticky if we have to set
	// up the console before doing the pivot_root(2). This is because the
	// Console API has to also work with the ExecIn case, which means that the
	// API must be able to deal with being inside as well as outside the
	// container. It's just cleaner to do this here (at the expense of the
	// operation not being perfectly split).
	if err := unix.Chdir(config.Rootfs); err != nil {
		return &os.PathError{Op: "chdir", Path: config.Rootfs, Err: err}
	s := iConfig.SpecState
	s.Pid = unix.Getpid()
	s.Status = specs.StateCreating
	if err := iConfig.Config.Hooks[configs.CreateContainer].RunHooks(s); err != nil {
		return err

5、msMoveRoot/chroot/pivotRoot

我们知道,进入容器后,只能看到容器内的目录,这实际上就是这上面三个命令的功劳。可能大家最熟悉的就是 chroot,这个 jail 技术已经存在很多年了。

不过在 runc 中,chroot 并不是最优选择,chroot 设计之初就不是为了创建一个安全且隔离的环境,它存在不少限制。其实从 man page 的定义中就可以看出 pivotRoot 和 chRoot 的底层原理是不同的:

chroot - run command or interactive shell with special root directory
pivot_root - change the root mount

chroot 是改变了 cmd/shell 的 root dir,而 pivot_root 是直接改了 root mount,chroot 有一个著名的越狱方案就是在 chroot 中调用 chroot,这里直接贴一下维基百科的说法:

chroot 机制的设计中,并不包括抵抗特权用户(root)的蓄意篡改。在大多数的系统中,chroot环境没有设计出适当的堆栈,所以一个在chroot下执行的程序,可能会透过第二次chroot来获得足够权限,逃出chroot的限制。为了减轻这种安全漏洞所带来的风险,在使用chroot后,在chroot下执行的程序,应该尽快放弃root权限,或是改用其他机制来替代,例如FreeBSD jail。在某些操作系统中,例如FreeBSD,已经采取预防措施,来防止第二次chroot的攻击[1]

  • 在支持设备节点的文件系统中,一个在chroot中的root用户仍然可以创建设备节点和挂载在chroot根目录的文件系统;尽管,chroot机制不是被打算用来阻止低特权用户级访问系统设备。

  • 在启动时,程序都期望能在某些预设位置找到scratch space,配置文件,设备节点共享库。对于一个成功启动的被chroot的程序,在chroot目录必须最低限度配备的这些文件设置。这使得chroot难以作为一般的沙箱来使用。

  • 只有root用户可以执行chroot。这是为了防止用户把一个setuid的程序放入一个特制的chroot监牢(例如一个有着假的/etc/passwd/etc/shadow文件的chroot监牢)由于引起提权攻击。

  • 在chroot的机制本身也不是为限制资源的使用而设计,如I/O,带宽,磁盘空间或CPU时间。大多数Unix系统都没有以完全文件系统为导向,以即给可能通过网络和过程控制,通过系统调用接口来提供一个破坏chroot的程序。

  • msMoveRoot 本质上也是调用了 chroot,是一个 chroot 的安全加强版:

    // Before we move the root and chroot we have to mask all "full" sysfs and
    	// procfs mounts which exist on the host. This is because while the kernel
    	// has protections against mounting procfs if it has masks, when using
    	// chroot(2) the *host* procfs mount is still reachable in the mount
    	// namespace and the kernel permits procfs mounts inside --no-pivot
    	// containers.
    	// Users shouldn't be using --no-pivot except in exceptional circumstances,
    	// but to avoid such a trivial security flaw we apply a best-effort
    	// protection here. The kernel only allows a mount of a pseudo-filesystem
    	// like procfs or sysfs if there is a *full* mount (the root of the
    	// filesystem is mounted) without any other locked mount points covering a
    	// subtree of the mount.
    	// So we try to unmount (or mount tmpfs on top of) any mountpoint which is
    	// a full mount of either sysfs or procfs (since those are the most
    	// concerning filesystems to us).
    	mountinfos, err := mountinfo.GetMounts(func(info *mountinfo.Info) (skip, stop bool) {
    		// Collect every sysfs and procfs filesystem, except for those which
    		// are non-full mounts or are inside the rootfs of the container.
    		if info.Root != "/" ||
    			(info.FSType != "proc" && info.FSType != "sysfs") ||
    			strings.HasPrefix(info.Mountpoint, rootfs) {
    			skip = true
    		return
    	if err != nil {
    		return err
    	for _, info := range mountinfos {
    		p := info.Mountpoint
    		// Be sure umount events are not propagated to the host.
    		if err := mount("", p, "", "", unix.MS_SLAVE|unix.MS_REC, ""); err != nil {
    			if errors.Is(err, unix.ENOENT) {
    				// If the mountpoint doesn't exist that means that we've
    				// already blasted away some parent directory of the mountpoint
    				// and so we don't care about this error.
    				continue
    			return err
    		if err := unmount(p, unix.MNT_DETACH); err != nil {
    			if !errors.Is(err, unix.EINVAL) && !errors.Is(err, unix.EPERM) {
    				return err
    			} else {
    				// If we have not privileges for umounting (e.g. rootless), then
    				// cover the path.
    				if err := mount("tmpfs", p, "", "tmpfs", 0, ""); err != nil {
    					return err
      // Move the rootfs on top of "/" in our mount namespace.
    	if err := mount(rootfs, "/", "", "", unix.MS_MOVE, ""); err != nil {
    		return err
    	return chroot()
    

    代码很长,实际上做的事情不复杂:这里把当前 ns 中的 proc/sysfs,且不属于 rootfs 底下的 mount 过滤出来 umount 掉了。

    最后做了一手 MS_MOVE,把 rootfs 这个 mount 挪到了 /,这里猜测是防止 chdir(../) chroot(/) 这种组合拳,因为把 mount 挪过去,原来的 rootfs mount 就不存在了,最后再执行一下 chroot。不过即使如此,runc 依旧不推荐使用 chroot。

    6、finalizeRootfs

    prepareRootfs 简单的过了一下,其实也就是它注释提到的那几件事情,进行设备、挂载点、fs的初始化,而 finalizeRootfs 是 prepareRootfs 的收尾。

    // finalizeRootfs sets anything to ro if necessary. You must call
    // prepareRootfs first.
    func finalizeRootfs(config *configs.Config) (err error) {
    	// All tmpfs mounts and /dev were previously mounted as rw
    	// by mountPropagate. Remount them read-only as requested.
    	for _, m := range config.Mounts {
    		if m.Flags&unix.MS_RDONLY != unix.MS_RDONLY {
    			continue
    		if m.Device == "tmpfs" || utils.CleanPath(m.Destination) == "/dev" {
    			if err := remountReadonly(m); err != nil {
    				return err
    	// set rootfs ( / ) as readonly
    	if config.Readonlyfs {
    		if err := setReadonly(); err != nil {
    			return fmt.Errorf("error setting rootfs as readonly: %w", err)
    	if config.Umask != nil {
    		unix.Umask(int(*config.Umask))
    	} else {
    		unix.Umask(0o022)
    	return nil
    

    finalizeRootfs 第一段代码实际上是和之前的 mountPropagate(就那个 runc 对 mount 操作的安全封装)交相呼应:

    // Do the mount operation followed by additional mounts required to take care
    // of propagation flags. This will always be scoped inside the container rootfs.
    func mountPropagate(m *configs.Mount, rootfs string, mountLabel string, mountFd *int) error {
    	var (
    		data  = label.FormatMountLabel(m.Data, mountLabel)
    		flags = m.Flags
    	// Delay mounting the filesystem read-only if we need to do further
    	// operations on it. We need to set up files in "/dev", and other tmpfs
    	// mounts may need to be chmod-ed after mounting. These mounts will be
    	// remounted ro later in finalizeRootfs(), if necessary.
    	if m.Device == "tmpfs" || utils.CleanPath(m.Destination) == "/dev" {
    		flags &= ^unix.MS_RDONLY
    

    由于 tmpfs 和 /dev 有可能会在 mount 之后做一些初始化,或者 chmod,所以当初挂的时候即使是 MS_RDONLY,也 ^ 掉了,在最后 finalize 的时候,如果配置了 MS_RDONLY,再 remount 一下,让它真正 mount 成 MS_RDONLY。

    第二段代码也是一样的道理,如果配置中设置了 Readonlyfs,同样也是在最后关头再设置成只读。

    第三段代码做了一个 umask,我们只看这个默认的 umask 022,因为正常的文件/目录如果未经过设置,是 666/777,umask 实际上就是把其他用户组的写权限拿掉,变成 644/755。

    二、 runc 的使用

    分析到这里,我们发现,runc 并没有对 rootfs 这个 mountpoint 是什么挂载去做定义,只是做了下 bind,而且允许我们自由定义 rootfs 如何挂载。至少看到这里,我们认为从外部提供一个 mount 配置挂载到 root 上是可行的。不如体验一下在不同的配置下, runc 是如何为我们生成容器的。

    1、使用 runc 创建并进入容器

    我们先跑下 runc spec && cat config.json,通过这个命令,能提供一个缺省的配置,实际上这坨配置就是 OCI-runtime-spec,描述如下:

    "ociVersion": "1.0.2-dev", "process": { "terminal": true, "user": { "uid": 0, "gid": 0 "args": [ "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "TERM=xterm" "cwd": "/", "capabilities": { "bounding": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" "effective": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" "permitted": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" "ambient": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" "rlimits": [ "type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024 "noNewPrivileges": true "root": { "path": "rootfs", "readonly": true "hostname": "runc", "mounts": [ "destination": "/proc", "type": "proc", "source": "proc" "destination": "/dev", "type": "tmpfs", "source": "tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" "destination": "/dev/pts", "type": "devpts", "source": "devpts", "options": [ "nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5" "destination": "/dev/shm", "type": "tmpfs", "source": "shm", "options": [ "nosuid", "noexec", "nodev", "mode=1777", "size=65536k" "destination": "/dev/mqueue", "type": "mqueue", "source": "mqueue", "options": [ "nosuid", "noexec", "nodev" "destination": "/sys", "type": "sysfs", "source": "sysfs", "options": [ "nosuid", "noexec", "nodev", "destination": "/sys/fs/cgroup", "type": "cgroup", "source": "cgroup", "options": [ "nosuid", "noexec", "nodev", "relatime", "linux": { "resources": { "devices": [ "allow": false, "access": "rwm" "namespaces": [ "type": "pid" "type": "network" "type": "ipc" "type": "uts" "type": "mount" "maskedPaths": [ "/proc/acpi", "/proc/asound", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug", "/sys/firmware", "/proc/scsi" "readonlyPaths": [ "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger"

    不过 rootfs 这个缺省目录下并没有一套根文件系统(现在都不存在这个目录),直接运行肯定是会报错的,如下:

    > runc run config.json
    ERRO[0000] runc run failed: invalid rootfs: stat /root/rootfs: no such file or directory
    

    这里借助 docker export 了一个 busybox 的根文件系统,并放在 /root/rootfs 下,并将刚才的配置 root.path 修改为 /root/rootfs:

    VERSION bin custom dev etc home json lib lib64 proc root tmp usr var /root/rootfs

    执行 runc 命令启动此容器:

    > runc run config.json
    / # ls
    VERSION  bin      custom   dev      etc      home     json     lib      lib64    proc     root     sys      tmp      usr      var
    / # echo $$
    / # ps -ef
    PID   USER     TIME  COMMAND
        1 root      0:00 sh
        8 root      0:00 ps -ef
    / # mount
    /dev/vda1 on / type ext4 (ro,noatime)
    / # echo something > test.log
    sh: can't create test.log: Read-only file system
    

    确实如配置那样,rootfs 被设置为只读,对应我们在第一小节里讲到的 finalizeRootfs 中的第二段操作。

    	"root": {
    		"path": "rootfs",
    		"readonly": true
    

    2、尝试在 chroot 下进行越狱

    我们试试在不安全的 chroot 底下进行越狱,先改一下刚才生成的配置,把 capabilities 的权限打开,不然有些命令比如 chroot 跑不了。另外就是 namespaces 需要去掉 mount(就是 NEWNS),如果打开了 NEWNS,根据我们前面的源码分析,它会自动去进行 pivot_root。最后再去掉 MaskPaths 和 ReadonlyPaths,否则无法通过安全检查:

    "ociVersion": "1.0.2-dev", "process": { "terminal": true, "user": { "uid": 0, "gid": 0 "args": [ "bash" "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "TERM=xterm" "cwd": "/", "capabilities": { "bounding": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_MKNOD", "CAP_SYS_ADMIN" "effective": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_MKNOD", "CAP_SYS_ADMIN" "permitted": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_MKNOD", "CAP_SYS_ADMIN" "ambient": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_MKNOD", "CAP_SYS_ADMIN" "rlimits": [ "type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024 "noNewPrivileges": true "root": { "path": "/root/ubuntu" "hostname": "runc", "mounts": [ "destination": "/proc", "type": "proc", "source": "proc" "destination": "/dev", "type": "tmpfs", "source": "tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" "destination": "/dev/pts", "type": "devpts", "source": "devpts", "options": [ "nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5" "destination": "/dev/shm", "type": "tmpfs", "source": "shm", "options": [ "nosuid", "noexec", "nodev", "mode=1777", "size=65536k" "destination": "/dev/mqueue", "type": "mqueue", "source": "mqueue", "options": [ "nosuid", "noexec", "nodev" "destination": "/sys", "type": "sysfs", "source": "sysfs", "options": [ "nosuid", "noexec", "nodev", "destination": "/sys/fs/cgroup", "type": "cgroup", "source": "cgroup", "options": [ "nosuid", "noexec", "nodev", "relatime", "linux": { "resources": { "devices": [ "allow": false, "access": "rwm" "namespaces": [ "type": "pid" "type": "network" "type": "ipc" "type": "uts"

    进入容器后,我们执行越狱教程中提供的代码,成功 break out:

    // 进入容器
    [root@master ~]# runc run config.json
    root@runc:/# ls -la
    total 104
    drwxr-xr-x  21 root root  4096 Feb 16 13:56 .
    drwxr-xr-x  21 root root  4096 Feb 16 13:56 ..
    -rwxr-xr-x   1 root root     0 Feb 14 03:20 .dockerenv
    lrwxrwxrwx   1 root root     7 Jan 26 02:03 bin -> usr/bin
    drwxr-xr-x   2 root root  4096 Apr 18  2022 boot
    -rwxr-xr-x   1 root root 29160 Feb 16 10:22 break
    drwxr-xr-x   2 root root  4096 Feb 16 14:00 d1r1
    drwxr-xr-x   2 root root  4096 Feb 16 14:00 d1r2
    drwxr-xr-x   2 root root  4096 Feb 16 14:01 d1r3
    drwxr-xr-x   5 root root   360 Feb 16 14:40 dev
    drwxr-xr-x  32 root root  4096 Feb 14 03:20 etc
    drwxr-xr-x   2 root root  4096 Apr 18  2022 home
    lrwxrwxrwx   1 root root     7 Jan 26 02:03 lib -> usr/lib
    lrwxrwxrwx   1 root root     9 Jan 26 02:03 lib32 -> usr/lib32
    lrwxrwxrwx   1 root root     9 Jan 26 02:03 lib64 -> usr/lib64
    lrwxrwxrwx   1 root root    10 Jan 26 02:03 libx32 -> usr/libx32
    drwxr-xr-x   2 root root  4096 Jan 26 02:03 media
    drwxr-xr-x   2 root root  4096 Jan 26 02:03 mnt
    drwxr-xr-x   2 root root  4096 Jan 26 02:03 opt
    dr-xr-xr-x 375 root root     0 Feb 16 14:40 proc
    drwx------   2 root root  4096 Feb 16 10:23 root
    drwxr-xr-x   6 root root  4096 Feb 14 03:20 run
    lrwxrwxrwx   1 root root     8 Jan 26 02:03 sbin -> usr/sbin
    drwxr-xr-x   2 root root  4096 Jan 26 02:03 srv
    dr-xr-xr-x  12 root root     0 Feb 16 14:40 sys
    drwxrwxrwt   2 root root  4096 Jan 26 02:06 tmp
    drwxr-xr-x  14 root root  4096 Jan 26 02:03 usr
    drwxr-xr-x  11 root root  4096 Jan 26 02:06 var
    drwxr-xr-x   2 root root  4096 Feb 16 10:22 waterbuffalo
    // 其实就是 chdir(..) + chroot(.)
    root@runc:/# ./break
    // 越狱成功
    [root@runc /]# ls -la
    total 18920
    dr-xr-xr-x  23 root root     4096 Feb 16 22:39 .
    dr-xr-xr-x  23 root root     4096 Feb 16 22:39 ..
    drwxr-xr-x   3 root root     4096 Jan  8  2021 agent
    drwxr-xr-x   3 root root     4096 Nov 24 16:03 api-helm
    -rw-r--r--   1 root root        0 Oct 30  2020 .autorelabel
    lrwxrwxrwx   1 root root        7 Dec 14  2020 bin -> usr/bin
    dr-xr-xr-x   5 root root     4096 Nov 30  2021 boot
    -rwxr-xr-x   1 root root 19261816 Dec  8 14:30 cloud-agent
    -rw-------   1 root root    12288 Nov 25 11:27 .conf.txt.swp
    drwxr-xr-x  13 root root     4096 Feb 16 22:39 data
    drwxr-xr-x  17 root root    14140 Sep  9 16:20 dev
    drwxr-xr-x 108 root root    12288 Feb 16 10:59 etc
    drwxr-xr-x   2 root root     4096 Dec 14  2020 home
    lrwxrwxrwx   1 root root        7 Dec 14  2020 lib -> usr/lib
    lrwxrwxrwx   1 root root        9 Dec 14  2020 lib64 -> usr/lib64
    drwx------   2 root root    16384 Aug 18  2020 lost+found
    drwxr-xr-x   2 root root     4096 Dec 14  2020 media
    drwxr-xr-x   2 root root     4096 Dec 14  2020 mnt
    drwxr-xr-x   6 root root     4096 Sep  9 16:31 opt
    dr-xr-xr-x 375 root root        0 Sep  9 16:19 proc
    dr-xr-x---  31 root root     4096 Feb 16 22:34 root
    drwxr-xr-x   2 root root     4096 Dec  5 11:40 rot
    drwxr-xr-x  33 root root     1120 Feb 16 14:59 run
    lrwxrwxrwx   1 root root        8 Dec 14  2020 sbin -> usr/sbin
    drwxr-xr-x   2 root root     4096 Dec 14  2020 srv
    dr-xr-xr-x  12 root root        0 Sep  9 16:19 sys
    drwxr-xr-x   2 root root     4096 Dec  5 11:33 test
    drwxrwxrwt   4 root root     4096 Feb 16 22:35 tmp
    drwxr-xr-x  12 root root     4096 Nov 30  2021 usr
    drwxr-xr-x  21 root root     4096 Sep  9 15:53 var
    [root@runc /]#
    

    至此也算是告一段落了,起码我们粗浅地了解了整个容器是如何初始化的,以及我们知道了,容器的 rootfs 是可以随意指定目录的。不过,开头提到的问题还是没能回答。不管是 Docker,还是 K8s,实际上都无法直接进行下列操作(k8s -> containerd 允许这么操作,但是容器创建不出来):

    > docker run -it -d --name xxx -p 8091:8090 -v /xxx:/ ubuntu
    docker: Error response from daemon: invalid volume specification: '/xxx:/': invalid mount config for type "bind": invalid specification: destination can't be '/'.
    See 'docker run --help'.
    

    后续,我们将尝试从 OCI、CRI 的角度再度探讨这个问题。

    文章如有错误,感谢指正。

    Reference

  • 小手册:https://man7.org/linux/man-pages/
  • runc:https://github.com/opencontainers
  • 大佬博客文章 Linux: Mount Shared Subtrees:https://pages.dogdog.run/tech/mount_subtree.html
  •