ptrace 系统调用拦截和模拟

背景

在gVisor沙箱项目中，Google采用了两种不同的系统调用拦截方式：

KVM模式
ptrace模式

另外，在Safe and Secure Subprocess Virtualization in Userspace中也提到了采用ptrace实现系统调用拦截。

以前对于ptrace的概念还停留在它是实现strace命令的基础，具体底层的实现原理就一无所知了，因此这篇笔记主要事项记录一下ptrace是如何实现系统调用拦截。

ptrace 定义

ptrace(2) (“process trace”)是Linux系统中的一个系统调用（syscall），它通常用于实现进程debugging功能。

通过ptrace系统调用，tracer（跟踪者）可以暂停tracee（被跟踪者）的执行流程，查看和设置tracee的内存和寄存器中的内容、监控系统调用的执行过程，甚至可以用来拦截模拟系统调用。

ptrace系统调用函数的定义：

long ptrace(int request, pid_t pid, void *addr, void *data);

request：表示需要执行什么样的ptrace操作，通常有下面几种常用的取值
- PTRACE_TRACEME：表示当前进程将被它的parent进程进行跟踪
- PTRACE_SYSCALL：表示遇到下一个待执行的系统调用入口时停止tracee进程或系统调用在内核态执行完毕返回后停止，以便tracer获取系统调用的返回值
- PTRACE_GETREGS：获取tracee进程中当前的寄存器中的值
- 其他的request的取值，详见ptrace(2) — Linux manual page
pid：表示被跟踪者的PID进程ID
addr和data：这两个是ptrace函数中通用参数，例如在读取tracee的寄存器值时，data参数就会传入保存寄存器内容变量的地址；而在一些场景下，这两个参数没有任何意义，通常默认传入0值

基于ptrace实现strace命令流程分析

基于ptrace系统调用实现的一个简化版本strace命令的代码如下所示：

int
main(int argc, char **argv)
{
    if (argc <= 1)
        FATAL("too few arguments: %d", argc);

    pid_t pid = fork();
    switch (pid) {
        case -1: /* error */
            FATAL("%s", strerror(errno));
        case 0:  /* child */
            ptrace(PTRACE_TRACEME, 0, 0, 0);
            /* Because we're now a tracee, execvp will block until the parent
             * attaches and allows us to continue. */
            execvp(argv[1], argv + 1);
            FATAL("%s", strerror(errno));
    }

    /* parent */
    waitpid(pid, 0, 0); // sync with execvp
    ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL);

    for (;;) {
        /* Enter next system call */
        if (ptrace(PTRACE_SYSCALL, pid, 0, 0) == -1)
            FATAL("%s", strerror(errno));
        if (waitpid(pid, 0, 0) == -1)
            FATAL("%s", strerror(errno));

        /* Gather system call arguments */
        struct user_regs_struct regs;
        if (ptrace(PTRACE_GETREGS, pid, 0, &regs) == -1)
            FATAL("%s", strerror(errno));
        long syscall = regs.orig_rax;

        /* Print a representation of the system call */
        fprintf(stderr, "%ld(%ld, %ld, %ld, %ld, %ld, %ld)",
                syscall,
                (long)regs.rdi, (long)regs.rsi, (long)regs.rdx,
                (long)regs.r10, (long)regs.r8,  (long)regs.r9);

        /* Run system call and stop on exit */
        if (ptrace(PTRACE_SYSCALL, pid, 0, 0) == -1)
            FATAL("%s", strerror(errno));
        if (waitpid(pid, 0, 0) == -1)
            FATAL("%s", strerror(errno));

        /* Get system call result */
        if (ptrace(PTRACE_GETREGS, pid, 0, &regs) == -1) {
            fputs(" = ?\n", stderr);
            if (errno == ESRCH)
                exit(regs.rdi); // system call was _exit(2) or similar
            FATAL("%s", strerror(errno));
        }

        /* Print system call result */
        fprintf(stderr, " = %ld\n", (long)regs.rax);
    }
}

上述代码流程简单总结一下就是：

通过fork()系统调用创建出待跟踪的tracee进程，这里使用到了 ptrace(PTRACE_TRACEME, 0, 0, 0)函数
设置父进程strace进程退出后，杀死tracee进程，这里使用到了ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL)函数
接下来就是一个死循环，不断地重复执行以下的流程：执行到tracee进程的系统调用入口处停止->获取系统调用参数信息->执行系统调用并等待返回后停止->获取系统调用执行返回结果
- 执行到tracee进程的系统调用入口处停止：这里使用到了ptrace(PTRACE_SYSCALL, pid, 0, 0)函数，在调用系统调用的入口处停止
- 获取系统调用参数信息：这里使用到了ptrace(PTRACE_GETREGS, pid, 0, &regs)函数，获取当前tracee进程中寄存器的值
- 执行系统调用并等待返回后停止：这里再次使用到了ptrace(PTRACE_SYSCALL, pid, 0, 0)函数，表示恢复tracee进程的系统调用执行，并等系统调用执行完毕返回后停止执行tracee进程
- 获取系统调用执行返回结果：这里再次使用到了ptrace(PTRACE_GETREGS, pid, 0, &regs)函数，获取系统调用的返回结果

拦截并模拟系统调用

从上面的例子可知，通过PTRACE_SYSCALL可以实现系统调用的拦截和恢复系统调用继续执行，如果要想实现系统调用的模拟，就需要在tracer跟踪器代码中增加一些关于系统调用的模拟操作，避免进入到内核中执行真实的系统调用。

一个简单地拦截过滤系统调用的例子：

for (;;) {
    /* Enter next system call */
    ptrace(PTRACE_SYSCALL, pid, 0, 0);
    waitpid(pid, 0, 0);

    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, 0, &regs);

    /* Is this system call permitted? */
    int blocked = 0;
    if (is_syscall_blocked(regs.orig_rax)) {
        blocked = 1;
        regs.orig_rax = -1; // set to invalid syscall
        ptrace(PTRACE_SETREGS, pid, 0, &regs);
    }

    /* Run system call and stop on exit */
    ptrace(PTRACE_SYSCALL, pid, 0, 0);
    waitpid(pid, 0, 0);

    if (blocked) {
        /* errno = EPERM */
        regs.rax = -EPERM; // Operation not permitted
        ptrace(PTRACE_SETREGS, pid, 0, &regs);
    }
}

系统调用拦截并模拟的操作可以通过下面的方法实现：

通过PTRACE_SYSCALL拦截系统调用
通过PTRACE_GETREGS获取实际调用的系统调用信息（包括系统调用号和参数）
通过PTRACE_SETREGS选项，修改系统调用的调用号信息，将其设置为一个非法的系统调用号，
最后，通过PTRACE_SYSCALL重新恢复系统系统调用执行，触发异常；在tracer代码中处理上述的异常，在用户空间代码中模拟系统调用的执行，返回模拟执行的结果

但是，上述这种拦截并模拟系统调用的方式非常低效，总共需要4次ptrace系统调用（也就触发了4次用户态和内核态的上下文切换）。

从2005年之后，Linux内核中引入了一个新的ptrace的request类型：PTRACE_SYSEMU来模拟系统调用，减少模拟系统调用带来的额外内核上下文开销。基于PTRACE_SYSEMU实现系统调用拦截和模拟的步骤如下：

通过PTRACE_SYSEMU在系统调用入口处拦截系统调用
通过PTRACE_GETREGS获取实际调用的系统调用信息（包括系统调用号和参数）
然后由tracer来模拟系统调用的实现，返回模拟的系统调用执行结果

一个简单基于PTRACE_SYSEMU实现系统调用和拦截的关键代码逻辑，如下所示：

for (;;) {
    ptrace(PTRACE_SYSEMU, pid, 0, 0);
    waitpid(pid, 0, 0);

    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, 0, &regs);

    switch (regs.orig_rax) {
        case OS_read:
            /* ... */

        case OS_write:
            /* ... */

        case OS_open:
            /* ... */

        case OS_exit:
            /* ... */

        /* ... and so on ... */
    }
}

所以，相比于PTRACE_SYSCALL，基于PTRACE_SYSEMU实现系统调用拦截和模拟的方法，减少了2个ptrace系统调用，整体系统调用模拟的性能也有所提高。

背景

ptrace 定义

基于ptrace实现strace命令流程分析

拦截并模拟系统调用

参考