bpf check_alu_op vulnerability analysis (CVE-2017–16995)

Around December 2017, CVE-2017–16995 was published:

The check_alu_op function in kernel/bpf/verifier.c in the Linux kernel through 4.14.8 allows local users to cause a denial of service (memory corruption) or possibly have unspecified other impact by leveraging incorrect sign extension.

CVE Database

I have come across this kernel exploit when rooting a machine on HackTheBox, and used it to gain root privileges. This was what led me to try to understand it.

First, we’ll want to understand, what is bpf exactly?

Berkeley Packet Filter (BPF) mostly allows you to run filters against packets at the kernel space. Filters are a set of instructions running inside the BPF virtual machine. Since Linux version 3.18, eBPF (extended BPF) was introduced. eBPF then allowed to attach those filters to sockets, tracepoints (for debugging purposes), and more. For example, tcpdump uses BPF bytecode to filter packets. You can view the BPF instructions if you add the -d argument:

user@ubuntu:~$ sudo tcpdump "port 80" -d
(000) ldh      [12]
(001) jeq      #0x86dd          jt 2    jf 10
(...)

The BPF Verifier

The bpf syscalls are essentially allowing to run code at the kernel space from the user space. That’s a security risk, hence we have the BPF verifier. Its purpose is to make sure the instructions are safe to run. Here are some of the checks it performs:

  1. Verifies the program isn’t too big
  2. Forbids loops (DAG check)
  3. Forbids unreachable instructions from existing in the code
  4. Forbids reading from uninitialized registers
  5. Forbids exiting the filter without setting a return value first
  6. Verifies the program isn’t accessing invalid memory

The Vulnerability

The vulnerable function check_alu_op is responsible for checking the validity of arithmetic instructions. When the static code analyzer wants to MOV an immediate value into a register, the following code will execute in the verifier:

/* case: R = imm
 * remember the value we stored into this reg
 */
regs[insn->dst_reg].type = SCALAR_VALUE;
__mark_reg_known(regs + insn->dst_reg, insn->imm);

Let’s look at the bpf_insn struct, and the signature of __mark_reg_known:

struct bpf_insn {
 __u8 code;  /* opcode */
 __u8 dst_reg:4; /* dest register */
 __u8 src_reg:4; /* source register */
 __s16 off;  /* signed offset */
 __s32 imm;  /* signed immediate constant */
};
/* Mark the unknown part of a register (variable offset or scalar value) as
 * known to have the value @imm.
 */
static void __mark_reg_known(struct bpf_reg_state *reg, u64 imm)

By looking at the struct, we can see that the immediate value is a signed 32-bit integer, and the imm parameter in __mark_reg_known is an unsigned 64-bit integer. Therefore, an implicit type conversion occurs.

But what happens when we convert a signed integer into an unsigned one? According to this document, the conversion happens this way:

while (number < 0) {
    number += MAX_UNSIGNED_INT + 1
}

Let’s say we have this BPF instruction:

mov32 r2, 0xFFFFFFFF

The immediate is a signed 32-bit int, so it’s actually -1 in decimal. We perform the above calculation with a MAX_UNSIGNED_INT of 64-bit: 0xFFFFFFFFFFFFFFFF

-1 += 0xFFFFFFFFFFFFFFFF + 1 // result = 0xFFFFFFFFFFFFFFFF

This is essentially sign extension, the most significant bit was extended throughout the 64-bit “register”. What should’ve happened is zero-padding of the upper 32-bits, since this is a MOV32 operation.

This resulted in 0xFFFFFFFFFFFFFFFF being saved to the hypothetical register state in the static code analyzer. But in reality, when the instructions will run, the register would be zero-padded: 0x00000000FFFFFFFF . Now we’re going to try and understand why the verifier is different than the actual code execution.

Let’s take a look at the bpf core code, specifically __bpf_prog_run

static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)
{
	...
	static const void *jumptable[256] = {
		[0 ... 255] = &&default_label,
		/* Now overwrite non-defaults ... */
		/* 32 bit ALU operations */
		[BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
		[BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
		...
		[BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
		[BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
    ...

What we’re interested in is the BPF_ALU | BPF_MOV | BPF_K . For reference: BPF_K means the source operand is an immediate value. BPF_X means the source operand is a register. Let’s look at the operation:

ALU_MOV_K:
    DST = (**u32**) IMM;
    CONT;

What differs here from the simulated MOV operation at the verifier, is that the immediate value is cast to an unsigned integer. This prevents the incorrect sign extension from occurring, and the result would be 0xFFFFFFFF . Casting to unsigned before moving is also the fix that was deployed for this issue.

Exploitation

We learned that the static code analyzer thinks we have a different immediate value in the register than what would’ve been during “real” code execution. The documentation in linux/kernel/bpf/verifier.c states:

bpf_check() is a static code analyzer that walks eBPF program instruction by instruction and updates register/stack state.
All paths of conditional branches are analyzed until 'bpf_exit' insn.

The common way to exploit this vulnerability is to trick the analyzer into thinking the code always exits with a conditional:

mov32 r2, 0xFFFFFFFF // r2 gets sign-extended in the verifier
jne r2, 0xFFFFFFFF, +2 // if r2!=0xFFFFFFFF, jmp 2 instruction ahead
mov64 r0, 0x0 // set return code to 0
bpf_exit // stop execution - verifier stops here

Let’s take a look at a part of the code that handles conditional jumps:

if (BPF_SRC(insn->code) == BPF_K &&
    (opcode == BPF_JEQ || opcode == BPF_JNE) &&
    dst_reg->type == SCALAR_VALUE &&
    tnum_equals_const(dst_reg->var_off, insn->imm)) {
	if (opcode == BPF_JEQ) {
		/* if (imm == imm) goto pc+off;
		 * only follow the goto, ignore fall-through
		 */
		*insn_idx += insn->off;
		return 0;
	} else {
		/* if (imm != imm) goto pc+off;
		 * only follow fall-through branch, since
		 * that's where the program will go
		 */
		return 0;
	}
}

tnum_equals_const is responsible for comparing the register state to the immediate value. We already know that insn->imm is a 32-bit signed integer, but inside the tnum struct, tnum.value is an unsigned 64-bit integer.

So what happens when we compare a signed and an unsigned integer? Simply put, the compiler converts the signed integer to an unsigned, repeating the same sign extension we’ve looked at previously. Can you guess the output of this code?

int a = 0xFFFFFFFF;
uint64_t b = 0xFFFFFFFFFFFFFFFF;
if(a == b) 
    printf("true");
else
    printf("false");

The output would be true. After understanding this, it’s clear that the aforementioned code would continue in the fall-through branch, going to the bpf_exit call, and leaving any code coming after the exit call unverified.

Leveraging this issue, an attacker could insert malicious code after the exit call, and have arbitrary R/W access to the kernel memory.

Extra information: eBPF Instructions Doc, PoC of Privilege Escalation, another PoC

Yanir Tsarimi

Yanir Tsarimi

Security enthusiast, developer, and blogger (sometimes)

comments powered by Disqus
rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora