mirror of
https://github.com/torvalds/linux.git
synced 2024-11-24 21:21:41 +00:00
11933cf1d9
The propagate_mnt() function handles mount propagation when creating mounts and propagates the source mount tree @source_mnt to all applicable nodes of the destination propagation mount tree headed by @dest_mnt. Unfortunately it contains a bug where it fails to terminate at peers of @source_mnt when looking up copies of the source mount that become masters for copies of the source mount tree mounted on top of slaves in the destination propagation tree causing a NULL dereference. Once the mechanics of the bug are understood it's easy to trigger. Because of unprivileged user namespaces it is available to unprivileged users. While fixing this bug we've gotten confused multiple times due to unclear terminology or missing concepts. So let's start this with some clarifications: * The terms "master" or "peer" denote a shared mount. A shared mount belongs to a peer group. * A peer group is a set of shared mounts that propagate to each other. They are identified by a peer group id. The peer group id is available in @shared_mnt->mnt_group_id. Shared mounts within the same peer group have the same peer group id. The peers in a peer group can be reached via @shared_mnt->mnt_share. * The terms "slave mount" or "dependent mount" denote a mount that receives propagation from a peer in a peer group. IOW, shared mounts may have slave mounts and slave mounts have shared mounts as their master. Slave mounts of a given peer in a peer group are listed on that peers slave list available at @shared_mnt->mnt_slave_list. * The term "master mount" denotes a mount in a peer group. IOW, it denotes a shared mount or a peer mount in a peer group. The term "master mount" - or "master" for short - is mostly used when talking in the context of slave mounts that receive propagation from a master mount. A master mount of a slave identifies the closest peer group a slave mount receives propagation from. The master mount of a slave can be identified via @slave_mount->mnt_master. Different slaves may point to different masters in the same peer group. * Multiple peers in a peer group can have non-empty ->mnt_slave_lists. Non-empty ->mnt_slave_lists of peers don't intersect. Consequently, to ensure all slave mounts of a peer group are visited the ->mnt_slave_lists of all peers in a peer group have to be walked. * Slave mounts point to a peer in the closest peer group they receive propagation from via @slave_mnt->mnt_master (see above). Together with these peers they form a propagation group (see below). The closest peer group can thus be identified through the peer group id @slave_mnt->mnt_master->mnt_group_id of the peer/master that a slave mount receives propagation from. * A shared-slave mount is a slave mount to a peer group pg1 while also a peer in another peer group pg2. IOW, a peer group may receive propagation from another peer group. If a peer group pg1 is a slave to another peer group pg2 then all peers in peer group pg1 point to the same peer in peer group pg2 via ->mnt_master. IOW, all peers in peer group pg1 appear on the same ->mnt_slave_list. IOW, they cannot be slaves to different peer groups. * A pure slave mount is a slave mount that is a slave to a peer group but is not a peer in another peer group. * A propagation group denotes the set of mounts consisting of a single peer group pg1 and all slave mounts and shared-slave mounts that point to a peer in that peer group via ->mnt_master. IOW, all slave mounts such that @slave_mnt->mnt_master->mnt_group_id is equal to @shared_mnt->mnt_group_id. The concept of a propagation group makes it easier to talk about a single propagation level in a propagation tree. For example, in propagate_mnt() the immediate peers of @dest_mnt and all slaves of @dest_mnt's peer group form a propagation group propg1. So a shared-slave mount that is a slave in propg1 and that is a peer in another peer group pg2 forms another propagation group propg2 together with all slaves that point to that shared-slave mount in their ->mnt_master. * A propagation tree refers to all mounts that receive propagation starting from a specific shared mount. For example, for propagate_mnt() @dest_mnt is the start of a propagation tree. The propagation tree ecompasses all mounts that receive propagation from @dest_mnt's peer group down to the leafs. With that out of the way let's get to the actual algorithm. We know that @dest_mnt is guaranteed to be a pure shared mount or a shared-slave mount. This is guaranteed by a check in attach_recursive_mnt(). So propagate_mnt() will first propagate the source mount tree to all peers in @dest_mnt's peer group: for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) { ret = propagate_one(n); if (ret) goto out; } Notice, that the peer propagation loop of propagate_mnt() doesn't propagate @dest_mnt itself. @dest_mnt is mounted directly in attach_recursive_mnt() after we propagated to the destination propagation tree. The mount that will be mounted on top of @dest_mnt is @source_mnt. This copy was created earlier even before we entered attach_recursive_mnt() and doesn't concern us a lot here. It's just important to notice that when propagate_mnt() is called @source_mnt will not yet have been mounted on top of @dest_mnt. Thus, @source_mnt->mnt_parent will either still point to @source_mnt or - in the case @source_mnt is moved and thus already attached - still to its former parent. For each peer @m in @dest_mnt's peer group propagate_one() will create a new copy of the source mount tree and mount that copy @child on @m such that @child->mnt_parent points to @m after propagate_one() returns. propagate_one() will stash the last destination propagation node @m in @last_dest and the last copy it created for the source mount tree in @last_source. Hence, if we call into propagate_one() again for the next destination propagation node @m, @last_dest will point to the previous destination propagation node and @last_source will point to the previous copy of the source mount tree and mounted on @last_dest. Each new copy of the source mount tree is created from the previous copy of the source mount tree. This will become important later. The peer loop in propagate_mnt() is straightforward. We iterate through the peers copying and updating @last_source and @last_dest as we go through them and mount each copy of the source mount tree @child on a peer @m in @dest_mnt's peer group. After propagate_mnt() handled the peers in @dest_mnt's peer group propagate_mnt() will propagate the source mount tree down the propagation tree that @dest_mnt's peer group propagates to: for (m = next_group(dest_mnt, dest_mnt); m; m = next_group(m, dest_mnt)) { /* everything in that slave group */ n = m; do { ret = propagate_one(n); if (ret) goto out; n = next_peer(n); } while (n != m); } The next_group() helper will recursively walk the destination propagation tree, descending into each propagation group of the propagation tree. The important part is that it takes care to propagate the source mount tree to all peers in the peer group of a propagation group before it propagates to the slaves to those peers in the propagation group. IOW, it creates and mounts copies of the source mount tree that become masters before it creates and mounts copies of the source mount tree that become slaves to these masters. It is important to remember that propagating the source mount tree to each mount @m in the destination propagation tree simply means that we create and mount new copies @child of the source mount tree on @m such that @child->mnt_parent points to @m. Since we know that each node @m in the destination propagation tree headed by @dest_mnt's peer group will be overmounted with a copy of the source mount tree and since we know that the propagation properties of each copy of the source mount tree we create and mount at @m will mostly mirror the propagation properties of @m. We can use that information to create and mount the copies of the source mount tree that become masters before their slaves. The easy case is always when @m and @last_dest are peers in a peer group of a given propagation group. In that case we know that we can simply copy @last_source without having to figure out what the master for the new copy @child of the source mount tree needs to be as we've done that in a previous call to propagate_one(). The hard case is when we're dealing with a slave mount or a shared-slave mount @m in a destination propagation group that we need to create and mount a copy of the source mount tree on. For each propagation group in the destination propagation tree we propagate the source mount tree to we want to make sure that the copies @child of the source mount tree we create and mount on slaves @m pick an ealier copy of the source mount tree that we mounted on a master @m of the destination propagation group as their master. This is a mouthful but as far as we can tell that's the core of it all. But, if we keep track of the masters in the destination propagation tree @m we can use the information to find the correct master for each copy of the source mount tree we create and mount at the slaves in the destination propagation tree @m. Let's walk through the base case as that's still fairly easy to grasp. If we're dealing with the first slave in the propagation group that @dest_mnt is in then we don't yet have marked any masters in the destination propagation tree. We know the master for the first slave to @dest_mnt's peer group is simple @dest_mnt. So we expect this algorithm to yield a copy of the source mount tree that was mounted on a peer in @dest_mnt's peer group as the master for the copy of the source mount tree we want to mount at the first slave @m: for (n = m; ; n = p) { p = n->mnt_master; if (p == dest_master || IS_MNT_MARKED(p)) break; } For the first slave we walk the destination propagation tree all the way up to a peer in @dest_mnt's peer group. IOW, the propagation hierarchy can be walked by walking up the @mnt->mnt_master hierarchy of the destination propagation tree @m. We will ultimately find a peer in @dest_mnt's peer group and thus ultimately @dest_mnt->mnt_master. Btw, here the assumption we listed at the beginning becomes important. Namely, that peers in a peer group pg1 that are slaves in another peer group pg2 appear on the same ->mnt_slave_list. IOW, all slaves who are peers in peer group pg1 point to the same peer in peer group pg2 via their ->mnt_master. Otherwise the termination condition in the code above would be wrong and next_group() would be broken too. So the first iteration sets: n = m; p = n->mnt_master; such that @p now points to a peer or @dest_mnt itself. We walk up one more level since we don't have any marked mounts. So we end up with: n = dest_mnt; p = dest_mnt->mnt_master; If @dest_mnt's peer group is not slave to another peer group then @p is now NULL. If @dest_mnt's peer group is a slave to another peer group then @p now points to @dest_mnt->mnt_master points which is a master outside the propagation tree we're dealing with. Now we need to figure out the master for the copy of the source mount tree we're about to create and mount on the first slave of @dest_mnt's peer group: do { struct mount *parent = last_source->mnt_parent; if (last_source == first_source) break; done = parent->mnt_master == p; if (done && peers(n, parent)) break; last_source = last_source->mnt_master; } while (!done); We know that @last_source->mnt_parent points to @last_dest and @last_dest is the last peer in @dest_mnt's peer group we propagated to in the peer loop in propagate_mnt(). Consequently, @last_source is the last copy we created and mount on that last peer in @dest_mnt's peer group. So @last_source is the master we want to pick. We know that @last_source->mnt_parent->mnt_master points to @last_dest->mnt_master. We also know that @last_dest->mnt_master is either NULL or points to a master outside of the destination propagation tree and so does @p. Hence: done = parent->mnt_master == p; is trivially true in the base condition. We also know that for the first slave mount of @dest_mnt's peer group that @last_dest either points @dest_mnt itself because it was initialized to: last_dest = dest_mnt; at the beginning of propagate_mnt() or it will point to a peer of @dest_mnt in its peer group. In both cases it is guaranteed that on the first iteration @n and @parent are peers (Please note the check for peers here as that's important.): if (done && peers(n, parent)) break; So, as we expected, we select @last_source, which referes to the last copy of the source mount tree we mounted on the last peer in @dest_mnt's peer group, as the master of the first slave in @dest_mnt's peer group. The rest is taken care of by clone_mnt(last_source, ...). We'll skip over that part otherwise this becomes a blogpost. At the end of propagate_mnt() we now mark @m->mnt_master as the first master in the destination propagation tree that is distinct from @dest_mnt->mnt_master. IOW, we mark @dest_mnt itself as a master. By marking @dest_mnt or one of it's peers we are able to easily find it again when we later lookup masters for other copies of the source mount tree we mount copies of the source mount tree on slaves @m to @dest_mnt's peer group. This, in turn allows us to find the master we selected for the copies of the source mount tree we mounted on master in the destination propagation tree again. The important part is to realize that the code makes use of the fact that the last copy of the source mount tree stashed in @last_source was mounted on top of the previous destination propagation node @last_dest. What this means is that @last_source allows us to walk the destination propagation hierarchy the same way each destination propagation node @m does. If we take @last_source, which is the copy of @source_mnt we have mounted on @last_dest in the previous iteration of propagate_one(), then we know @last_source->mnt_parent points to @last_dest but we also know that as we walk through the destination propagation tree that @last_source->mnt_master will point to an earlier copy of the source mount tree we mounted one an earlier destination propagation node @m. IOW, @last_source->mnt_parent will be our hook into the destination propagation tree and each consecutive @last_source->mnt_master will lead us to an earlier propagation node @m via @last_source->mnt_master->mnt_parent. Hence, by walking up @last_source->mnt_master, each of which is mounted on a node that is a master @m in the destination propagation tree we can also walk up the destination propagation hierarchy. So, for each new destination propagation node @m we use the previous copy of @last_source and the fact it's mounted on the previous propagation node @last_dest via @last_source->mnt_master->mnt_parent to determine what the master of the new copy of @last_source needs to be. The goal is to find the _closest_ master that the new copy of the source mount tree we are about to create and mount on a slave @m in the destination propagation tree needs to pick. IOW, we want to find a suitable master in the propagation group. As the propagation structure of the source mount propagation tree we create mirrors the propagation structure of the destination propagation tree we can find @m's closest master - i.e., a marked master - which is a peer in the closest peer group that @m receives propagation from. We store that closest master of @m in @p as before and record the slave to that master in @n We then search for this master @p via @last_source by walking up the master hierarchy starting from the last copy of the source mount tree stored in @last_source that we created and mounted on the previous destination propagation node @m. We will try to find the master by walking @last_source->mnt_master and by comparing @last_source->mnt_master->mnt_parent->mnt_master to @p. If we find @p then we can figure out what earlier copy of the source mount tree needs to be the master for the new copy of the source mount tree we're about to create and mount at the current destination propagation node @m. If @last_source->mnt_master->mnt_parent and @n are peers then we know that the closest master they receive propagation from is @last_source->mnt_master->mnt_parent->mnt_master. If not then the closest immediate peer group that they receive propagation from must be one level higher up. This builds on the earlier clarification at the beginning that all peers in a peer group which are slaves of other peer groups all point to the same ->mnt_master, i.e., appear on the same ->mnt_slave_list, of the closest peer group that they receive propagation from. However, terminating the walk has corner cases. If the closest marked master for a given destination node @m cannot be found by walking up the master hierarchy via @last_source->mnt_master then we need to terminate the walk when we encounter @source_mnt again. This isn't an arbitrary termination. It simply means that the new copy of the source mount tree we're about to create has a copy of the source mount tree we created and mounted on a peer in @dest_mnt's peer group as its master. IOW, @source_mnt is the peer in the closest peer group that the new copy of the source mount tree receives propagation from. We absolutely have to stop @source_mnt because @last_source->mnt_master either points outside the propagation hierarchy we're dealing with or it is NULL because @source_mnt isn't a shared-slave. So continuing the walk past @source_mnt would cause a NULL dereference via @last_source->mnt_master->mnt_parent. And so we have to stop the walk when we encounter @source_mnt again. One scenario where this can happen is when we first handled a series of slaves of @dest_mnt's peer group and then encounter peers in a new peer group that is a slave to @dest_mnt's peer group. We handle them and then we encounter another slave mount to @dest_mnt that is a pure slave to @dest_mnt's peer group. That pure slave will have a peer in @dest_mnt's peer group as its master. Consequently, the new copy of the source mount tree will need to have @source_mnt as it's master. So we walk the propagation hierarchy all the way up to @source_mnt based on @last_source->mnt_master. So terminate on @source_mnt, easy peasy. Except, that the check misses something that the rest of the algorithm already handles. If @dest_mnt has peers in it's peer group the peer loop in propagate_mnt(): for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) { ret = propagate_one(n); if (ret) goto out; } will consecutively update @last_source with each previous copy of the source mount tree we created and mounted at the previous peer in @dest_mnt's peer group. So after that loop terminates @last_source will point to whatever copy of the source mount tree was created and mounted on the last peer in @dest_mnt's peer group. Furthermore, if there is even a single additional peer in @dest_mnt's peer group then @last_source will __not__ point to @source_mnt anymore. Because, as we mentioned above, @dest_mnt isn't even handled in this loop but directly in attach_recursive_mnt(). So it can't even accidently come last in that peer loop. So the first time we handle a slave mount @m of @dest_mnt's peer group the copy of the source mount tree we create will make the __last copy of the source mount tree we created and mounted on the last peer in @dest_mnt's peer group the master of the new copy of the source mount tree we create and mount on the first slave of @dest_mnt's peer group__. But this means that the termination condition that checks for @source_mnt is wrong. The @source_mnt cannot be found anymore by propagate_one(). Instead it will find the last copy of the source mount tree we created and mounted for the last peer of @dest_mnt's peer group again. And that is a peer of @source_mnt not @source_mnt itself. IOW, we fail to terminate the loop correctly and ultimately dereference @last_source->mnt_master->mnt_parent. When @source_mnt's peer group isn't slave to another peer group then @last_source->mnt_master is NULL causing the splat below. For example, assume @dest_mnt is a pure shared mount and has three peers in its peer group: =================================================================================== mount-id mount-parent-id peer-group-id =================================================================================== (@dest_mnt) mnt_master[216] 309 297 shared:216 \ (@source_mnt) mnt_master[218]: 609 609 shared:218 (1) mnt_master[216]: 607 605 shared:216 \ (P1) mnt_master[218]: 624 607 shared:218 (2) mnt_master[216]: 576 574 shared:216 \ (P2) mnt_master[218]: 625 576 shared:218 (3) mnt_master[216]: 545 543 shared:216 \ (P3) mnt_master[218]: 626 545 shared:218 After this sequence has been processed @last_source will point to (P3), the copy generated for the third peer in @dest_mnt's peer group we handled. So the copy of the source mount tree (P4) we create and mount on the first slave of @dest_mnt's peer group: =================================================================================== mount-id mount-parent-id peer-group-id =================================================================================== mnt_master[216] 309 297 shared:216 / / (S0) mnt_slave 483 481 master:216 \ \ (P3) mnt_master[218] 626 545 shared:218 \ / \/ (P4) mnt_slave 627 483 master:218 will pick the last copy of the source mount tree (P3) as master, not (S0). When walking the propagation hierarchy via @last_source's master hierarchy we encounter (P3) but not (S0), i.e., @source_mnt. We can fix this in multiple ways: (1) By setting @last_source to @source_mnt after we processed the peers in @dest_mnt's peer group right after the peer loop in propagate_mnt(). (2) By changing the termination condition that relies on finding exactly @source_mnt to finding a peer of @source_mnt. (3) By only moving @last_source when we actually venture into a new peer group or some clever variant thereof. The first two options are minimally invasive and what we want as a fix. The third option is more intrusive but something we'd like to explore in the near future. This passes all LTP tests and specifically the mount propagation testsuite part of it. It also holds up against all known reproducers of this issues. Final words. First, this is a clever but __worringly__ underdocumented algorithm. There isn't a single detailed comment to be found in next_group(), propagate_one() or anywhere else in that file for that matter. This has been a giant pain to understand and work through and a bug like this is insanely difficult to fix without a detailed understanding of what's happening. Let's not talk about the amount of time that was sunk into fixing this. Second, all the cool kids with access to unshare --mount --user --map-root --propagation=unchanged are going to have a lot of fun. IOW, triggerable by unprivileged users while namespace_lock() lock is held. [ 115.848393] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 115.848967] #PF: supervisor read access in kernel mode [ 115.849386] #PF: error_code(0x0000) - not-present page [ 115.849803] PGD 0 P4D 0 [ 115.850012] Oops: 0000 [#1] PREEMPT SMP PTI [ 115.850354] CPU: 0 PID: 15591 Comm: mount Not tainted 6.1.0-rc7 #3 [ 115.850851] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 115.851510] RIP: 0010:propagate_one.part.0+0x7f/0x1a0 [ 115.851924] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10 49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01 00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37 02 4d [ 115.853441] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282 [ 115.853865] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00 [ 115.854458] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780 [ 115.855044] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0 [ 115.855693] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8 [ 115.856304] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000 [ 115.856859] FS: 00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000) knlGS:0000000000000000 [ 115.857531] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 115.858006] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0 [ 115.858598] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 115.859393] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 115.860099] Call Trace: [ 115.860358] <TASK> [ 115.860535] propagate_mnt+0x14d/0x190 [ 115.860848] attach_recursive_mnt+0x274/0x3e0 [ 115.861212] path_mount+0x8c8/0xa60 [ 115.861503] __x64_sys_mount+0xf6/0x140 [ 115.861819] do_syscall_64+0x5b/0x80 [ 115.862117] ? do_faccessat+0x123/0x250 [ 115.862435] ? syscall_exit_to_user_mode+0x17/0x40 [ 115.862826] ? do_syscall_64+0x67/0x80 [ 115.863133] ? syscall_exit_to_user_mode+0x17/0x40 [ 115.863527] ? do_syscall_64+0x67/0x80 [ 115.863835] ? do_syscall_64+0x67/0x80 [ 115.864144] ? do_syscall_64+0x67/0x80 [ 115.864452] ? exc_page_fault+0x70/0x170 [ 115.864775] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 115.865187] RIP: 0033:0x7f92c92b0ebe [ 115.865480] Code: 48 8b 0d 75 4f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 42 4f 0c 00 f7 d8 64 89 01 48 [ 115.866984] RSP: 002b:00007fff000aa728 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [ 115.867607] RAX: ffffffffffffffda RBX: 000055a77888d6b0 RCX: 00007f92c92b0ebe [ 115.868240] RDX: 000055a77888d8e0 RSI: 000055a77888e6e0 RDI: 000055a77888e620 [ 115.868823] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 115.869403] R10: 0000000000001000 R11: 0000000000000246 R12: 000055a77888e620 [ 115.869994] R13: 000055a77888d8e0 R14: 00000000ffffffff R15: 00007f92c93e4076 [ 115.870581] </TASK> [ 115.870763] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink qrtr snd_intel8x0 sunrpc snd_ac97_codec ac97_bus snd_pcm snd_timer intel_rapl_msr intel_rapl_common snd vboxguest intel_powerclamp video rapl joydev soundcore i2c_piix4 wmi fuse zram xfs vmwgfx crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic drm_ttm_helper ttm e1000 ghash_clmulni_intel serio_raw ata_generic pata_acpi scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath [ 115.875288] CR2: 0000000000000010 [ 115.875641] ---[ end trace 0000000000000000 ]--- [ 115.876135] RIP: 0010:propagate_one.part.0+0x7f/0x1a0 [ 115.876551] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10 49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01 00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37 02 4d [ 115.878086] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282 [ 115.878511] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00 [ 115.879128] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780 [ 115.879715] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0 [ 115.880359] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8 [ 115.880962] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000 [ 115.881548] FS: 00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000) knlGS:0000000000000000 [ 115.882234] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 115.882713] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0 [ 115.883314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 115.883966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes:f2ebb3a921
("smarter propagate_mnt()") Fixes:5ec0811d30
("propogate_mnt: Handle the first propogated copy being a slave") Cc: <stable@vger.kernel.org> Reported-by: Ditang Chen <ditang.c@gmail.com> Signed-off-by: Seth Forshee (Digital Ocean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> --- If there are no big objections I'll get this to Linus rather sooner than later.
603 lines
15 KiB
C
603 lines
15 KiB
C
// SPDX-License-Identifier: GPL-2.0-only
|
|
/*
|
|
* linux/fs/pnode.c
|
|
*
|
|
* (C) Copyright IBM Corporation 2005.
|
|
* Author : Ram Pai (linuxram@us.ibm.com)
|
|
*/
|
|
#include <linux/mnt_namespace.h>
|
|
#include <linux/mount.h>
|
|
#include <linux/fs.h>
|
|
#include <linux/nsproxy.h>
|
|
#include <uapi/linux/mount.h>
|
|
#include "internal.h"
|
|
#include "pnode.h"
|
|
|
|
/* return the next shared peer mount of @p */
|
|
static inline struct mount *next_peer(struct mount *p)
|
|
{
|
|
return list_entry(p->mnt_share.next, struct mount, mnt_share);
|
|
}
|
|
|
|
static inline struct mount *first_slave(struct mount *p)
|
|
{
|
|
return list_entry(p->mnt_slave_list.next, struct mount, mnt_slave);
|
|
}
|
|
|
|
static inline struct mount *last_slave(struct mount *p)
|
|
{
|
|
return list_entry(p->mnt_slave_list.prev, struct mount, mnt_slave);
|
|
}
|
|
|
|
static inline struct mount *next_slave(struct mount *p)
|
|
{
|
|
return list_entry(p->mnt_slave.next, struct mount, mnt_slave);
|
|
}
|
|
|
|
static struct mount *get_peer_under_root(struct mount *mnt,
|
|
struct mnt_namespace *ns,
|
|
const struct path *root)
|
|
{
|
|
struct mount *m = mnt;
|
|
|
|
do {
|
|
/* Check the namespace first for optimization */
|
|
if (m->mnt_ns == ns && is_path_reachable(m, m->mnt.mnt_root, root))
|
|
return m;
|
|
|
|
m = next_peer(m);
|
|
} while (m != mnt);
|
|
|
|
return NULL;
|
|
}
|
|
|
|
/*
|
|
* Get ID of closest dominating peer group having a representative
|
|
* under the given root.
|
|
*
|
|
* Caller must hold namespace_sem
|
|
*/
|
|
int get_dominating_id(struct mount *mnt, const struct path *root)
|
|
{
|
|
struct mount *m;
|
|
|
|
for (m = mnt->mnt_master; m != NULL; m = m->mnt_master) {
|
|
struct mount *d = get_peer_under_root(m, mnt->mnt_ns, root);
|
|
if (d)
|
|
return d->mnt_group_id;
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
|
|
static int do_make_slave(struct mount *mnt)
|
|
{
|
|
struct mount *master, *slave_mnt;
|
|
|
|
if (list_empty(&mnt->mnt_share)) {
|
|
if (IS_MNT_SHARED(mnt)) {
|
|
mnt_release_group_id(mnt);
|
|
CLEAR_MNT_SHARED(mnt);
|
|
}
|
|
master = mnt->mnt_master;
|
|
if (!master) {
|
|
struct list_head *p = &mnt->mnt_slave_list;
|
|
while (!list_empty(p)) {
|
|
slave_mnt = list_first_entry(p,
|
|
struct mount, mnt_slave);
|
|
list_del_init(&slave_mnt->mnt_slave);
|
|
slave_mnt->mnt_master = NULL;
|
|
}
|
|
return 0;
|
|
}
|
|
} else {
|
|
struct mount *m;
|
|
/*
|
|
* slave 'mnt' to a peer mount that has the
|
|
* same root dentry. If none is available then
|
|
* slave it to anything that is available.
|
|
*/
|
|
for (m = master = next_peer(mnt); m != mnt; m = next_peer(m)) {
|
|
if (m->mnt.mnt_root == mnt->mnt.mnt_root) {
|
|
master = m;
|
|
break;
|
|
}
|
|
}
|
|
list_del_init(&mnt->mnt_share);
|
|
mnt->mnt_group_id = 0;
|
|
CLEAR_MNT_SHARED(mnt);
|
|
}
|
|
list_for_each_entry(slave_mnt, &mnt->mnt_slave_list, mnt_slave)
|
|
slave_mnt->mnt_master = master;
|
|
list_move(&mnt->mnt_slave, &master->mnt_slave_list);
|
|
list_splice(&mnt->mnt_slave_list, master->mnt_slave_list.prev);
|
|
INIT_LIST_HEAD(&mnt->mnt_slave_list);
|
|
mnt->mnt_master = master;
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* vfsmount lock must be held for write
|
|
*/
|
|
void change_mnt_propagation(struct mount *mnt, int type)
|
|
{
|
|
if (type == MS_SHARED) {
|
|
set_mnt_shared(mnt);
|
|
return;
|
|
}
|
|
do_make_slave(mnt);
|
|
if (type != MS_SLAVE) {
|
|
list_del_init(&mnt->mnt_slave);
|
|
mnt->mnt_master = NULL;
|
|
if (type == MS_UNBINDABLE)
|
|
mnt->mnt.mnt_flags |= MNT_UNBINDABLE;
|
|
else
|
|
mnt->mnt.mnt_flags &= ~MNT_UNBINDABLE;
|
|
}
|
|
}
|
|
|
|
/*
|
|
* get the next mount in the propagation tree.
|
|
* @m: the mount seen last
|
|
* @origin: the original mount from where the tree walk initiated
|
|
*
|
|
* Note that peer groups form contiguous segments of slave lists.
|
|
* We rely on that in get_source() to be able to find out if
|
|
* vfsmount found while iterating with propagation_next() is
|
|
* a peer of one we'd found earlier.
|
|
*/
|
|
static struct mount *propagation_next(struct mount *m,
|
|
struct mount *origin)
|
|
{
|
|
/* are there any slaves of this mount? */
|
|
if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
|
|
return first_slave(m);
|
|
|
|
while (1) {
|
|
struct mount *master = m->mnt_master;
|
|
|
|
if (master == origin->mnt_master) {
|
|
struct mount *next = next_peer(m);
|
|
return (next == origin) ? NULL : next;
|
|
} else if (m->mnt_slave.next != &master->mnt_slave_list)
|
|
return next_slave(m);
|
|
|
|
/* back at master */
|
|
m = master;
|
|
}
|
|
}
|
|
|
|
static struct mount *skip_propagation_subtree(struct mount *m,
|
|
struct mount *origin)
|
|
{
|
|
/*
|
|
* Advance m such that propagation_next will not return
|
|
* the slaves of m.
|
|
*/
|
|
if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
|
|
m = last_slave(m);
|
|
|
|
return m;
|
|
}
|
|
|
|
static struct mount *next_group(struct mount *m, struct mount *origin)
|
|
{
|
|
while (1) {
|
|
while (1) {
|
|
struct mount *next;
|
|
if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
|
|
return first_slave(m);
|
|
next = next_peer(m);
|
|
if (m->mnt_group_id == origin->mnt_group_id) {
|
|
if (next == origin)
|
|
return NULL;
|
|
} else if (m->mnt_slave.next != &next->mnt_slave)
|
|
break;
|
|
m = next;
|
|
}
|
|
/* m is the last peer */
|
|
while (1) {
|
|
struct mount *master = m->mnt_master;
|
|
if (m->mnt_slave.next != &master->mnt_slave_list)
|
|
return next_slave(m);
|
|
m = next_peer(master);
|
|
if (master->mnt_group_id == origin->mnt_group_id)
|
|
break;
|
|
if (master->mnt_slave.next == &m->mnt_slave)
|
|
break;
|
|
m = master;
|
|
}
|
|
if (m == origin)
|
|
return NULL;
|
|
}
|
|
}
|
|
|
|
/* all accesses are serialized by namespace_sem */
|
|
static struct mount *last_dest, *first_source, *last_source, *dest_master;
|
|
static struct mountpoint *mp;
|
|
static struct hlist_head *list;
|
|
|
|
static inline bool peers(struct mount *m1, struct mount *m2)
|
|
{
|
|
return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
|
|
}
|
|
|
|
static int propagate_one(struct mount *m)
|
|
{
|
|
struct mount *child;
|
|
int type;
|
|
/* skip ones added by this propagate_mnt() */
|
|
if (IS_MNT_NEW(m))
|
|
return 0;
|
|
/* skip if mountpoint isn't covered by it */
|
|
if (!is_subdir(mp->m_dentry, m->mnt.mnt_root))
|
|
return 0;
|
|
if (peers(m, last_dest)) {
|
|
type = CL_MAKE_SHARED;
|
|
} else {
|
|
struct mount *n, *p;
|
|
bool done;
|
|
for (n = m; ; n = p) {
|
|
p = n->mnt_master;
|
|
if (p == dest_master || IS_MNT_MARKED(p))
|
|
break;
|
|
}
|
|
do {
|
|
struct mount *parent = last_source->mnt_parent;
|
|
if (peers(last_source, first_source))
|
|
break;
|
|
done = parent->mnt_master == p;
|
|
if (done && peers(n, parent))
|
|
break;
|
|
last_source = last_source->mnt_master;
|
|
} while (!done);
|
|
|
|
type = CL_SLAVE;
|
|
/* beginning of peer group among the slaves? */
|
|
if (IS_MNT_SHARED(m))
|
|
type |= CL_MAKE_SHARED;
|
|
}
|
|
|
|
child = copy_tree(last_source, last_source->mnt.mnt_root, type);
|
|
if (IS_ERR(child))
|
|
return PTR_ERR(child);
|
|
read_seqlock_excl(&mount_lock);
|
|
mnt_set_mountpoint(m, mp, child);
|
|
if (m->mnt_master != dest_master)
|
|
SET_MNT_MARK(m->mnt_master);
|
|
read_sequnlock_excl(&mount_lock);
|
|
last_dest = m;
|
|
last_source = child;
|
|
hlist_add_head(&child->mnt_hash, list);
|
|
return count_mounts(m->mnt_ns, child);
|
|
}
|
|
|
|
/*
|
|
* mount 'source_mnt' under the destination 'dest_mnt' at
|
|
* dentry 'dest_dentry'. And propagate that mount to
|
|
* all the peer and slave mounts of 'dest_mnt'.
|
|
* Link all the new mounts into a propagation tree headed at
|
|
* source_mnt. Also link all the new mounts using ->mnt_list
|
|
* headed at source_mnt's ->mnt_list
|
|
*
|
|
* @dest_mnt: destination mount.
|
|
* @dest_dentry: destination dentry.
|
|
* @source_mnt: source mount.
|
|
* @tree_list : list of heads of trees to be attached.
|
|
*/
|
|
int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
|
|
struct mount *source_mnt, struct hlist_head *tree_list)
|
|
{
|
|
struct mount *m, *n;
|
|
int ret = 0;
|
|
|
|
/*
|
|
* we don't want to bother passing tons of arguments to
|
|
* propagate_one(); everything is serialized by namespace_sem,
|
|
* so globals will do just fine.
|
|
*/
|
|
last_dest = dest_mnt;
|
|
first_source = source_mnt;
|
|
last_source = source_mnt;
|
|
mp = dest_mp;
|
|
list = tree_list;
|
|
dest_master = dest_mnt->mnt_master;
|
|
|
|
/* all peers of dest_mnt, except dest_mnt itself */
|
|
for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
|
|
ret = propagate_one(n);
|
|
if (ret)
|
|
goto out;
|
|
}
|
|
|
|
/* all slave groups */
|
|
for (m = next_group(dest_mnt, dest_mnt); m;
|
|
m = next_group(m, dest_mnt)) {
|
|
/* everything in that slave group */
|
|
n = m;
|
|
do {
|
|
ret = propagate_one(n);
|
|
if (ret)
|
|
goto out;
|
|
n = next_peer(n);
|
|
} while (n != m);
|
|
}
|
|
out:
|
|
read_seqlock_excl(&mount_lock);
|
|
hlist_for_each_entry(n, tree_list, mnt_hash) {
|
|
m = n->mnt_parent;
|
|
if (m->mnt_master != dest_mnt->mnt_master)
|
|
CLEAR_MNT_MARK(m->mnt_master);
|
|
}
|
|
read_sequnlock_excl(&mount_lock);
|
|
return ret;
|
|
}
|
|
|
|
static struct mount *find_topper(struct mount *mnt)
|
|
{
|
|
/* If there is exactly one mount covering mnt completely return it. */
|
|
struct mount *child;
|
|
|
|
if (!list_is_singular(&mnt->mnt_mounts))
|
|
return NULL;
|
|
|
|
child = list_first_entry(&mnt->mnt_mounts, struct mount, mnt_child);
|
|
if (child->mnt_mountpoint != mnt->mnt.mnt_root)
|
|
return NULL;
|
|
|
|
return child;
|
|
}
|
|
|
|
/*
|
|
* return true if the refcount is greater than count
|
|
*/
|
|
static inline int do_refcount_check(struct mount *mnt, int count)
|
|
{
|
|
return mnt_get_count(mnt) > count;
|
|
}
|
|
|
|
/*
|
|
* check if the mount 'mnt' can be unmounted successfully.
|
|
* @mnt: the mount to be checked for unmount
|
|
* NOTE: unmounting 'mnt' would naturally propagate to all
|
|
* other mounts its parent propagates to.
|
|
* Check if any of these mounts that **do not have submounts**
|
|
* have more references than 'refcnt'. If so return busy.
|
|
*
|
|
* vfsmount lock must be held for write
|
|
*/
|
|
int propagate_mount_busy(struct mount *mnt, int refcnt)
|
|
{
|
|
struct mount *m, *child, *topper;
|
|
struct mount *parent = mnt->mnt_parent;
|
|
|
|
if (mnt == parent)
|
|
return do_refcount_check(mnt, refcnt);
|
|
|
|
/*
|
|
* quickly check if the current mount can be unmounted.
|
|
* If not, we don't have to go checking for all other
|
|
* mounts
|
|
*/
|
|
if (!list_empty(&mnt->mnt_mounts) || do_refcount_check(mnt, refcnt))
|
|
return 1;
|
|
|
|
for (m = propagation_next(parent, parent); m;
|
|
m = propagation_next(m, parent)) {
|
|
int count = 1;
|
|
child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
|
|
if (!child)
|
|
continue;
|
|
|
|
/* Is there exactly one mount on the child that covers
|
|
* it completely whose reference should be ignored?
|
|
*/
|
|
topper = find_topper(child);
|
|
if (topper)
|
|
count += 1;
|
|
else if (!list_empty(&child->mnt_mounts))
|
|
continue;
|
|
|
|
if (do_refcount_check(child, count))
|
|
return 1;
|
|
}
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* Clear MNT_LOCKED when it can be shown to be safe.
|
|
*
|
|
* mount_lock lock must be held for write
|
|
*/
|
|
void propagate_mount_unlock(struct mount *mnt)
|
|
{
|
|
struct mount *parent = mnt->mnt_parent;
|
|
struct mount *m, *child;
|
|
|
|
BUG_ON(parent == mnt);
|
|
|
|
for (m = propagation_next(parent, parent); m;
|
|
m = propagation_next(m, parent)) {
|
|
child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
|
|
if (child)
|
|
child->mnt.mnt_flags &= ~MNT_LOCKED;
|
|
}
|
|
}
|
|
|
|
static void umount_one(struct mount *mnt, struct list_head *to_umount)
|
|
{
|
|
CLEAR_MNT_MARK(mnt);
|
|
mnt->mnt.mnt_flags |= MNT_UMOUNT;
|
|
list_del_init(&mnt->mnt_child);
|
|
list_del_init(&mnt->mnt_umounting);
|
|
list_move_tail(&mnt->mnt_list, to_umount);
|
|
}
|
|
|
|
/*
|
|
* NOTE: unmounting 'mnt' naturally propagates to all other mounts its
|
|
* parent propagates to.
|
|
*/
|
|
static bool __propagate_umount(struct mount *mnt,
|
|
struct list_head *to_umount,
|
|
struct list_head *to_restore)
|
|
{
|
|
bool progress = false;
|
|
struct mount *child;
|
|
|
|
/*
|
|
* The state of the parent won't change if this mount is
|
|
* already unmounted or marked as without children.
|
|
*/
|
|
if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED))
|
|
goto out;
|
|
|
|
/* Verify topper is the only grandchild that has not been
|
|
* speculatively unmounted.
|
|
*/
|
|
list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
|
|
if (child->mnt_mountpoint == mnt->mnt.mnt_root)
|
|
continue;
|
|
if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child))
|
|
continue;
|
|
/* Found a mounted child */
|
|
goto children;
|
|
}
|
|
|
|
/* Mark mounts that can be unmounted if not locked */
|
|
SET_MNT_MARK(mnt);
|
|
progress = true;
|
|
|
|
/* If a mount is without children and not locked umount it. */
|
|
if (!IS_MNT_LOCKED(mnt)) {
|
|
umount_one(mnt, to_umount);
|
|
} else {
|
|
children:
|
|
list_move_tail(&mnt->mnt_umounting, to_restore);
|
|
}
|
|
out:
|
|
return progress;
|
|
}
|
|
|
|
static void umount_list(struct list_head *to_umount,
|
|
struct list_head *to_restore)
|
|
{
|
|
struct mount *mnt, *child, *tmp;
|
|
list_for_each_entry(mnt, to_umount, mnt_list) {
|
|
list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) {
|
|
/* topper? */
|
|
if (child->mnt_mountpoint == mnt->mnt.mnt_root)
|
|
list_move_tail(&child->mnt_umounting, to_restore);
|
|
else
|
|
umount_one(child, to_umount);
|
|
}
|
|
}
|
|
}
|
|
|
|
static void restore_mounts(struct list_head *to_restore)
|
|
{
|
|
/* Restore mounts to a clean working state */
|
|
while (!list_empty(to_restore)) {
|
|
struct mount *mnt, *parent;
|
|
struct mountpoint *mp;
|
|
|
|
mnt = list_first_entry(to_restore, struct mount, mnt_umounting);
|
|
CLEAR_MNT_MARK(mnt);
|
|
list_del_init(&mnt->mnt_umounting);
|
|
|
|
/* Should this mount be reparented? */
|
|
mp = mnt->mnt_mp;
|
|
parent = mnt->mnt_parent;
|
|
while (parent->mnt.mnt_flags & MNT_UMOUNT) {
|
|
mp = parent->mnt_mp;
|
|
parent = parent->mnt_parent;
|
|
}
|
|
if (parent != mnt->mnt_parent)
|
|
mnt_change_mountpoint(parent, mp, mnt);
|
|
}
|
|
}
|
|
|
|
static void cleanup_umount_visitations(struct list_head *visited)
|
|
{
|
|
while (!list_empty(visited)) {
|
|
struct mount *mnt =
|
|
list_first_entry(visited, struct mount, mnt_umounting);
|
|
list_del_init(&mnt->mnt_umounting);
|
|
}
|
|
}
|
|
|
|
/*
|
|
* collect all mounts that receive propagation from the mount in @list,
|
|
* and return these additional mounts in the same list.
|
|
* @list: the list of mounts to be unmounted.
|
|
*
|
|
* vfsmount lock must be held for write
|
|
*/
|
|
int propagate_umount(struct list_head *list)
|
|
{
|
|
struct mount *mnt;
|
|
LIST_HEAD(to_restore);
|
|
LIST_HEAD(to_umount);
|
|
LIST_HEAD(visited);
|
|
|
|
/* Find candidates for unmounting */
|
|
list_for_each_entry_reverse(mnt, list, mnt_list) {
|
|
struct mount *parent = mnt->mnt_parent;
|
|
struct mount *m;
|
|
|
|
/*
|
|
* If this mount has already been visited it is known that it's
|
|
* entire peer group and all of their slaves in the propagation
|
|
* tree for the mountpoint has already been visited and there is
|
|
* no need to visit them again.
|
|
*/
|
|
if (!list_empty(&mnt->mnt_umounting))
|
|
continue;
|
|
|
|
list_add_tail(&mnt->mnt_umounting, &visited);
|
|
for (m = propagation_next(parent, parent); m;
|
|
m = propagation_next(m, parent)) {
|
|
struct mount *child = __lookup_mnt(&m->mnt,
|
|
mnt->mnt_mountpoint);
|
|
if (!child)
|
|
continue;
|
|
|
|
if (!list_empty(&child->mnt_umounting)) {
|
|
/*
|
|
* If the child has already been visited it is
|
|
* know that it's entire peer group and all of
|
|
* their slaves in the propgation tree for the
|
|
* mountpoint has already been visited and there
|
|
* is no need to visit this subtree again.
|
|
*/
|
|
m = skip_propagation_subtree(m, parent);
|
|
continue;
|
|
} else if (child->mnt.mnt_flags & MNT_UMOUNT) {
|
|
/*
|
|
* We have come accross an partially unmounted
|
|
* mount in list that has not been visited yet.
|
|
* Remember it has been visited and continue
|
|
* about our merry way.
|
|
*/
|
|
list_add_tail(&child->mnt_umounting, &visited);
|
|
continue;
|
|
}
|
|
|
|
/* Check the child and parents while progress is made */
|
|
while (__propagate_umount(child,
|
|
&to_umount, &to_restore)) {
|
|
/* Is the parent a umount candidate? */
|
|
child = child->mnt_parent;
|
|
if (list_empty(&child->mnt_umounting))
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
|
|
umount_list(&to_umount, &to_restore);
|
|
restore_mounts(&to_restore);
|
|
cleanup_umount_visitations(&visited);
|
|
list_splice_tail(&to_umount, list);
|
|
|
|
return 0;
|
|
}
|