Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

From: Stephen Smalley
Date: Fri Dec 15 2017 - 08:56:02 EST


On Fri, 2017-12-15 at 03:09 +0000, yangjihong wrote:
> On 12/15/2017 10:31 PM, yangjihong wrote:
> > On 12/14/2017 12:42 PM, Casey Schaufler wrote:
> > > On 12/14/2017 9:15 AM, Stephen Smalley wrote:
> > > > On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:
> > > > > On 12/14/2017 8:42 AM, Stephen Smalley wrote:
> > > > > > On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
> > > > > > > On 12/13/2017 7:18 AM, Stephen Smalley wrote:
> > > > > > > > On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I am doing stressing testing on 3.10 kernel(centos
> > > > > > > > > 7.4), toÂ
> > > > > > > > > constantly starting numbers of docker ontainers with
> > > > > > > > > selinuxÂ
> > > > > > > > > enabled, and after about 2 days, the kernel
> > > > > > > > > softlockup panic:
> > > > > > > > > Â <IRQ>ÂÂ[<ffffffff810bb778>]
> > > > > > > > > sched_show_task+0xb8/0x120
> > > > > > > > > Â [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
> > > > > > > > > Â [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
> > > > > > > > > Â [<ffffffff811224d0>] ?
> > > > > > > > > watchdog_enable_all_cpus.part.4+0x40/0x40
> > > > > > > > > Â [<ffffffff810abf82>]
> > > > > > > > > __hrtimer_run_queues+0xd2/0x260
> > > > > > > > > Â [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
> > > > > > > > > Â [<ffffffff8104a477>]
> > > > > > > > > local_apic_timer_interrupt+0x37/0x60
> > > > > > > > > Â [<ffffffff8166fd90>]
> > > > > > > > > smp_apic_timer_interrupt+0x50/0x140
> > > > > > > > > Â [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
> > > > > > > > > Â <EOI>ÂÂ[<ffffffff812b4193>] ?
> > > > > > > > > sidtab_context_to_sid+0xb3/0x480
> > > > > > > > > Â [<ffffffff812b41f0>] ?
> > > > > > > > > sidtab_context_to_sid+0x110/0x480
> > > > > > > > > Â [<ffffffff812c0d15>] ?
> > > > > > > > > mls_setup_user_range+0x145/0x250
> > > > > > > > > Â [<ffffffff812bd477>]
> > > > > > > > > security_get_user_sids+0x3f7/0x550
> > > > > > > > > Â [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
> > > > > > > > > Â [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
> > > > > > > > > Â [<ffffffff812b01d8>]
> > > > > > > > > selinux_transaction_write+0x48/0x80
> > > > > > > > > Â [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
> > > > > > > > > Â [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
> > > > > > > > > Â [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
> > > > > > > > >
> > > > > > > > > My opinion:
> > > > > > > > > when the docker container starts, it would mount
> > > > > > > > > overlayÂ
> > > > > > > > > filesystem with different selinux context, mount
> > > > > > > > > point such as:
> > > > > > > > > overlay on
> > > > > > > > > /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea
> > > > > > > > > e4f6cb0f
> > > > > > > > > 07b4
> > > > > > > > > bc32
> > > > > > > > > 6cb07495ca08fc9ddb66/merged type overlay
> > > > > > > > > (rw,relatime,context="system_u:object_r:svirt_sandbox
> > > > > > > > > _file_t:
> > > > > > > > > s0:c
> > > > > > > > > 414,
> > > > > > > > > c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV
> > > > > > > > > 5CFWLADP
> > > > > > > > > ARHH
> > > > > > > > > WY7:
> > > > > > > > > /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS
> > > > > > > > > :/var/li
> > > > > > > > > b/do
> > > > > > > > > cker
> > > > > > > > > /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/
> > > > > > > > > lib/dock
> > > > > > > > > er/o
> > > > > > > > > verl
> > > > > > > > > ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07
> > > > > > > > > 495ca08f
> > > > > > > > > c9dd
> > > > > > > > > b66/
> > > > > > > > > diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f
> > > > > > > > > c4530e0e
> > > > > > > > > 952e
> > > > > > > > > ae4f
> > > > > > > > > 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
> > > > > > > > > shm on
> > > > > > > > > /var/lib/docker/containers/9fd65e177d2132011d7b422755
> > > > > > > > > 793449c9
> > > > > > > > > 1327
> > > > > > > > > ca57
> > > > > > > > > 7b8f5d9d6a4adf218d4876/shm type tmpfsÂ
> > > > > > > > > (rw,nosuid,nodev,noexec,relatime,context="system_u:ob
> > > > > > > > > ject_r:s
> > > > > > > > > virt
> > > > > > > > > _san
> > > > > > > > > dbox_file_t:s0:c414,c873",size=65536k)
> > > > > > > > > overlay on
> > > > > > > > > /var/lib/docker/overlay2/38d1544d080145c7d76150530d02
> > > > > > > > > 55991dfb
> > > > > > > > > 7258
> > > > > > > > > cbca
> > > > > > > > > 14ff6d165b94353eefab/merged type overlay
> > > > > > > > > (rw,relatime,context="system_u:object_r:svirt_sandbox
> > > > > > > > > _file_t:
> > > > > > > > > s0:c
> > > > > > > > > 431,
> > > > > > > > > c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF
> > > > > > > > > B7ANVRHP
> > > > > > > > > AVRC
> > > > > > > > > RSS:
> > > > > > > > > /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI
> > > > > > > > > ,upperdi
> > > > > > > > > r=/v
> > > > > > > > > ar/l
> > > > > > > > > ib/docker/overlay2/38d1544d080145c7d76150530d0255991d
> > > > > > > > > fb7258cb
> > > > > > > > > ca14
> > > > > > > > > ff6d
> > > > > > > > > 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/
> > > > > > > > > 38d1544d
> > > > > > > > > 0801
> > > > > > > > > 45c7
> > > > > > > > > d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work
> > > > > > > > > )
> > > > > > > > > shm on
> > > > > > > > > /var/lib/docker/containers/662e7f798fc08b09eae0f0f944
> > > > > > > > > 537a4bce
> > > > > > > > > dc1d
> > > > > > > > > cf05
> > > > > > > > > a65866458523ffd4a71614/shm type tmpfsÂ
> > > > > > > > > (rw,nosuid,nodev,noexec,relatime,context="system_u:ob
> > > > > > > > > ject_r:s
> > > > > > > > > virt
> > > > > > > > > _san
> > > > > > > > > dbox_file_t:s0:c431,c651",size=65536k)
> > > > > > > > >
> > > > > > > > > sidtab_search_context check the context whether is in
> > > > > > > > > the sidtabÂ
> > > > > > > > > list, If not found, a new node is generated and
> > > > > > > > > insert into theÂ
> > > > > > > > > list, As the number of containers is
> > > > > > > > > increasing,ÂÂcontext nodesÂ
> > > > > > > > > are also more and more, we tested the final number of
> > > > > > > > > nodesÂ
> > > > > > > > > reached
> > > > > > > > > 300,000 +,
> > > > > > > > > sidtab_context_to_sid runtime needs 100-200ms, which
> > > > > > > > > will leadÂ
> > > > > > > > > to the system softlockup.
> > > > > > > > >
> > > > > > > > > Is this a selinux bug? When filesystem umount, why
> > > > > > > > > context nodeÂ
> > > > > > > > > is not deleted?ÂÂI cannot find the relevant function
> > > > > > > > > to deleteÂ
> > > > > > > > > the node in sidtab.c
> > > > > > > > >
> > > > > > > > > Thanks for reading and looking forward to your reply.
> > > > > > > >
> > > > > > > > So, does docker just keep allocating a unique category
> > > > > > > > set forÂ
> > > > > > > > every new container, never reusing them even if the
> > > > > > > > container isÂ
> > > > > > > > destroyed?
> > > > > > > > That would be a bug in docker IMHO.ÂÂOr are you
> > > > > > > > creating anÂ
> > > > > > > > unbounded number of containers and never destroying the
> > > > > > > > olderÂ
> > > > > > > > ones?
> > > > > > >
> > > > > > > You can't reuse the security context. A process in
> > > > > > > ContainerAÂ
> > > > > > > sends a labeled packet to MachineB. ContainerA goes away
> > > > > > > and itsÂ
> > > > > > > context is recycled in ContainerC. MachineB responds some
> > > > > > > timeÂ
> > > > > > > later, again with a labeled packet. ContainerC gets
> > > > > > > informationÂ
> > > > > > > intended for ContainerA, and uses the information to take
> > > > > > > over theÂ
> > > > > > > Elbonian government.
> > > > > >
> > > > > > Docker isn't using labeled networking (nor is anything else
> > > > > > byÂ
> > > > > > default; it is only enabled if explicitly configured).
> > > > >
> > > > > If labeled networking weren't an issue we'd have full
> > > > > securityÂ
> > > > > module stacking by now. Yes, it's an edge case. If you want
> > > > > to useÂ
> > > > > labeled NFS or a local filesystem that gets mounted in eachÂ
> > > > > container (don't tell me that nobody would do that) you've
> > > > > got theÂ
> > > > > same problem.
> > > >
> > > > Even if someone were to configure labeled networking, Docker is
> > > > notÂ
> > > > presently relying on that or SELinux network enforcement for
> > > > anyÂ
> > > > security properties, so it really doesn't matter.
> > >
> > > True enough. I can imagine a use case, but as you point out, it
> > > wouldÂ
> > > be a very complex configuration and coordination exercise usingÂ
> > > SELinux.
> > >
> > > > And if they wanted
> > > > to do that, they'd have to coordinate category assignments
> > > > across allÂ
> > > > systems involved, for which no facility exists AFAIK.ÂÂIf you
> > > > haveÂ
> > > > two docker instances running on different hosts, I'd wager that
> > > > theyÂ
> > > > can hand out the same category sets today to different
> > > > containers.
> > > >
> > > > With respect to labeled NFS, that's also not the default for
> > > > nfsÂ
> > > > mounts, so again it is a custom configuration and Docker isn'tÂ
> > > > relying on it for any guarantees today.ÂÂFor local filesystems,
> > > > theyÂ
> > > > would normally be context-mounted or using genfscon rather
> > > > thanÂ
> > > > xattrs in order to be accessible to the container, thus no
> > > > persistentÂ
> > > > storage of the category sets.
> >
> > Well Kubernetes and OpenShift do set the labels to be the same
> > within a project, and they can manage across nodes.ÂÂBut yes we are
> > not using labeled networking at this point.
> > > I know that is the intended configuration, but I see people do
> > > allÂ
> > > sorts of stoopid things for what they believe are good reasons.
> > > Unfortunately, lots of people count on containers to provideÂ
> > > isolation, but create "solutions" for data sharing that defeat
> > > it.
> > >
> > > > Certainly docker could provide an option to not reuse category
> > > > sets,Â
> > > > but making that the default is not sane and just guaranteesÂ
> > > > exhaustion of the SID and context space (just create and tear
> > > > downÂ
> > > > lots of containers every day or more frequently).
> > >
> > > It seems that Docker might have a similar issue with UIDs, but
> > > itÂ
> > > takes longer to run out of UIDs than sidtab entries.
> > >
> > > > > > > > On the selinux userspace side, we'd also like to
> > > > > > > > eliminate theÂ
> > > > > > > > use of /sys/fs/selinux/user (sel_write_user ->Â
> > > > > > > > security_get_user_sids) entirely, which is what
> > > > > > > > triggered thisÂ
> > > > > > > > for you.
> > > > > > > >
> > > > > > > > We cannot currently delete a sidtab node because we
> > > > > > > > have no wayÂ
> > > > > > > > of knowing if there are any lingering references to the
> > > > > > > > SID.ÂÂ
> > > > > > > > Fixing that would require reference-counted SIDs, which
> > > > > > > > goesÂ
> > > > > > > > beyond just SELinux since SIDs/secids are returned by
> > > > > > > > LSM hooksÂ
> > > > > > > > and cached in other kernel data structures.
> > > > > > >
> > > > > > > You could delete a sidtab node. The code already deals
> > > > > > > withÂ
> > > > > > > unfindable SIDs. The issue is that eventually you run out
> > > > > > > of SIDs.Â
> > > > > > > Then you are forced to recycle SIDs, which leads to the
> > > > > > > overthrowÂ
> > > > > > > of the Elbonian government.
> > > > > >
> > > > > > We don't know when we can safely delete a sidtab node since
> > > > > > SIDsÂ
> > > > > > aren't reference counted and we can't know whether it is
> > > > > > still inÂ
> > > > > > use somewhere in the kernel.ÂÂDoing so prematurely would
> > > > > > lead toÂ
> > > > > > the SID being remapped to the unlabeled context, and then
> > > > > > likely toÂ
> > > > > > undesired denials.
> > > > >
> > > > > I would suggest that if you delete a sidtab node and someone
> > > > > comesÂ
> > > > > along later and tries to use it that denial is exactly what
> > > > > youÂ
> > > > > would desire. I don't see any other rational action.
> > > >
> > > > Yes, if we know that the SID wasn't in use at the time we tore
> > > > it down.
> > > > Â But if we're just randomly deleting sidtab entries based on
> > > > age orÂ
> > > > something (since we have no reference count), we'll almost
> > > > certainlyÂ
> > > > encounter situations where a SID hasn't been accessed in a long
> > > > timeÂ
> > > > but is still being legitimately cached somewhere.ÂÂJust a file
> > > > thatÂ
> > > > hasn't been accessed in a while might have that SID still
> > > > cached inÂ
> > > > its inode security blob, or anywhere else.
> > > >
> > > > > > > > sidtab_search_context() could no doubt be optimized for
> > > > > > > > theÂ
> > > > > > > > negative case; there was an earlier optimization for
> > > > > > > > the positiveÂ
> > > > > > > > case by adding a cache to sidtab_context_to_sid() prior
> > > > > > > > toÂ
> > > > > > > > calling it.ÂÂIt's a reverse lookup in the sidtab.
> > > > > > >
> > > > > > > This seems like a bad idea.
> > > > > >
> > > > > > Not sure what you mean, but it can certainly be changed to
> > > > > > at leastÂ
> > > > > > use a hash table for these reverse lookups.
> > > > > >
> > > > > >
> > >
> > >
> > >
>
> Thanks for reply and discussion.
> I think docker container is only a case, Is it possible there is a
> similar way, through some means of attack, triggered a constantly
> increasing ofÂÂSIDs list, eventually leading to the system panic?
>
> I think the issue is that is takes too long to search SID node when
> SIDs list too large,Â
> If can optimize the node's data structure(ie : tree structure) or
> search algorithm to ensure that traversing all nodes can be very
> short time even in many nodes, maybe it can solve the problem.
> Or, in sidtab.c provides "delete_sidtab_node" interface, when umount
> fs, delete the SID node. Because when fs is umounted, the SID is
> useless, could delete it to control the size of SIDs list.
>
> Thanks for reading and looking forward to your reply.

We cannot safely delete entries in the sidtab without first adding
reference counting of SIDs, which goes beyond just SELinux since they
are cached in other kernel data structures and returned by LSM hooks.
That's a non-trivial undertaking.

Far more practical in the near term would be to introduce a hash table
or other mechanism for efficient reverse lookups in the sidtab. Are
you offering to implement that or just requesting it?

Independent of that, docker should support reuse of category sets when
containers are deleted, at least as an option and probably as the
default.