Re: [RFC] syscalls, x86: Add __NR_kcmp syscall

From: Cyrill Gorcunov
Date: Wed Jan 18 2012 - 03:01:10 EST


On Tue, Jan 17, 2012 at 01:35:00PM -0800, Eric W. Biederman wrote:
> "H. Peter Anvin" <hpa@xxxxxxxxx> writes:
>
> > On 01/17/2012 06:44 AM, Cyrill Gorcunov wrote:
> >> On Tue, Jan 17, 2012 at 04:38:14PM +0200, Alexey Dobriyan wrote:
> >>> On 1/17/12, Cyrill Gorcunov <gorcunov@xxxxxxxxx> wrote:
> >>>> +#define KCMP_EQ 0
> >>>> +#define KCMP_LT 1
> >>>> +#define KCMP_GT 2
> >>>
> >>> LT and GT are meaningless.
> >>>
> >>
> >> I found symbolic names better than open-coded values. But sure,
> >> if this is problem it could be dropped.
> >>
> >> Or you mean that in general anything but 'equal' is useless?
> >>
> >
> > Why on Earth would user space need to know which order in memory certain
> > kernel objects are?
>
> For checkpoint restart and for some other kinds of introspection what is
> needed is a comparison function to see if two processes share the same
> object. The most interesting of these objects from a checkpoint restart case
> are file descriptors, and there can be a lot of file descriptors.
>
> The order in memory does not matter. What does matter is that the
> comparison function return some ordering between objects. The algorithm
> for figuring out of N items which of them are duplicates is O(N^2) if
> the comparison function can only return equal or not equal. The
> algorithm for finding duplications is only O(NlogN) if the comparison
> function will return an ordering among the objects.
>

Yes, thanks Eric, I missed this text in patch description, my bad. And
yes, performance will degrade with plain eq/ne approach. But as Pavel
stated in another email

| We can compare the e.g. files' target inodes (ino + dev) and positions and
| comparing each-to-each only for those having these pairs equal. Looking at
| the existing large containers with tens thousands of fd-s we have this
| gives us maximum 6 files to compare, and performing 15 syscalls for this suits
| us for now.

> > Keep in mind that this is *exactly* the kind of information which makes
> > rootkits easier.
>
> I would be very surprised if basic in memory ordering information was
> not already available from simple creation ordering.
>

I think Peter means the scenario where we say have some bug in slab/slub
code which happens on say some Nth allocation and attacker somehow reveal
at least one memory address of struct file, then using such syscall an
attacker might inspect a series of fd (and associated struct file) and guess
which addresses the rest of "struct file" are. In most cases this wont help
(if a system is under more/less high load and open/close files fast enough
'cause "struct file" comes from kmem caches) but on some non-heavy loaded
machine this might do a trick and narrow addresses (if say there only 10
fds which allocated from cache in a row and you somehow know address of
one associated struct file).

In short -- I don't know if it's indeed really serious issue or not
(since from my POV it'll require at least a couple of bugs in a row
to happen before the attacker might use this information). OTOH, shit
happens exactly in 'impossible' scenarios ;)

> If using the in memory ordering is a problem in practice there are a lot
> of other possible ways to order the kernel objects. Allocating sequence
> numbers for the kernel objects, passing the pointers through a
> cryptographically secure hash before comparing them, etc.
>

We've been trying this already ;)

> It does look like Cyrill's patch description lacked the important bit of
> information about the algorithm complexity requiring an ordering among
> kernel objects. Cyrill you probably want to describe more prominently
> what is happening now and why in your patch description rather than give
> the history of different approaches.
>

Yeah, i'll write detailed change log, gimme some time. Thanks Eric!

Btw, extending this syscall to lt/ge variant will be easy, so this is
not a problem I think. At moment we guarantee to return 0/1 on succes,
and < 0 on error, so if we start returing 2/3 in a sake of ordering
the applications which were using only 0/1 values wont crash (if they
are not crappy written ones).

Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/