RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

From: Xin, Xiaohui
Date: Sat Sep 11 2010 - 03:41:42 EST


>>Playing with rlimit on data path, transparently to the application in this way
>>looks strange to me, I suspect this has unexpected security implications.
>>Further, applications may have other uses for locked memory
>>besides mpassthru - you should not just take it because it's there.
>>
>>Can we have an ioctl that lets userspace configure how much
>>memory to lock? This ioctl will decrement the rlimit and store
>>the data in the device structure so we can do accounting
>>internally. Put it back on close or on another ioctl.
>Yes, we can decrement the rlimit in ioctl in one time to avoid
>data path.
>
>>Need to be careful for when this operation gets called
>>again with 0 or another small value while we have locked memory -
>>maybe just fail with EBUSY? or wait until it gets unlocked?
>>Maybe 0 can be special-cased and deactivate zero-copy?.
>>

How about we don't use a new ioctl, but just check the rlimit
in one MPASSTHRU_BINDDEV ioctl? If we find mp device
break the rlimit, then we fail the bind ioctl, and thus can't do
zero copy any more.

>In fact, if we choose RLIMIT_MEMLOCK to limit the lock memory,
>the default value is only 16 pages. It's too small to make the device to
>work. So we always to configure it with a large value.
>I think, if rlimit value after decrement is < 0, then deactivate zero-copy
>is better. 0 maybe ok.
>

>>> +
>>> + if (ctor->lock_pages + count > lock_limit && npages) {
>>> + printk(KERN_INFO "exceed the locked memory rlimit.");
>>> + return NULL;
>>> + }
>>> +
>>> + info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);
>>
>>You seem to fill in all memory, why zalloc? this is data path ...
>
>Ok, Let me check this.
>
>>
>>> +
>>> + if (!info)
>>> + return NULL;
>>> +
>>> + for (i = j = 0; i < count; i++) {
>>> + base = (unsigned long)iov[i].iov_base;
>>> + len = iov[i].iov_len;
>>> +
>>> + if (!len)
>>> + continue;
>>> + n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
>>> +
>>> + rc = get_user_pages_fast(base, n, npages ? 1 : 0,
>>
>>npages controls whether this is a write? Why?
>
>We use npages as a flag here. In mp_sendmsg(), we called alloc_page_info() with npages =
>0.
>
>>
>>> + &info->pages[j]);
>>> + if (rc != n)
>>> + goto failed;
>>> +
>>> + while (n--) {
>>> + frags[j].offset = base & ~PAGE_MASK;
>>> + frags[j].size = min_t(int, len,
>>> + PAGE_SIZE - frags[j].offset);
>>> + len -= frags[j].size;
>>> + base += frags[j].size;
>>> + j++;
>>> + }
>>> + }
>>> +
>>> +#ifdef CONFIG_HIGHMEM
>>> + if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
>>> + for (i = 0; i < j; i++) {
>>> + if (PageHighMem(info->pages[i]))
>>> + goto failed;
>>> + }
>>> + }
>>> +#endif
>>
>>Are non-highdma devices worth bothering with? If yes -
>>are there other limitations devices might have that we need to handle?
>>E.g. what about non-s/g devices or no checksum offloading?.
>
>Basically I think there is no limitations for both, but let me have a check.
>
>>
>>> + skb_push(skb, ETH_HLEN);
>>> +
>>> + if (skb_is_gso(skb)) {
>>> + hdr.hdr.hdr_len = skb_headlen(skb);
>>> + hdr.hdr.gso_size = shinfo->gso_size;
>>> + if (shinfo->gso_type & SKB_GSO_TCPV4)
>>> + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>>> + else if (shinfo->gso_type & SKB_GSO_TCPV6)
>>> + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
>>> + else if (shinfo->gso_type & SKB_GSO_UDP)
>>> + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
>>> + else
>>> + BUG();
>>> + if (shinfo->gso_type & SKB_GSO_TCP_ECN)
>>> + hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>>> +
>>> + } else
>>> + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
>>> +
>>> + if (skb->ip_summed == CHECKSUM_PARTIAL) {
>>> + hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
>>> + hdr.hdr.csum_start =
>>> + skb->csum_start - skb_headroom(skb);
>>> + hdr.hdr.csum_offset = skb->csum_offset;
>>> + }
>>
>>We have this code in tun, macvtap and packet socket already.
>>Could this be a good time to move these into networking core?
>>I'm not asking you to do this right now, but could this generic
>>virtio-net to skb stuff be encapsulated in functions?
>
>It seems reasonable.
>
>>
>>--
>>MST
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@xxxxxxxxxxxxxxx
>More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/