Re: [PATCH v3] net: bpf: permit redirect from ingress L3 to egress L2 devices at near max mtu

From: Daniel Borkmann
Date: Thu May 07 2020 - 17:06:10 EST

Next message: Johannes Weiner: "Re: [PATCH v3 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo"
Previous message: Rob Herring: "Re: [PATCH v4 01/12] PCI: aardvark: Train link immediately after enabling training"
In reply to: Maciej Åenczykowski: "Re: [PATCH v3] net: bpf: permit redirect from ingress L3 to egress L2 devices at near max mtu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 5/7/20 6:46 PM, Maciej Åenczykowski wrote:

(a) not clear why the max is SKB_MAX_ALLOC in the first place (this is
PAGE_SIZE << 2, ie. 16K on x86), while lo mtu is 64k

Agreed, tbh, it's not clear to me either atm. :) The SKB_MAX_ALLOC constant itself
should be replaced with something more appropriate. Also as a small side note,
the !skb->dev check should be removed since in tc ingress/egress the skb->dev
is never NULL. (See also tc_cls_act_convert_ctx_access() on struct __sk_buff,
ifindex conversion.)

(b) hmm, if we're not redirecting, then exceeding the ingress device's
mtu doesn't seem to be a problem.

Indeed AFAIK this can already happen, some devices will round mtu up
when they configure the device mru buffers.
(ie. you configure L3 mtu 1500, they treat that as L2 1536 or 1532 [-4
fcs], simply because 3 * 512 is a nice amount of buffers, or they'll
accept not only 1514 L2, but also 1518 L2 or even 1522 L2 for VLAN and
Q-IN-Q -- even if the packets aren't actually VLAN'ed, so your non
VLAN mru might be 1504 or 1508)

Indeed my corp dell workstation with some standard 1 gigabit
motherboard nic has a standard default mtu of 1500, and I've seen it
receive L3 mtu 1520 packets (apparently due to misconfiguration in our
hardware [cisco? juniper?] ipv4->ipv6 translator which can take 1500
mtu ipv4 packets and convert them to 1520 mtu ipv6 packets without
fragmenting or generating icmp too big errors). While it's obviously
wrong, it does just work (the network paths themselves are also
obviously 1520 clean).

Right, agree on tc ingress side when skb goes further up the stack.

(c) If we are redirecting we'll eventually (after bpf program returns)
hit dev_queue_xmit(), and shouldn't that be what returns an error?

You mean whether the check should be inside __dev_queue_xmit() itself
instead? Maybe. From a cursory glance the MTU checks are spread in upper
layer protos. As mentioned, we would need to have coverage for BPF progs
attached on the tc ingress and egress (sch_handle_{ingress,egress}())
hook where for each case an skb can be just TC_ACT_OK'ed (up to stack or
down to driver), redirected via ____dev_forward_skb() or dev_queue_xmit().
The ____dev_forward_skb() has us covered and for TC_ACT_OK on tc ingress,
we'd only need a generic upper cap. So for the rest of the cases, we'd
need to have some sort of sanity check somewhere.

btw. is_skb_forwardable() actually tests
- device is up && (packet is gso || skb->len < dev->mtu +
dev->hard_header_len + VLAN_HLEN);

which is also wrong and in 2 ways, cause VLAN_HLEN makes no sense on
non ethernet, and the __bpf_skb_max_len function doesn't account for
VLAN... (which possibly has implications if you try to redirect to a
vlan interface)

Yeah, it probably would for QinQ which is probably why noone was running
into it so far. At least the skb_vlan_push() would first store the tag
via __vlan_hwaccel_put_tag() in the skb itself.

I think having an mtu check is useful, but I think the mtu should be
selectable by the bpf program. Because it might not even be device
mtu at all, it might be path mtu which we should be testing against.
It should also be checked for gso frames, since the max post
segmentation size should be enforced.

I agree we should expose dev->mtu (and dev->hard_header_len and hatype)

I'll mull this over a bit more, but I'm not convinced this patch isn't ok as is.
There just is probably more we should do.

Ok, makes sense. Thanks for looking into it!

Next message: Johannes Weiner: "Re: [PATCH v3 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo"
Previous message: Rob Herring: "Re: [PATCH v4 01/12] PCI: aardvark: Train link immediately after enabling training"
In reply to: Maciej Åenczykowski: "Re: [PATCH v3] net: bpf: permit redirect from ingress L3 to egress L2 devices at near max mtu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]