Re: [REGRESSION] asix: Lots of asix_rx_fixup() errors and slow transmissions

From: Dean Jenkins
Date: Wed May 11 2016 - 18:00:26 EST

Next message: Peter Hurley: "Re: [PATCH v2] locking/rwsem: Add reader-owned state to the owner field"
Previous message: Eric Dumazet: "Re: [RFC PATCH 0/2] net: threadable napi poll loop"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi John,

I have purchased a "uGreen" USB Ethernet Adaptor which was reported as showing the issue:

lsusb shows:
ID 0b95:772b ASIX Electronics Corp. AX88772B

dmesg shows:
[119591.413298] usb 2-1: new high-speed USB device number 12 using ci_hdrc
[119591.576970] usb 2-1: New USB device found, idVendor=0b95, idProduct=772b
[119591.576994] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[119591.577010] usb 2-1: Product: AX88772C
[119591.577025] usb 2-1: Manufacturer: ASIX Elec. Corp.

Strangely the product string says "AX88772C" and lsusb shows "AX88772B"

I used our ARM (32-bit 2 core) board running our highly customised 3.14 kernel and ran a ping test that slowly increments the ping payload size so forcing the Ethernet frames to slowly extend in length and eventually forcing IPv4 fragmentation to occur due to the MTU limit of 1500. In my test the ICMP ping payload lengths ranged from 1 to 5000.

During the test run I saw (only 3 errors):
[27455.113010] asix 2-1:1.0 eth0: asix_rx_fixup() Data Header synchronisation was lost, remaining 23
[27455.113037] asix 2-1:1.0 eth0: asix_rx_fixup() Bad Header Length 0x77767574, offset 4
[27456.113269] asix 2-1:1.0 eth0: asix_rx_fixup() Data Header synchronisation was lost, remaining 27
[27456.113329] asix 2-1:1.0 eth0: asix_rx_fixup() Bad Header Length 0x77767574, offset 4
[27457.113271] asix 2-1:1.0 eth0: asix_rx_fixup() Data Header synchronisation was lost, remaining 30
[27457.113328] asix 2-1:1.0 eth0: asix_rx_fixup() Bad Header Length 0x77767574, offset 4

This meets my expectation of "sync lost" followed immediately by "Bad Header Length". A close look at the timestamps shows gaps of around 20us to 50us which suggests the code is processing the same URB eg. "sync lost" and "Bad Header Length" are written from the same instance of asix_rx_fixup_internal().

My example suggests that the previous URB went missing so data was lost causing a discontinuity in the data stream. This was the intended purpose of the commit to prevent bad Ethernet frames being sent up the IP stack when an URB went missing. A bad Ethernet frame would otherwise be created by having the start of an Ethernet frame appended with data from the current URB causing a corrupted Ethernet frame to be generated and sent up the IP stack.

Also the failure seems to be independent of the ping payload length but longer test periods of specific payload lengths would be needed to allow the 32 bit header word to move around relative to the start of the URB buffer.

In my example, the "Bad Header Length 0x77767574" is reading the ping payload data of 0x74, 0x75, 0x76, 077 which is located at the start of the URB buffer. The remaining values are low at 23 to 30 which suggests the end of the Ethernet frame was in the missing URB. The ICMP ping data of 0x74, 0x75, 0x76, 077 is from the next Ethernet frame meaning the end of the current Ethernet frame is missing and the next frame has a missing start of Ethernet frame.

Note that due to IPv4 fragmentation "consecutive" Ethernet frames will contain payloads of 1500 (MTU size) octets typically followed by a short Ethernet frame. The payloads are fragmented IP packets.

So I've been trying to add some print messages here to better
understand whats going on.

Again, I'm a bit new to this code, so forgive me for my lack of
understanding things. Since the remaining value seems to be key, I
tried to look around and figure out where it was being set. It seems
like its only set in this function, is that right? So this made me
guess something might be happening in a previous iteration that was
causing this to trigger.

I added some debug prints to every time we set the remaining value, or
modify it, as well as to print the value if we enter the fixup
function with a non-zero remaining value.

When we set the remaining value, usually its to 1514, when the skblen is 1518.

1514 is Ethernet header length + payload at MTU size of 1500.
skblen of 1518 is 32-bit header word + maximum Ethernet frame length (for your Network).

However, right before we catch the problem, I see this:

I am guessing where your debug is located in the code so I may have ms-interpreted your information.

[ 84.844337] JDB set remaining to 1514 (skblen: 1518)

This suggests 1 maximum length Ethernet frame of 1514 octets in the URB buffer.

[ 84.844379] JDB set remaining to 1514 (skblen: 1518)
[ 84.844429] JDB set remaining to 1514 (skblen: 1518)
[ 84.844458] JDB set remaining to 1514 (skblen: 1518)
[ 84.844483] JDB set remaining to 1514 (skblen: 1518)
[ 84.844507] JDB set remaining to 1514 (skblen: 1518)
[ 84.844559] JDB set remaining to 1514 (skblen: 2048)

This URB probably has 2 Ethernet frames; 1 complete frame plus the start of the next Ethernet frame.

I think 2048 could be the maximum URB transfer length for the USB bulk transfer. 2048 seems to be a low value so should be investigated.

[ 84.844583] JDB set remaining to 1514 (skblen: 2048)
[ 84.844595] JDB: 1514 > 2048 - 1522
[ 84.844606] JDB: dropping remaining by 526

32-bit header word + Ethernet frame of 1514 is consumed leaving 2048 - (4+1514) = 530 octets

So there is room for another 32-bit header word and the start of a new Ethernet frame within those 530 octets.
Start of Ethernet frame is 530 - 4 = 526 octets of Ethernet frame than spans into the next URB buffer.

The remaining value is 988 which means the total Ethernet frame length was 526 + 988 = 1514 which is a maximum length Ethernet frame

[ 84.844624] asix_rx_fixup_internal() remaining: 988, skb->len: 2048
[ 84.844672] asix 1-5:1.0 eth1: asix_rx_fixup() Data Header
synchronisation was lost, remaining 988

There should be a 32-bit header word located after the Ethernet frame after location 988 but no 32-bit header was there. This is consistent with a missing URB (there could be other reasons).

However, the gap between the timestamps is higher than expected at 273us:
84.844945 - 84.844672 = 0.000273 = 273us

It could be that "Sync lost" and "Bad Header" are not from the same instance of asix_rx_fixup_internal() eg. processing different URBs

Therefore, it could be concluded that synchronisation has been recovered by the "sync lost" detector forcing to look at the start of the URB buffer for the 32-bit header word and a 32-bit header was found as no immediate "Bad Header length" message was indicated. This also suggests a URB was lost.

[ 84.844945] asix 1-5:1.0 eth1: asix_rx_fixup() Bad Header Length
0xdd5f8f9b, offset 4

This time an URB containing the start of Ethernet frame is missing because remaining is 0 (as no "sync lost" message"). This means that there is no 32-bit header word at the start of the URB buffer meaning no start of an Ethernet frame. This is also consistent with a dropped URB which contained the start of the now missing Ethernet frame.

[ 84.845217] asix 1-5:1.0 eth1: asix_rx_fixup() Bad Header Length
0x3ce1ad3c, offset 4
[ 84.845451] JDB set remaining to 1514 (skblen: 1518)
[ 84.845485] JDB set remaining to 1514 (skblen: 1518)
[ 84.845511] JDB set remaining to 1514 (skblen: 1518)
[ 84.851003] JDB set remaining to 1514 (skblen: 1518)

So when the issue happens, it seems to be due to an larger then usual
skb (2048). The first time through the wile loop we set the remaining
to 1514, but offset is fairly small, so we set copy_length to 1514,
and clear remaining. The offset is bumped by a little more then the
copy length and we loop again. Then the second time through we set
remaining to 1514, but since offset is bigger now, the if
(rx->remaining > skb->len - offset) case is true..

This is where it feels a little strange..

It is because you have to wait for the next URB to get the remaining part of the Ethernet frame which occurs in the near future. eg. Ethernet frame spanned across 2 URB buffers.

We calculate the copy_length as the difference between the offset and
the skb->len (so how much is left in the skb, which is 526), then
decrement remaining by that amount. Not really sure what remaining
(now 988) is supposed to represent here. We copy the 526 bytes, and
then exit the loop.

As explained above. The end of the current URB buffer contains the first 526 octets of the Ethernet frame and we have to wait for the next URB to get the next part of the Ethernet frame.

The Ethernet frame length is 1514 so the remaining value is 1514 - 526 = 988. So the next URB should contain the missing 988 octets.

Now the next time we are called, we enter and we have a remaining
value still of 988, which triggers the header synchronization error
path.

The sync is lost because this new URB buffer is not containing the expected 988 octets because the 32-bit header word after the end of the Ethernet frame appears to be missing. Therefore, probably detected a dropped URB which had the 988 octets of the end of the current Ethernet frame.

Now, I'm not sure if the remainder handling logic is sane, or if the
skb->len being 2048 is problematic, or what. The skb->lens can vary
in sizes, usually 1518 during high throughput, but I've seen 1588,
1640, and other larger numbers that don't trigger the same problem.

I expect the USB bulk transfer is limited to a maximum transfer size of 2048 octets eg. max URB buffer size. The ASIX driver code should be checked to see where the 2048 limit comes from. I wonder whether 4096 could be used ?

If the USB to Ethernet adaptor has more data to send than can fit in 2048 byte buffer then the Ethernet frame needs to be split across 2 URBs.

In the case of 1640 buffer, it probably means 2 Ethernet frames are in the buffer:
32-bit header word + 1514 Ethernet frame + 32-bit header word + ((1640 - (4 + 1514 +4)) = 118) Ethernet frame
This could be due to IP fragmentation.

ie:
[ 106.946473] JDB set remaining to 1514 (skblen: 1518)
[ 106.946525] JDB set remaining to 1514 (skblen: 1640)
[ 106.946546] JDB set remaining to 118 (skblen: 1640)

Yes, that 118 confirms my reasoning above.

[ 106.946586] JDB set remaining to 1514 (skblen: 1518)

So yea.. maybe that will help clue things in a bit? I'm still a bit lost. :)

Your observations are consistent with missing URBs from the USB host controller.

Here is a summary of what I think is happening in your case:

Good case:
URB #1: 1514 octets of 1514 Ethernet frame (A)
URB #2: 1514 octets of 1514 Ethernet frame (B) + 526 octets of 1514 Ethernet frame (C)
URB #3: 988 octets of 1514 Ethernet frame (C)
URB #4: 1514 octets of 1514 Ethernet frame (D)

Therefore, Ethernet frame (C) is spanning URBs #2 and #3.

Bad case, URB #3 is lost:
URB #1: 1514 octets of 1514 Ethernet frame (A)
URB #2: 1514 octets of 1514 Ethernet frame (B) + 526 octets of 1514 Ethernet frame (C)
Remaining is 988
URB #4: 1514 octets of 1514 Ethernet frame (D)

But when URB #4 is analysed the 32-bit Header word is not found after 988 octets in the URB buffer so "sync lost".
The end of Ethernet frame (C) is missing so drop the Ethernet frame.
Now look at the start of the URB #4 buffer and find a 32-bit header word so Ethernet frame (D) can be consumed.

So I think the commit is acting as intended and you are suffering from lost URBs.

So perhaps you have a URB memory allocation issue for the USB host controller ? However, this would also mean x86 has the same issue. So perhaps it is a bug in common USB kernel code ?

You might find it helpful to get USB analyser logs or use the software utility usbmon to grab the USB host controller data. You might then be able to compare the data from the USB log with the URB buffer data. This might confirm that URBs are going missing.

On the other hand, it was suggested that removing my commit improved through-put. Is that really true ? Perhaps you can try an experiment where you add test code to add a counter to use my "sync lost" detector for 1000 URBs then by-pass the "sync lost" detector for 1000 URBs and repeat it cyclically. 1000 is a guess, you might need a higher value. Then run some tests to see whether the through-put periodically goes up and down in sympathy with the "sync lost" detector being used or not being used. Also track the RX error counter from ifconfig to see how many bad Ethernet frames got to the IP stack.

Also try using wireshark to see whether you are generating IP fragmented frames. Unfortunately, using iperf with IPv6 with defaults will generate IPv6 fragmentation causing twice as many Ethernet frames to be sent as needed so is inefficient. To fix that, in iperf reduce the data length so that the MTU size of 1500 is not exceeded.

The commit only runs when Ethernet frames span URBs which typically occurs when IP fragmentation happens or when there is a very high rate of Ethernet frames per second.

At least, it is now clear why you see "988 remaining", that is because you are using maximum length Ethernet frames and 2 of those frames are written into a single URB which means 988 octets of the 2nd Ethernet frame goes into the 2nd URB buffer. So that makes sense and is correct.

Regards,
Dean

thanks
-john

--
Dean Jenkins
Embedded Software Engineer
Linux Transportation Solutions
Mentor Embedded Software Division
Mentor Graphics (UK) Ltd.

Next message: Peter Hurley: "Re: [PATCH v2] locking/rwsem: Add reader-owned state to the owner field"
Previous message: Eric Dumazet: "Re: [RFC PATCH 0/2] net: threadable napi poll loop"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]