Re: Memory barrier needed with wake_up_process()?

From: Felipe Balbi
Date: Mon Jan 16 2017 - 06:14:58 EST



Hi,

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> writes:
> On Tue, 20 Sep 2016, Felipe Balbi wrote:
>
>> And here's trace output (complete, scroll to bottom). It seems to me
>> like the thread didn't wake up at all. It didn't even try to
>> execute. I'll add some more traces and try to get better information
>> about what's going on.
>>
>>
>>
>> # tracer: nop
>> #
>> # entries-in-buffer/entries-written: 2865/2865 #P:4
>> #
>> # _-----=> irqs-off
>> # / _----=> need-resched
>> # | / _---=> hardirq/softirq
>> # || / _--=> preempt-depth
>> # ||| / delay
>> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
>> # | | | |||| | |
>
> Skipping to the end...
>
>> irq/17-dwc3-2522 [002] d... 43.504199: bulk_out_complete: 0, 31/31
>> file-storage-2521 [001] .... 43.504202: fsg_main_thread: get_next_command -> 0
>> file-storage-2521 [001] .... 43.504288: fsg_main_thread: do_scsi_command -> 0
>> file-storage-2521 [001] .... 43.504295: fsg_main_thread: finish_reply -> 0
>> file-storage-2521 [001] .... 43.504298: fsg_main_thread: send_status -> 0
>> irq/17-dwc3-2522 [002] d... 43.504347: bulk_in_complete: 0, 16384/16384
>> irq/17-dwc3-2522 [002] d... 43.504351: bulk_in_complete: 0, 16384/16384
>> irq/17-dwc3-2522 [002] d... 43.504434: bulk_in_complete: 0, 16384/16384
>> irq/17-dwc3-2522 [002] d... 43.504438: bulk_in_complete: 0, 16384/16384
>> irq/17-dwc3-2522 [002] d... 43.504535: bulk_in_complete: 0, 16384/16384
>> irq/17-dwc3-2522 [002] d... 43.504539: bulk_in_complete: 0, 16384/16384
>> irq/17-dwc3-2522 [002] d... 43.504618: bulk_in_complete: 0, 16384/16384
>> irq/17-dwc3-2522 [002] d... 43.504703: bulk_in_complete: 0, 16384/16384
>> irq/17-dwc3-2522 [002] d... 43.504794: bulk_in_complete: 0, 13/13
>> irq/17-dwc3-2522 [002] d... 43.504797: bulk_out_complete: 0, 31/31
>
> Like you say, it appears that the thread didn't get woken up at all.
> But this is inconsistent with your earlier results. On Sep. 9, you
> posted a message that ended with these lines:
>
>> irq/17-dwc3-3579 [003] d..1 21167.729666: bulk_out_complete: compl: bh ffff880111e6aac0 state 1
>> file-storage-3578 [002] .... 21167.729670: fsg_main_thread: next: bh ffff880111e6aac0 state 1
>
> This indicates that in the earlier test, the thread did start running
> and get_next_command should have returned.
>
> The trace you posted after this one doesn't seem to show anything new,
> as far as I can tell.
>
> So I still can't tell what's happening. Maybe the patch below will
> help. It concentrates on the critical area.

Sorry for the long delay, I finally have more information on this. All
this time I was doing something that I never considered to matter: I've
been running host and peripheral on the same machine. Now that I have
tracepoints on xHCI as well, I could see that these 30 seconds of
"nothing" is actuall full of xHCI activity and I can see that for the
duration of these 30 seconds preempt depth on the CPU that (eventually)
queues a request on dwc3, is always > 1 (sometimes 2, most of the time
1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU
and g_mass_storage is spinning for over 30 seconds at which point
storage.ko (host side class driver) dequeues the request.

I'll see if I can capture a fresh trace with both xHCI and dwc3 with
this happening, but probably not today (testing stuff for -rc).

--
balbi

Attachment: signature.asc
Description: PGP signature