RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health reporters for NPA

From: George Cherian
Date: Tue Dec 01 2020 - 00:24:15 EST




> -----Original Message-----
> From: George Cherian
> Sent: Tuesday, December 1, 2020 10:49 AM
> To: 'Jakub Kicinski' <kuba@xxxxxxxxxx>
> Cc: 'netdev@xxxxxxxxxxxxxxx' <netdev@xxxxxxxxxxxxxxx>; 'linux-
> kernel@xxxxxxxxxxxxxxx' <linux-kernel@xxxxxxxxxxxxxxx>;
> 'davem@xxxxxxxxxxxxx' <davem@xxxxxxxxxxxxx>; Sunil Kovvuri Goutham
> <sgoutham@xxxxxxxxxxx>; Linu Cherian <lcherian@xxxxxxxxxxx>;
> Geethasowjanya Akula <gakula@xxxxxxxxxxx>; 'masahiroy@xxxxxxxxxx'
> <masahiroy@xxxxxxxxxx>; 'willemdebruijn.kernel@xxxxxxxxx'
> <willemdebruijn.kernel@xxxxxxxxx>; 'saeed@xxxxxxxxxx'
> <saeed@xxxxxxxxxx>; 'jiri@xxxxxxxxxxx' <jiri@xxxxxxxxxxx>
> Subject: RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> reporters for NPA
>
> Jakub,
>
> > -----Original Message-----
> > From: George Cherian
> > Sent: Tuesday, December 1, 2020 9:06 AM
> > To: Jakub Kicinski <kuba@xxxxxxxxxx>
> > Cc: netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > davem@xxxxxxxxxxxxx; Sunil Kovvuri Goutham
> <sgoutham@xxxxxxxxxxx>;
> > Linu Cherian <lcherian@xxxxxxxxxxx>; Geethasowjanya Akula
> > <gakula@xxxxxxxxxxx>; masahiroy@xxxxxxxxxx;
> > willemdebruijn.kernel@xxxxxxxxx; saeed@xxxxxxxxxx; jiri@xxxxxxxxxxx
> > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > reporters for NPA
> >
> > Hi Jakub,
> >
> > > -----Original Message-----
> > > From: Jakub Kicinski <kuba@xxxxxxxxxx>
> > > Sent: Tuesday, December 1, 2020 7:59 AM
> > > To: George Cherian <gcherian@xxxxxxxxxxx>
> > > Cc: netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > > davem@xxxxxxxxxxxxx; Sunil Kovvuri Goutham
> > <sgoutham@xxxxxxxxxxx>;
> > > Linu Cherian <lcherian@xxxxxxxxxxx>; Geethasowjanya Akula
> > > <gakula@xxxxxxxxxxx>; masahiroy@xxxxxxxxxx;
> > > willemdebruijn.kernel@xxxxxxxxx; saeed@xxxxxxxxxx; jiri@xxxxxxxxxxx
> > > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > > reporters for NPA
> > >
> > > On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote:
> > > > Add health reporters for RVU NPA block.
> > > > NPA Health reporters handle following HW event groups
> > > > - GENERAL events
> > > > - ERROR events
> > > > - RAS events
> > > > - RVU event
> > > > An event counter per event is maintained in SW.
> > > >
> > > > Output:
> > > > # devlink health
> > > > pci/0002:01:00.0:
> > > > reporter hw_npa
> > > > state healthy error 0 recover 0 # devlink health dump show
> > > > pci/0002:01:00.0 reporter hw_npa
> > > > NPA_AF_GENERAL:
> > > > Unmap PF Error: 0
> > > > NIX:
> > > > 0: free disabled RX: 0 free disabled TX: 0
> > > > 1: free disabled RX: 0 free disabled TX: 0
> > > > Free Disabled for SSO: 0
> > > > Free Disabled for TIM: 0
> > > > Free Disabled for DPI: 0
> > > > Free Disabled for AURA: 0
> > > > Alloc Disabled for Resvd: 0
> > > > NPA_AF_ERR:
> > > > Memory Fault on NPA_AQ_INST_S read: 0
> > > > Memory Fault on NPA_AQ_RES_S write: 0
> > > > AQ Doorbell Error: 0
> > > > Poisoned data on NPA_AQ_INST_S read: 0
> > > > Poisoned data on NPA_AQ_RES_S write: 0
> > > > Poisoned data on HW context read: 0
> > > > NPA_AF_RVU:
> > > > Unmap Slot Error: 0
> > >
> > > You seem to have missed the feedback Saeed and I gave you on v2.
> > >
> > > Did you test this with the errors actually triggering? Devlink
> > > should store only
> > Yes, the same was tested using devlink health test interface by
> > injecting errors.
> > The dump gets generated automatically and the counters do get out of
> > sync, in case of continuous error.
> > That wouldn't be much of an issue as the user could manually trigger a
> > dump clear and Re-dump the counters to get the exact status of the
> > counters at any point of time.
>
> Now that recover op is added the devlink error counter and recover counter
> will be proper. The internal counter for each event is needed just to
> understand within a specific reporter, how many such events occurred.
>
> Following is the log snippet of the devlink health test being done on hw_nix
> reporter.
> # for i in `seq 1 33` ; do devlink health test pci/0002:01:00.0 reporter hw_nix;
> done //Inject 33 errors (16 of NIX_AF_RVU and 17 of NIX_AF_RAS and
> NIX_AF_GENERAL errors) # devlink health
> pci/0002:01:00.0:
> reporter hw_npa
> state healthy error 0 recover 0 grace_period 0 auto_recover true
> auto_dump true
> reporter hw_nix
> state healthy error 250 recover 250 last_dump_date 1970-01-01
> last_dump_time 00:04:16 grace_period 0 auto_recover true auto_dump true
Oops, There was a log copy paste error above its not 250 (that was from a run, in which test was done
for 250 error injections)
# devlink health
pci/0002:01:00.0:
reporter hw_npa
state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
reporter hw_nix
state healthy error 33 recover 33 last_dump_date 1970-01-01 last_dump_time 00:02:16 grace_period 0 auto_recover true auto_dump true

> # devlink health dump show pci/0002:01:00.0 reporter hw_nix
> NIX_AF_GENERAL:
> Memory Fault on NIX_AQ_INST_S read: 1
> Memory Fault on NIX_AQ_RES_S write: 1
> AQ Doorbell error: 1
> Rx on unmapped PF_FUNC: 1
> Rx multicast replication error: 1
> Memory fault on NIX_RX_MCE_S read: 1
> Memory fault on multicast WQE read: 1
> Memory fault on mirror WQE read: 1
> Memory fault on mirror pkt write: 1
> Memory fault on multicast pkt write: 1
> NIX_AF_RAS:
> Poisoned data on NIX_AQ_INST_S read: 1
> Poisoned data on NIX_AQ_RES_S write: 1
> Poisoned data on HW context read: 1
> Poisoned data on packet read from mirror buffer: 1
> Poisoned data on packet read from mcast buffer: 1
> Poisoned data on WQE read from mirror buffer: 1
> Poisoned data on WQE read from multicast buffer: 1
> Poisoned data on NIX_RX_MCE_S read: 1
> NIX_AF_RVU:
> Unmap Slot Error: 0
> # devlink health dump clear pci/0002:01:00.0 reporter hw_nix # devlink
> health dump show pci/0002:01:00.0 reporter hw_nix
> NIX_AF_GENERAL:
> Memory Fault on NIX_AQ_INST_S read: 17
> Memory Fault on NIX_AQ_RES_S write: 17
> AQ Doorbell error: 17
> Rx on unmapped PF_FUNC: 17
> Rx multicast replication error: 17
> Memory fault on NIX_RX_MCE_S read: 17
> Memory fault on multicast WQE read: 17
> Memory fault on mirror WQE read: 17
> Memory fault on mirror pkt write: 17
> Memory fault on multicast pkt write: 17
> NIX_AF_RAS:
> Poisoned data on NIX_AQ_INST_S read: 17
> Poisoned data on NIX_AQ_RES_S write: 17
> Poisoned data on HW context read: 17
> Poisoned data on packet read from mirror buffer: 17
> Poisoned data on packet read from mcast buffer: 17
> Poisoned data on WQE read from mirror buffer: 17
> Poisoned data on WQE read from multicast buffer: 17
> Poisoned data on NIX_RX_MCE_S read: 17
> NIX_AF_RVU:
> Unmap Slot Error: 16
> >
> > > one dump, are the counters not going to get out of sync unless
> > > something clears the dump every time it triggers?
> Also, note that auto_dump is something which can be turned off by user.
> # devlink health set pci/0002:01:00.0 reporter hw_nix auto_dump false So
> that user can dump whenever required, which will always return the correct
> counter values.
>
> >
> > Regards,
> > -George