Re: [PATCH] netfilter: use per-CPU recursive lock {XV}

From: Mathieu Desnoyers
Date: Sun Apr 26 2009 - 17:40:21 EST


* Eric Dumazet (dada1@xxxxxxxxxxxxx) wrote:
> Mathieu Desnoyers a écrit :
> > * Eric Dumazet (dada1@xxxxxxxxxxxxx) wrote:
> >> From: Stephen Hemminger <shemminger@xxxxxxxxxx>
> >>
> >>> Epilogue due to master Jarek. Lockdep carest not about the locking
> >>> doth bestowed. Therefore no keys are needed.
> >>>
> >>> Signed-off-by: Stephen Hemminger <shemminger@xxxxxxxxxx>
> >> So far, so good, should be ready for inclusion now, nobody complained :)
> >>
> >> I include the final patch, merge of your last two patches.
> >>
> >> David, could you please review it once again and apply it if it's OK ?
> >>
> > [...]
> >> +/*
> >> + * Per-CPU read/write lock associated with per-cpu table entries.
> >> + * This is not a general solution but makes reader locking fast since
> >> + * there is no shared variable to cause cache ping-pong; but adds an
> >> + * additional write-side penalty since update must lock all
> >> + * possible CPU's.
> >> + *
> >> + * Read lock is used by ip/arp/ip6 tables rule processing which runs per-cpu.
> >> + * It needs to ensure that the rules are not being changed while packet
> >> + * is being processed. In some cases, the read lock will be acquired
> >> + * twice on the same CPU; this is okay because read locks handle nesting.
> >> + *
> >> + * Write lock is used in two cases:
> >> + * 1. reading counter values
> >> + * all readers need to be stopped and the per-CPU values are summed.
> >> + *
> >> + * 2. replacing tables
> >> + * any readers that are using the old tables have to complete
> >> + * before freeing the old table. This is handled by reading
> >> + * as a side effect of reading counters
> >> + */
> >> +DECLARE_PER_CPU(rwlock_t, xt_info_locks);
> >> +
> >> +static inline void xt_info_rdlock_bh(void)
> >> +{
> >> + /*
> >> + * Note: can not use read_lock_bh(&__get_cpu_var(xt_info_locks))
> >> + * because need to ensure that preemption is disable before
> >> + * acquiring per-cpu-variable, so do it as a two step process
> >> + */
> >> + local_bh_disable();
> >
> > Why do you need to disable bottom halves on the read-side ? You could
> > probably just disable preemption, given this lock is nestable on the
> > read-side anyway. Or I'm missing something obvious ?
>
> It may not be obvious, but subject already raised on this list, so I'll
> try to be as precise as possible (But may be wrong on some points, I'll
> let Patrick correct me if necessary)
>
> ipt_do_table() is not a readonly function returning a verdict.
>
> 1) It handles a stack (check how is used next->comefrom) that seems to
> be stored on rules themselves. (This is how I understand this code)
> This is safe as each cpu has its own copy of rules/counters, and BH protected.
>
> 2) It also updates two 64 bit counters (bytes/packets) on each matched rule.
>
> 3) Some netfilter matches/targets probably rely on the fact their handlers
> are run with BH disabled by their caller (ipt_do_table()/arp/ip6...)
>
> These must be BH protected (and preempt disabled too), or else :
>
> 1) A softirq could interrupt a process in the middle of ipt_do_table()
> and corrupt its "stack".
>
> 2) A softirq could interrupt a process in ipt_do_table() in the middle
> of the ADD_COUNTER(). Some counters could be corrupted.
>
> 3) Some netfiler extensions would break.
>
> Previous linux versions already used a read_lock_bh() here, on a single
> and shared rwlock, there is nothing new on this BH locking AFAIK.
>
> Thank you

Thanks for the explanation. It might help to document the role of bh
disabling for the reader in a supplementary code comment, otherwise one
might think it's been put there to match the bottom half disabling used
on the write-side, which has the supplementary role of making sure bh
will not deadlock (and this precise behavior is not needed usually on
the read-side).

One more point :

* 1. reading counter values
* all readers need to be stopped and the per-CPU values are summed.

Maybe it's just me, but this sentence does not seem to clearly indicate
that we have :

for_each_cpu()
write lock()
read data
write unlock()

One might interpret it as :

for_each_cpu()
write lock()

read data

for_each_cpu()
write unlock()

Or maybe it's just my understanding of English that's not perfect.
Anyhow, rewording this sentence might not hurt. Something along the
lines of :

"reading counter values
all readers are iteratively stopped to have their per-CPU values
summed"

This is an important difference, as this behaves more like a RCU-based
mechanism than a global per-cpu read/write lock where all the write
locks would be taken at once.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/