Re: 2.2.18Pre Lan Performance Rocks!

From: Andi Kleen (ak@suse.de)
Date: Mon Oct 30 2000 - 02:38:27 EST


On Mon, Oct 30, 2000 at 12:16:46AM -0700, Jeff V. Merkey wrote:
> On Mon, Oct 30, 2000 at 08:08:58AM +0100, Andi Kleen wrote:
> > On Sun, Oct 29, 2000 at 11:58:21PM -0700, Jeff V. Merkey wrote:
> > > On Mon, Oct 30, 2000 at 07:47:00AM +0100, Andi Kleen wrote:
> > > > On Sun, Oct 29, 2000 at 06:35:31PM -0700, Jeff V. Merkey wrote:
> > > > > On Mon, Oct 30, 2000 at 12:04:23AM +0000, Alan Cox wrote:
> > > > > > > It's still got some problems with NFS (I am seeing a few RPC timeout
> > > > > > > errors) so I am backreving to 2.2.17 for the Ute-NWFS release next week,
> > > > > > > but it's most impressive.
> > > > > >
> > > > > > Can you send a summary of the NFS reports to nfs-devel@linux.kernel.org
> > > > >
> > > > > Yes. I just went home, so I am emailing from my house. I'll post late
> > > > > tonight or in the morning. Performance on 100Mbit with NFS going
> > > > > Linux->Linux is getting better throughput than IPX NetWare Client ->
> > > > > NetWare 5.x on the same network by @ 3%. When you start loading up a
> > > > > Linux server, it drops off sharply and NetWare keeps scaling, however,
> > > > > this does indicate that the LAN code paths are equivalent relative to
> > > > > latency vs. MSM/TSM/HSM in NetWare. NetWare does better caching
> > > > > (but we'll fix this in Linux next). I think the ring transitions to
> > > > > user space daemons are what are causing the scaling problems
> > > > > Linux vs. NetWare.
> > > >
> > > > There are no user space daemons involved in the knfsd fast path, only in slow paths
> > > > like mounting.
> > >
> > > So why is it spawning off nfsd servers in user space? I did notice that all
> >
> > They just provide a process context, the actual work is only done in kernel mode.
> >
> > > the connect logic is down in the kernel with the RPC stuff in xprt.c, and this
> > > looks to be pretty fast. I know about the checksumming problem with TCPIP.
> > > But cycles are cheap on todays processors, so even with this overhead, it
> > > could still get faster. IPX uses small packets that are less wire efficient
> > > since the ratio of header size to payload is larger than what NFS in Linux
> > > is doing, plus there's more of them for an equivalent data transfer, even
> > > with packet burst. IPX would tend to be faster if there were multiple
> > > routers involved, since the latency of smaller packets would be less and
> > > IPX never has to deal with the problem of fragmentation.
> > >
> > > Is there any CR3 reloading or tasking switching going on for each interrupt
> > > that comes in you know of, BTW? EMON stats show the server's bus utilization
> >
> > No, interrupts do not change CR3.
> >
> > One problem in Linux 2.2 is that kernel threads reload their VM on context switch
> > (that would include the nfsd thread), this should be fixed in 2.4 with lazy mm.
>
> When you say it reloads it's VM, you mean it reloads the CR3 register?

Yes.

> This will cost you about 15 clocks up front, and about 150 clocks over
> time as each TLB is reloaded (plus is will do LOCK# assertions invisibly
> underneath for PTE fetches) One major difference is that in NetWare,
> CR3 does not ever get changed in any network I/O fast paths, and none of
> the TCP or IPX processes exist in user space, so there's never this
> background overhead with page tables being loaded in and out. CR3 only
> gets mucked with when someone allocates memory and pages in an address
> frame (when memory is alloced and freed).
>
> The user space activity is what's causing this, plus the copying. Is there
> an easy way to completely disable multiple PTE/PDE address spaces and just
> map the entire address space linear (like NetWare) with a compile option
> so I could do a true apples to apples comparison?

No. In 2.4 you could probably use the on demand lazy vm mechanism ingo described for
the nfsd processes. In 2.2 it is a bit more tricky, if I remember right lazy mm needed
quite a few changes.

But before doing too many changes i would first verify if that is really the problem.

> > Hmm actually it should be only fixed for true kernel threads that have been started
> > with kernel_thread(), the "pseudo kernel threads" like nfsd uses probably do not
> > get that optimization because they don't set their MM to init_mm.
> >
> > > to get disproportiantely higher in Linux than NetWare 5.x and when it hits
> > > 60% of total clock cycles, Linux starts dropping off. NetWare 5.x is 1/8
> >
> > I think that can be explained by the copying.
>
> It's more. I am seeing tons of segment register reloads, which are very
> heavy, and atomic reads of PTE entries.

PTEs are read for aging on memory pressure in vmscan.

segment register reload happen on interrupt entry, to load the kernel cs/ds (are they that
costly?). If it was really costly you could probably check in interrupt entry if you're
already running in kernel space and skip it.

>
> > > the bus utilitization, which would suggest clocks are being used for something
> > > other than pushing packets in and out of the box (the checksumming is done
> > > in the tcp code when it copies the fragments, which is very low overhead).
> >
> > No, unfortunately not. NFS uses UDP and the UDP defragmenting doesn't do copy-checksum
>
> Correct. You're right -- it's in the tcp code, however, I remember going over
> this when I was reviewing Alan's code -- that's what I was think of.

2.2 does not use checksum-copy-to-user in TCP RX, csum is either done separate or in hardware.
It does csum-copy-fragment for TX, as well as UDP.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Oct 31 2000 - 21:00:25 EST