Re: [PATCH][CFT] dcache-ac6-D - dcache threading

From: kumon@flab.fujitsu.co.jp
Date: Sat Jun 03 2000 - 15:24:11 EST


Andi Kleen writes:
> Some comments from a networking perspective.

Thank you for having interests on my measurement.

> It may also be worth to try the e100 driver from the Intel website.

Instruction level profile shows, a "inw()" statement in
speedo_interrupt():eepro100.c occupies more than half of the overhead
of speedo_interrupt(). Actually, this statement is not compiled into
an inw instruction but a movzwl instruction.

near line 1520 in eepro100.c:
        do {
HERE>>> status = inw(ioaddr + SCBStatus);
                /* Acknowledge all of the current interrupt sources ASAP. */
                /* Will change from 0xfc00 to 0xff00 when we start handling
                   FCP and ER interrupts --Dragan */
                outw(status & 0xfc00, ioaddr + SCBStatus);

> > 22.9 22.2 kmalloc
> > 22.2 20.5 kfree
>
> That requires per CPU slabs to fix. Normally the new per CPU skb cache in 2.4
> should help a bit already, maybe you need to increase
> /proc/sys/net/core/hot_list_len

Some part of kmalloc/kfree overhead is come from do_select, and it is
easily eliminated using small array on a stack. Which I've already
posted. IMHO per CPU skb will not reduce the kmalloc overhead,
skb_buf_head uses kmem_cache_allocate directly.

The statistics shows that the skb buffer (not skb buffer head)
allocation is the most frequent kmalloc/kfree client (after do_select
optimization).

Using per-cpu cache mechanism and auto-array in do_select() can
curtail kmalloc() overhead to 1/3. And the statistics shows the most
frequently requested sizes to kmalloc are 2kb and 128b, of course it
is application dependent.

> > 20.3 18.3 speedo_start_xmit
> > 19.3 18.5 tcp_v4_rcv
>
> That is the checksum computation and hash lookup. The e100 driver would fix
> a lot of that (it supports hw checksums on the later eepros)

OK, I'll try and measure it.

> > 9.0 8.2 ip_route_input
>
> Interesting. Looks like the routing cache hash isn't as good as we thought.
> Could you add some statistics to ipv4/route.c:ip_route_input to check
> the average hash chain length or where exactly the cycles go there?

Two lock instructions took more than half of the ticks. But we should
be carefull to interpret it. Super scalar execution may distort
results. So, some additional experiments are needed for confirmation.
I suspect instruction serialization may be the main reason.

The instructions are shown below:

int ip_route_input(struct sk_buff *skb, u32 daddr, u32 saddr,
                   u8 tos, struct net_device *dev)
{
        struct rtable * rth;
        unsigned hash;
        int iif = dev->ifindex;

        tos &= IPTOS_TOS_MASK;
        hash = rt_hash_code(daddr, saddr^(iif<<5), tos);

HERE>> read_lock(&rt_hash_table[hash].lock);

the other is around line 1560
                        rth->u.dst.lastuse = jiffies;
HERE>> dst_hold(&rth->u.dst);
                        rth->u.dst.__use++;
                        read_unlock(&rt_hash_table[hash].lock);
                        skb->dst = (struct dst_entry*)rth;
                        return 0;

--
Computer Systems Laboratory, Fujitsu Labs.
kumon@flab.fujitsu.co.jp

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Jun 07 2000 - 21:00:17 EST