Bonnie++ with 1024k stripe SW/RAID5 causes kernel to goto D-state

From: Justin Piszcz
Date: Sat Sep 29 2007 - 13:09:04 EST


Kernel: 2.6.23-rc8 (older kernels do this as well)

When running the following command:
/usr/bin/time /usr/sbin/bonnie++ -d /x/test -s 16384 -m p34 -n 16:100000:16:64

It hangs unless I increase various parameters md/raid such as the stripe_cache_size etc..

# ps auxww | grep D
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 276 0.0 0.0 0 0 ? D 12:14 0:00 [pdflush]
root 277 0.0 0.0 0 0 ? D 12:14 0:00 [pdflush]
root 1639 0.0 0.0 0 0 ? D< 12:14 0:00 [xfsbufd]
root 1767 0.0 0.0 8100 420 ? Ds 12:14 0:00 root 2895 0.0 0.0 5916 632 ? Ds 12:15 0:00 /sbin/syslogd -r

See the bottom for more details.

Is this normal? Does md only work without tuning up to a certain stripe size? I use a RAID 5 with 1024k stripe which works fine with many optimizations, but if I just boot the system and run bonnie++ on it without applying the optimizations, it will hang in d-state. When I run the optimizations, then it exits out of D-state, pretty weird?

(again, without this, bonnie++ will hang in d-state.. until this is run)

Optimization script:

#!/bin/bash

# source profile
. /etc/profile

# Tell user what's going on.
echo "Optimizing RAID Arrays..."

# Define DISKS.
cd /sys/block
DISKS=$(/bin/ls -1d sd[a-z])

# This step must come first.
# See: http://www.3ware.com/KB/article.aspx?id=11050
echo "Setting max_sectors_kb to 128 KiB"
for i in $DISKS
do
echo "Setting /dev/$i to 128 KiB..."
echo 128 > /sys/block/"$i"/queue/max_sectors_kb
done

# This step comes next.
echo "Setting nr_requests to 512 KiB"
for i in $DISKS
do
echo "Setting /dev/$i to 512K KiB"
echo 512 > /sys/block/"$i"/queue/nr_requests
done

# Set read-ahead.
echo "Setting read-ahead to 64 MiB for /dev/md3"
blockdev --setra 65536 /dev/md3

# Set stripe-cache_size for RAID5.
echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
echo 16384 > /sys/block/md3/md/stripe_cache_size

# Set minimum and maximum raid rebuild speed to 30MB/s.
echo "Setting minimum and maximum resync speed to 30 MiB/s..."
echo 30000 > /sys/block/md0/md/sync_speed_min
echo 30000 > /sys/block/md0/md/sync_speed_max
echo 30000 > /sys/block/md1/md/sync_speed_min
echo 30000 > /sys/block/md1/md/sync_speed_max
echo 30000 > /sys/block/md2/md/sync_speed_min
echo 30000 > /sys/block/md2/md/sync_speed_max
echo 30000 > /sys/block/md3/md/sync_speed_min
echo 30000 > /sys/block/md3/md/sync_speed_max

# Disable NCQ on all disks.
echo "Disabling NCQ on all disks..."
for i in $DISKS
do
echo "Disabling NCQ on $i"
echo 1 > /sys/block/"$i"/device/queue_depth
done

--

Once this runs, everything works fine again.

--

# mdadm -D /dev/md3
/dev/md3:
Version : 00.90.03
Creation Time : Wed Aug 22 10:38:53 2007
Raid Level : raid5
Array Size : 1318680576 (1257.59 GiB 1350.33 GB)
Used Dev Size : 146520064 (139.73 GiB 150.04 GB)
Raid Devices : 10
Total Devices : 10
Preferred Minor : 3
Persistence : Superblock is persistent

Update Time : Sat Sep 29 13:05:15 2007
State : active, resyncing
Active Devices : 10
Working Devices : 10
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 1024K

Rebuild Status : 8% complete

UUID : e37a12d1:1b0b989a:083fb634:68e9eb49 (local to host p34.internal.lan)
Events : 0.4211

Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 8 49 1 active sync /dev/sdd1
2 8 65 2 active sync /dev/sde1
3 8 81 3 active sync /dev/sdf1
4 8 97 4 active sync /dev/sdg1
5 8 113 5 active sync /dev/sdh1
6 8 129 6 active sync /dev/sdi1
7 8 145 7 active sync /dev/sdj1
8 8 161 8 active sync /dev/sdk1
9 8 177 9 active sync /dev/sdl1

--

NOTE: This bug is reproducible every time:

Example:

$ /usr/bin/time /usr/sbin/bonnie++ -d /x/test -s 16384 -m p34 -n 16:100000:16:64
Writing with putc()...

It writes for 4-5 minutes and then...... SILENCE + D-STATE, I was too late this time :(

$ ps auxww | grep D
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 276 1.2 0.0 0 0 ? D 12:50 0:03 [pdflush]
root 2901 0.0 0.0 5916 632 ? Ds 12:50 0:00 /sbin/syslogd -
r
user 4571 48.0 0.0 11644 1084 pts/1 D+ 12:51 1:55 /usr/sbin/bonn
ie++ -d /x/test -s 16384 -m p34 -n 16:100000:16:64
root 4612 1.0 0.0 0 0 ? D 12:52 0:01 [pdflush]
root 4624 5.0 0.0 40964 7436 ? D 12:55 0:00 /usr/bin/perl -
w /app/rrd-cputemp/bin/rrd_cputemp.pl
root 4684 0.0 0.0 31968 1416 ? D 12:55 0:00 /usr/bin/rateup
/var/www/monitor/mrtg/ eth0 1191084902 -Z u 265975 843609 125000000 c #00cc00 #
0000ff #006600 #ff00ff k 1000 i /var/www/monitor/mrtg/eth0-day.png -125000000 -1
25000000 400 100 1 1 1 300 0 4 1 %Y-%m-%d %H:%M 0 i /var/www/monitor/mrtg/eth0-w
eek.png -125000000 -125000000 400 100 1 1 1 1800 0 4 1 %Y-%m-%d %H:%M 0 i /var/w
ww/monitor/mrtg/eth0-month.png -125000000 -125000000 400 100 1 1 1 7200 0 4 1 %Y
-%m-%d %H:%M 0
root 4686 0.0 0.0 4420 932 ? D 12:55 0:00 /usr/sbin/hddte
mp -n /dev/sdf
user 4688 0.0 0.0 4232 800 pts/5 S+ 12:55 0:00 grep --color D
$

If you are not logged as root already, it is sometimes too late to su to root
and run the optimizations:

$ su -
Password: <hang forever>

Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/