Re: Race conditions galore (2.0.33 and possibly 2.1.x)

Stephen R. van den Berg (srb@cuci.nl)
Mon, 22 Dec 1997 21:46:19 +0100


--fL5xGRrasGEqWYE3
Content-Type: text/plain; charset=us-ascii

Linus Torvalds wrote:
>On Mon, 22 Dec 1997, Stephen R. van den Berg wrote:
>> Well, it *does* fix some hung processes on a newsserver of mine.
>> First, several processes hung in buffer.c (very reproducable).
>> I patched it (using the described patch).
>> Now, a process hung in filemap.c (same construct). So I patched that
>> as well.

>Ok, something is definitely going on here. But as you don't even have an
>SMP setup, I don't see exactly what..

Incidentally, it happened on several different programs at "random" times.
I.e. it's not user-land dependent.
It hung a perl program, two dd programs (I ran those as a test for
disk errors), innd, crosspost, e2fsck (several times); all this in several
sessions/reboots.

>Your patch looked ok - there was nothing wrong with it. It just shouldn't
>really have any effect, because the old code should be equally stable.
>Interesting.

>Was that the _only_ change you did?

Yes, the only change. I do plead guilty with respect to not running
an entirely stock 2.0.33 kernel (I have some local mods, nothing in the
mainstream modules though). I don't think the local changes affect any
of the critical code (yeah, sure, how many times have you heard this before
:-); as a reference, I've been carrying some of these changes over from
kernel revision to kernel revision dating back more than three years ago.
Many of the local mods have vanished over time due to them being integrated
in the mainstream kernel (like noatime, serial console, etc.).

But, to get back to the matter at hand:
1. I had a 2.0.33+ kernel, programs where hanging at times; all
in buffer.c. Rebooting did not help.
2. I patched buffer.c with the shown patch. Rebooted and this time
*one* program hung again. This time in filemap.c.
3. Patched filemap.c the same as buffer.c.
4. Rebooted, hang-free for 19 hours up till now and still counting.

No other changes. Nor in the kernel, nor in userland.

BTW, I checked the hung processes with kdebug. I did find that the
buffer they where hanging on was:
1. *NOT* locked anymore. The buffer state was 0x9, the locked mask is 0x2
I believe.
2. We were still in its wait queue (first on the list), but we were not the
only one. If I recall correctly there was at least one other entry,
if not more (I did not know yet at that point that the wait_queues
where circular).

>> > current->state = TASK_UNINTERRUPTIBLE;
>> > if (should_wait_condition) {

>> What if the should_wait_condition changes here?
>> I.e. the buffer is unlocked. Will the current->state be affected?
>> Or will wake_up process avoid touching the current->state because the
>> current process is "still running"?

>The wake_up() will indeed change current->state - even if the process is
>really running (it doesn't care, and that's the whole reason why the
>construct works).

Hmmm... Just thinking aloud here, so bear with me...
Looking at wake_up_process() I see that it indeed writes over p->state
in any case. What's p->next_run? When is that set?
Could it be that p->next_run is set and about to be cleared without
adding the task to the runqueue (in a different part of the kernel)?

Looking at wake_up(), the only reason the wakeup call could be missed
is when p->state has already been changed away from TASK_UNINTERRUPTIBLE
even though the process has not been added to the runqueue.

One more thing... What if schedule() is being invoked with
current->state == TASK_RUNNING ?
What if someone changes current->state while schedule() is running
(that *should* not be possible, considering that most of it runs
with interrupts disabled)?

>I wonder whether gcc might be changing the ordering here. That would
>explain why your patch would make a difference..

Possibly, but not in the parts where I patched. Maybe there are parts
in sched.c which are being reordered?

>In fact, the more I think about it, the more your patch makes sense
>considering what gcc might do to the sources. Do you by any chance use
>egcs or a recent snapshot of gcc?

Sorry, no. gcc version 2.7.2.1 (from a recent Debian distribution).
Compile options:

gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -O2 \
-malign-jumps=2 -pipe -m486 -S -g "$@"

Nothing truly unusual, I think.

>> >There are only two important rules:
>> > - you have to add yourself to the wait queue _before_ testing for the
>> > condition.
>> > - you have to mark yourself asleep _before_ testing for the condition.

>> Maybe we need to change the last rule?

>> - You have to mark yourself asleep _before_ adding yourself to the waitqueue.

>That would be a reasonably ok rule, but in general there are cases where
>you want to add yourself to the wait-queue just once (outside a larger
>loop), and mark yourself sleeping inside the loop.

If that is supposed to work, then the problem must be in sched.c somewhere.

>IF this is a compiler optimization thing, could you test the original code
>but just add a "mb()" to after marking yourself as sleeping. Ie

> current->state = TASK_UNINTERRUPTIBLE;
>+ mb();
> if (condition..)

>The mb() should make sure that gcc cannot move things around, and on some
>architectures it will also do some other coherency stuff for SMP. If that
>also fixes it I'll have to think about this some more..

Not in the code fragment I patched.

Here's the diff with and without mb().

--- t. Mon Dec 22 20:09:07 1997
+++ tc. Mon Dec 22 20:09:03 1997
@@ -117,14 +117,12 @@
.LBE8:
.LBE7:
.Ltext6:
- 121:buffer.c **** current->state = TASK_UNINTERRUPTIBLE; mb();
+ 121:buffer.c **** current->state = TASK_UNINTERRUPTIBLE;
.LM35:
0063 A1000000 movl current_set,%eax
00
0068 C7000200 movl $2,(%eax)
0000
- #APP
- #NO_APP
122:buffer.c **** if (buffer_locked(bh)) {
.LM36:
.LBB11:

It's rather insignificant, as you can see.

>(Your patch essentially did a similar thing to the "mb()" - the wait queue
>manipulation will also act as a barrier for gcc optimizations. So that
>might explain the thing).

I've included the original and my patched version of the function in
assembler as attachments to this mail.

>Anyway, thanks for testing, this looks like a potentially very interesting
>problem that needs to be solved..

Evidently. I wouldn't really sleep all too well, without having certainty
about why my patches make a difference.

-- 
Sincerely,                                                          srb@cuci.nl
           Stephen R. van den Berg (AKA BuGless).

"Father's Day: Nine months before Mother's Day."

--fL5xGRrasGEqWYE3 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=t

113:buffer.c **** void __wait_on_buffer(struct buffer_head * bh) 114:buffer.c **** { 975 .LM1: 976 0000 55 pushl %ebp 977 0001 89E5 movl %esp,%ebp 978 0003 83EC0C subl $12,%esp 979 0006 57 pushl %edi 980 0007 56 pushl %esi 981 0008 53 pushl %ebx 982 0009 8B7508 movl 8(%ebp),%esi 115:buffer.c **** struct wait_queue wait = { current, NULL }; 984 .LM2: 985 .LBB2: 986 000c A1000000 movl current_set,%eax 986 00 987 0011 8945F8 movl %eax,-8(%ebp) 988 0014 C745FC00 movl $0,-4(%ebp) 988 000000 116:buffer.c **** 117:buffer.c **** bh->b_count++; 990 .LM3: 991 001b FF461C incl 28(%esi) 118:buffer.c **** add_wait_queue(&bh->b_wait, &wait); 993 .LM4: 994 .LBB3: 996 .Ltext1: 998 .LM5: 999 .LBB4: 1001 .LM6: 1002 #APP 1003 001e 9C59 pushfl ; popl %ecx 1004 #NO_APP 1006 .LM7: 1007 #APP 1008 0020 FA cli 1009 #NO_APP 1011 .LM8: 1012 .LBB5: 1014 .LM9: 1015 .LBB6: 1016 0021 8B4634 movl 52(%esi),%eax 1018 .LM10: 1019 0024 8D5630 leal 48(%esi),%edx 1021 .LM11: 1022 0027 85C0 testl %eax,%eax 1023 0029 7402 je .L1391 1025 .LM12: 1026 002b 89C2 movl %eax,%edx 1027 .L1391: 1029 .LM13: 1030 002d 8D7DF8 leal -8(%ebp),%edi 1031 0030 897E34 movl %edi,52(%esi) 1033 .LM14: 1034 0033 8955FC movl %edx,-4(%ebp) 1036 .LM15: 1037 .LBE6: 1038 .LBE5: 1040 .LM16: 1041 #APP 1042 0036 519D pushl %ecx ; popfl 1043 #NO_APP 1045 .LM17: 1046 .LBE4: 1047 .LBE3: 1048 .L1392: 1050 .Ltext2: 119:buffer.c **** repeat: 120:buffer.c **** run_task_queue(&tq_disk); 1052 .LM18: 1053 .LBB7: 1055 .Ltext3: 1057 .LM19: 1058 .LBB8: 1060 .LM20: 1061 .LBB9: 1063 .Ltext4: 1065 .LM21: 1066 0038 31C0 xorl %eax,%eax 1068 .LM22: 1070 .LM23: 1071 003a 89C3 movl %eax,%ebx 1072 #APP 1073 003c 871D0000 xchgl %ebx,tq_disk 1073 0000 1074 #NO_APP 1076 .LM24: 1077 .LBE9: 1079 .Ltext5: 1081 .LM25: 1082 0042 85DB testl %ebx,%ebx 1083 0044 741D je .L1415 1084 0046 89F6 .align 4 1085 .L1404: 1087 .LM26: 1088 .LBB10: 1090 .LM27: 1091 0048 8B430C movl 12(%ebx),%eax 1093 .LM28: 1094 004b 8B4B08 movl 8(%ebx),%ecx 1096 .LM29: 1097 004e 89DA movl %ebx,%edx 1099 .LM30: 1100 0050 8B1B movl (%ebx),%ebx 1102 .LM31: 1103 0052 C7420400 movl $0,4(%edx) 1103 000000 1105 .LM32: 1106 0059 50 pushl %eax 1107 005a FFD1 call *%ecx 1109 .LM33: 1110 .LBE10: 1111 005c 83C404 addl $4,%esp 1112 005f 85DB testl %ebx,%ebx 1113 0061 75E5 jne .L1404 1114 .L1415: 1116 .LM34: 1117 .LBE8: 1118 .LBE7: 1120 .Ltext6: 121:buffer.c **** current->state = TASK_UNINTERRUPTIBLE; mb(); 1122 .LM35: 1123 0063 A1000000 movl current_set,%eax 1123 00 1124 0068 C7000200 movl $2,(%eax) 1124 0000 1125 #APP 1126 #NO_APP 122:buffer.c **** if (buffer_locked(bh)) { 1128 .LM36: 1129 .LBB11: 1131 .Ltext7: 1133 .LM37: 1134 .LBB12: 1135 .LBB13: 1137 .Ltext8: 1139 .LM38: 1140 .LBE13: 1142 .Ltext9: 1144 .LM39: 1145 .LBE12: 1146 .LBE11: 1148 .Ltext10: 1150 .LM40: 1151 006e F6461404 testb $4,20(%esi) 1152 0072 7408 je .L1406 123:buffer.c **** schedule(); 1154 .LM41: 1155 0074 E8FCFFFF call schedule 1155 FF 124:buffer.c **** goto repeat; 1157 .LM42: 1158 0079 EBBD jmp .L1392 1159 007b 90 .align 4 1160 .L1406: 125:buffer.c **** } 126:buffer.c **** remove_wait_queue(&bh->b_wait, &wait); 1162 .LM43: 1163 .LBB14: 1165 .Ltext11: 1167 .LM44: 1168 007c 8D5DF8 leal -8(%ebp),%ebx 1170 .LM45: 1171 .LBB15: 1173 .LM46: 1174 #APP 1175 007f 9C8F45F4 pushfl ; popl -12(%ebp) 1176 #NO_APP 1178 .LM47: 1179 #APP 1180 0083 FA cli 1181 #NO_APP 1183 .LM48: 1184 .LBB16: 1186 .LM49: 1187 .LBB17: 1188 0084 8B4DFC movl -4(%ebp),%ecx 1190 .LM50: 1191 0087 89CA movl %ecx,%edx 1192 0089 8D7600 .align 4 1193 .L1411: 1195 .LM51: 1196 .LBB18: 1197 008c 8B4204 movl 4(%edx),%eax 1199 .LM52: 1200 008f 39D8 cmpl %ebx,%eax 1201 0091 7405 je .L1412 1203 .LM53: 1204 0093 89C2 movl %eax,%edx 1206 .LM54: 1207 .LBE18: 1209 .LM55: 1210 0095 EBF5 jmp .L1411 1211 0097 90 .align 4 1212 .L1412: 1214 .LM56: 1215 0098 894A04 movl %ecx,4(%edx) 1217 .LM57: 1218 .LBE17: 1219 .LBE16: 1221 .LM58: 1222 #APP 1223 009b FF75F49D pushl -12(%ebp) ; popfl 1224 #NO_APP 1226 .LM59: 1227 .LBE15: 1228 .LBE14: 1230 .Ltext12: 127:buffer.c **** bh->b_count--; 1232 .LM60: 1233 009f FF4E1C decl 28(%esi) 128:buffer.c **** current->state = TASK_RUNNING; 1235 .LM61: 1236 00a2 A1000000 movl current_set,%eax 1236 00 1237 00a7 C7000000 movl $0,(%eax) 1237 0000 129:buffer.c **** } 1239 .LM62: 1240 .LBE2: 1241 00ad 8D65E8 leal -24(%ebp),%esp 1242 00b0 5B popl %ebx 1243 00b1 5E popl %esi 1244 00b2 5F popl %edi 1245 00b3 89EC movl %ebp,%esp 1246 00b5 5D popl %ebp 1247 00b6 C3 ret 1248 .Lfe1: 1299 00b7 89F68DBC .align 16 1299 27000000 1299 00

--fL5xGRrasGEqWYE3 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=tc

113:buffer.c **** void __wait_on_buffer(struct buffer_head * bh) 114:buffer.c **** { 975 .LM1: 976 0000 55 pushl %ebp 977 0001 89E5 movl %esp,%ebp 978 0003 83EC0C subl $12,%esp 979 0006 57 pushl %edi 980 0007 56 pushl %esi 981 0008 53 pushl %ebx 982 0009 8B7508 movl 8(%ebp),%esi 115:buffer.c **** struct wait_queue wait = { current, NULL }; 984 .LM2: 985 .LBB2: 986 000c A1000000 movl current_set,%eax 986 00 987 0011 8945F8 movl %eax,-8(%ebp) 988 0014 C745FC00 movl $0,-4(%ebp) 988 000000 116:buffer.c **** 117:buffer.c **** bh->b_count++; 990 .LM3: 991 001b FF461C incl 28(%esi) 118:buffer.c **** add_wait_queue(&bh->b_wait, &wait); 993 .LM4: 994 .LBB3: 996 .Ltext1: 998 .LM5: 999 .LBB4: 1001 .LM6: 1002 #APP 1003 001e 9C59 pushfl ; popl %ecx 1004 #NO_APP 1006 .LM7: 1007 #APP 1008 0020 FA cli 1009 #NO_APP 1011 .LM8: 1012 .LBB5: 1014 .LM9: 1015 .LBB6: 1016 0021 8B4634 movl 52(%esi),%eax 1018 .LM10: 1019 0024 8D5630 leal 48(%esi),%edx 1021 .LM11: 1022 0027 85C0 testl %eax,%eax 1023 0029 7402 je .L1391 1025 .LM12: 1026 002b 89C2 movl %eax,%edx 1027 .L1391: 1029 .LM13: 1030 002d 8D7DF8 leal -8(%ebp),%edi 1031 0030 897E34 movl %edi,52(%esi) 1033 .LM14: 1034 0033 8955FC movl %edx,-4(%ebp) 1036 .LM15: 1037 .LBE6: 1038 .LBE5: 1040 .LM16: 1041 #APP 1042 0036 519D pushl %ecx ; popfl 1043 #NO_APP 1045 .LM17: 1046 .LBE4: 1047 .LBE3: 1048 .L1392: 1050 .Ltext2: 119:buffer.c **** repeat: 120:buffer.c **** run_task_queue(&tq_disk); 1052 .LM18: 1053 .LBB7: 1055 .Ltext3: 1057 .LM19: 1058 .LBB8: 1060 .LM20: 1061 .LBB9: 1063 .Ltext4: 1065 .LM21: 1066 0038 31C0 xorl %eax,%eax 1068 .LM22: 1070 .LM23: 1071 003a 89C3 movl %eax,%ebx 1072 #APP 1073 003c 871D0000 xchgl %ebx,tq_disk 1073 0000 1074 #NO_APP 1076 .LM24: 1077 .LBE9: 1079 .Ltext5: 1081 .LM25: 1082 0042 85DB testl %ebx,%ebx 1083 0044 741D je .L1415 1084 0046 89F6 .align 4 1085 .L1404: 1087 .LM26: 1088 .LBB10: 1090 .LM27: 1091 0048 8B430C movl 12(%ebx),%eax 1093 .LM28: 1094 004b 8B4B08 movl 8(%ebx),%ecx 1096 .LM29: 1097 004e 89DA movl %ebx,%edx 1099 .LM30: 1100 0050 8B1B movl (%ebx),%ebx 1102 .LM31: 1103 0052 C7420400 movl $0,4(%edx) 1103 000000 1105 .LM32: 1106 0059 50 pushl %eax 1107 005a FFD1 call *%ecx 1109 .LM33: 1110 .LBE10: 1111 005c 83C404 addl $4,%esp 1112 005f 85DB testl %ebx,%ebx 1113 0061 75E5 jne .L1404 1114 .L1415: 1116 .LM34: 1117 .LBE8: 1118 .LBE7: 1120 .Ltext6: 121:buffer.c **** current->state = TASK_UNINTERRUPTIBLE; 1122 .LM35: 1123 0063 A1000000 movl current_set,%eax 1123 00 1124 0068 C7000200 movl $2,(%eax) 1124 0000 122:buffer.c **** if (buffer_locked(bh)) { 1126 .LM36: 1127 .LBB11: 1129 .Ltext7: 1131 .LM37: 1132 .LBB12: 1133 .LBB13: 1135 .Ltext8: 1137 .LM38: 1138 .LBE13: 1140 .Ltext9: 1142 .LM39: 1143 .LBE12: 1144 .LBE11: 1146 .Ltext10: 1148 .LM40: 1149 006e F6461404 testb $4,20(%esi) 1150 0072 7408 je .L1406 123:buffer.c **** schedule(); 1152 .LM41: 1153 0074 E8FCFFFF call schedule 1153 FF 124:buffer.c **** goto repeat; 1155 .LM42: 1156 0079 EBBD jmp .L1392 1157 007b 90 .align 4 1158 .L1406: 125:buffer.c **** } 126:buffer.c **** remove_wait_queue(&bh->b_wait, &wait); 1160 .LM43: 1161 .LBB14: 1163 .Ltext11: 1165 .LM44: 1166 007c 8D5DF8 leal -8(%ebp),%ebx 1168 .LM45: 1169 .LBB15: 1171 .LM46: 1172 #APP 1173 007f 9C8F45F4 pushfl ; popl -12(%ebp) 1174 #NO_APP 1176 .LM47: 1177 #APP 1178 0083 FA cli 1179 #NO_APP 1181 .LM48: 1182 .LBB16: 1184 .LM49: 1185 .LBB17: 1186 0084 8B4DFC movl -4(%ebp),%ecx 1188 .LM50: 1189 0087 89CA movl %ecx,%edx 1190 0089 8D7600 .align 4 1191 .L1411: 1193 .LM51: 1194 .LBB18: 1195 008c 8B4204 movl 4(%edx),%eax 1197 .LM52: 1198 008f 39D8 cmpl %ebx,%eax 1199 0091 7405 je .L1412 1201 .LM53: 1202 0093 89C2 movl %eax,%edx 1204 .LM54: 1205 .LBE18: 1207 .LM55: 1208 0095 EBF5 jmp .L1411 1209 0097 90 .align 4 1210 .L1412: 1212 .LM56: 1213 0098 894A04 movl %ecx,4(%edx) 1215 .LM57: 1216 .LBE17: 1217 .LBE16: 1219 .LM58: 1220 #APP 1221 009b FF75F49D pushl -12(%ebp) ; popfl 1222 #NO_APP 1224 .LM59: 1225 .LBE15: 1226 .LBE14: 1228 .Ltext12: 127:buffer.c **** bh->b_count--; 1230 .LM60: 1231 009f FF4E1C decl 28(%esi) 128:buffer.c **** current->state = TASK_RUNNING; 1233 .LM61: 1234 00a2 A1000000 movl current_set,%eax 1234 00 1235 00a7 C7000000 movl $0,(%eax) 1235 0000 129:buffer.c **** } 1237 .LM62: 1238 .LBE2: 1239 00ad 8D65E8 leal -24(%ebp),%esp 1240 00b0 5B popl %ebx 1241 00b1 5E popl %esi 1242 00b2 5F popl %edi 1243 00b3 89EC movl %ebp,%esp 1244 00b5 5D popl %ebp 1245 00b6 C3 ret 1246 .Lfe1: 1297 00b7 89F68DBC .align 16 1297 27000000 1297 00

--fL5xGRrasGEqWYE3--