Re: 3.6.11 AMD-Vi: Completion-Wait loop timed out

From: Suravee Suthikulanit
Date: Tue Jan 22 2013 - 18:29:53 EST


On 1/22/2013 10:29 AM, Udo van den Heuvel wrote:

On 2013-01-22 17:12, Boris Ostrovsky wrote:
Your BIOS does not have the required erratum workaround. We will provide
a patch to close that hole but since the problem is not easily
reproducible (and the erratum is also not easy to trigger) it may be
difficult to say whether it really helped with your problem.

Udo,

I sent out a patch (http://marc.info/?l=linux-kernel&m=135889686523524&w=2) which should implement
the workaround for AMD processor family15h model 10-1Fh erratum 746 in the IOMMU driver.
In your case, the output from "setpci -s 00:00.02 F4.w" is "0050" which tells me that BIOS doesn't
implement the work around. After patching, you should see the following message in "dmesg".

"AMD-Vi: Applying erratum 746 for IOMMU at 0000:00:00.2"

Can we think of certain loads/actions/etc that could help trigger the issue?
Then if reproducing is easier we can better say if stuff is actually
fixed after the workaround.

Udo

Looking at the original kernel message, it seems that the the kernel timed out while waiting for the IOMMU
to finish executing the "COMPLETION_WAIT" command. In this particular case, it is issued as part of
"__domain_flush_pages()" while trying to send the "INVALIDATE_IOMMU_PAGE" command to the IOMMU but the command
buffer is getting full and the kernel needed to wait for the command buffer to free up. However, the kernel
message did not exactly telling us what caused IOMMU to locked up in the first place.

According to my observation, high disk traffic workload should trigger large amount of "INVALIDATE_IOMMU_PAGE".
However, this doesn't automatically issuing "COMPLETION_WAIT" command. The following patch slightly modify
the code to always issue "COMPLETION_WAIT" after every command. This should help increasing the chance of reproducing
the issue.


diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index c1c74e0..d05b1f9 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1016,6 +1016,7 @@ static int iommu_queue_command_sync(struct amd_iommu *iommu,
struct iommu_cmd *cmd,
bool sync)
{
+#if 0
u32 left, tail, head, next_tail;
unsigned long flags;
@@ -1052,6 +1053,40 @@ again:
spin_unlock_irqrestore(&iommu->lock, flags);
+#else
+ u32 tail;
+ unsigned long flags;
+
+ WARN_ON(iommu->cmd_buf_size & CMD_BUFFER_UNINITIALIZED);
+ printk (KERN_DEBUG "AMD-Vi: iommu_queue_command_sync: iommu_queue_command_sync"
+ " data[0]:%#x data[1]:%#x data[2]:%#x data[3]:%#x\n",
+ cmd->data[0], cmd->data[1], cmd->data[2], cmd->data[3] );
+
+ spin_lock_irqsave(&iommu->lock, flags);
+
+ tail = readl(iommu->mmio_base + MMIO_CMD_TAIL_OFFSET);
+ copy_cmd_to_buffer(iommu, cmd, tail);
+
+ spin_unlock_irqrestore(&iommu->lock, flags);
+
+ // Sending completion_wait command
+ {
+ struct iommu_cmd sync_cmd;
+ volatile u64 sem = 0;
+ int ret;
+
+ spin_lock_irqsave(&iommu->lock, flags);
+
+ tail = readl(iommu->mmio_base + MMIO_CMD_TAIL_OFFSET);
+ build_completion_wait(&sync_cmd, (u64)&sem);
+ copy_cmd_to_buffer(iommu, &sync_cmd, tail);
+
+ spin_unlock_irqrestore(&iommu->lock, flags);
+
+ if ((ret = wait_on_sem(&sem)) != 0)
+ return ret;
+ }
+#endif
return 0;
}







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/