Re: [3/5] 2.6.21-rc4: known regressions (v2)

From: Eric W. Biederman
Date: Sun Mar 25 2007 - 16:40:46 EST


"Rafael J. Wysocki" <rjw@xxxxxxx> writes:

> On Sunday, 25 March 2007 14:56, Eric W. Biederman wrote:
>> "Rafael J. Wysocki" <rjw@xxxxxxx> writes:
>>
>> > Yes, in kernel/power/disk.c:power_down() .
>> >
>> > Please comment out the disable_nonboot_cpus() in there and retest (but
> please
>> > test the latest Linus' tree).
>>
>> <rant>
>>
>> Why do we even need a disable_nonboot_cpus in that path? machine_shutdown
>> on i386 and x86_64 should take care of that. Further the code that computes
>> the boot cpu is bogus (not all architectures require cpu == 0 to be
>> the boot cpu), and disabling non boot cpus appears to be a strong
>> x86ism, in the first place.
>
> Yes.
>
>> If the only reason for disable_nonboot_cpus there is to avoid the
>> WARN_ON in init_low_mappings() we should seriously consider killing
>> it.
>
> We have considered it, but no one was sure that it was a good idea.

The problem with the current init_low_mappings is that it hacks the
current page table. If we can instead use a different page table
the code becomes SMP safe.

I have extracted the patch that addresses this from the relocatable
patchset and appended it for sparking ideas. It goes a little
farther than we need to solve this issue but the basics are there.

>> If we can wait for 2.6.22 the relocatable x86_64 patchset that
>> Andi has queued, has changes that kill the init_low_mapping() hack.
>
> I think we should kill the WARN_ON() right now, perhaps replacing it with
> a FIXME comment.

Reasonable.

>> I'm not very comfortable with calling cpu_down in a common code path
>> right now either. I'm fairly certain we still don't have that
>> correct. So if we confine the mess that is cpu_down to #if
>> defined(CPU_HOTPLUG) && defined(CONFIG_EXPERIMENTAL) I don't care.
>> If we start using it everywhere I'm very nervous.
>> migration when bringing a cpu down is strongly racy, and I don't think
>> we actually put cpus to sleep properly either.
>
> I'm interested in all of the details, please. I seriously consider dropping
> cpu_up()/cpu_down() from the suspend code paths.


So I'm not certain if in a multiple cpu context we can avoid all of the
issues with cpu hotplug but there is a reasonable chance so I will
explain as best I can.

Yanking the appropriate code out of linuxbios the way a processor should stop
itself is to send an INIT IPI to itself. This puts a cpu into an optimized
wait for startup IPI state where it is otherwise disabled. This is the state
any sane BIOS will put the cpus into before control is handed off to the kernel.

> static inline void stop_this_cpu(void)
> {
> unsigned apicid;
> apicid = lapicid();
>
> /* Send an APIC INIT to myself */
> lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid));
> lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_INT_ASSERT | LAPIC_DM_INIT);
> /* Wait for the ipi send to finish */
> lapic_wait_icr_idle();
>
> /* Deassert the APIC INIT */
> lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid));
> lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_DM_INIT);
> /* Wait for the ipi send to finish */
> lapic_wait_icr_idle();
>
> /* If I haven't halted spin forever */
> for(;;) {
> hlt();
> }
> }

I'm not certain what to do with the interrupt races. But I will see
if I can explain what I know.

<braindump>

- Most ioapics are buggy.
- Most ioapics do not follow pci-ordering rules with respect to
interrupt message deliver so ensuring all in-flight irqs have
arrived somewhere is very hard.
- To avoid bugs we always limit ourselves to reprogramming the ioapics
in the interrupt handler, and not considering an interrupt
successfully reprogrammed until we have received an irq in the new
location.
- On x86 we have two basic interrupt handling modes.
o logical addressing with lowest priority delivery.
o physical addressing with delivery to a single cpu.
- With logical addressing as long as the cpu is not available for
having an interrupt delivered to it the interrupt will be
never be delivered to a particular cpu. Ideally we also update
the mask in the ioapic to not target that cpu.
- With physical addressing targeting a single cpu we need to reprogram
the ioapics not to target that specific cpu. This needs to happen
in the interrupt handler and we need to wait for the next interrupt
before we tear down our data structures for handling the interrupt.

The current cpu hotplug code attempts to reprogram the ioapics from
process context which is just wrong.

Now as part of suspend/resume I think we should be programming the
hardware not to generate interrupts in the first place at the actual
hardware devices so we can likely avoid all of the code that
reprograms interrupts while they are active. If we can use things
like pci ordering rules to ensure the device will never fire the
interrupt until resumed we should be able to disable interrupts
synchronously. Something that we can not safely do in the current
cpu hotplug scenario. Which should make the problem of doing the
work race free much easier.

I don't know if other architectures need to disable cpus or not before
doing a suspend to ram or a suspend to disk.

I also don't know if we have any code that brings up the other cpus
after we have been suspended. In either the cpu hotplug paths or the
architecture specific paths.

The bottom line is after the partial review of the irq handling during
cpu shutdown a little while ago that I consider the current cpu
hotplug code to be something that works when you are lucky. There are
to many hardware bugs to support the general case it tries to
implement.

I have not reviewed the entire cpu hotplug path nor have I even tried
suspend/resume. But I have been all and down the boot and shutdown
code paths so I understand the general problem.

The other part I know is that cpu numbers are assigned at the
architectures discretion. So unless the architecture code or the
early boot code sets a variable remembering which was the boot cpu
there is no reason to believe we can deduce it. In addition I don't
know if any other architectures have anything resembling the x86
requirement to disable non-boot cpus. I do know machine_shutdown
on i386 and x86_64 performs that action (if not perfectly) so at
least currently we should be able to do this in architecture
specific code.

</braindump>

Eric



From: Vivek Goyal <vgoyal@xxxxxxxxxx>
Subject: [PATCH 12/20] x86_64: 64bit ACPI wakeup trampoline

o Moved wakeup_level4_pgt into the wakeup routine so we can
run the kernel above 4G.

o Now we first go to 64bit mode and continue to run from trampoline and
then then start accessing kernel symbols and restore processor context.
This enables us to resume even in relocatable kernel context when
kernel might not be loaded at physical addr it has been compiled for.

o Removed the need for modifying any existing kernel page table.

o Increased the size of the wakeup routine to 8K. This is required as
wake page tables are on trampoline itself and they got to be at 4K
boundary, hence one page is not sufficient.

Signed-off-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
Signed-off-by: Vivek Goyal <vgoyal@xxxxxxxxxx>
---

arch/x86_64/kernel/acpi/sleep.c | 22 ++------------
arch/x86_64/kernel/acpi/wakeup.S | 59 ++++++++++++++++++++++++---------------
arch/x86_64/kernel/head.S | 9 -----
3 files changed, 41 insertions(+), 49 deletions(-)

diff -puN arch/x86_64/kernel/acpi/sleep.c~x86_64-64bit-ACPI-wakeup-trampoline
arch/x86_64/kernel/acpi/sleep.c
---
linux-2.6.21-rc2-reloc/arch/x86_64/kernel/acpi/sleep.c~x86_64-64bit-ACPI-wakeup-trampoline
2007-03-07 01:28:11.000000000 +0530
+++ linux-2.6.21-rc2-reloc-root/arch/x86_64/kernel/acpi/sleep.c 2007-03-07
01:28:11.000000000 +0530
@@ -60,17 +60,6 @@ extern char wakeup_start, wakeup_end;

extern unsigned long acpi_copy_wakeup_routine(unsigned long);

-static pgd_t low_ptr;
-
-static void init_low_mapping(void)
-{
- pgd_t *slot0 = pgd_offset(current->mm, 0UL);
- low_ptr = *slot0;
- set_pgd(slot0, *pgd_offset(current->mm, PAGE_OFFSET));
- WARN_ON(num_online_cpus() != 1);
- local_flush_tlb();
-}
-
/**
* acpi_save_state_mem - save kernel state
*
@@ -79,8 +68,6 @@ static void init_low_mapping(void)
*/
int acpi_save_state_mem(void)
{
- init_low_mapping();
-
memcpy((void *)acpi_wakeup_address, &wakeup_start,
&wakeup_end - &wakeup_start);
acpi_copy_wakeup_routine(acpi_wakeup_address);
@@ -93,8 +80,6 @@ int acpi_save_state_mem(void)
*/
void acpi_restore_state_mem(void)
{
- set_pgd(pgd_offset(current->mm, 0UL), low_ptr);
- local_flush_tlb();
}

/**
@@ -107,10 +92,11 @@ void acpi_restore_state_mem(void)
*/
void __init acpi_reserve_bootmem(void)
{
- acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE);
- if ((&wakeup_end - &wakeup_start) > PAGE_SIZE)
+ acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE*2);
+ if ((&wakeup_end - &wakeup_start) > (PAGE_SIZE*2))
printk(KERN_CRIT
- "ACPI: Wakeup code way too big, will crash on attempt to suspend\n");
+ "ACPI: Wakeup code way too big, will crash on attempt"
+ " to suspend\n");
}

static int __init acpi_sleep_setup(char *str)
diff -puN arch/x86_64/kernel/acpi/wakeup.S~x86_64-64bit-ACPI-wakeup-trampoline
arch/x86_64/kernel/acpi/wakeup.S
---
linux-2.6.21-rc2-reloc/arch/x86_64/kernel/acpi/wakeup.S~x86_64-64bit-ACPI-wakeup-trampoline
2007-03-07 01:28:11.000000000 +0530
+++ linux-2.6.21-rc2-reloc-root/arch/x86_64/kernel/acpi/wakeup.S 2007-03-07
01:28:11.000000000 +0530
@@ -1,6 +1,7 @@
.text
#include <linux/linkage.h>
#include <asm/segment.h>
+#include <asm/pgtable.h>
#include <asm/page.h>
#include <asm/msr.h>

@@ -62,12 +63,15 @@ wakeup_code:

movb $0xa2, %al ; outb %al, $0x80

- lidt %ds:idt_48a - wakeup_code
- xorl %eax, %eax
- movw %ds, %ax # (Convert %ds:gdt to a linear ptr)
- shll $4, %eax
- addl $(gdta - wakeup_code), %eax
- movl %eax, gdt_48a +2 - wakeup_code
+ mov %ds, %ax # Find 32bit wakeup_code addr
+ movzx %ax, %esi # (Convert %ds:gdt to a liner ptr)
+ shll $4, %esi
+ # Fix up the vectors
+ addl %esi, wakeup_32_vector - wakeup_code
+ addl %esi, wakeup_long64_vector - wakeup_code
+ addl %esi, gdt_48a + 2 - wakeup_code # Fixup the gdt pointer
+
+ lidtl %ds:idt_48a - wakeup_code
lgdtl %ds:gdt_48a - wakeup_code # load gdt with whatever is
# appropriate

@@ -80,7 +84,7 @@ wakeup_code:

.balign 4
wakeup_32_vector:
- .long wakeup_32 - __START_KERNEL_map
+ .long wakeup_32 - wakeup_code
.word __KERNEL32_CS, 0

.code32
@@ -103,10 +107,6 @@ wakeup_32:
movl $__KERNEL_DS, %eax
movl %eax, %ds

- movl saved_magic - __START_KERNEL_map, %eax
- cmpl $0x9abcdef0, %eax
- jne bogus_32_magic
-
movw $0x0e00 + 'i', %ds:(0xb8012)
movb $0xa8, %al ; outb %al, $0x80;

@@ -120,7 +120,7 @@ wakeup_32:
movl %eax, %cr4

/* Setup early boot stage 4 level pagetables */
- movl $(wakeup_level4_pgt - __START_KERNEL_map), %eax
+ leal (wakeup_level4_pgt - wakeup_code)(%esi), %eax
movl %eax, %cr3

/* Enable Long Mode */
@@ -159,11 +159,11 @@ wakeup_32:
*/

/* Finally jump in 64bit mode */
- ljmp *(wakeup_long64_vector - __START_KERNEL_map)
+ ljmp *(wakeup_long64_vector - wakeup_code)(%esi)

.balign 4
wakeup_long64_vector:
- .long wakeup_long64 - __START_KERNEL_map
+ .long wakeup_long64 - wakeup_code
.word __KERNEL_CS, 0

.code64
@@ -178,11 +178,16 @@ wakeup_long64:
* addresses where we're currently running on. We have to do that here
* because in 32bit we couldn't load a 64bit linear address.
*/
- lgdt cpu_gdt_descr - __START_KERNEL_map
+ lgdt cpu_gdt_descr

movw $0x0e00 + 'n', %ds:(0xb8014)
movb $0xa9, %al ; outb %al, $0x80

+ movq saved_magic, %rax
+ movq $0x123456789abcdef0, %rdx
+ cmpq %rdx, %rax
+ jne bogus_64_magic
+
movw $0x0e00 + 'u', %ds:(0xb8016)

nop
@@ -223,20 +228,21 @@ idt_48a:
gdt_48a:
.word 0x800 # gdt limit=2048,
# 256 GDT entries
- .word 0, 0 # gdt base (filled in later)
-
+ .long gdta - wakeup_code # gdt base (relocated in later)

real_magic: .quad 0
video_mode: .quad 0
video_flags: .quad 0

+.code16
bogus_real_magic:
movb $0xba,%al ; outb %al,$0x80
jmp bogus_real_magic

-bogus_32_magic:
+.code64
+bogus_64_magic:
movb $0xb3,%al ; outb %al,$0x80
- jmp bogus_32_magic
+ jmp bogus_64_magic

bogus_cpu:
movb $0xbc,%al ; outb %al,$0x80
@@ -263,6 +269,7 @@ bogus_cpu:
#define VIDEO_FIRST_V7 0x0900

# Setting of user mode (AX=mode ID) => CF=success
+.code16
mode_seta:
movw %ax, %bx
#if 0
@@ -313,6 +320,13 @@ wakeup_stack_begin: # Stack grows down
.org 0xff0
wakeup_stack: # Just below end of page

+.org 0x1000
+ENTRY(wakeup_level4_pgt)
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 510,8,0
+ /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
+ .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
+
ENTRY(wakeup_end)

##
@@ -338,9 +352,10 @@ ENTRY(acpi_copy_wakeup_routine)
movq $0x123456789abcdef0, %rdx
movq %rdx, saved_magic

- movl saved_magic - __START_KERNEL_map, %eax
- cmpl $0x9abcdef0, %eax
- jne bogus_32_magic
+ movq saved_magic, %rax
+ movq $0x123456789abcdef0, %rdx
+ cmpq %rdx, %rax
+ jne bogus_64_magic

# restore the regs we used
popq %rdx
diff -puN arch/x86_64/kernel/head.S~x86_64-64bit-ACPI-wakeup-trampoline
arch/x86_64/kernel/head.S
---
linux-2.6.21-rc2-reloc/arch/x86_64/kernel/head.S~x86_64-64bit-ACPI-wakeup-trampoline
2007-03-07 01:28:11.000000000 +0530
+++ linux-2.6.21-rc2-reloc-root/arch/x86_64/kernel/head.S 2007-03-07
01:28:11.000000000 +0530
@@ -308,15 +308,6 @@ NEXT_PAGE(level2_kernel_pgt)

.data

-#ifdef CONFIG_ACPI_SLEEP
- .align PAGE_SIZE
-ENTRY(wakeup_level4_pgt)
- .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .fill 510,8,0
- /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
-#endif
-
#ifndef CONFIG_HOTPLUG_CPU
__INITDATA
#endif
_


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/