Re: [patch] voluntary-preempt-2.6.8-rc2-J3

From: Andrew Morton
Date: Mon Jul 26 2004 - 16:24:48 EST

Next message: Ingo Molnar: "Re: no luck with max_sectors_kb (Re: voluntary-preempt-2.6.8-rc2-J4)"
Previous message: Rajesh Venkatasubramanian: "Re: [PATCH] prio_tree iterator cleanup."
In reply to: Ingo Molnar: "Re: [patch] voluntary-preempt-2.6.8-rc2-J3"
Next in thread: Junio C Hamano: "Re: [patch] voluntary-preempt-2.6.8-rc2-J3"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Ingo Molnar <mingo@xxxxxxx> wrote:
>
> ...
> guaranteeing latencies on SMP is hard, because the latency of a CPU
> might depend on the latency of a task on another CPU - which CPU isnt
> notified of the rescheduling request.
>
> one alternative technique to yours would be to notify _all_ CPUs that a
> high-prio RT task is ready to run (via a broadcast need-resched). That
> way the UP latency-break techniques map to SMP in a 1:1 way.

I did exactly that in the 2.4 ll patch. It works, but doesn't seem very nice.

> non-RT tasks dont get this benefit, which is a difference to the UP
> situation, but i dont think it would be appropriate to use the UP
> behavior, due to the overhead of broadcasting.
>
> a combination of the two techniques could be used too: a global 'break
> locks from now on' flag which gets set if a (RT?) task wants to
> reschedule. Normally this flag would be zero and the cacheline would be
> clean and shared between all CPUs, causing no overhead. Once a task
> wants to reschedule it would increase by 1 until that task has been
> scheduled.

Something like that. The obscure livelock opportunities worry me.

> This would remove the ugliness factor too, your sample code
> would look:
>
> > + if (need_lock_break()) {
> > + spin_unlock(&mm->page_table_lock);
> > + cpu_relax();
> > + spin_lock(&mm->page_table_lock);
> > + }
> > +
>
> or a shortcut:
>
> > + cond_lock_break(&mm->page_table_lock);

The usual problem is, of course, long-running loops under locks. These
loops usually have some sort of counter (or you add one). The below patch
changes cond_resched_lock() so that it takes the counter, and the tuned
upper-loop-count limit. It also does the right thing on UP. It needs
enhancements for the situations where we already have a counter in the
loop.

I dunno if it's sufficient, but I do think that we need to settle on some
infrastructure of this sort before proceeding too far into this thing.

> there are two problems with the unconditional lock-break:
>
> - i'm not sure cpu_relax() guarantees that another CPU can grab this
> lock. It all depends on whether the lock cacheline can bounce over
> to the other CPU faster than cpu_relax() finishes and this CPU
> re-acquires the lock.

precisely.

diff -puN include/linux/sched.h~cond_resched_lock-smp-fix include/linux/sched.h
--- 25/include/linux/sched.h~cond_resched_lock-smp-fix 2004-07-13 16:58:42.445155480 -0700
+++ 25-akpm/include/linux/sched.h 2004-07-13 17:02:56.194579704 -0700
@@ -1038,22 +1038,32 @@ static inline void cond_resched(void)
}

/*
- * cond_resched_lock() - if a reschedule is pending, drop the given lock,
- * call schedule, and on return reacquire the lock.
- *
* This works OK both with and without CONFIG_PREEMPT. We do strange low-level
* operations here to prevent schedule() from being called twice (once via
* spin_unlock(), once by hand).
*/
-static inline void cond_resched_lock(spinlock_t * lock)
-{
- if (need_resched()) {
- _raw_spin_unlock(lock);
- preempt_enable_no_resched();
- __cond_resched();
- spin_lock(lock);
- }
-}
+
+#ifdef CONFIG_SMP
+#define cond_resched_lock(lock, counter, limit) \
+ do { \
+ if (++(counter) >= limit) { \
+ spin_unlock(lock); \
+ cpu_relax(); \
+ spin_lock(lock); \
+ } \
+ (counter) = 0; \
+ } while (0)
+#else
+#define cond_resched_lock(lock, counter, limit) \
+ do { \
+ if (need_resched()) { \
+ _raw_spin_unlock(lock); \
+ preempt_enable_no_resched(); \
+ __cond_resched(); \
+ spin_lock(lock); \
+ } \
+ } while (0)
+#endif

/* Reevaluate whether the task has signals pending delivery.
This is required every time the blocked sigset_t changes.
diff -puN fs/reiserfs/journal.c~cond_resched_lock-smp-fix fs/reiserfs/journal.c
--- 25/fs/reiserfs/journal.c~cond_resched_lock-smp-fix 2004-07-13 16:58:42.462152896 -0700
+++ 25-akpm/fs/reiserfs/journal.c 2004-07-13 17:03:43.131444216 -0700
@@ -796,6 +796,7 @@ static int write_ordered_buffers(spinloc
struct buffer_chunk chunk;
struct list_head tmp;
INIT_LIST_HEAD(&tmp);
+ int resched_counter = 0;

chunk.nr = 0;
spin_lock(lock);
@@ -827,7 +828,7 @@ static int write_ordered_buffers(spinloc
}
loop_next:
put_bh(bh);
- cond_resched_lock(lock);
+ cond_resched_lock(lock, resched_counter, 64);
}
if (chunk.nr) {
spin_unlock(lock);
@@ -848,7 +849,7 @@ loop_next:
if (!buffer_uptodate(bh))
ret = -EIO;
put_bh(bh);
- cond_resched_lock(lock);
+ cond_resched_lock(lock, resched_counter, 64);
}
spin_unlock(lock);
return ret;
diff -puN mm/shmem.c~cond_resched_lock-smp-fix mm/shmem.c
--- 25/mm/shmem.c~cond_resched_lock-smp-fix 2004-07-13 16:58:42.487149096 -0700
+++ 25-akpm/mm/shmem.c 2004-07-13 17:04:06.310920400 -0700
@@ -424,6 +424,7 @@ static void shmem_truncate(struct inode
swp_entry_t *ptr;
int offset;
int freed;
+ int resched_counter = 0;

inode->i_ctime = inode->i_mtime = CURRENT_TIME;
idx = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
@@ -496,7 +497,7 @@ static void shmem_truncate(struct inode
shmem_free_block(inode);
}
empty = subdir;
- cond_resched_lock(&info->lock);
+ cond_resched_lock(&info->lock, resched_counter, 32);
dir = shmem_dir_map(subdir);
}
subdir = *dir;
diff -puN mm/memory.c~cond_resched_lock-smp-fix mm/memory.c
--- 25/mm/memory.c~cond_resched_lock-smp-fix 2004-07-13 16:58:42.504146512 -0700
+++ 25-akpm/mm/memory.c 2004-07-13 17:05:43.943078056 -0700
@@ -224,6 +224,7 @@ int copy_page_range(struct mm_struct *ds
unsigned long address = vma->vm_start;
unsigned long end = vma->vm_end;
unsigned long cow;
+ int resched_counter = 0;

if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst, src, vma);
@@ -341,7 +342,8 @@ cont_copy_pte_range_noset:
pte_unmap_nested(src_pte-1);
pte_unmap(dst_pte-1);
spin_unlock(&src->page_table_lock);
- cond_resched_lock(&dst->page_table_lock);
+ cond_resched_lock(&dst->page_table_lock,
+ resched_counter, 256);
cont_copy_pmd_range:
src_pmd++;
dst_pmd++;
@@ -530,6 +532,7 @@ int unmap_vmas(struct mmu_gather **tlbp,
int tlb_start_valid = 0;
int ret = 0;
int atomic = details && details->atomic;
+ int resched_counter = 0;

for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
unsigned long start;
@@ -567,10 +570,11 @@ int unmap_vmas(struct mmu_gather **tlbp,
zap_bytes -= block;
if ((long)zap_bytes > 0)
continue;
- if (!atomic && need_resched()) {
+ if (!atomic) {
int fullmm = tlb_is_full_mm(*tlbp);
tlb_finish_mmu(*tlbp, tlb_start, start);
- cond_resched_lock(&mm->page_table_lock);
+ cond_resched_lock(&mm->page_table_lock,
+ resched_counter, 1);
*tlbp = tlb_gather_mmu(mm, fullmm);
tlb_start_valid = 0;
}
diff -puN mm/rmap.c~cond_resched_lock-smp-fix mm/rmap.c
--- 25/mm/rmap.c~cond_resched_lock-smp-fix 2004-07-13 16:58:42.521143928 -0700
+++ 25-akpm/mm/rmap.c 2004-07-13 17:06:59.866535936 -0700
@@ -661,6 +661,7 @@ static inline int try_to_unmap_file(stru
unsigned long max_nl_cursor = 0;
unsigned long max_nl_size = 0;
unsigned int mapcount;
+ int resched_counter = 0;

if (!spin_trylock(&mapping->i_mmap_lock))
return ret;
@@ -699,7 +700,7 @@ static inline int try_to_unmap_file(stru
*/
mapcount = page->mapcount;
page_map_unlock(page);
- cond_resched_lock(&mapping->i_mmap_lock);
+ cond_resched_lock(&mapping->i_mmap_lock, resched_counter, 1);

max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
if (max_nl_cursor == 0)
@@ -728,7 +729,7 @@ static inline int try_to_unmap_file(stru
(void *) max_nl_cursor;
ret = SWAP_AGAIN;
}
- cond_resched_lock(&mapping->i_mmap_lock);
+ cond_resched_lock(&mapping->i_mmap_lock, resched_counter, 8);
max_nl_cursor += CLUSTER_SIZE;
} while (max_nl_cursor <= max_nl_size);

_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ingo Molnar: "Re: no luck with max_sectors_kb (Re: voluntary-preempt-2.6.8-rc2-J4)"
Previous message: Rajesh Venkatasubramanian: "Re: [PATCH] prio_tree iterator cleanup."
In reply to: Ingo Molnar: "Re: [patch] voluntary-preempt-2.6.8-rc2-J3"
Next in thread: Junio C Hamano: "Re: [patch] voluntary-preempt-2.6.8-rc2-J3"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]