Commit graph

4851 commits

Author SHA1 Message Date
Linus Torvalds
338b9bb3ad Merge branch 'x86/auditsc' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland
* 'x86/auditsc' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland:
  i386 syscall audit fast-path
  x86_64 ia32 syscall audit fast-path
  x86_64 syscall audit fast-path
  x86_64: remove bogus optimization in sysret_signal
2008-07-23 20:39:21 -07:00
Linus Torvalds
7f9dce3837 Merge branch 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: hrtick_enabled() should use cpu_active()
  sched, x86: clean up hrtick implementation
  sched: fix build error, provide partition_sched_domains() unconditionally
  sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed
  cpu hotplug: Make cpu_active_map synchronization dependency clear
  cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)
  sched: rework of "prioritize non-migratable tasks over migratable ones"
  sched: reduce stack size in isolated_cpu_setup()
  Revert parts of "ftrace: do not trace scheduler functions"

Fixed up conflicts in include/asm-x86/thread_info.h (due to the
TIF_SINGLESTEP unification vs TIF_HRTICK_RESCHED removal) and
kernel/sched_fair.c (due to cpu_active_map vs for_each_cpu_mask_nr()
introduction).
2008-07-23 19:36:53 -07:00
Linus Torvalds
26dcce0fab Merge branch 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
  NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
  cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
  NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
  NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
  NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
  cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
  cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
  cpumask: Provide a generic set of CPUMASK_ALLOC macros
  cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
  cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
  cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
  cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
  cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
  cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
  Revert "cpumask: introduce new APIs"
  cpumask: make for_each_cpu_mask a bit smaller
  net: Pass reference to cpumask variable in net/sunrpc/svc.c
  ...

Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually
2008-07-23 18:37:44 -07:00
Linus Torvalds
d7b6de14a0 Merge branch 'core/softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core/softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  softlockup: fix invalid proc_handler for softlockup_panic
  softlockup: fix watchdog task wakeup frequency
  softlockup: fix watchdog task wakeup frequency
  softlockup: show irqtrace
  softlockup: print a module list on being stuck
  softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
  softlockup: fix false positives on nohz if CPU is 100% idle for more than 60 seconds
  softlockup: fix softlockup_thresh fix
  softlockup: fix softlockup_thresh unaligned access and disable detection at runtime
  softlockup: allow panic on lockup
2008-07-23 18:34:13 -07:00
Roland McGrath
86a1c34a92 x86_64 syscall audit fast-path
This adds a fast path for 64-bit syscall entry and exit when
TIF_SYSCALL_AUDIT is set, but no other kind of syscall tracing.
This path does not need to save and restore all registers as
the general case of tracing does.  Avoiding the iret return path
when syscall audit is enabled helps performance a lot.

Signed-off-by: Roland McGrath <roland@redhat.com>
2008-07-23 17:47:32 -07:00
Oleg Nesterov
ba661292a2 posix-timers: fix posix_timer_event() vs dequeue_signal() race
The bug was reported and analysed by Mark McLoughlin <markmc@redhat.com>,
the patch is based on his and Roland's suggestions.

posix_timer_event() always rewrites the pre-allocated siginfo before sending
the signal. Most of the written info is the same all the time, but memset(0)
is very wrong. If ->sigq is queued we can race with collect_signal() which
can fail to find this siginfo looking at .si_signo, or copy_siginfo() can
copy the wrong .si_code/si_tid/etc.

In short, sys_timer_settime() can in fact stop the active timer, or the user
can receive the siginfo with the wrong .si_xxx values.

Move "memset(->info, 0)" from posix_timer_event() to alloc_posix_timer(),
change send_sigqueue() to set .si_overrun = 0 when ->sigq is not queued.
It would be nice to move the whole sigq->info initialization from send to
create path, but this is not easy to do without uglifying timer_create()
further.

As Roland rightly pointed out, we need more cleanups/fixes here, see the
"FIXME" comment in the patch. Hopefully this patch makes sense anyway, and
it can mask the most bad implications.

Reported-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Oliver Pinter <oliver.pntr@gmail.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: stable@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

 kernel/posix-timers.c |   17 +++++++++++++----
 kernel/signal.c       |    1 +
 2 files changed, 14 insertions(+), 4 deletions(-)
2008-07-24 01:26:08 +02:00
Oleg Nesterov
54da117492 posix-timers: do_schedule_next_timer: fix the setting of ->si_overrun
do_schedule_next_timer() sets info->si_overrun = timr->it_overrun_last,
this discards the already accumulated overruns.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Oliver Pinter <oliver.pntr@gmail.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: stable@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-07-24 01:26:08 +02:00
Uwe Kleine-König
2db873211b set_irq_wake: fix return code and wake status tracking
Since 15a647eba9 set_irq_wake returned -ENXIO
if another device had it already enabled.  Zero is the right value to
return in this case.  Moreover the change to desc->status was not reverted
if desc->chip->set_wake returned an error.

Signed-off-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com>
Acked-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-23 09:35:53 -07:00
Ingo Molnar
422037bafd sched: fix hrtick & generic-ipi dependency
Andrew Morton reported this s390 allmodconfig build failure:

 kernel/built-in.o: In function `hrtick_start_fair':
 sched.c:(.text+0x69c6): undefined reference to `__smp_call_function_single'

the reason is that s390 is not a generic-ipi SMP platform yet, while
the hrtick code relies on it. Fix the dependency.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-23 11:18:28 +02:00
Linus Torvalds
6eaaaac974 Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  remove CONFIG_KMOD from core kernel code
  remove CONFIG_KMOD from lib
  remove CONFIG_KMOD from sparc64
  rework try_then_request_module to do less in non-modular kernels
  remove mention of CONFIG_KMOD from documentation
  make CONFIG_KMOD invisible
  modules: Take a shortcut for checking if an address is in a module
  module: turn longs into ints for module sizes
  Shrink struct module: CONFIG_UNUSED_SYMBOLS ifdefs
  module: reorder struct module to save space on 64 bit builds
  module: generic each_symbol iterator function
  module: don't use stop_machine for waiting rmmod
2008-07-22 13:17:15 -07:00
Linus Torvalds
53baaaa968 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (79 commits)
  arm: bus_id -> dev_name() and dev_set_name() conversions
  sparc64: fix up bus_id changes in sparc core code
  3c59x: handle pci_name() being const
  MTD: handle pci_name() being const
  HP iLO driver
  sysdev: Convert the x86 mce tolerant sysdev attribute to generic attribute
  sysdev: Add utility functions for simple int/ulong variable sysdev attributes
  sysdev: Pass the attribute to the low level sysdev show/store function
  driver core: Suppress sysfs warnings for device_rename().
  kobject: Transmit return value of call_usermodehelper() to caller
  sysfs-rules.txt: reword API stability statement
  debugfs: Implement debugfs_remove_recursive()
  HOWTO: change email addresses of James in HOWTO
  always enable FW_LOADER unless EMBEDDED=y
  uio-howto.tmpl: use unique output names
  uio-howto.tmpl: use standard copyright/legal markings
  sysfs: don't call notify_change
  sysdev: fix debugging statements in registration code.
  kobject: should use kobject_put() in kset-example
  kobject: reorder kobject to save space on 64 bit builds
  ...
2008-07-22 13:13:47 -07:00
Atsushi Nemoto
5f17156fc5 Fix build on COMPAT platforms when CONFIG_EPOLL is disabled
Add missing cond_syscall() entry for compat_sys_epoll_pwait.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-22 09:59:41 -07:00
Miao Xie
91cd4d6ef0 cpusets: fix wrong domain attr updates
Fix wrong domain attr updates, or we will always update the first sched
domain attr.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>	[2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-22 09:59:41 -07:00
Johannes Berg
a1ef5adb4c remove CONFIG_KMOD from core kernel code
Always compile request_module when the kernel allows modules.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-07-22 19:24:31 +10:00
Rusty Russell
3a642e99ba modules: Take a shortcut for checking if an address is in a module
This patch keeps track of the boundaries of module allocation, in
order to speed up module_text_address().

Inspired by Arjan's version, which required arch-specific defines:

	Various pieces of the kernel (lockdep, latencytop, etc) tend
	to store backtraces, sometimes at a relatively high
	frequency. In itself this isn't a big performance deal (after
	all you're using diagnostics features), but there have been
	some complaints from people who have over 100 modules loaded
	that this is a tad too slow.

	This is due to the new backtracer code which looks at every
	slot on the stack to see if it's a kernel/module text address,
	so that's 1024 slots.  1024 times 100 modules... that's a lot
	of list walking.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-07-22 19:24:28 +10:00
Denys Vlasenko
2f0f2a334b module: turn longs into ints for module sizes
This shrinks module.o and each *.ko file.

And finally, structure members which hold length of module
code (four such members there) and count of symbols
are converted from longs to ints.

We cannot possibly have a module where 32 bits won't
be enough to hold such counts.

For one, module loading checks module size for sanity
before loading, so such insanely big module will fail
that test first.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-07-22 19:24:27 +10:00
Denys Vlasenko
f7f5b67557 Shrink struct module: CONFIG_UNUSED_SYMBOLS ifdefs
module.c and module.h conatains code for finding
exported symbols which are declared with EXPORT_UNUSED_SYMBOL,
and this code is compiled in even if CONFIG_UNUSED_SYMBOLS is not set
and thus there can be no EXPORT_UNUSED_SYMBOLs in modules anyway
(because EXPORT_UNUSED_SYMBOL(x) are compiled out to nothing then).

This patch adds required #ifdefs.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-07-22 19:24:27 +10:00
Rusty Russell
dafd0940c9 module: generic each_symbol iterator function
Introduce an each_symbol() iterator to avoid duplicating the knowledge
about the 5 different sections containing symbols.  Currently only
used by find_symbol(), but will be used by symbol_put_addr() too.

(Includes NULL ptr deref fix by Jiri Kosina <jkosina@suse.cz>)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jiri Kosina <jkosina@suse.cz>
2008-07-22 19:24:26 +10:00
Rusty Russell
da39ba5e1d module: don't use stop_machine for waiting rmmod
rmmod has a little-used "-w" option, meaning that instead of failing if the
module is in use, it should block until the module becomes unused.

In this case, we don't need to use stop_machine: Max Krasnyansky
indicated that would be useful for SystemTap which loads/unloads new
modules frequently.

Cc: Max Krasnyansky <maxk@qualcomm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-07-22 19:24:25 +10:00
Ingo Molnar
76c3bb15d6 Merge branch 'linus' into x86/x2apic 2008-07-22 09:06:21 +02:00
Andi Kleen
4a0b2b4dbe sysdev: Pass the attribute to the low level sysdev show/store function
This allow to dynamically generate attributes and share show/store
functions between attributes. Right now most attributes are generated
by special macros and lots of duplicated code. With the attribute
passed it's instead possible to attach some data to the attribute
and then use that in shared low level functions to do different things.

I need this for the dynamically generated bank attributes in the x86
machine check code, but it'll allow some further cleanups.

I converted all users in tree to the new show/store prototype. It's a single
huge patch to avoid unbisectable sections.

Runtime tested: x86-32, x86-64
Compiled only: ia64, powerpc
Not compile tested/only grep converted: sh, arm, avr32

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-07-21 21:55:02 -07:00
Dmitry Baryshkov
b6d4f7e3ef dma-coherent: add documentation to new interfaces
Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-20 21:22:00 +02:00
Ingo Molnar
ba42059fbd sched: hrtick_enabled() should use cpu_active()
Peter pointed out that hrtick_enabled() should use cpu_active().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-20 11:02:06 +02:00
Ingo Molnar
d986434a7d Merge branch 'sched/urgent' into sched/devel 2008-07-20 11:01:29 +02:00
Peter Zijlstra
31656519e1 sched, x86: clean up hrtick implementation
random uvesafb failures were reported against Gentoo:

  http://bugs.gentoo.org/show_bug.cgi?id=222799

and Mihai Moldovan bisected it back to:

> 8f4d37ec07 is first bad commit
> commit 8f4d37ec07
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date:   Fri Jan 25 21:08:29 2008 +0100
>
>    sched: high-res preemption tick

Linus suspected it to be hrtick + vm86 interaction and observed:

> Btw, Peter, Ingo: I think that commit is doing bad things. They aren't
> _incorrect_ per se, but they are definitely bad.
>
> Why?
>
> Using random _TIF_WORK_MASK flags is really impolite for doing
> "scheduling" work. There's a reason that arch/x86/kernel/entry_32.S
> special-cases the _TIF_NEED_RESCHED flag: we don't want to exit out of
> vm86 mode unnecessarily.
>
> See the "work_notifysig_v86" label, and how it does that
> "save_v86_state()" thing etc etc.

Right, I never liked having to fiddle with those TIF flags. Initially I
needed it because the hrtimer base lock could not nest in the rq lock.
That however is fixed these days.

Currently the only reason left to fiddle with the TIF flags is remote
wakeups. We cannot program a remote cpu's hrtimer. I've been thinking
about using the new and improved IPI function call stuff to implement
hrtimer_start_on().

However that does require that smp_call_function_single(.wait=0) works
from interrupt context - /me looks at the latest series from Jens - Yes
that does seem to be supported, good.

Here's a stab at cleaning this stuff up ...

Mihai reported test success as well.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Mihai Moldovan <ionic@ionic.de>
Cc: Michal Januszewski <spock@gentoo.org>
Cc: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-20 10:37:28 +02:00
Ingo Molnar
a208f37a46 Merge branch 'linus' into x86/x2apic 2008-07-18 22:50:34 +02:00
Mike Travis
c18a41fbbc cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
* Optimize various places where a pointer to the cpumask_of_cpu value
    will result in reducing stack pressure.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 22:02:59 +02:00
Mike Travis
65c0118453 cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
* This patch replaces the dangerous lvalue version of cpumask_of_cpu
    with new cpumask_of_cpu_ptr macros.  These are patterned after the
    node_to_cpumask_ptr macros.

    In general terms, if there is a cpumask_of_cpu_map[] then a pointer to
    the cpumask_of_cpu_map[cpu] entry is used.  The cpumask_of_cpu_map
    is provided when there is a large NR_CPUS count, reducing
    greatly the amount of code generated and stack space used for
    cpumask_of_cpu().  The pointer to the cpumask_t value is needed for
    calling set_cpus_allowed_ptr() to reduce the amount of stack space
    needed to pass the cpumask_t value.

    If there isn't a cpumask_of_cpu_map[], then a temporary variable is
    declared and filled in with value from cpumask_of_cpu(cpu) as well as
    a pointer variable pointing to this temporary variable.  Afterwards,
    the pointer is used to reference the cpumask value.  The compiler
    will optimize out the extra dereference through the pointer as well
    as the stack space used for the pointer, resulting in identical code.

    A good example of the orthogonal usages is in net/sunrpc/svc.c:

	case SVC_POOL_PERCPU:
	{
		unsigned int cpu = m->pool_to[pidx];
		cpumask_of_cpu_ptr(cpumask, cpu);

		*oldmask = current->cpus_allowed;
		set_cpus_allowed_ptr(current, cpumask);
		return 1;
	}
	case SVC_POOL_PERNODE:
	{
		unsigned int node = m->pool_to[pidx];
		node_to_cpumask_ptr(nodecpumask, node);

		*oldmask = current->cpus_allowed;
		set_cpus_allowed_ptr(current, nodecpumask);
		return 1;
	}

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 22:02:57 +02:00
Ingo Molnar
bb2c018b09 Merge branch 'linus' into cpus4096
Conflicts:

	drivers/acpi/processor_throttling.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 22:00:54 +02:00
Dmitry Baryshkov
538c29d43e Generic dma-coherent: fix DMA_MEMORY_EXCLUSIVE
Don't rewrite successfull allocation return values
in case the memory was marked with DMA_MEMORY_EXCLUSIVE.

Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 21:14:01 +02:00
Ingo Molnar
f6dc8ccaab Merge branch 'linus' into core/generic-dma-coherent
Conflicts:

	kernel/Makefile

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 21:13:20 +02:00
Ingo Molnar
9b610fda0d Merge branch 'linus' into timers/nohz 2008-07-18 19:53:16 +02:00
Steven Rostedt
1e01cb0c6f ftrace: only trace preempt off with preempt tracer
When PREEMPT_TRACER and IRQSOFF_TRACER are both configured and irqsoff
tracer is running, the preempt_off sections might also be traced.

Thanks to Andrew Morton for pointing out my mistake of spin_lock disabling
interrupts while he was reviewing ftrace.txt. Seems that my example I used
actually hit this bug.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 18:57:34 +02:00
Mike Travis
8df185a95c kthread: reduce stack pressure in create_kthread and kthreadd
* Replace:

  	set_cpus_allowed(..., CPU_MASK_ALL)

    with:

  	set_cpus_allowed_ptr(..., CPU_MASK_ALL_PTR)

    to remove excessive stack requirements when NR_CPUS=4096.

Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 18:46:58 +02:00
Hiroshi Shimamoto
4dca10a960 softlockup: fix invalid proc_handler for softlockup_panic
The type of softlockup_panic is int, but the proc_handler is
proc_doulongvec_minmax(). This handler is for unsigned long.

This handler should be proc_dointvec_minmax().

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 18:29:28 +02:00
Thomas Gleixner
b8f8c3cf0a nohz: prevent tick stop outside of the idle loop
Jack Ren and Eric Miao tracked down the following long standing
problem in the NOHZ code:

	scheduler switch to idle task
	enable interrupts

Window starts here

	----> interrupt happens (does not set NEED_RESCHED)
	      	irq_exit() stops the tick

	----> interrupt happens (does set NEED_RESCHED)

	return from schedule()
	
	cpu_idle(): preempt_disable();

Window ends here

The interrupts can happen at any point inside the race window. The
first interrupt stops the tick, the second one causes the scheduler to
rerun and switch away from idle again and we end up with the tick
disabled.

The fact that it needs two interrupts where the first one does not set
NEED_RESCHED and the second one does made the bug obscure and extremly
hard to reproduce and analyse. Kudos to Jack and Eric.

Solution: Limit the NOHZ functionality to the idle loop to make sure
that we can not run into such a situation ever again.

cpu_idle()
{
	preempt_disable();

	while(1) {
		 tick_nohz_stop_sched_tick(1); <- tell NOHZ code that we
		 			          are in the idle loop

		 while (!need_resched())
		       halt();

		 tick_nohz_restart_sched_tick(); <- disables NOHZ mode
		 preempt_enable_no_resched();
		 schedule();
		 preempt_disable();
	}
}

In hindsight we should have done this forever, but ... 

/me grabs a large brown paperbag.

Debugged-by: Jack Ren <jack.ren@marvell.com>, 
Debugged-by: eric miao <eric.y.miao@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-07-18 18:10:28 +02:00
Lai Jiangshan
5127bed588 rcu classic: new algorithm for callbacks-processing(v2)
This is v2, it's a little deference from v1 that I
had send to lkml.
use ACCESS_ONCE
use rcu_batch_after/rcu_batch_before for batch # comparison.

rcutorture test result:
(hotplugs: do cpu-online/offline once per second)

No CONFIG_NO_HZ:           OK, 12hours
No CONFIG_NO_HZ, hotplugs: OK, 12hours
CONFIG_NO_HZ=y:            OK, 24hours
CONFIG_NO_HZ=y, hotplugs:  Failed.
(Failed also without my patch applied, exactly the same bug occurred,
http://lkml.org/lkml/2008/7/3/24)

v1's email thread:
http://lkml.org/lkml/2008/6/2/539

v1's description:

The code/algorithm of the implement of current callbacks-processing
is very efficient and technical. But when I studied it and I found
a disadvantage:

In multi-CPU systems, when a new RCU callback is being
queued(call_rcu[_bh]), this callback will be invoked after the grace
period for the batch with batch number = rcp->cur+2 has completed
very very likely in current implement. Actually, this callback can be
invoked after the grace period for the batch with
batch number = rcp->cur+1 has completed. The delay of invocation means
that latency of synchronize_rcu() is extended. But more important thing
is that the callbacks usually free memory, and these works are delayed
too! it's necessary for reclaimer to free memory as soon as
possible when left memory is few.

A very simple way can solve this problem:
a field(struct rcu_head::batch) is added to record the batch number for
the RCU callback. And when a new RCU callback is being queued, we
determine the batch number for this callback(head->batch = rcp->cur+1)
and we move this callback to rdp->donelist if we find
that head->batch <= rcp->completed when we process callbacks.
This simple way reduces the wait time for invocation a lot. (about
2.5Grace Period -> 1.5Grace Period in average in multi-CPU systems)

This is my algorithm. But I do not add any field for struct rcu_head
in my implement. We just need to memorize the last 2 batches and
their batch number, because these 2 batches include all entries that
for whom the grace period hasn't completed. So we use a special
linked-list rather than add a field.
Please see the comment of struct rcu_data.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 16:07:33 +02:00
Lai Jiangshan
3cac97cbb1 rcu classic: simplify the next pending batch
use a batch number(rcp->pending) instead of a flag(rcp->next_pending)

rcu_start_batch() need to change this flag, so mb()s is needed
for memory-access safe.

but(after this patch applied) rcu_start_batch() do not change
this batch number(rcp->pending), rcp->pending is managed by
__rcu_process_callbacks only, and troublesome mb()s are eliminated.

And codes look simpler and clearer.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 16:07:32 +02:00
David Howells
577b4a58d2 sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed
Fix inc_rt_tasks() to not declare variable 'rq' if it's not needed.  It is
declared if CONFIG_SMP or CONFIG_RT_GROUP_SCHED, but only used if CONFIG_SMP.

This is a consequence of patch 1f11eb6a8b plus
patch 1100ac91b6.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 13:56:03 +02:00
Steven Rostedt
e59494f441 ftrace: fix 4d3702b6 (post-v2.6.26): WARNING: at kernel/lockdep.c:2731 check_flags (ftrace)
On Wed, 16 Jul 2008, Vegard Nossum wrote:

> When booting 4d3702b6, I got this huge thing:
>
> Testing tracer wakeup: <4>------------[ cut here ]------------
> WARNING: at kernel/lockdep.c:2731 check_flags+0x123/0x160()
> Modules linked in:
> Pid: 1, comm: swapper Not tainted 2.6.26-crashing-02127-g4d3702b6 #30
>  [<c015c349>] warn_on_slowpath+0x59/0xb0
>  [<c01276c6>] ? ftrace_call+0x5/0x8
>  [<c012d800>] ? native_read_tsc+0x0/0x20
>  [<c0158de2>] ? sub_preempt_count+0x12/0xf0
>  [<c01814eb>] ? trace_hardirqs_off+0xb/0x10
>  [<c0182fbc>] ? __lock_acquire+0x2cc/0x1120
>  [<c01814eb>] ? trace_hardirqs_off+0xb/0x10
>  [<c01276af>] ? mcount_call+0x5/0xa
>  [<c017ff53>] check_flags+0x123/0x160
>  [<c0183e61>] lock_acquire+0x51/0xd0
>  [<c01276c6>] ? ftrace_call+0x5/0x8
>  [<c0613d4f>] _spin_lock_irqsave+0x5f/0xa0
>  [<c01a8d45>] ? ftrace_record_ip+0xf5/0x220
>  [<c02d5413>] ? debug_locks_off+0x3/0x50
>  [<c01a8d45>] ftrace_record_ip+0xf5/0x220
>  [<c01276af>] mcount_call+0x5/0xa
>  [<c02d5418>] ? debug_locks_off+0x8/0x50
>  [<c017ff27>] check_flags+0xf7/0x160
>  [<c0183e61>] lock_acquire+0x51/0xd0
>  [<c01276c6>] ? ftrace_call+0x5/0x8
>  [<c0613d4f>] _spin_lock_irqsave+0x5f/0xa0
>  [<c01affcd>] ? wakeup_tracer_call+0x6d/0xf0
>  [<c01625e2>] ? _local_bh_enable+0x62/0xb0
>  [<c0158ddd>] ? sub_preempt_count+0xd/0xf0
>  [<c01affcd>] wakeup_tracer_call+0x6d/0xf0
>  [<c0162724>] ? __do_softirq+0xf4/0x110
>  [<c01afff1>] ? wakeup_tracer_call+0x91/0xf0
>  [<c01276c6>] ftrace_call+0x5/0x8
>  [<c0162724>] ? __do_softirq+0xf4/0x110
>  [<c0158de2>] ? sub_preempt_count+0x12/0xf0
>  [<c01625e2>] _local_bh_enable+0x62/0xb0
>  [<c0162724>] __do_softirq+0xf4/0x110
>  [<c01627ed>] do_softirq+0xad/0xb0
>  [<c0162a15>] irq_exit+0xa5/0xb0
>  [<c013a506>] smp_apic_timer_interrupt+0x66/0xa0
>  [<c02d3fac>] ? trace_hardirqs_off_thunk+0xc/0x10
>  [<c0127449>] apic_timer_interrupt+0x2d/0x34
>  [<c018007b>] ? find_usage_backwards+0xb/0xf0
>  [<c0613a09>] ? _spin_unlock_irqrestore+0x69/0x80
>  [<c014ef32>] tg_shares_up+0x132/0x1d0
>  [<c014d2a2>] walk_tg_tree+0x62/0xa0
>  [<c014ee00>] ? tg_shares_up+0x0/0x1d0
>  [<c014a860>] ? tg_nop+0x0/0x10
>  [<c015499d>] update_shares+0x5d/0x80
>  [<c0154a2f>] try_to_wake_up+0x6f/0x280
>  [<c01a8b90>] ? __ftrace_modify_code+0x0/0xc0
>  [<c01a8b90>] ? __ftrace_modify_code+0x0/0xc0
>  [<c0154c94>] wake_up_process+0x14/0x20
>  [<c01725f6>] kthread_create+0x66/0xb0
>  [<c0195400>] ? do_stop+0x0/0x200
>  [<c0195320>] ? __stop_machine_run+0x30/0xb0
>  [<c0195340>] __stop_machine_run+0x50/0xb0
>  [<c0195400>] ? do_stop+0x0/0x200
>  [<c01a8b90>] ? __ftrace_modify_code+0x0/0xc0
>  [<c061242d>] ? mutex_unlock+0xd/0x10
>  [<c01953cc>] stop_machine_run+0x2c/0x60
>  [<c01a94d3>] unregister_ftrace_function+0x103/0x180
>  [<c01b0517>] stop_wakeup_tracer+0x17/0x60
>  [<c01b056f>] wakeup_tracer_ctrl_update+0xf/0x30
>  [<c01ab8d5>] trace_selftest_startup_wakeup+0xb5/0x130
>  [<c01ab950>] ? trace_wakeup_test_thread+0x0/0x70
>  [<c01aadf5>] register_tracer+0x135/0x1b0
>  [<c0877d02>] init_wakeup_tracer+0xd/0xf
>  [<c085d437>] kernel_init+0x1a9/0x2ce
>  [<c061397b>] ? _spin_unlock_irq+0x3b/0x60
>  [<c02d3f9c>] ? trace_hardirqs_on_thunk+0xc/0x10
>  [<c0877cf5>] ? init_wakeup_tracer+0x0/0xf
>  [<c0182646>] ? trace_hardirqs_on_caller+0x126/0x180
>  [<c02d3f9c>] ? trace_hardirqs_on_thunk+0xc/0x10
>  [<c01269c8>] ? restore_nocheck_notrace+0x0/0xe
>  [<c085d28e>] ? kernel_init+0x0/0x2ce
>  [<c085d28e>] ? kernel_init+0x0/0x2ce
>  [<c01275fb>] kernel_thread_helper+0x7/0x10
>  =======================
> ---[ end trace a7919e7f17c0a725 ]---
> irq event stamp: 579530
> hardirqs last  enabled at (579528): [<c01826ab>] trace_hardirqs_on+0xb/0x10
> hardirqs last disabled at (579529): [<c01814eb>] trace_hardirqs_off+0xb/0x10
> softirqs last  enabled at (579530): [<c0162724>] __do_softirq+0xf4/0x110
> softirqs last disabled at (579517): [<c01627ed>] do_softirq+0xad/0xb0
> irq event stamp: 579530
> hardirqs last  enabled at (579528): [<c01826ab>] trace_hardirqs_on+0xb/0x10
> hardirqs last disabled at (579529): [<c01814eb>] trace_hardirqs_off+0xb/0x10
> softirqs last  enabled at (579530): [<c0162724>] __do_softirq+0xf4/0x110
> softirqs last disabled at (579517): [<c01627ed>] do_softirq+0xad/0xb0
> PASSED
>
> Incidentally, the kernel also hung while I was typing in this report.

Things get weird between lockdep and ftrace because ftrace can be called
within lockdep internal code (via the mcount pointer) and lockdep can be
called with ftrace (via spin_locks).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 13:47:15 +02:00
Max Krasnyansky
39b0fad712 cpu hotplug: Make cpu_active_map synchronization dependency clear
This goes on top of the cpu_active_map (take 2) patch.

Currently we depend on the stop_machine to provide nescessesary
synchronization for the cpu_active_map updates.
As Dmitry Adamushko pointed this is fragile and is not much clearer
than the previous scheme. In other words we do not want to depend on
the internal stop machine operation here.
So make the synchronization rules clear by doing synchronize_sched()
after clearing out cpu active bit.

Tested on quad-Core2 with:

   while true; do
      for i in 1 2 3; do
        echo 0 > /sys/devices/system/cpu/cpu$i/online
      done
      for i in 1 2 3; do
        echo 1 > /sys/devices/system/cpu/cpu$i/online
      done
   done
and
   stress -c 200

No lockdep, preempt or other complaints.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 13:23:18 +02:00
Max Krasnyansky
e761b77252 cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)
This is based on Linus' idea of creating cpu_active_map that prevents
scheduler load balancer from migrating tasks to the cpu that is going
down.

It allows us to simplify domain management code and avoid unecessary
domain rebuilds during cpu hotplug event handling.

Please ignore the cpusets part for now. It needs some more work in order
to avoid crazy lock nesting. Although I did simplfy and unify domain
reinitialization logic. We now simply call partition_sched_domains() in
all the cases. This means that we're using exact same code paths as in
cpusets case and hence the test below cover cpusets too.
Cpuset changes to make rebuild_sched_domains() callable from various
contexts are in the separate patch (right next after this one).

This not only boots but also easily handles
	while true; do make clean; make -j 8; done
and
	while true; do on-off-cpu 1; done
at the same time.
(on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
this on right now in gnome-terminal and things are moving just fine.

Also this is running with most of the debug features enabled (lockdep,
mutex, etc) no BUG_ONs or lockdep complaints so far.

I believe I addressed all of the Dmitry's comments for original Linus'
version. I changed both fair and rt balancer to mask out non-active cpus.
And replaced cpu_is_offline() with !cpu_active() in the main scheduler
code where it made sense (to me).

Signed-off-by: Max Krasnyanskiy <maxk@qualcomm.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Cc: dmitry.adamushko@gmail.com
Cc: pj@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 13:22:25 +02:00
Dmitry Adamushko
7ebefa8cee sched: rework of "prioritize non-migratable tasks over migratable ones"
(1) handle in a generic way all cases when a newly woken-up task is
not migratable (not just a corner case when "rt_se->nr_cpus_allowed ==
1")

(2) if current is to be preempted, then make sure "p" will be picked
up by pick_next_task_rt().
i.e. move task's group at the head of its list as well.

currently, it's not a case for the group-scheduling case as described
here: http://www.ussg.iu.edu/hypermail/linux/kernel/0807.0/0134.html

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 12:55:14 +02:00
Mike Travis
13b40c1e40 sched: reduce stack size in isolated_cpu_setup()
* Remove 16k stack requirements in isolated_cpu_setup when NR_CPUS=4096.

Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 11:55:42 +02:00
Ingo Molnar
d88c169197 Revert parts of "ftrace: do not trace scheduler functions"
the removal of -mno-spe in the !ftrace case was not intended.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 08:59:24 +02:00
Ingo Molnar
393d81aa02 Merge branch 'linus' into xen-64bit 2008-07-17 23:57:20 +02:00
Linus Torvalds
bdec6cace4 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  ftrace: do not trace library functions
  ftrace: do not trace scheduler functions
  ftrace: fix lockup with MAXSMP
  ftrace: fix merge buglet
2008-07-17 10:37:10 -07:00
Jeremy Fitzhardinge
93a0886e23 x86, xen, power: fix up config dependencies on PM
Xen save/restore needs bits of code enabled by PM_SLEEP, and PM_SLEEP
depends on PM.  So make XEN_SAVE_RESTORE depend on PM and PM_SLEEP
depend on XEN_SAVE_RESTORE.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-17 19:25:20 +02:00
Ingo Molnar
c349e0a01c ftrace: do not trace scheduler functions
do not trace scheduler functions - it's still a bit fragile
and can lock up with:

  http://redhat.com/~mingo/misc/config-Thu_Jul_17_13_34_52_CEST_2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-17 17:40:11 +02:00
Roland McGrath
666f164f4f fix dangling zombie when new parent ignores children
This fixes an arcane bug that we think was a regression introduced
by commit b2b2cbc4b2.  When a parent
ignores SIGCHLD (or uses SA_NOCLDWAIT), its children would self-reap
but they don't if it's using ptrace on them.  When the parent thread
later exits and ceases to ptrace a child but leaves other live
threads in the parent's thread group, any zombie children are left
dangling.  The fix makes them self-reap then, as they would have
done earlier if ptrace had not been in use.

Signed-off-by: Roland McGrath <roland@redhat.com>
2008-07-16 18:02:34 -07:00