Commit Graph

25711 Commits

Author SHA1 Message Date
Bjorn Helgaas
a1802c4295 PNP: make resource option structures private to PNP subsystem
Nothing outside the PNP subsystem should need access to a
device's resource options, so this patch moves the option
structure declarations to a private header file.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rene Herman <rene.herman@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-07-16 23:27:06 +02:00
Bjorn Helgaas
08c9f262f2 PNP: define PNP-specific IORESOURCE_IO_* flags alongside IRQ, DMA, MEM
PNP previously defined PNP_PORT_FLAG_16BITADDR and PNP_PORT_FLAG_FIXED
in a private header file, but put those flags in struct resource.flags
fields.  Better to make them IORESOURCE_IO_* flags like the existing
IRQ, DMA, and MEM flags.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rene Herman <rene.herman@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-07-16 23:27:06 +02:00
Bjorn Helgaas
57fd51a8be PNP: add pnp_possible_config() -- can a device could be configured this way?
As part of a heuristic to identify modem devices, 8250_pnp.c
checks to see whether a device can be configured at any of the
legacy COM port addresses.

This patch moves the code that traverses the PNP "possible resource
options" from 8250_pnp.c to the PNP subsystem.  This encapsulation
is important because a future patch will change the implementation
of those resource options.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rene Herman <rene.herman@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-07-16 23:27:06 +02:00
Bjorn Helgaas
aee3ad815d PNP: replace pnp_resource_table with dynamically allocated resources
PNP used to have a fixed-size pnp_resource_table for tracking the
resources used by a device.  This table often overflowed, so we've
had to increase the table size, which wastes memory because most
devices have very few resources.

This patch replaces the table with a linked list of resources where
the entries are allocated on demand.

This removes messages like these:

    pnpacpi: exceeded the max number of IO resources
    00:01: too many I/O port resources

References:

    http://bugzilla.kernel.org/show_bug.cgi?id=9535
    http://bugzilla.kernel.org/show_bug.cgi?id=9740
    http://lkml.org/lkml/2007/11/30/110

This patch also changes the way PNP uses the IORESOURCE_UNSET,
IORESOURCE_AUTO, and IORESOURCE_DISABLED flags.

Prior to this patch, the pnp_resource_table entries used the flags
like this:

    IORESOURCE_UNSET
	This table entry is unused and available for use.  When this flag
	is set, we shouldn't look at anything else in the resource structure.
	This flag is set when a resource table entry is initialized.

    IORESOURCE_AUTO
	This resource was assigned automatically by pnp_assign_{io,mem,etc}().

	This flag is set when a resource table entry is initialized and
	cleared whenever we discover a resource setting by reading an ISAPNP
	config register, parsing a PNPBIOS resource data stream, parsing an
	ACPI _CRS list, or interpreting a sysfs "set" command.

	Resources marked IORESOURCE_AUTO are reinitialized and marked as
	IORESOURCE_UNSET by pnp_clean_resource_table() in these cases:

	    - before we attempt to assign resources automatically,
	    - if we fail to assign resources automatically,
	    - after disabling a device

    IORESOURCE_DISABLED
	Set by pnp_assign_{io,mem,etc}() when automatic assignment fails.
	Also set by PNPBIOS and PNPACPI for:

	    - invalid IRQs or GSI registration failures
	    - invalid DMA channels
	    - I/O ports above 0x10000
	    - mem ranges with negative length

After this patch, there is no pnp_resource_table, and the resource list
entries use the flags like this:

    IORESOURCE_UNSET
	This flag is no longer used in PNP.  Instead of keeping
	IORESOURCE_UNSET entries in the resource list, we remove
	entries from the list and free them.

    IORESOURCE_AUTO
	No change in meaning: it still means the resource was assigned
	automatically by pnp_assign_{port,mem,etc}(), but these functions
	now set the bit explicitly.

	We still "clean" a device's resource list in the same places,
	but rather than reinitializing IORESOURCE_AUTO entries, we
	just remove them from the list.

	Note that IORESOURCE_AUTO entries are always at the end of the
	list, so removing them doesn't reorder other list entries.
	This is because non-IORESOURCE_AUTO entries are added by the
	ISAPNP, PNPBIOS, or PNPACPI "get resources" methods and by the
	sysfs "set" command.  In each of these cases, we completely free
	the resource list first.

    IORESOURCE_DISABLED
	In addition to the cases where we used to set this flag, ISAPNP now
	adds an IORESOURCE_DISABLED resource when it reads a configuration
	register with a "disabled" value.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:05 +02:00
Bjorn Helgaas
20bfdbba72 PNP: make pnp_{port,mem,etc}_start(), et al work for invalid resources
Some callers use pnp_port_start() and similar functions without
making sure the resource is valid.  This patch makes us fall
back to returning the initial values if the resource is not
valid or not even present.

This mostly preserves the previous behavior, where we would just
return the initial values set by pnp_init_resource_table().  The
original 2.6.25 code didn't range-check the "bar", so it would
return garbage if the bar exceeded the table size.  This code
returns sensible values instead.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:05 +02:00
Zhao Yakui
da5e09a1b3 ACPI : Create "idle=nomwait" bootparam
"idle=nomwait" disables the use of the MWAIT
instruction from both C1 (C1_FFH) and deeper (C2C3_FFH)
C-states.

When MWAIT is unavailable, the BIOS and OS generally
negotiate to use the HALT instruction for C1,
and use IO accesses for deeper C-states.

This option is useful for power and performance
comparisons, and also to work around BIOS bugs
where broken MWAIT support is advertised.

http://bugzilla.kernel.org/show_bug.cgi?id=10807
http://bugzilla.kernel.org/show_bug.cgi?id=10914

Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
Signed-off-by: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:05 +02:00
Zhao Yakui
c1e3b377ad ACPI: Create "idle=halt" bootparam
"idle=halt" limits the idle loop to using
the halt instruction.  No MWAIT, no IO accesses,
no C-states deeper than C1.

If something is broken in the idle code,
"idle=halt" is a less severe workaround
than "idle=poll" which disables all power savings.

Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:05 +02:00
Zhang Rui
71b58cbb0c ACPI: Enhance /sys/firmware/interrupts to allow enable/disable/clear from user-space
Allow users to enable/disable/clear a specific & valid GPE/Fixed Event
in user space.

This is useful for debugging, especially for some
interrupt storm issues.

All wakeup GPEs are disabled and they can not be enabled at runtime,
and we mark them as invalid.

All GPEs that don't have a _Lxx/_Exx method are marked as invalid.

All Fixed Events that don't have an event handler are marked as invalid
and they can't be enabled until an event handler is registered.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ling Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:04 +02:00
Bob Moore
9c9f6d052d ACPICA: Update version to 20080609
Update version to 20080609.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:04 +02:00
Bob Moore
71d993e115 ACPICA: Cleanup debug operand dump mechanism
Eliminated unnecessary operands; eliminated use of negative index
in loop.  Operands now displayed in correct order, not backwards.

Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:04 +02:00
Bob Moore
75e5b5fb77 ACPICA: Update disassembler for DMAR table changes
Now supports the 2007 intel Virtualization Technology for Directed
I/O specification.

Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:04 +02:00
Bob Moore
19d0cfe9dd ACPICA: Update DMAR and SRAT table definitions
Synchronized tables with current specifications.

Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:04 +02:00
Bob Moore
b25d2a470b ACPICA: Update version to 20080514
Update version to 20080514

Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:04 +02:00
Bob Moore
4b8ed63167 ACPICA: Add const qualifier for appropriate string constants
Mostly MODULE_NAME and printf format strings.

Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:04 +02:00
Bob Moore
67a119f990 ACPICA: Eliminate acpi_native_uint type v2
No longer needed; replaced mostly with u32, but also acpi_size
where a type that changes 32/64 bit on 32/64-bit platforms is
required.

v2: Fix a cast of a 32-bit int to a pointer in ACPI to avoid a compiler warning.
from David Howells

Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:03 +02:00
Bob Moore
11f2a61ab4 ACPICA: Fix possible negative array index in acpi_ut_validate_exception
Added NULL fields to the exception string arrays to eliminate
the -1 subtraction on the SubStatus field.

Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:03 +02:00
Jan Beulich
6719561f9b ACPICA: Update tracking macros to reduce code/data size
Changed ACPI_MODULE_NAME and ACPI_FUNCTION_NAME to use arrays of
strings instead of pointers to static strings. Jan Beulich and
Bob Moore.

Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:03 +02:00
Bob Moore
c91d924e3a ACPICA: Fix for hang on GPE method invocation
Fixes problem where the new method argument count validation mechanism
will enter an infinite loop when a GPE method is dispatched.
Problem fixed be removing the obsolete code that passes GPE block
information to the notify handler via the control method parameter pointer.

Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:03 +02:00
Bob Moore
f3454ae810 ACPICA: Add argument count checking to control method invocation via acpi_evaluate_object
Error if too few arguments, warning if too many. This applies
only to external programmatic control method execution, not
method-to-method calls within the AML.

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:03 +02:00
Rafael J. Wysocki
ebb12db51f Freezer: Introduce PF_FREEZER_NOSIG
The freezer currently attempts to distinguish kernel threads from
user space tasks by checking if their mm pointer is unset and it
does not send fake signals to kernel threads.  However, there are
kernel threads, mostly related to networking, that behave like
user space tasks and may want to be sent a fake signal to be frozen.

Introduce the new process flag PF_FREEZER_NOSIG that will be set
by default for all kernel threads and make the freezer only send
fake signals to the tasks having PF_FREEZER_NOSIG unset.  Provide
the set_freezable_with_signal() function to be called by the kernel
threads that want to be sent a fake signal for freezing.

This patch should not change the freezer's observable behavior.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-07-16 23:27:03 +02:00
David Brownell
2fe2de5f6c ACPI PM: acpi_pm_device_sleep_state() cleanup
Get rid of a superfluous acpi_pm_device_sleep_state() parameter.  The
only legitimate value of that parameter must be derived from the first
parameter, which is what all the callers already do.  (However, this
does not address the fact that ACPI still doesn't set up those flags.)

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-07-16 23:27:02 +02:00
Vegard Nossum
47c00d2bc2 ACPICA: fix mutex names in debug code.
Reorder the mutex names to match the preceding #defines

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:01 +02:00
Bob Moore
e38e8a0743 Make GPE disable more robust
Implemented another change for the GPE disable. We now perform a
read-change-write of the enable register instead of simply writing out the
cached enable mask. This will prevent inadvertent enabling of GPEs if a rogue
GPE is received during initialization (before GPE handlers are installed.)

http://bugzilla.kernel.org/show_bug.cgi?id=6217

Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:01 +02:00
Mike Travis
706546d023 ACPI: change processors from array to per_cpu variable
Change processors from an array sized by NR_CPUS to a per_cpu variable.

Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-07-16 23:27:01 +02:00
Roland McGrath
380fdd7585 x86 ptrace: user-sets-TF nits
This closes some arcane holes in single-step handling that can arise
only when user programs set TF directly (via popf or sigreturn) and
then use vDSO (syscall/sysenter) system call entry.  In those entry
paths, the clear_TF_reenable case hits and we must check TIF_SINGLESTEP
to be sure our bookkeeping stays correct wrt the user's view of TF.

Signed-off-by: Roland McGrath <roland@redhat.com>
2008-07-16 12:15:17 -07:00
Roland McGrath
d4d6715016 x86 ptrace: unify syscall tracing
This unifies and cleans up the syscall tracing code on i386 and x86_64.

Using a single function for entry and exit tracing on 32-bit made the
do_syscall_trace() into some terrible spaghetti.  The logic is clear and
simple using separate syscall_trace_enter() and syscall_trace_leave()
functions as on 64-bit.

The unification adds PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP support
on x86_64, for 32-bit ptrace() callers and for 64-bit ptrace() callers
tracing either 32-bit or 64-bit tasks.  It behaves just like 32-bit.

Changing syscall_trace_enter() to return the syscall number shortens
all the assembly paths, while adding the SYSEMU feature in a simple way.

Signed-off-by: Roland McGrath <roland@redhat.com>
2008-07-16 12:15:17 -07:00
Roland McGrath
64f0973319 x86 ptrace: unify TIF_SINGLESTEP
This unifies the treatment of TIF_SINGLESTEP on i386 and x86_64.
The bit is now excluded from _TIF_WORK_MASK on i386 as it has been
on x86_64.  This means the do_notify_resume() path using it is never
used, so TIF_SINGLESTEP is not cleared on returning to user mode.

Both now leave TIF_SINGLESTEP set when returning to user, so that
it's already set on an int $0x80 system call entry.  This removes
the need for testing TF on the system_call path.  Doing it this way
fixes the regression for PTRACE_SINGLESTEP into a sigreturn syscall,
introduced by commit 1e2e99f0e4.

The clear_TF_reenable case that sets TIF_SINGLESTEP can only happen
on a non-exception kernel entry, i.e. sysenter/syscall instruction.
That will always get to the syscall exit tracing path.

Signed-off-by: Roland McGrath <roland@redhat.com>
2008-07-16 12:15:16 -07:00
Elias Oltmanns
3ef5eb424e IDE: Remove unused code
Remove some code which has been made obsolete and hasn't worked properly
before anyway.  Part of the infrastructure may be reintroduced in a
follow up patch to implement a working command aborting facility.

Signed-off-by: Elias Oltmanns <eo@nebensachen.de>
Cc: "Alan Cox" <alan@lxorguk.ukuu.org.uk>
Cc: "Randy Dunlap" <randy.dunlap@oracle.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-07-16 20:33:48 +02:00
Elias Oltmanns
79e36a9f54 IDE: Fix HDIO_DRIVE_RESET handling
Currently, the code path executing an HDIO_DRIVE_RESET ioctl is broken
in various ways.  Most importantly, it is treated as an out of band
request in an illegal way which may very likely lead to system lock ups.
Use the drive's request queue to avoid this problem (and fix a locking
issue for free along the way).

Signed-off-by: Elias Oltmanns <eo@nebensachen.de>
Cc: "Alan Cox" <alan@lxorguk.ukuu.org.uk>
Cc: "Randy Dunlap" <randy.dunlap@oracle.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-07-16 20:33:48 +02:00
Bartlomiej Zolnierkiewicz
e6d95bd149 ide: ->port_init_devs -> ->init_dev
Change ->port_init_devs method to take 'ide_drive_t *' as an argument
instead of 'ide_hwif_t *' and rename it to ->init_dev.

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-07-16 20:33:42 +02:00
Bartlomiej Zolnierkiewicz
c56c5648a3 ide: set hwif->dev in ide_init_port_hw() (take 2)
* Add 'parent' field to hw_regs_t for optional parent device pointer (needed
  by macio PMAC IDE controllers) and set hwif->dev in ide_init_port_hw().

* Update au1xxx-ide.c, sgiioc4.c, pmac.c and setup-pci.c accordingly.

v2:

* Update scc_pata.c.

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-07-16 20:33:40 +02:00
Bartlomiej Zolnierkiewicz
63b51c6d1d ide: make ide_hwifs[] static
Move ide_hwifs[] from ide.c to ide-probe.c and make it static.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-07-16 20:33:40 +02:00
Bartlomiej Zolnierkiewicz
9ad5409375 ide: move PIO blacklist to ide-pio-blacklist.c
Move PIO blacklist to ide-pio-blacklist.c.

While at it:

- fix comment

- fix whitespace damage

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-07-16 20:33:39 +02:00
Bartlomiej Zolnierkiewicz
3e153cfb5e ide: remove no longer used ide_pio_timings[]
Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-07-16 20:33:39 +02:00
Bartlomiej Zolnierkiewicz
c9d6c1a237 ide: move ide_pio_cycle_time() to ide-timings.c
All ide_pio_cycle_time() users already select CONFIG_IDE_TIMINGS
so move the function from ide-lib.c to ide-timings.c.

While at it:

- convert ide_pio_cycle_time() to use ide_timing_find_mode()

- cleanup ide_pio_cycle_time() a bit

There should be no functional changes caused by this patch.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-07-16 20:33:39 +02:00
Bartlomiej Zolnierkiewicz
f06ab3402a ide: convert ide-timing.h to ide-timings.c library (take 2)
* Don't include ide-timing.h in cs5535 and sis5513 host drivers
  (they don't need it currently).

* Convert ide-timing.h to ide-timings.c library and add CONFIG_IDE_TIMINGS
  config option to be selected by host drivers using the library.

While at it:

- fix ide_timing_find_mode() placement

v2:
* Add missing EXPORT_SYMBOLs. (Stephen Rothwell <sfr@canb.auug.org.au>)

There should be no functional changes caused by this patch.

Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-07-16 20:33:37 +02:00
Bartlomiej Zolnierkiewicz
3be53f3f21 ide: move some bits from ide-timing.h to <linux/ide.h>
Move struct ide_timing and IDE_TIMING_* defines to <linux/ide.h>
from drivers/ide/ide-timing.h.

While at it:

- use u8/u16 instead of short for struct ide_timing fields

- use enum for IDE_TIMING_*

There should be no functional changes caused by this patch.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-07-16 20:33:36 +02:00
Ingo Molnar
4bb689eee1 x86: paravirt spinlocks, !CONFIG_SMP build fixes
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:15:53 +02:00
Jeremy Fitzhardinge
2d9e1e2f58 xen: implement Xen-specific spinlocks
The standard ticket spinlocks are very expensive in a virtual
environment, because their performance depends on Xen's scheduler
giving vcpus time in the order that they're supposed to take the
spinlock.

This implements a Xen-specific spinlock, which should be much more
efficient.

The fast-path is essentially the old Linux-x86 locks, using a single
lock byte.  The locker decrements the byte; if the result is 0, then
they have the lock.  If the lock is negative, then locker must spin
until the lock is positive again.

When there's contention, the locker spin for 2^16[*] iterations waiting
to get the lock.  If it fails to get the lock in that time, it adds
itself to the contention count in the lock and blocks on a per-cpu
event channel.

When unlocking the spinlock, the locker looks to see if there's anyone
blocked waiting for the lock by checking for a non-zero waiter count.
If there's a waiter, it traverses the per-cpu "lock_spinners"
variable, which contains which lock each CPU is waiting on.  It picks
one CPU waiting on the lock and sends it an event to wake it up.

This allows efficient fast-path spinlock operation, while allowing
spinning vcpus to give up their processor time while waiting for a
contended lock.

[*] 2^16 iterations is threshold at which 98% locks have been taken
according to Thomas Friebel's Xen Summit talk "Preventing Guests from
Spinning Around".  Therefore, we'd expect the lock and unlock slow
paths will only be entered 2% of the time.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@linux-foundation.org>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Virtualization <virtualization@lists.linux-foundation.org>
Cc: Xen devel <xen-devel@lists.xensource.com>
Cc: Thomas Friebel <thomas.friebel@amd.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:15:53 +02:00
Jeremy Fitzhardinge
8efcbab674 paravirt: introduce a "lock-byte" spinlock implementation
Implement a version of the old spinlock algorithm, in which everyone
spins waiting for a lock byte.  In order to be compatible with the
ticket-lock's use of a zero initializer, this uses the convention of
'0' for unlocked and '1' for locked.

This algorithm is much better than ticket locks in a virtual
envionment, because it doesn't interact badly with the vcpu scheduler.
If there are multiple vcpus spinning on a lock and the lock is
released, the next vcpu to be scheduled will take the lock, rather
than cycling around until the next ticketed vcpu gets it.

To use this, you must call paravirt_use_bytelocks() very early, before
any spinlocks have been taken.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@linux-foundation.org>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Virtualization <virtualization@lists.linux-foundation.org>
Cc: Xen devel <xen-devel@lists.xensource.com>
Cc: Thomas Friebel <thomas.friebel@amd.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:15:53 +02:00
Jeremy Fitzhardinge
74d4affde8 x86/paravirt: add hooks for spinlock operations
Ticket spinlocks have absolutely ghastly worst-case performance
characteristics in a virtual environment.  If there is any contention
for physical CPUs (ie, there are more runnable vcpus than cpus), then
ticket locks can cause the system to end up spending 90+% of its time
spinning.

The problem is that (v)cpus waiting on a ticket spinlock will be
granted access to the lock in strict order they got their tickets.  If
the hypervisor scheduler doesn't give the vcpus time in that order,
they will burn timeslices waiting for the scheduler to give the right
vcpu some time.  In the worst case it could take O(n^2) vcpu scheduler
timeslices for everyone waiting on the lock to get it, not counting
new cpus trying to take the lock while the log-jam is sorted out.

These hooks allow a paravirt backend to replace the spinlock
implementation.

At the very least, this could revert the implementation back to the
old lock algorithm, which allows the next scheduled vcpu to take the
lock, and has basically fairly good performance.

It also allows the spinlocks to take advantages of the hypervisor
features to make locks more efficient (spin and block, for example).

The cost to native execution is an extra direct call when using a
spinlock function.  There's no overhead if CONFIG_PARAVIRT is turned
off.

The lock structure is fixed at a single "unsigned int", initialized to
zero, but the spinlock implementation can use it as it wishes.

Thanks to Thomas Friebel's Xen Summit talk "Preventing Guests from
Spinning Around" for pointing out this problem.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@linux-foundation.org>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Virtualization <virtualization@lists.linux-foundation.org>
Cc: Xen devel <xen-devel@lists.xensource.com>
Cc: Thomas Friebel <thomas.friebel@amd.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:15:52 +02:00
Jeremy Fitzhardinge
6a52e4b1cd x86_64: further cleanup of 32-bit compat syscall mechanisms
AMD only supports "syscall" from 32-bit compat usermode.
Intel and Centaur(?) only support "sysenter" from 32-bit compat usermode.

Set the X86 feature bits accordingly, and set up the vdso in
accordance with those bits.  On the offchance we run on in a 64-bit
environment which supports neither syscall nor sysenter from 32-bit
mode, then fall back to the int $0x80 vdso.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2008-07-16 11:08:27 +02:00
Ingo Molnar
9c8a442044 xen64: fix !HVC_XEN build dependency
fix:

arch/x86/xen/built-in.o: In function `set_page_prot':
enlighten.c:(.text+0x111d): undefined reference to `xen_raw_printk'
arch/x86/xen/built-in.o: In function `xen_start_kernel':
: undefined reference to `xen_raw_console_write'
arch/x86/xen/built-in.o: In function `xen_start_kernel':
: undefined reference to `xen_raw_console_write'

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:06:48 +02:00
Jeremy Fitzhardinge
c24481e9da xen64: save lots of registers
The Xen hypercall interface is allowed to trash any or all of the
argument registers, so we need to be careful that the kernel state
isn't damaged.  On 32-bit kernels, the hypercall parameter registers
same as a regparm function call, so we've got away without explicit
clobbering so far.  The 64-bit ABI defines lots of caller-save
registers, so save them all for safety.  We can trim this set later by
re-distributing the responsibility for saving all these registers.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:05:23 +02:00
Jeremy Fitzhardinge
c05f1cfaba xen64: implement 64-bit update_descriptor
64-bit hypercall interface can pass a maddr in one argument rather
than splitting it.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:05:09 +02:00
Eduardo Habkost
45eb0d8898 Xen64: HYPERVISOR_set_segment_base() implementation
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:03:31 +02:00
Jeremy Fitzhardinge
88459d4c7e xen64: register callbacks in arch-independent way
Use callback_op hypercall to register callbacks in a 32/64-bit
independent way (64-bit doesn't need a code segment, but that detail
is hidden in XEN_CALLBACK).

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:03:01 +02:00
Jeremy Fitzhardinge
ce803e705f xen64: use arbitrary_virt_to_machine for xen_set_pmd
When building initial pagetables in 64-bit kernel the pud/pmd pointer may
be in ioremap/fixmap space, so we need to walk the pagetable to look up the
physical address.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:01:17 +02:00
Jeremy Fitzhardinge
084a2a4e76 xen64: early mapping setup
Set up the initial pagetables to map the kernel mapping into the
physical mapping space.  This makes __va() usable, since it requires
physical mappings.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 11:00:07 +02:00
Jeremy Fitzhardinge
5b09b2876e x86_64: add workaround for no %gs-based percpu
As a stopgap until Mike Travis's x86-64 gs-based percpu patches are
ready, provide workaround functions for x86_read/write_percpu for
Xen's use.

Specifically, this means that we can't really make use of vcpu
placement, because we can't use a single gs-based memory access to get
to vcpu fields.  So disable all that for now.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-16 10:58:13 +02:00