Avi pointed out a while ago that those MSRs falls into the pentium
PMU range. So the idea here is to add new ones, and after a while,
deprecate the old ones.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In recent stress tests, it was found that pvclock-based systems
could seriously warp in smp systems. Using ingo's time-warp-test.c,
I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
(to be fair, it wasn't that bad in most of them). Investigating further, I
found out that such warps were caused by the very offset-based calculation
pvclock is based on.
This happens even on some machines that report constant_tsc in its tsc flags,
specially on multi-socket ones.
Two reads of the same kernel timestamp at approx the same time, will likely
have tsc timestamped in different occasions too. This means the delta we
calculate is unpredictable at best, and can probably be smaller in a cpu
that is legitimately reading clock in a forward ocasion.
Some adjustments on the host could make this window less likely to happen,
but still, it pretty much poses as an intrinsic problem of the mechanism.
A while ago, I though about using a shared variable anyway, to hold clock
last state, but gave up due to the high contention locking was likely
to introduce, possibly rendering the thing useless on big machines. I argue,
however, that locking is not necessary.
We do a read-and-return sequence in pvclock, and between read and return,
the global value can have changed. However, it can only have changed
by means of an addition of a positive value. So if we detected that our
clock timestamp is less than the current global, we know that we need to
return a higher one, even though it is not exactly the one we compared to.
OTOH, if we detect we're greater than the current time source, we atomically
replace the value with our new readings. This do causes contention on big
boxes (but big here means *BIG*), but it seems like a good trade off, since
it provide us with a time source guaranteed to be stable wrt time warps.
After this patch is applied, I don't see a single warp in time during 5 days
of execution, in any of the machines I saw them before.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Avi Kivity <avi@redhat.com>
CC: Marcelo Tosatti <mtosatti@redhat.com>
CC: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch removes one padding byte and transform it into a flags
field. New versions of guests using pvclock will query these flags
upon each read.
Flags, however, will only be interpreted when the guest decides to.
It uses the pvclock_valid_flags function to signal that a specific
set of flags should be taken into consideration. Which flags are valid
are usually devised via HV negotiation.
Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch fixes a bug in the KVM efer-msr write path. If a
guest writes to a reserved efer bit the set_efer function
injects the #GP directly. The architecture dependent wrmsr
function does not see this, assumes success and advances the
rip. This results in a #GP in the guest with the wrong rip.
This patch fixes this by reporting efer write errors back to
the architectural wrmsr function.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch disables the possibility for a l2-guest to do a
VMMCALL directly into the host. This would happen if the
l1-hypervisor doesn't intercept VMMCALL and the l2-guest
executes this instruction.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The patch merged recently which allowed to mark an exception
as reinjected has a bug as it always marks the exception as
reinjected. This breaks nested-svm shadow-on-shadow
implementation.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Wallclock writing uses an unprotected global variable to hold the version;
this can cause one guest to interfere with another if both write their
wallclock at the same time.
Acked-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
On svm, kvm_read_pdptr() may require reading guest memory, which can sleep.
Push the spinlock into mmu_alloc_roots(), and only take it after we've read
the pdptr.
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Per document, for feature control MSR:
Bit 1 enables VMXON in SMX operation. If the bit is clear, execution
of VMXON in SMX operation causes a general-protection exception.
Bit 2 enables VMXON outside SMX operation. If the bit is clear, execution
of VMXON outside SMX operation causes a general-protection exception.
This patch is to enable this kind of check with SMX for VMXON in KVM.
Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The recent changes to emulate string instructions without entering guest
mode exposed a bug where pending interrupts are not properly reflected
in ready_for_interrupt_injection.
The result is that userspace overwrites a previously queued interrupt,
when irqchip's are emulated in userspace.
Fix by always updating state before returning to userspace.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When EPT is enabled, we cannot emulate EFER.NX=0 through the shadow page
tables. This causes accesses through ptes with bit 63 set to succeed instead
of failing a reserved bit check.
Signed-off-by: Avi Kivity <avi@redhat.com>
Some guest msr values cannot be used on the host (for example. EFER.NX=0),
so we need to switch them atomically during guest entry or exit.
Add a facility to program the vmx msr autoload registers accordingly.
Signed-off-by: Avi Kivity <avi@redhat.com>
vmx and svm vcpus have different contents and therefore may have different
alignmment requirements. Let each specify its required alignment.
Signed-off-by: Avi Kivity <avi@redhat.com>
Move unsync/sync tracepoints to the proper place, it's good
for us to obtain unsync page live time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kvm_mmu_remove_one_alloc_mmu_page() assumes kvm_mmu_zap_page() only reclaims
only one sp, but that's not the case. This will cause mmu shrinker returns
a wrong number. This patch fix the counting error.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
For TDP mode, avoid creating multiple page table roots for the single
guest-to-host physical address map by fixing the inputs used for the
shadow page table hash in mmu_alloc_roots().
Signed-off-by: Eric Northup <digitaleric@google.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
WARN() is used in some places to report firmware or hardware bugs that
are then worked-around. These bugs do not affect the stability of the
kernel and should not set the flag for TAINT_WARN. To allow for this,
add WARN_TAINT() and WARN_TAINT_ONCE() macros that take a taint number
as argument.
Architectures that implement warnings using trap instructions instead
of calls to warn_slowpath_*() now implement __WARN_TAINT(taint)
instead of __WARN().
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Helge Deller <deller@gmx.de>
Tested-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Because SLOB fancies being different, the default minimum alignment is
only "unsigned long" instead of SLAB/SLUB where the default is
"unsigned long long"
The inconsistency makes no sense and is asking for trouble, but define
ARCH_SLAB_MINALIGN to get it right in all cases even after they fix
the inconsistency.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch eliminates the node pointer from struct of_device and the
of_node (or prom_node) pointer from struct dev_archdata since the node
pointer is now part of struct device proper when CONFIG_OF is set, and
all users of the old pointer locations have already been converted over
to use device->of_node.
Also remove dev_archdata_{get,set}_node() as it is no longer used by
anything.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
The following structure elements duplicate the information in
'struct device.of_node' and so are being eliminated. This patch
makes all readers of these elements use device.of_node instead.
(struct of_device *)->node
(struct dev_archdata *)->prom_node (sparc)
(struct dev_archdata *)->of_node (powerpc & microblaze)
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Both ACPI and SFI firmwares will have MCFG space, but the error message
isn't valid on SFI, so don't print the message in that case.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
One entry for this in arch/ia64/Kconfig should be enough.
Reported-by: Christoph Egger <siccegge@cs.fau.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
As explained in commit 1c0fe6e3bd, we want to call the architecture independent
oom killer when getting an unexplained OOM from handle_mm_fault, rather than
simply killing current.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
There is a macro called dev_info that prints struct device specific
information. Having variables with the same name can be confusing and
prevents conversion of the macro to a function.
Rename the existing dev_info variables to something else in preparation
to converting the dev_info macro to a function.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pointless to use #ifdef CONFIG_NUMA in code that is
already inside another #ifdef CONFIG_NUMA.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
* 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, pat: Update the page flags for memtype atomically instead of using memtype_lock
x86, pat: In rbt_memtype_check_insert(), update new->type only if valid
x86, pat: Migrate to rbtree only backend for pat memtype management
x86, pat: Preparatory changes in pat.c for bigger rbtree change
rbtree: Add support for augmented rbtrees
* 'core-hweight-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, hweight: Use a 32-bit popcnt for __arch_hweight32()
arch, hweight: Fix compilation errors
x86: Add optimized popcnt variants
bitops: Optimize hweight() by making use of compile-time evaluation
* 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, acpi/irq: Define gsi_end when X86_IO_APIC is undefined
x86, irq: Kill io_apic_renumber_irq
x86, acpi/irq: Handle isa irqs that are not identity mapped to gsi's.
x86, ioapic: Simplify probe_nr_irqs_gsi.
x86, ioapic: Optimize pin_2_irq
x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic.
x86, ioapic: In mpparse use mp_register_ioapic
x86, ioapic: Teach mp_register_ioapic to compute a global gsi_end
x86, ioapic: Fix the types of gsi values
x86, ioapic: Fix io_apic_redir_entries to return the number of entries.
x86, ioapic: Only export mp_find_ioapic and mp_find_ioapic_pin in io_apic.h
x86, acpi/irq: Generalize mp_config_acpi_legacy_irqs
x86, acpi/irq: Fix acpi_sci_ioapic_setup so it has both bus_irq and gsi
x86, acpi/irq: pci device dev->irq is an isa irq not a gsi
x86, acpi/irq: Teach acpi_get_override_irq to take a gsi not an isa_irq
x86, acpi/irq: Introduce apci_isa_irq_to_gsi
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, fpu: Use static_cpu_has() to implement use_xsave()
x86: Add new static_cpu_has() function using alternatives
x86, fpu: Use the proper asm constraint in use_xsave()
x86, fpu: Unbreak FPU emulation
x86: Introduce 'struct fpu' and related API
x86: Eliminate TS_XSAVE
x86-32: Don't set ignore_fpu_irq in simd exception
x86: Merge kernel_math_error() into math_error()
x86: Merge simd_math_error() into math_error()
x86-32: Rework cache flush denied handler
Fix trivial conflict in arch/x86/kernel/process.c
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, hypervisor: add missing <linux/module.h>
Modify the VMware balloon driver for the new x86_hyper API
x86, hypervisor: Export the x86_hyper* symbols
x86: Clean up the hypervisor layer
x86, HyperV: fix up the license to mshyperv.c
x86: Detect running on a Microsoft HyperV system
x86, cpu: Make APERF/MPERF a normal table-driven flag
x86, k8: Fix build error when K8_NB is disabled
x86, cacheinfo: Disable index in all four subcaches
x86, cacheinfo: Make L3 cache info per node
x86, cacheinfo: Reorganize AMD L3 cache structure
x86, cacheinfo: Turn off L3 cache index disable feature in virtualized environments
x86, cacheinfo: Unify AMD L3 cache index disable checking
cpufreq: Unify sysfs attribute definition macros
powernow-k8: Fix frequency reporting
x86, cpufreq: Add APERF/MPERF support for AMD processors
x86: Unify APERF/MPERF support
powernow-k8: Add core performance boost support
x86, cpu: Add AMD core boosting feature flag to /proc/cpuinfo
Fix up trivial conflicts in arch/x86/kernel/cpu/intel_cacheinfo.c and
drivers/cpufreq/cpufreq_ondemand.c