The nested_ept_enabled flag introduced in commit 7ca29de213 was not
computed correctly. We are interested only in L1's EPT state, not the
the combined L0+L1 value.
In particular, if L0 uses EPT but L1 does not, nested_ept_enabled must
be false to make sure that PDPSTRs are loaded based on CR3 as usual,
because the special case described in 26.3.2.4 Loading Page-Directory-
Pointer-Table Entries does not apply.
Fixes: 7ca29de213 ("KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT")
Cc: qemu-stable@nongnu.org
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This can be reproduced by running L2 on L1, and disable VPID on L0
if w/o commit "KVM: nVMX: Fix nested VPID vmx exec control", the L2
crash as below:
KVM: entry failed, hardware error 0x7
EAX=00000000 EBX=00000000 ECX=00000000 EDX=000306c3
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
SS =0000 00000000 0000ffff 00009300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 00000000 0000ffff
IDT= 00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Reference SDM 30.3 INVVPID:
Protected Mode Exceptions
- #UD
- If not in VMX operation.
- If the logical processor does not support VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=0).
- If the logical processor supports VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=1) but does
not support the INVVPID instruction (IA32_VMX_EPT_VPID_CAP[32]=0).
So we should check both VPID enable bit in vmx exec control and INVVPID support bit
in vmx capability MSRs to enable VPID. This patch adds the guarantee to not enable
VPID if either INVVPID or single-context/all-context invalidation is not exposed in
vmx capability MSRs.
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This can be reproduced by running kvm-unit-tests/vmx.flat on L0 w/ vpid disabled.
Test suite: VPID
Unhandled exception 6 #UD at ip 00000000004051a6
error_code=0000 rflags=00010047 cs=00000008
rax=0000000000000000 rcx=0000000000000001 rdx=0000000000000047 rbx=0000000000402f79
rbp=0000000000456240 rsi=0000000000000001 rdi=0000000000000000
r8=000000000000000a r9=00000000000003f8 r10=0000000080010011 r11=0000000000000000
r12=0000000000000003 r13=0000000000000708 r14=0000000000000000 r15=0000000000000000
cr0=0000000080010031 cr2=0000000000000000 cr3=0000000007fff000 cr4=0000000000002020
cr8=0000000000000000
STACK: @4051a6 40523e 400f7f 402059 40028f
We should hide and forbid VPID in L1 if it is disabled on L0. However, nested VPID
enable bit is set unconditionally during setup nested vmx exec controls though VPID
is not exposed through nested VMX capablity. This patch fixes it by don't set nested
VPID enable bit if it is disabled on L0.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 5c614b3583 (KVM: nVMX: nested VPID emulation)
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After async pf setup successfully, there is a broadcast wakeup w/ special
token 0xffffffff which tells vCPU that it should wake up all processes
waiting for APFs though there is no real process waiting at the moment.
The async page present tracepoint print prematurely and fails to catch the
special token setup. This patch fixes it by moving the async page present
tracepoint after the special token setup.
Before patch:
qemu-system-x86-8499 [006] ...1 5973.473292: kvm_async_pf_ready: token 0x0 gva 0x0
After patch:
qemu-system-x86-8499 [006] ...1 5973.473292: kvm_async_pf_ready: token 0xffffffff gva 0x0
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Quoting from the Intel SDM, volume 3, section 28.3.3.4: Guidelines for
Use of the INVEPT Instruction:
If EPT was in use on a logical processor at one time with EPTP X, it
is recommended that software use the INVEPT instruction with the
"single-context" INVEPT type and with EPTP X in the INVEPT descriptor
before a VM entry on the same logical processor that enables EPT with
EPTP X and either (a) the "virtualize APIC accesses" VM-execution
control was changed from 0 to 1; or (b) the value of the APIC-access
address was changed.
In the nested case, the burden falls on L1, unless L0 enables EPT in
vmcs02 when L1 doesn't enable EPT in vmcs12.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
We have specific destructors for pic/ioapic, we'd better use them when
destroying the VM as well.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Mostly used for split irqchip mode. In that case, these two things are
not inited at all, so no need to release.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
kvm mmu is reset once successfully loading CR3 as part of emulating vmentry
in nested_vmx_load_cr3(). We should not reset kvm mmu twice.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
If avic is not enabled, avic_vm_init() does nothing and returns early.
However, avic_vm_destroy() still tries to destroy what hasn't been created.
The only bad consequence of this now is that avic_vm_destroy() uses
svm_vm_data_hash_lock that hasn't been initialized (and is not meant
to be used at all if avic is not enabled).
Return early from avic_vm_destroy() if avic is not enabled.
It has nothing to destroy.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kvm@vger.kernel.org
Cc: syzkaller@googlegroups.com
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Pull more powerpc fixes from Michael Ellerman:
"A couple of minor powerpc fixes for 4.11:
- wire up statx() syscall
- don't print a warning on memory hotplug when HPT resizing isn't
available
Thanks to: David Gibson, Chandan Rajendra"
* tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/pseries: Don't give a warning when HPT resizing isn't available
powerpc: Wire up statx() syscall
Pull parisc fixes from Helge Deller:
- Mikulas Patocka added support for R_PARISC_SECREL32 relocations in
modules with CONFIG_MODVERSIONS.
- Dave Anglin optimized the cache flushing for vmap ranges.
- Arvind Yadav provided a fix for a potential NULL pointer dereference
in the parisc perf code (and some code cleanups).
- I wired up the new statx system call, fixed some compiler warnings
with the access_ok() macro and fixed shutdown code to really halt a
system at shutdown instead of crashing & rebooting.
* 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix system shutdown halt
parisc: perf: Fix potential NULL pointer dereference
parisc: Avoid compiler warnings with access_ok()
parisc: Wire up statx system call
parisc: Optimize flush_kernel_vmap_range and invalidate_kernel_vmap_range
parisc: support R_PARISC_SECREL32 relocation in modules
Pull OpenRISC fixes from Stafford Horne:
"OpenRISC fixes for build issues that were exposed by kbuild robots
after 4.11 merge. All from allmodconfig builds. This includes:
- bug in the handling of 8-byte get_user() calls
- module build failure due to multile missing symbol exports"
* tag 'openrisc-for-linus' of git://github.com/openrisc/linux:
openrisc: Export symbols needed by modules
openrisc: fix issue handling 8 byte get_user calls
openrisc: xchg: fix `computed is not used` warning
On those parisc machines which don't provide a software power off
function, the system currently kills the init process at the end of a
shutdown and unexpectedly restarts insteads of halting.
Fix it by adding a loop which will not return.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # 4.9+
Pull x86 fixes from Thomas Gleixner:
"An assorted pile of fixes along with some hardware enablement:
- a fix for a KASAN / branch profiling related boot failure
- some more fallout of the PUD rework
- a fix for the Always Running Timer which is not initialized when
the TSC frequency is known at boot time (via MSR/CPUID)
- a resource leak fix for the RDT filesystem
- another unwinder corner case fixup
- removal of the warning for duplicate NMI handlers because there are
legitimate cases where more than one handler can be registered at
the last level
- make a function static - found by sparse
- a set of updates for the Intel MID platform which got delayed due
to merge ordering constraints. It's hardware enablement for a non
mainstream platform, so there is no risk"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mpx: Make unnecessarily global function static
x86/intel_rdt: Put group node in rdtgroup_kn_unlock
x86/unwind: Fix last frame check for aligned function stacks
mm, x86: Fix native_pud_clear build error
x86/kasan: Fix boot with KASAN=y and PROFILE_ANNOTATED_BRANCHES=y
x86/platform/intel-mid: Add power button support for Merrifield
x86/platform/intel-mid: Use common power off sequence
x86/platform: Remove warning message for duplicate NMI handlers
x86/tsc: Fix ART for TSC_KNOWN_FREQ
x86/platform/intel-mid: Correct MSI IRQ line for watchdog device
Pull x86 acpi fixes from Thomas Gleixner:
"This update deals with the fallout of the recent work to make
cpuid/node mappings persistent.
It turned out that the boot time ACPI based mapping tripped over ACPI
inconsistencies and caused regressions. It's partially reverted and
the fragile part replaced by an implementation which makes the mapping
persistent when a CPU goes online for the first time"
* 'x86-acpi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
acpi/processor: Check for duplicate processor ids at hotplug time
acpi/processor: Implement DEVICE operator for processor enumeration
x86/acpi: Restore the order of CPU IDs
Revert"x86/acpi: Enable MADT APIs to return disabled apicids"
Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting"
Pull perf fixes from Thomas Gleixner:
"A set of perf related fixes:
- fix a CR4.PCE propagation issue caused by usage of mm instead of
active_mm and therefore propagated the wrong value.
- perf core fixes, which plug a use-after-free issue and make the
event inheritance on fork more robust.
- a tooling fix for symbol handling"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf symbols: Fix symbols__fixup_end heuristic for corner cases
x86/perf: Clarify why x86_pmu_event_mapped() isn't racy
x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
perf/core: Better explain the inherit magic
perf/core: Simplify perf_event_free_task()
perf/core: Fix event inheritance on fork()
perf/core: Fix use-after-free in perf_release()
Pull ARM fix from Russell King:
"Just one change to add the statx syscall this time around"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: wire up statx syscall
As of commit 438cc81a41 ("powerpc/pseries: Automatically resize HPT
for memory hot add/remove"), when running on the pseries platform, we
always attempt to use the PAPR extension to resize the hashed page
table (HPT) when we add or remove memory.
This is fine, but when the extension is not available we'll give a
harmless, but scary warning. Instead check if the firmware supports HPT
resizing before populating the mmu_hash_ops.resize_hpt pointer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Merge fixes from Andrew Morton:
"6 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
drivers core: remove assert_held_device_hotplug()
mm: add private lock to serialize memory hotplug operations
mm: don't warn when vmalloc() fails due to a fatal signal
mm, x86: fix native_pud_clear build error
kasan: add a prototype of task_struct to avoid warning
z3fold: fix spinlock unlocking in page reclaim
Pull arm64 fixes/cleanups from Catalin Marinas:
"In Will's absence I'm sending the arm64 fixes he queued for 4.11-rc3:
- fix arm64 kernel boot warning when DEBUG_VIRTUAL and KASAN are
enabled
- enable KEYS_COMPAT for keyctl compat support
- use cpus_have_const_cap() for system_uses_ttbr0_pan() (slight
performance improvement)
- update kerneldoc for cpu_suspend() rename
- remove the arm64-specific kprobe_exceptions_notify (weak generic
variant defined)"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: kernel: Update kerneldoc for cpu_suspend() rename
arm64: use const cap for system_uses_ttbr0_pan()
arm64: support keyctl() system call in 32-bit mode
arm64: kasan: avoid bad virt_to_pfn()
arm64: kprobes: remove kprobe_exceptions_notify
Test runs on a ppc64 BE guest succeeded. linux/samples/statx/test-statx
program was executed on the following file types,
1. Regular file
2. Directory
3. device file
4. symlink
5. Named pipe
The test run also included invoking test-statx with the runtime options
provided in the main() function of test-statx.c
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 09b871ffd4 (parisc: Define access_ok() as macro) missed to mark uaddr
as used, which then gives compiler warnings about unused variables.
Fix it by comparing uaddr to uaddr which then gets optimized away by the
compiler.
Signed-off-by: Helge Deller <deller@gmx.de>
Fixes: 09b871ffd4 ("parisc: Define access_ok() as macro")
The previously submitted patch did not resolve the random segmentation
faults observed on the phantom buildd system. There are still
unresolved problems with the Debian 4.8 and 4.9 kernels on C8000.
The attached patch removes the flush of the offset map pages and does a
whole data cache flush for large ranges. No other arch flushes the
offset map in these routines as far as I can tell.
I have not observed any random segmentation faults on rp3440 in two
weeks of testing with 4.10.0 and 4.10.1.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Helge Deller <deller@gmx.de>
The parisc kernel doesn't work with CONFIG_MODVERSIONS since the commit
71810db27c. It can't load modules with the
error: "module unix: Unknown relocation: 41".
The commit changes __kcrctab from 64-bit valus to 32-bit values. The
assembler generates R_PARISC_SECREL32 secrel relocation for them and the
module loader doesn't support this relocation.
This patch adds the R_PARISC_SECREL32 relocation to the module loader.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Helge Deller <deller@gmx.de>
Pull crypto fixes from Herbert Xu:
- self-test failure of crc32c on powerpc
- regressions of ecb(aes) when used with xts/lrw in s5p-sss
- a number of bugs in the omap RNG driver
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: s5p-sss - Fix spinlock recursion on LRW(AES)
hwrng: omap - Do not access INTMASK_REG on EIP76
hwrng: omap - use devm_clk_get() instead of of_clk_get()
hwrng: omap - write registers after enabling the clock
crypto: s5p-sss - Fix completing crypto request in IRQ handler
crypto: powerpc - Fix initialisation of crc32c context
Was getting the following error with allmodconfig:
ERROR: "__get_user_bad" [lib/test_user_copy.ko] undefined!
This was simply a missing break statement, causing an unwanted fall
through.
Signed-off-by: Stafford Horne <shorne@gmail.com>
When building allmodconfig this warning shows.
fs/ocfs2/file.c: In function 'ocfs2_file_write_iter':
./arch/openrisc/include/asm/cmpxchg.h:81:3: warning: value computed is
not used [-Wunused-value]
((typeof(*(ptr)))__xchg((unsigned long)(with), (ptr), sizeof(*(ptr))))
^
Applying the same patch logic that was done to the cmpxchg macro.
Signed-off-by: Stafford Horne <shorne@gmail.com>
The rdtgroup_kn_unlock waits for the last user to release and put its
node. But it's calling kernfs_put on the node which calls the
rdtgroup_kn_unlock, which might not be the group's directory node, but
another group's file node.
This race could be easily reproduced by running 2 instances
of following script:
mount -t resctrl resctrl /sys/fs/resctrl/
pushd /sys/fs/resctrl/
mkdir krava
echo "krava" > krava/schemata
rmdir krava
popd
umount /sys/fs/resctrl
It triggers the slub debug error message with following command
line config: slub_debug=,kernfs_node_cache.
Call kernfs_put on the group's node to fix it.
Fixes: 60cf5e101f ("x86/intel_rdt: Add mkdir to resctrl file system")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Shaohua Li <shli@fb.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1489501253-20248-1-git-send-email-jolsa@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pavel Machek reported the following warning on x86-32:
WARNING: kernel stack frame pointer at f50cdf98 in swapper/2:0 has bad value (null)
The warning is caused by the unwinder not realizing that it reached the
end of the stack, due to an unusual prologue which gcc sometimes
generates for aligned stacks. The prologue is based on a gcc feature
called the Dynamic Realign Argument Pointer (DRAP). It's almost always
enabled for aligned stacks when -maccumulate-outgoing-args isn't set.
This issue is similar to the one fixed by the following commit:
8023e0e2a4 ("x86/unwind: Adjust last frame check for aligned function stacks")
... but that fix was specific to x86-64.
Make the fix more generic to cover x86-32 as well, and also ensure that
the return address referred to by the frame pointer is a copy of the
original return address.
Fixes: acb4608ad1 ("x86/unwind: Create stack frames for saved syscall registers")
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/50d4924db716c264b14f1633037385ec80bf89d2.1489465609.git.jpoimboe@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We still get a build error in random configurations, after this has been
modified a few times:
In file included from include/linux/mm.h:68:0,
from include/linux/suspend.h:8,
from arch/x86/kernel/asm-offsets.c:12:
arch/x86/include/asm/pgtable.h:66:26: error: redefinition of 'native_pud_clear'
#define pud_clear(pud) native_pud_clear(pud)
My interpretation is that the build error comes from a typo in __PAGETABLE_PUD_FOLDED,
so fix that typo now, and remove the incorrect #ifdef around the native_pud_clear
definition.
Fixes: 3e761a42e1 ("mm, x86: fix HIGHMEM64 && PARAVIRT build config for native_pud_clear()")
Fixes: a00cc7d9dd ("mm, x86: add support for PUD-sized transparent hugepages")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Dave Jiang <dave.jiang@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Thomas Garnier <thgarnie@google.com>
Link: http://lkml.kernel.org/r/20170314121330.182155-1-arnd@arndb.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull some more powerpc fixes from Michael Ellerman:
"The main item is the addition of the Power9 Machine Check handler.
This was delayed to make sure some details were correct, and is as
minimal as possible.
The rest is small fixes, two for the Power9 PMU, two dealing with
obscure toolchain problems, two for the PowerNV IOMMU code (used by
VFIO), and one to fix a crash on 32-bit machines with macio devices
due to missing dma_ops.
Thanks to:
Alexey Kardashevskiy, Cyril Bur, Larry Finger, Madhavan Srinivasan,
Nicholas Piggin"
* tag 'powerpc-4.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: POWER9 machine check handler
powerpc/64s: allow machine check handler to set severity and initiator
powerpc/64s: fix handling of non-synchronous machine checks
powerpc/pmac: Fix crash in dma-mapping.h with NULL dma_ops
powerpc/powernv/ioda2: Update iommu table base on ownership change
powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
selftests/powerpc: Replace stxvx and lxvx with stxvd2x/lxvd2x
powerpc/perf: Handle sdar_mode for marked event in power9
powerpc/perf: Fix perf_get_data_addr() for power9 DD1
powerpc/boot: Fix zImage TOC alignment
Subhransu reported that convert_art_to_tsc() isn't working for him.
The ART to TSC relation is only set up for systems which use the refined
TSC calibration. Systems with known TSC frequency (available via CPUID 15)
are not using the refined calibration and therefor the ART to TSC relation
is never established.
Add the setup to the known frequency init path which skips ART
calibration. The init code needs to be duplicated as for systems which use
refined calibration the ART setup must be delayed until calibration has
been done.
The problem has been there since the ART support was introdduced, but only
detected now because Subhransu tested the first time on hardware which has
TSC frequency enumerated via CPUID 15.
Note for stable: The conditional has changed from TSC_RELIABLE to
TSC_KNOWN_FREQUENCY.
[ tglx: Rewrote changelog and identified the proper 'Fixes' commit ]
Fixes: f9677e0f83 ("x86/tsc: Always Running Timer (ART) correlated clocksource")
Reported-by: "Prusty, Subhransu S" <subhransu.s.prusty@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Cc: christopher.s.hall@intel.com
Cc: kevin.b.stanton@intel.com
Cc: john.stultz@linaro.org
Cc: akataria@vmware.com
Link: http://lkml.kernel.org/r/20170313145712.GI3312@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull s390 fixes from Martin Schwidefsky:
- four patches to get the new cputime code in shape for s390
- add the new statx system call
- a few bug fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: wire up statx system call
KVM: s390: Fix guest migration for huge guests resulting in panic
s390/ipl: always use load normal for CCW-type re-IPL
s390/timex: micro optimization for tod_to_ns
s390/cputime: provide archicture specific cputime_to_nsecs
s390/cputime: reset all accounting fields on fork
s390/cputime: remove last traces of cputime_t
s390: fix in-kernel program checks
s390/crypt: fix missing unlock in ctr_paes_crypt on error path
Pull x86 fixes from Thomas Gleixner:
- a fix for the kexec/purgatory regression which was introduced in the
merge window via an innocent sparse fix. We could have reverted that
commit, but on deeper inspection it turned out that the whole
machinery is neither documented nor robust. So a proper cleanup was
done instead
- the fix for the TLB flush issue which was discovered recently
- a simple typo fix for a reboot quirk
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tlb: Fix tlb flushing when lguest clears PGE
kexec, x86/purgatory: Unbreak it and clean it up
x86/reboot/quirks: Fix typo in ASUS EeeBook X205TA reboot quirk
Pull irq fixes from Thomas Gleixner:
- a workaround for a GIC erratum
- a missing stub function for CONFIG_IRQDOMAIN=n
- fixes for a couple of type inconsistencies
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/crossbar: Fix incorrect type of register size
irqchip/gicv3-its: Add workaround for QDF2400 ITS erratum 0065
irqdomain: Add empty irq_domain_check_msi_remap
irqchip/crossbar: Fix incorrect type of local variables
Fengguang reported random corruptions from various locations on x86-32
after commits d2852a2240 ("arch: add ARCH_HAS_SET_MEMORY config") and
9d876e79df ("bpf: fix unlocking of jited image when module ronx not set")
that uses the former. While x86-32 doesn't have a JIT like x86_64, the
bpf_prog_lock_ro() and bpf_prog_unlock_ro() got enabled due to
ARCH_HAS_SET_MEMORY, whereas Fengguang's test kernel doesn't have module
support built in and therefore never had the DEBUG_SET_MODULE_RONX setting
enabled.
After investigating the crashes further, it turned out that using
set_memory_ro() and set_memory_rw() didn't have the desired effect, for
example, setting the pages as read-only on x86-32 would still let
probe_kernel_write() succeed without error. This behavior would manifest
itself in situations where the vmalloc'ed buffer was accessed prior to
set_memory_*() such as in case of bpf_prog_alloc(). In cases where it
wasn't, the page attribute changes seemed to have taken effect, leading to
the conclusion that a TLB invalidate didn't happen. Moreover, it turned out
that this issue reproduced with qemu in "-cpu kvm64" mode, but not for
"-cpu host". When the issue occurs, change_page_attr_set_clr() did trigger
a TLB flush as expected via __flush_tlb_all() through cpa_flush_range(),
though.
There are 3 variants for issuing a TLB flush: invpcid_flush_all() (depends
on CPU feature bits X86_FEATURE_INVPCID, X86_FEATURE_PGE), cr4 based flush
(depends on X86_FEATURE_PGE), and cr3 based flush. For "-cpu host" case in
my setup, the flush used invpcid_flush_all() variant, whereas for "-cpu
kvm64", the flush was cr4 based. Switching the kvm64 case to cr3 manually
worked fine, and further investigating the cr4 one turned out that
X86_CR4_PGE bit was not set in cr4 register, meaning the
__native_flush_tlb_global_irq_disabled() wrote cr4 twice with the same
value instead of clearing X86_CR4_PGE in the first write to trigger the
flush.
It turned out that X86_CR4_PGE was cleared from cr4 during init from
lguest_arch_host_init() via adjust_pge(). The X86_FEATURE_PGE bit is also
cleared from there due to concerns of using PGE in guest kernel that can
lead to hard to trace bugs (see bff672e630 ("lguest: documentation V:
Host") in init()). The CPU feature bits are cleared in dynamic
boot_cpu_data, but they never propagated to __flush_tlb_all() as it uses
static_cpu_has() instead of boot_cpu_has() for testing which variant of TLB
flushing to use, meaning they still used the old setting of the host
kernel.
Clearing via setup_clear_cpu_cap(X86_FEATURE_PGE) so this would propagate
to static_cpu_has() checks is too late at this point as sections have been
patched already, so for now, it seems reasonable to switch back to
boot_cpu_has(X86_FEATURE_PGE) as it was prior to commit c109bf9599
("x86/cpufeature: Remove cpu_has_pge"). This lets the TLB flush trigger via
cr3 as originally intended, properly makes the new page attributes visible
and thus fixes the crashes seen by Fengguang.
Fixes: c109bf9599 ("x86/cpufeature: Remove cpu_has_pge")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: bp@suse.de
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: lkp@01.org
Cc: Laura Abbott <labbott@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernrl.org/r/20170301125426.l4nf65rx4wahohyl@wfg-t540p.sh.intel.com
Link: http://lkml.kernel.org/r/25c41ad9eca164be4db9ad84f768965b7eb19d9e.1489191673.git.daniel@iogearbox.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>