The only remaining consumer is new_inode, where it showed up in 2001 as
commit c37fa164f793 ("v2.4.9.9 -> v2.4.9.10") in a historical repo [1]
with a changelog which does not mention it.
Since then the line got only touched up to keep compiling.
While it may have been of benefit back in the day, it is guaranteed to
at best not get in the way in the multicore setting -- as the code
performs *a lot* of work between the prefetch and actual lock acquire,
any contention means the cacheline is already invalid by the time the
routine calls spin_lock(). It adds spurious traffic, for short.
On top of it prefetch is notoriously tricky to use for single-threaded
purposes, making it questionable from the get go.
As such, remove it.
I admit upfront I did not see value in benchmarking this change, but I
can do it if that is deemed appropriate.
Removal from new_inode and of the entire thing are in the same patch as
requested by Linus, so whatever weird looks can be directed at that guy.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/fs/inode.c?id=c37fa164f793735b32aa3f53154ff1a7659e6442 [1]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 fixes from Borislav Petkov:
- Do not parse the confidential computing blob on non-AMD hardware as
it leads to an EFI config table ending up unmapped
- Use the correct segment selector in the 32-bit version of getcpu() in
the vDSO
- Make sure vDSO and VVAR regions are placed in the 47-bit VA range
even on 5-level paging systems
- Add models 0x90-0x91 to the range of AMD Zenbleed-affected CPUs
* tag 'x86_urgent_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Enable Zenbleed fix for AMD Custom APU 0405
x86/mm: Fix VDSO and VVAR placement on 5-level paging machines
x86/linkage: Fix typo of BUILD_VDSO in asm/linkage.h
x86/vdso: Choose the right GDT_ENTRY_CPUNODE for 32-bit getcpu() on 64-bit kernel
x86/sev: Do not try to parse for the CC blob on non-AMD hardware
Pull x86 mitigation fixes from Borislav Petkov:
"The first set of fallout fixes after the embargo madness. There will
be another set next week too.
- A first series of cleanups/unifications and documentation
improvements to the SRSO and GDS mitigations code which got
postponed to after the embargo date
- Fix the SRSO aliasing addresses assertion so that the LLVM linker
can parse it too"
* tag 'x86_bugs_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
driver core: cpu: Fix the fallback cpu_show_gds() name
x86: Move gds_ucode_mitigated() declaration to header
x86/speculation: Add cpu_show_gds() prototype
driver core: cpu: Make cpu_show_not_affected() static
x86/srso: Fix build breakage with the LLVM linker
Documentation/srso: Document IBPB aspect and fix formatting
driver core: cpu: Unify redundant silly stubs
Documentation/hw-vuln: Unify filename specification in index
Pull ACPI fixes from Rafael Wysocki:
"Rework the handling of interrupt overrides on AMD Zen-based machines
to avoid recently introduced regressions (Hans de Goede).
Note that this is intended as a short-term mitigation for 6.5 and the
long-term approach will be to attempt to use the configuration left by
the BIOS, but it requires more investigation"
* tag 'acpi-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: resource: Add IRQ override quirk for PCSpecialist Elimina Pro 16 M
ACPI: resource: Honor MADT INT_SRC_OVR settings for IRQ1 on AMD Zen
ACPI: resource: Always use MADT override IRQ settings for all legacy non i8042 IRQs
ACPI: resource: revert "Remove "Zen" specific match and quirks"
Under certain circumstances, an integer division by 0 which faults, can
leave stale quotient data from a previous division operation on Zen1
microarchitectures.
Do a dummy division 0/1 before returning from the #DE exception handler
in order to avoid any leaks of potentially sensitive data.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vDSO getcpu() reads CPU ID from the GDT_ENTRY_CPUNODE entry when the RDPID
instruction is not available. And GDT_ENTRY_CPUNODE is defined as 28 on 32-bit
Linux kernel and 15 on 64-bit. But the 32-bit getcpu() on 64-bit Linux kernel
is compiled with 32-bit Linux kernel GDT_ENTRY_CPUNODE, i.e., 28, beyond the
64-bit Linux kernel GDT limit. Thus, it just fails _silently_.
When BUILD_VDSO32_64 is defined, choose the 64-bit Linux kernel GDT definitions
to compile the 32-bit getcpu().
Fixes: 877cff5296 ("x86/vdso: Fake 32bit VDSO build on 64bit compile for vgetcpu")
Reported-by: kernel test robot <yujie.liu@intel.com>
Reported-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230322061758.10639-1-xin3.li@intel.com
Link: https://lore.kernel.org/oe-lkp/202303020903.b01fd1de-yujie.liu@intel.com
Pull x86/gds fixes from Dave Hansen:
"Mitigate Gather Data Sampling issue:
- Add Base GDS mitigation
- Support GDS_NO under KVM
- Fix a documentation typo"
* tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/x86: Fix backwards on/off logic about YMM support
KVM: Add GDS_NO support to KVM
x86/speculation: Add Kconfig option for GDS
x86/speculation: Add force option to GDS mitigation
x86/speculation: Add Gather Data Sampling mitigation
Pull x86/srso fixes from Borislav Petkov:
"Add a mitigation for the speculative RAS (Return Address Stack)
overflow vulnerability on AMD processors.
In short, this is yet another issue where userspace poisons a
microarchitectural structure which can then be used to leak privileged
information through a side channel"
* tag 'x86_bugs_srso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/srso: Tie SBPB bit setting to microcode patch detection
x86/srso: Add a forgotten NOENDBR annotation
x86/srso: Fix return thunks in generated code
x86/srso: Add IBPB on VMEXIT
x86/srso: Add IBPB
x86/srso: Add SRSO_NO support
x86/srso: Add IBPB_BRTYPE support
x86/srso: Add a Speculative RAS Overflow mitigation
x86/bugs: Increase the x86 bugs vector size to two u32s
Pull hyperv fixes from Wei Liu:
- Fix a bug in a python script for Hyper-V (Ani Sinha)
- Workaround a bug in Hyper-V when IBT is enabled (Michael Kelley)
- Fix an issue parsing MP table when Linux runs in VTL2 (Saurabh
Sengar)
- Several cleanup patches (Nischala Yelchuri, Kameron Carr, YueHaibing,
ZhiHu)
* tag 'hyperv-fixes-signed-20230804' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Drivers: hv: vmbus: Remove unused extern declaration vmbus_ontimer()
x86/hyperv: add noop functions to x86_init mpparse functions
vmbus_testing: fix wrong python syntax for integer value comparison
x86/hyperv: fix a warning in mshyperv.h
x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
x86/hyperv: Improve code for referencing hyperv_pcpu_input_arg
Drivers: hv: Change hv_free_hyperv_page() to take void * argument
Pull kvm fixes from Paolo Bonzini:
"x86:
- Do not register IRQ bypass consumer if posted interrupts not
supported
- Fix missed device interrupt due to non-atomic update of IRR
- Use GFP_KERNEL_ACCOUNT for pid_table in ipiv
- Make VMREAD error path play nice with noinstr
- x86: Acquire SRCU read lock when handling fastpath MSR writes
- Support linking rseq tests statically against glibc 2.35+
- Fix reference count for stats file descriptors
- Detect userspace setting invalid CR0
Non-KVM:
- Remove coccinelle script that has caused multiple confusion
("debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE()
usage", acked by Greg)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: selftests: Expand x86's sregs test to cover illegal CR0 values
KVM: VMX: Don't fudge CR0 and CR4 for restricted L2 guest
KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalid
Revert "debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE() usage"
KVM: selftests: Verify stats fd is usable after VM fd has been closed
KVM: selftests: Verify stats fd can be dup()'d and read
KVM: selftests: Verify userspace can create "redundant" binary stats files
KVM: selftests: Explicitly free vcpus array in binary stats test
KVM: selftests: Clean up stats fd in common stats_test() helper
KVM: selftests: Use pread() to read binary stats header
KVM: Grab a reference to KVM for VM and vCPU stats file descriptors
selftests/rseq: Play nice with binaries statically linked against glibc 2.35+
Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"
KVM: x86: Acquire SRCU read lock when handling fastpath MSR writes
KVM: VMX: Use vmread_error() to report VM-Fail in "goto" path
KVM: VMX: Make VMREAD error path play nice with noinstr
KVM: x86/irq: Conditionally register IRQ bypass consumer again
KVM: X86: Use GFP_KERNEL_ACCOUNT for pid_table in ipiv
KVM: x86: check the kvm_cpu_get_interrupt result before using it
KVM: x86: VMX: set irr_pending in kvm_apic_update_irr
...
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid,
e.g. due to setting bits 63:32, illegal combinations, or to a value that
isn't allowed in VMX (non-)root mode. The VMX checks in particular are
"fun" as failure to disallow Real Mode for an L2 that is configured with
unrestricted guest disabled, when KVM itself has unrestricted guest
enabled, will result in KVM forcing VM86 mode to virtual Real Mode for
L2, but then fail to unwind the related metadata when synthesizing a
nested VM-Exit back to L1 (which has unrestricted guest enabled).
Opportunistically fix a benign typo in the prototype for is_valid_cr4().
Cc: stable@vger.kernel.org
Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add the option to flush IBPB only on VMEXIT in order to protect from
malicious guests but one otherwise trusts the software that runs on the
hypervisor.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Add the option to mitigate using IBPB on a kernel entry. Pull in the
Retbleed alternative so that the IBPB call from there can be used. Also,
if Retbleed mitigation is done using IBPB, the same mitigation can and
must be used here.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Add support for the synthetic CPUID flag which "if this bit is 1,
it indicates that MSR 49h (PRED_CMD) bit 0 (IBPB) flushes all branch
type predictions from the CPU branch predictor."
This flag is there so that this capability in guests can be detected
easily (otherwise one would have to track microcode revisions which is
impossible for guests).
It is also needed only for Zen3 and -4. The other two (Zen1 and -2)
always flush branch type predictions by default.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Add a mitigation for the speculative return address stack overflow
vulnerability found on AMD processors.
The mitigation works by ensuring all RET instructions speculate to
a controlled location, similar to how speculation is controlled in the
retpoline sequence. To accomplish this, the __x86_return_thunk forces
the CPU to mispredict every function return using a 'safe return'
sequence.
To ensure the safety of this mitigation, the kernel must ensure that the
safe return sequence is itself free from attacker interference. In Zen3
and Zen4, this is accomplished by creating a BTB alias between the
untraining function srso_untrain_ret_alias() and the safe return
function srso_safe_ret_alias() which results in evicting a potentially
poisoned BTB entry and using that safe one for all function returns.
In older Zen1 and Zen2, this is accomplished using a reinterpretation
technique similar to Retbleed one: srso_untrain_ret() and
srso_safe_ret().
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
The following checkpatch warning is removed:
WARNING: Use #include <linux/io.h> instead of <asm/io.h>
Signed-off-by: ZhiHu <huzhi001@208suo.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Gather Data Sampling (GDS) is a hardware vulnerability which allows
unprivileged speculative access to data which was previously stored in
vector registers.
Intel processors that support AVX2 and AVX512 have gather instructions
that fetch non-contiguous data elements from memory. On vulnerable
hardware, when a gather instruction is transiently executed and
encounters a fault, stale data from architectural or internal vector
registers may get transiently stored to the destination vector
register allowing an attacker to infer the stale data using typical
side channel techniques like cache timing attacks.
This mitigation is different from many earlier ones for two reasons.
First, it is enabled by default and a bit must be set to *DISABLE* it.
This is the opposite of normal mitigation polarity. This means GDS can
be mitigated simply by updating microcode and leaving the new control
bit alone.
Second, GDS has a "lock" bit. This lock bit is there because the
mitigation affects the hardware security features KeyLocker and SGX.
It needs to be enabled and *STAY* enabled for these features to be
mitigated against GDS.
The mitigation is enabled in the microcode by default. Disable it by
setting gather_data_sampling=off or by disabling all mitigations with
mitigations=off. The mitigation status can be checked by reading:
/sys/devices/system/cpu/vulnerabilities/gather_data_sampling
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Add a fix for the Zen2 VZEROUPPER data corruption bug where under
certain circumstances executing VZEROUPPER can cause register
corruption or leak data.
The optimal fix is through microcode but in the case the proper
microcode revision has not been applied, enable a fallback fix using
a chicken bit.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Eager page splitting optimization for dirty logging, optionally
allowing for a VM to avoid the cost of hugepage splitting in the
stage-2 fault path.
- Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact
with services that live in the Secure world. pKVM intervenes on
FF-A calls to guarantee the host doesn't misuse memory donated to
the hyp or a pKVM guest.
- Support for running the split hypervisor with VHE enabled, known as
'hVHE' mode. This is extremely useful for testing the split
hypervisor on VHE-only systems, and paves the way for new use cases
that depend on having two TTBRs available at EL2.
- Generalized framework for configurable ID registers from userspace.
KVM/arm64 currently prevents arbitrary CPU feature set
configuration from userspace, but the intent is to relax this
limitation and allow userspace to select a feature set consistent
with the CPU.
- Enable the use of Branch Target Identification (FEAT_BTI) in the
hypervisor.
- Use a separate set of pointer authentication keys for the
hypervisor when running in protected mode, as the host is untrusted
at runtime.
- Ensure timer IRQs are consistently released in the init failure
paths.
- Avoid trapping CTR_EL0 on systems with Enhanced Virtualization
Traps (FEAT_EVT), as it is a register commonly read from userspace.
- Erratum workaround for the upcoming AmpereOne part, which has
broken hardware A/D state management.
RISC-V:
- Redirect AMO load/store misaligned traps to KVM guest
- Trap-n-emulate AIA in-kernel irqchip for KVM guest
- Svnapot support for KVM Guest
s390:
- New uvdevice secret API
- CMM selftest and fixes
- fix racy access to target CPU for diag 9c
x86:
- Fix missing/incorrect #GP checks on ENCLS
- Use standard mmu_notifier hooks for handling APIC access page
- Drop now unnecessary TR/TSS load after VM-Exit on AMD
- Print more descriptive information about the status of SEV and
SEV-ES during module load
- Add a test for splitting and reconstituting hugepages during and
after dirty logging
- Add support for CPU pinning in demand paging test
- Add support for AMD PerfMonV2, with a variety of cleanups and minor
fixes included along the way
- Add a "nx_huge_pages=never" option to effectively avoid creating NX
hugepage recovery threads (because nx_huge_pages=off can be toggled
at runtime)
- Move handling of PAT out of MTRR code and dedup SVM+VMX code
- Fix output of PIC poll command emulation when there's an interrupt
- Add a maintainer's handbook to document KVM x86 processes,
preferred coding style, testing expectations, etc.
- Misc cleanups, fixes and comments
Generic:
- Miscellaneous bugfixes and cleanups
Selftests:
- Generate dependency files so that partial rebuilds work as
expected"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (153 commits)
Documentation/process: Add a maintainer handbook for KVM x86
Documentation/process: Add a label for the tip tree handbook's coding style
KVM: arm64: Fix misuse of KVM_ARM_VCPU_POWER_OFF bit index
RISC-V: KVM: Remove unneeded semicolon
RISC-V: KVM: Allow Svnapot extension for Guest/VM
riscv: kvm: define vcpu_sbi_ext_pmu in header
RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip
RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC
RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip
RISC-V: KVM: Add in-kernel emulation of AIA APLIC
RISC-V: KVM: Implement device interface for AIA irqchip
RISC-V: KVM: Skeletal in-kernel AIA irqchip support
RISC-V: KVM: Set kvm_riscv_aia_nr_hgei to zero
RISC-V: KVM: Add APLIC related defines
RISC-V: KVM: Add IMSIC related defines
RISC-V: KVM: Implement guest external interrupt line management
KVM: x86: Remove PRIx* definitions as they are solely for user space
s390/uv: Update query for secret-UVCs
s390/uv: replace scnprintf with sysfs_emit
s390/uvdevice: Add 'Lock Secret Store' UVC
...
Pull tracing updates from Steven Rostedt:
- Add new feature to have function graph tracer record the return
value. Adds a new option: funcgraph-retval ; when set, will show the
return value of a function in the function graph tracer.
- Also add the option: funcgraph-retval-hex where if it is not set, and
the return value is an error code, then it will return the decimal of
the error code, otherwise it still reports the hex value.
- Add the file /sys/kernel/tracing/osnoise/per_cpu/cpu<cpu>/timerlat_fd
That when a application opens it, it becomes the task that the timer
lat tracer traces. The application can also read this file to find
out how it's being interrupted.
- Add the file /sys/kernel/tracing/available_filter_functions_addrs
that works just the same as available_filter_functions but also shows
the addresses of the functions like kallsyms, except that it gives
the address of where the fentry/mcount jump/nop is. This is used by
BPF to make it easier to attach BPF programs to ftrace hooks.
- Replace strlcpy with strscpy in the tracing boot code.
* tag 'trace-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix warnings when building htmldocs for function graph retval
riscv: ftrace: Enable HAVE_FUNCTION_GRAPH_RETVAL
tracing/boot: Replace strlcpy with strscpy
tracing/timerlat: Add user-space interface
tracing/osnoise: Skip running osnoise if all instances are off
tracing/osnoise: Switch from PF_NO_SETAFFINITY to migrate_disable
ftrace: Show all functions with addresses in available_filter_functions_addrs
selftests/ftrace: Add funcgraph-retval test case
LoongArch: ftrace: Enable HAVE_FUNCTION_GRAPH_RETVAL
x86/ftrace: Enable HAVE_FUNCTION_GRAPH_RETVAL
arm64: ftrace: Enable HAVE_FUNCTION_GRAPH_RETVAL
tracing: Add documentation for funcgraph-retval and funcgraph-retval-hex
function_graph: Support recording and printing the return value of function
fgraph: Add declaration of "struct fgraph_ret_regs"
Pull drm updates from Dave Airlie:
"There is one set of patches to misc for a i915 gsc/mei proxy driver.
Otherwise it's mostly amdgpu/i915/msm, lots of hw enablement and lots
of refactoring.
core:
- replace strlcpy with strscpy
- EDID changes to support further conversion to struct drm_edid
- Move i915 DSC parameter code to common DRM helpers
- Add Colorspace functionality
aperture:
- ignore framebuffers with non-primary devices
fbdev:
- use fbdev i/o helpers
- add Kconfig options for fb_ops helpers
- use new fb io helpers directly in drivers
sysfs:
- export DRM connector ID
scheduler:
- Avoid an infinite loop
ttm:
- store function table in .rodata
- Add query for TTM mem limit
- Add NUMA awareness to pools
- Export ttm_pool_fini()
bridge:
- fsl-ldb: support i.MX6SX
- lt9211, lt9611: remove blanking packets
- tc358768: implement input bus formats, devm cleanups
- ti-snd65dsi86: implement wait_hpd_asserted
- analogix: fix endless probe loop
- samsung-dsim: support swapped clock, fix enabling, support var
clock
- display-connector: Add support for external power supply
- imx: Fix module linking
- tc358762: Support reset GPIO
panel:
- nt36523: Support Lenovo J606F
- st7703: Support Anbernic RG353V-V2
- InnoLux G070ACE-L01 support
- boe-tv101wum-nl6: Improve initialization
- sharp-ls043t1le001: Mode fixes
- simple: BOE EV121WXM-N10-1850, S6D7AA0
- Ampire AM-800480L1TMQW-T00H
- Rocktech RK043FN48H
- Starry himax83102-j02
- Starry ili9882t
amdgpu:
- add new ctx query flag to handle reset better
- add new query/set shadow buffer for rdna3
- DCN 3.2/3.1.x/3.0.x updates
- Enable DC_FP on loongarch
- PCIe fix for RDNA2
- improve DC FAMS/SubVP support for better power management
- partition support for lots of engines
- Take NUMA into account when allocating memory
- Add new DRM_AMDGPU_WERROR config parameter to help with CI
- Initial SMU13 overdrive support
- Add support for new colorspace KMS API
- W=1 fixes
amdkfd:
- Query TTM mem limit rather than hardcoding it
- GC 9.4.3 partition support
- Handle NUMA for partitions
- Add debugger interface for enabling gdb
- Add KFD event age tracking
radeon:
- Fix possible UAF
i915:
- new getparam for PXP support
- GSC/MEI proxy driver
- Meteorlake display enablement
- avoid clearing preallocated framebuffers with TTM
- implement framebuffer mmap support
- Disable sampler indirect state in bindless heap
- Enable fdinfo for GuC backends
- GuC loading and firmware table handling fixes
- Various refactors for multi-tile enablement
- Define MOCS and PAT tables for MTL
- GSC/MEI support for Meteorlake
- PMU multi-tile support
- Large driver kernel doc cleanup
- Allow VRR toggling and arbitrary refresh rates
- Support async flips on linear buffers on display ver 12+
- Expose CRTC CTM property on ILK/SNB/VLV
- New debugfs for display clock frequencies
- Hotplug refactoring
- Display refactoring
- I915_GEM_CREATE_EXT_SET_PAT for Mesa on Meteorlake
- Use large rings for compute contexts
- HuC loading for MTL
- Allow user to set cache at BO creation
- MTL powermanagement enhancements
- Switch to dedicated workqueues to stop using flush_scheduled_work()
- Move display runtime init under display/
- Remove 10bit gamma on desktop gen3 parts, they don't support it
habanalabs:
- uapi: return 0 for user queries if there was a h/w or f/w error
- Add pci health check when we lose connection with the firmware.
This can be used to distinguish between pci link down and firmware
getting stuck.
- Add more info to the error print when TPC interrupt occur.
- Firmware fixes
msm:
- Adreno A660 bindings
- SM8350 MDSS bindings fix
- Added support for DPU on sm6350 and sm6375 platforms
- Implemented tearcheck support to support vsync on SM150 and newer
platforms
- Enabled missing features (DSPP, DSC, split display) on sc8180x,
sc8280xp, sm8450
- Added support for DSI and 28nm DSI PHY on MSM8226 platform
- Added support for DSI on sm6350 and sm6375 platforms
- Added support for display controller on MSM8226 platform
- A690 GPU support
- Move cmdstream dumping out of fence signaling path
- a610 support
- Support for a6xx devices without GMU
nouveau:
- NULL ptr before deref fixes
armada:
- implement fbdev emulation as client
sun4i:
- fix mipi-dsi dotclock
- release clocks
vc4:
- rgb range toggle property
- BT601 / BT2020 HDMI support
vkms:
- convert to drmm helpers
- add reflection and rotation support
- fix rgb565 conversion
gma500:
- fix iomem access
shmobile:
- support renesas soc platform
- enable fbdev
mxsfb:
- Add support for i.MX93 LCDIF
stm:
- dsi: Use devm_ helper
- ltdc: Fix potential invalid pointer deref
renesas:
- Group drivers in renesas subdirectory to prepare for new platform
- Drop deprecated R-Car H3 ES1.x support
meson:
- Add support for MIPI DSI displays
virtio:
- add sync object support
mediatek:
- Add display binding document for MT6795"
* tag 'drm-next-2023-06-29' of git://anongit.freedesktop.org/drm/drm: (1791 commits)
drm/i915: Fix a NULL vs IS_ERR() bug
drm/i915: make i915_drm_client_fdinfo() reference conditional again
drm/i915/huc: Fix missing error code in intel_huc_init()
drm/i915/gsc: take a wakeref for the proxy-init-completion check
drm/msm/a6xx: Add A610 speedbin support
drm/msm/a6xx: Add A619_holi speedbin support
drm/msm/a6xx: Use adreno_is_aXYZ macros in speedbin matching
drm/msm/a6xx: Use "else if" in GPU speedbin rev matching
drm/msm/a6xx: Fix some A619 tunables
drm/msm/a6xx: Add A610 support
drm/msm/a6xx: Add support for A619_holi
drm/msm/adreno: Disable has_cached_coherent in GMU wrapper configurations
drm/msm/a6xx: Introduce GMU wrapper support
drm/msm/a6xx: Move CX GMU power counter enablement to hw_init
drm/msm/a6xx: Extend and explain UBWC config
drm/msm/a6xx: Remove both GBIF and RBBM GBIF halt on hw init
drm/msm/a6xx: Add a helper for software-resetting the GPU
drm/msm/a6xx: Improve a6xx_bus_clear_pending_transactions()
drm/msm/a6xx: Move a6xx_bus_clear_pending_transactions to a6xx_gpu
drm/msm/a6xx: Move force keepalive vote removal to a6xx_gmu_force_off()
...
Pull non-mm updates from Andrew Morton:
- Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in top-level
directories
- Douglas Anderson has added a new "buddy" mode to the hardlockup
detector. It permits the detector to work on architectures which
cannot provide the required interrupts, by having CPUs periodically
perform checks on other CPUs
- Zhen Lei has enhanced kexec's ability to support two crash regions
- Petr Mladek has done a lot of cleanup on the hard lockup detector's
Kconfig entries
- And the usual bunch of singleton patches in various places
* tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
kernel/time/posix-stubs.c: remove duplicated include
ocfs2: remove redundant assignment to variable bit_off
watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDY
powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.h
devres: show which resource was invalid in __devm_ioremap_resource()
watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH
watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64
watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific
watchdog/hardlockup: declare arch_touch_nmi_watchdog() only in linux/nmi.h
watchdog/hardlockup: make the config checks more straightforward
watchdog/hardlockup: sort hardlockup detector related config values a logical way
watchdog/hardlockup: move SMP barriers from common code to buddy code
watchdog/buddy: simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY
watchdog/buddy: don't copy the cpumask in watchdog_next_cpu()
watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called
watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog()
watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy()
watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick()
watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe()
watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails
...
Pull objtool updates from Ingo Molar:
"Build footprint & performance improvements:
- Reduce memory usage with CONFIG_DEBUG_INFO=y
In the worst case of an allyesconfig+CONFIG_DEBUG_INFO=y kernel,
DWARF creates almost 200 million relocations, ballooning objtool's
peak heap usage to 53GB. These patches reduce that to 25GB.
On a distro-type kernel with kernel IBT enabled, they reduce
objtool's peak heap usage from 4.2GB to 2.8GB.
These changes also improve the runtime significantly.
Debuggability improvements:
- Add the unwind_debug command-line option, for more extend unwinding
debugging output
- Limit unreachable warnings to once per function
- Add verbose option for disassembling affected functions
- Include backtrace in verbose mode
- Detect missing __noreturn annotations
- Ignore exc_double_fault() __noreturn warnings
- Remove superfluous global_noreturns entries
- Move noreturn function list to separate file
- Add __kunit_abort() to noreturns
Unwinder improvements:
- Allow stack operations in UNWIND_HINT_UNDEFINED regions
- drm/vmwgfx: Add unwind hints around RBP clobber
Cleanups:
- Move the x86 entry thunk restore code into thunk functions
- x86/unwind/orc: Use swap() instead of open coding it
- Remove unnecessary/unused variables
Fixes for modern stack canary handling"
* tag 'objtool-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
x86/orc: Make the is_callthunk() definition depend on CONFIG_BPF_JIT=y
objtool: Skip reading DWARF section data
objtool: Free insns when done
objtool: Get rid of reloc->rel[a]
objtool: Shrink elf hash nodes
objtool: Shrink reloc->sym_reloc_entry
objtool: Get rid of reloc->jump_table_start
objtool: Get rid of reloc->addend
objtool: Get rid of reloc->type
objtool: Get rid of reloc->offset
objtool: Get rid of reloc->idx
objtool: Get rid of reloc->list
objtool: Allocate relocs in advance for new rela sections
objtool: Add for_each_reloc()
objtool: Don't free memory in elf_close()
objtool: Keep GElf_Rel[a] structs synced
objtool: Add elf_create_section_pair()
objtool: Add mark_sec_changed()
objtool: Fix reloc_hash size
objtool: Consolidate rel/rela handling
...
Pull x86 mm updates from Ingo Molnar:
- Remove Xen-PV leftovers from init_32.c
- Fix __swp_entry_to_pte() warning splat for Xen PV guests, triggered
on CONFIG_DEBUG_VM_PGTABLE=y
* tag 'x86-mm-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Remove Xen-PV leftovers from init_32.c
x86/mm: Fix __swp_entry_to_pte() for Xen PV guests
Pull perf events updates from Ingo Molnar:
- Rework & fix the event forwarding logic by extending the core
interface.
This fixes AMD PMU events that have to be forwarded from the
core PMU to the IBS PMU.
- Add self-tests to test AMD IBS invocation via core PMU events
- Clean up Intel FixCntrCtl MSR encoding & handling
* tag 'perf-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Re-instate the linear PMU search
perf/x86/intel: Define bit macros for FixCntrCtl MSR
perf test: Add selftest to test IBS invocation via core pmu events
perf/core: Remove pmu linear searching code
perf/ibs: Fix interface via core pmu events
perf/core: Rework forwarding of {task|cpu}-clock events
Pull locking updates from Ingo Molnar:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
The cmpxchg128() family of functions is basically & functionally the
same as cmpxchg_double(), but with a saner interface.
Instead of a 6-parameter horror that forced u128 - u64/u64-halves
layout details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128
types.
- Restructure the generated atomic headers, and add kerneldoc comments
for all of the generic atomic{,64,_long}_t operations.
The generated definitions are much cleaner now, and come with
documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
taking multiple locks of the same type.
This gets rid of one use of lockdep_set_novalidate_class() in the
bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
shadowing generating garbage code on Clang on certain ARM builds.
* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
locking/atomic: treewide: delete arch_atomic_*() kerneldoc
locking/atomic: docs: Add atomic operations to the driver basic API documentation
locking/atomic: scripts: generate kerneldoc comments
docs: scripts: kernel-doc: accept bitwise negation like ~@var
locking/atomic: scripts: simplify raw_atomic*() definitions
locking/atomic: scripts: simplify raw_atomic_long*() definitions
locking/atomic: scripts: split pfx/name/sfx/order
locking/atomic: scripts: restructure fallback ifdeffery
locking/atomic: scripts: build raw_atomic_long*() directly
locking/atomic: treewide: use raw_atomic*_<op>()
locking/atomic: scripts: add trivial raw_atomic*_<op>()
locking/atomic: scripts: factor out order template generation
locking/atomic: scripts: remove leftover "${mult}"
locking/atomic: scripts: remove bogus order parameter
locking/atomic: xtensa: add preprocessor symbols
locking/atomic: x86: add preprocessor symbols
locking/atomic: sparc: add preprocessor symbols
locking/atomic: sh: add preprocessor symbols
...
Pull scheduler updates from Ingo Molnar:
"Scheduler SMP load-balancer improvements:
- Avoid unnecessary migrations within SMT domains on hybrid systems.
Problem:
On hybrid CPU systems, (processors with a mixture of
higher-frequency SMT cores and lower-frequency non-SMT cores),
under the old code lower-priority CPUs pulled tasks from the
higher-priority cores if more than one SMT sibling was busy -
resulting in many unnecessary task migrations.
Solution:
The new code improves the load balancer to recognize SMT cores
with more than one busy sibling and allows lower-priority CPUs
to pull tasks, which avoids superfluous migrations and lets
lower-priority cores inspect all SMT siblings for the busiest
queue.
- Implement the 'runnable boosting' feature in the EAS balancer:
consider CPU contention in frequency, EAS max util & load-balance
busiest CPU selection.
This improves CPU utilization for certain workloads, while leaves
other key workloads unchanged.
Scheduler infrastructure improvements:
- Rewrite the scheduler topology setup code by consolidating it into
the build_sched_topology() helper function and building it
dynamically on the fly.
- Resolve the local_clock() vs. noinstr complications by rewriting
the code: provide separate sched_clock_noinstr() and
local_clock_noinstr() functions to be used in instrumentation code,
and make sure it is all instrumentation-safe.
Fixes:
- Fix a kthread_park() race with wait_woken()
- Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
- Fix UP PREEMPT bug by unifying the SMP and UP implementations
- Fix task_struct::saved_state handling
- Fix various rq clock update bugs, unearthed by turning on the rq
clock debugging code.
- Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger
by creating enough cgroups, by removing the warnign and restricting
window size triggers to PSI file write-permission or
CAP_SYS_RESOURCE.
- Propagate SMT flags in the topology when removing degenerate domain
- Fix grub_reclaim() calculation bug in the deadline scheduler code
- Avoid resetting the min update period when it is unnecessary, in
psi_trigger_destroy().
- Don't balance a task to its current running CPU in load_balance(),
which was possible on certain NUMA topologies with overlapping
groups.
- Fix the sched-debug printing of rq->nr_uninterruptible
Cleanups:
- Address various -Wmissing-prototype warnings, as a preparation to
(maybe) enable this warning in the future.
- Remove unused code
- Mark more functions __init
- Fix shadow-variable warnings"
* tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
sched/core: Fixed missing rq clock update before calling set_rq_offline()
sched/deadline: Update GRUB description in the documentation
sched/deadline: Fix bandwidth reclaim equation in GRUB
sched/wait: Fix a kthread_park race with wait_woken()
sched/topology: Mark set_sched_topology() __init
sched/fair: Rename variable cpu_util eff_util
arm64/arch_timer: Fix MMIO byteswap
sched/fair, cpufreq: Introduce 'runnable boosting'
sched/fair: Refactor CPU utilization functions
cpuidle: Use local_clock_noinstr()
sched/clock: Provide local_clock_noinstr()
x86/tsc: Provide sched_clock_noinstr()
clocksource: hyper-v: Provide noinstr sched_clock()
clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
x86/vdso: Fix gettimeofday masking
math64: Always inline u128 version of mul_u64_u64_shr()
s390/time: Provide sched_clock_noinstr()
loongarch: Provide noinstr sched_clock_read()
...
Pull x86 SEV updates from Borislav Petkov:
- Some SEV and CC platform helpers cleanup and simplifications now that
the usage patterns are becoming apparent
[ I'm sure I'm the only one that has gets confused by all the TLAs, but
in case there are others: here SEV is AMD's "Secure Encrypted
Virtualization" and CC is generic "Confidential Computing".
There's also Intel SGX (Software Guard Extensions) and TDX (Trust
Domain Extensions), along with all the vendor memory encryption
extensions (SME, TSME, TME, and WTF).
And then we have arm64 with RMA and CCA, and I probably forgot another
dozen or so related acronyms - Linus ]
* tag 'x86_sev_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/coco: Get rid of accessor functions
x86/sev: Get rid of special sev_es_enable_key
x86/coco: Mark cc_platform_has() and descendants noinstr
Pull x86 mtrr updates from Borislav Petkov:
"A serious scrubbing of the MTRR code including adding a new map
mechanism in order to look up the memory type of a region easily.
Also address memory range lookup issues like returning an invalid
memory type. Furthermore, this handles the decoupling of PAT from MTRR
more naturally.
All work by Juergen Gross"
* tag 'x86_mtrr_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen: Set default memory type for PV guests to WB
x86/mtrr: Unify debugging printing
x86/mtrr: Remove unused code
x86/mm: Only check uniform after calling mtrr_type_lookup()
x86/mtrr: Don't let mtrr_type_lookup() return MTRR_TYPE_INVALID
x86/mtrr: Use new cache_map in mtrr_type_lookup()
x86/mtrr: Add mtrr=debug command line option
x86/mtrr: Construct a memory map with cache modes
x86/mtrr: Add get_effective_type() service function
x86/mtrr: Allocate mtrr_value array dynamically
x86/mtrr: Move 32-bit code from mtrr.c to legacy.c
x86/mtrr: Have only one set_mtrr() variant
x86/mtrr: Replace vendor tests in MTRR code
x86/xen: Set MTRR state when running as Xen PV initial domain
x86/hyperv: Set MTRR state when running as SEV-SNP Hyper-V guest
x86/mtrr: Support setting MTRR state for software defined MTRRs
x86/mtrr: Replace size_or_mask and size_and_mask with a much easier concept
x86/mtrr: Remove physical address size calculation
Pull x86 cleanups from Dave Hansen:
"As usual, these are all over the map. The biggest cluster is work from
Arnd to eliminate -Wmissing-prototype warnings:
- Address -Wmissing-prototype warnings
- Remove repeated 'the' in comments
- Remove unused current_untag_mask()
- Document urgent tip branch timing
- Clean up MSR kernel-doc notation
- Clean up paravirt_ops doc
- Update Srivatsa S. Bhat's maintained areas
- Remove unused extern declaration acpi_copy_wakeup_routine()"
* tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86/acpi: Remove unused extern declaration acpi_copy_wakeup_routine()
Documentation: virt: Clean up paravirt_ops doc
x86/mm: Remove unused current_untag_mask()
x86/mm: Remove repeated word in comments
x86/lib/msr: Clean up kernel-doc notation
x86/platform: Avoid missing-prototype warnings for OLPC
x86/mm: Add early_memremap_pgprot_adjust() prototype
x86/usercopy: Include arch_wb_cache_pmem() declaration
x86/vdso: Include vdso/processor.h
x86/mce: Add copy_mc_fragile_handle_tail() prototype
x86/fbdev: Include asm/fb.h as needed
x86/hibernate: Declare global functions in suspend.h
x86/entry: Add do_SYSENTER_32() prototype
x86/quirks: Include linux/pnp.h for arch_pnpbios_disabled()
x86/mm: Include asm/numa.h for set_highmem_pages_init()
x86: Avoid missing-prototype warnings for doublefault code
x86/fpu: Include asm/fpu/regset.h
x86: Add dummy prototype for mk_early_pgtbl_32()
x86/pci: Mark local functions as 'static'
x86/ftrace: Move prepare_ftrace_return prototype to header
...
Pull x86 tdx updates from Dave Hansen:
- Fix a race window where load_unaligned_zeropad() could cause a fatal
shutdown during TDX private<=>shared conversion
The race has never been observed in practice but might allow
load_unaligned_zeropad() to catch a TDX page in the middle of its
conversion process which would lead to a fatal and unrecoverable
guest shutdown.
- Annotate sites where VM "exit reasons" are reused as hypercall
numbers.
* tag 'x86_tdx_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Fix enc_status_change_finish_noop()
x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()
x86/mm: Allow guest.enc_status_change_prepare() to fail
x86/tdx: Wrap exit reason with hcall_func()
Pull x86 platform updates from Dave Hansen:
"Allow CPUs in SGX/HPE Ultraviolet to start using Sub-NUMA clustering
(SNC) mode. SNC has been around outside the UV world for a while but
evidently never worked on UV systems.
SNC is rather notorious for breaking bad assumptions of a 1:1
relationship between physical sockets and NUMA nodes. The UV code was
rather prolific with these assumptions and took quite a bit of
refactoring to remove them"
* tag 'x86_platform_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Update UV[23] platform code for SNC
x86/platform/uv: Remove remaining BUG_ON() and BUG() calls
x86/platform/uv: UV support for sub-NUMA clustering
x86/platform/uv: Helper functions for allocating and freeing conversion tables
x86/platform/uv: When searching for minimums, start at INT_MAX not 99999
x86/platform/uv: Fix printed information in calc_mmioh_map
x86/platform/uv: Introduce helper function uv_pnode_to_socket.
x86/platform/uv: Add platform resolving #defines for misc GAM_MMIOH_REDIRECT*
Pull x86 cpu updates from Borislav Petkov:
- Compute the purposeful misalignment of zen_untrain_ret automatically
and assert __x86_return_thunk's alignment so that future changes to
the symbol macros do not accidentally break them.
- Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is
pointless
* tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/retbleed: Add __x86_return_thunk alignment checks
x86/cpu: Remove X86_FEATURE_NAMES
x86/Kconfig: Make X86_FEATURE_NAMES non-configurable in prompt
Pull x86 confidential computing update from Borislav Petkov:
- Add support for unaccepted memory as specified in the UEFI spec v2.9.
The gist of it all is that Intel TDX and AMD SEV-SNP confidential
computing guests define the notion of accepting memory before using
it and thus preventing a whole set of attacks against such guests
like memory replay and the like.
There are a couple of strategies of how memory should be accepted -
the current implementation does an on-demand way of accepting.
* tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
virt: sevguest: Add CONFIG_CRYPTO dependency
x86/efi: Safely enable unaccepted memory in UEFI
x86/sev: Add SNP-specific unaccepted memory support
x86/sev: Use large PSC requests if applicable
x86/sev: Allow for use of the early boot GHCB for PSC requests
x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
x86/sev: Fix calculation of end address based on number of pages
x86/tdx: Add unaccepted memory support
x86/tdx: Refactor try_accept_one()
x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
efi: Add unaccepted memory support
x86/boot/compressed: Handle unaccepted memory
efi/libstub: Implement support for unaccepted memory
efi/x86: Get full memory map in allocate_e820()
mm: Add support for unaccepted memory
Pull x86 instruction alternatives updates from Borislav Petkov:
- Up until now the Fast Short Rep Mov optimizations implied the
presence of the ERMS CPUID flag. AMD decoupled them with a BIOS
setting so decouple that dependency in the kernel code too
- Teach the alternatives machinery to handle relocations
- Make debug_alternative accept flags in order to see only that set of
patching done one is interested in
- Other fixes, cleanups and optimizations to the patching code
* tag 'x86_alternatives_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternative: PAUSE is not a NOP
x86/alternatives: Add cond_resched() to text_poke_bp_batch()
x86/nospec: Shorten RESET_CALL_DEPTH
x86/alternatives: Add longer 64-bit NOPs
x86/alternatives: Fix section mismatch warnings
x86/alternative: Optimize returns patching
x86/alternative: Complicate optimize_nops() some more
x86/alternative: Rewrite optimize_nops() some
x86/lib/memmove: Decouple ERMS from FSRM
x86/alternative: Support relocations in alternatives
x86/alternative: Make debug-alternative selective
Pull x86 core updates from Thomas Gleixner:
"A set of fixes for kexec(), reboot and shutdown issues:
- Ensure that the WBINVD in stop_this_cpu() has been completed before
the control CPU proceedes.
stop_this_cpu() is used for kexec(), reboot and shutdown to park
the APs in a HLT loop.
The control CPU sends an IPI to the APs and waits for their CPU
online bits to be cleared. Once they all are marked "offline" it
proceeds.
But stop_this_cpu() clears the CPU online bit before issuing
WBINVD, which means there is no guarantee that the AP has reached
the HLT loop.
This was reported to cause intermittent reboot/shutdown failures
due to some dubious interaction with the firmware.
This is not only a problem of WBINVD. The code to actually "stop"
the CPU which runs between clearing the online bit and reaching the
HLT loop can cause large enough delays on its own (think
virtualization). That's especially dangerous for kexec() as kexec()
expects that all APs are in a safe state and not executing code
while the boot CPU jumps to the new kernel. There are more issues
vs kexec() which are addressed separately.
Cure this by implementing an explicit synchronization point right
before the AP reaches HLT. This guarantees that the AP has
completed the full stop proceedure.
- Fix the condition for WBINVD in stop_this_cpu().
The WBINVD in stop_this_cpu() is required for ensuring that when
switching to or from memory encryption no dirty data is left in the
cache lines which might cause a write back in the wrong more later.
This checks CPUID directly because the feature bit might have been
cleared due to a command line option.
But that CPUID check accesses leaf 0x8000001f::EAX unconditionally.
Intel CPUs return the content of the highest supported leaf when a
non-existing leaf is read, while AMD CPUs return all zeros for
unsupported leafs.
So the result of the test on Intel CPUs is lottery and on AMD its
just correct by chance.
While harmless it's incorrect and causes the conditional wbinvd()
to be issued where not required, which caused the above issue to be
unearthed.
- Make kexec() robust against AP code execution
Ashok observed triple faults when doing kexec() on a system which
had been booted with "nosmt".
It turned out that the SMT siblings which had been brought up
partially are parked in mwait_play_dead() to enable power savings.
mwait_play_dead() is monitoring the thread flags of the AP's idle
task, which has been chosen as it's unlikely to be written to.
But kexec() can overwrite the previous kernel text and data
including page tables etc. When it overwrites the cache lines
monitored by an AP that AP resumes execution after the MWAIT on
eventually overwritten text, stack and page tables, which obviously
might end up in a triple fault easily.
Make this more robust in several steps:
1) Use an explicit per CPU cache line for monitoring.
2) Write a command to these cache lines to kick APs out of MWAIT
before proceeding with kexec(), shutdown or reboot.
The APs confirm the wakeup by writing status back and then
enter a HLT loop.
3) If the system uses INIT/INIT/STARTUP for AP bringup, park the
APs in INIT state.
HLT is not a guarantee that an AP won't wake up and resume
execution. HLT is woken up by NMI and SMI. SMI puts the CPU
back into HLT (+/- firmware bugs), but NMI is delivered to the
CPU which executes the NMI handler. Same issue as the MWAIT
scenario described above.
Sending an INIT/INIT sequence to the APs puts them into wait
for STARTUP state, which is safe against NMI.
There is still an issue remaining which can't be fixed: #MCE
If the AP sits in HLT and receives a broadcast #MCE it will try to
handle it with the obvious consequences.
INIT/INIT clears CR4.MCE in the AP which will cause a broadcast
#MCE to shut down the machine.
So there is a choice between fire (HLT) and frying pan (INIT).
Frying pan has been chosen as it's at least preventing the NMI
issue.
On systems which are not using INIT/INIT/STARTUP there is not much
which can be done right now, but at least the obvious and easy to
trigger MWAIT issue has been addressed"
* tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/smp: Put CPUs into INIT on shutdown if possible
x86/smp: Split sending INIT IPI out into a helper function
x86/smp: Cure kexec() vs. mwait_play_dead() breakage
x86/smp: Use dedicated cache-line for mwait_play_dead()
x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
x86/smp: Dont access non-existing CPUID leaf
x86/smp: Make stop_other_cpus() more robust
Pull SMP updates from Thomas Gleixner:
"A large update for SMP management:
- Parallel CPU bringup
The reason why people are interested in parallel bringup is to
shorten the (kexec) reboot time of cloud servers to reduce the
downtime of the VM tenants.
The current fully serialized bringup does the following per AP:
1) Prepare callbacks (allocate, intialize, create threads)
2) Kick the AP alive (e.g. INIT/SIPI on x86)
3) Wait for the AP to report alive state
4) Let the AP continue through the atomic bringup
5) Let the AP run the threaded bringup to full online state
There are two significant delays:
#3 The time for an AP to report alive state in start_secondary()
on x86 has been measured in the range between 350us and 3.5ms
depending on vendor and CPU type, BIOS microcode size etc.
#4 The atomic bringup does the microcode update. This has been
measured to take up to ~8ms on the primary threads depending
on the microcode patch size to apply.
On a two socket SKL server with 56 cores (112 threads) the boot CPU
spends on current mainline about 800ms busy waiting for the APs to
come up and apply microcode. That's more than 80% of the actual
onlining procedure.
This can be reduced significantly by splitting the bringup
mechanism into two parts:
1) Run the prepare callbacks and kick the AP alive for each AP
which needs to be brought up.
The APs wake up, do their firmware initialization and run the
low level kernel startup code including microcode loading in
parallel up to the first synchronization point. (#1 and #2
above)
2) Run the rest of the bringup code strictly serialized per CPU
(#3 - #5 above) as it's done today.
Parallelizing that stage of the CPU bringup might be possible
in theory, but it's questionable whether required surgery
would be justified for a pretty small gain.
If the system is large enough the first AP is already waiting at
the first synchronization point when the boot CPU finished the
wake-up of the last AP. That reduces the AP bringup time on that
SKL from ~800ms to ~80ms, i.e. by a factor ~10x.
The actual gain varies wildly depending on the system, CPU,
microcode patch size and other factors. There are some
opportunities to reduce the overhead further, but that needs some
deep surgery in the x86 CPU bringup code.
For now this is only enabled on x86, but the core functionality
obviously works for all SMP capable architectures.
- Enhancements for SMP function call tracing so it is possible to
locate the scheduling and the actual execution points. That allows
to measure IPI delivery time precisely"
* tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
trace,smp: Add tracepoints for scheduling remotelly called functions
trace,smp: Add tracepoints around remotelly called functions
MAINTAINERS: Add CPU HOTPLUG entry
x86/smpboot: Fix the parallel bringup decision
x86/realmode: Make stack lock work in trampoline_compat()
x86/smp: Initialize cpu_primary_thread_mask late
cpu/hotplug: Fix off by one in cpuhp_bringup_mask()
x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils
x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it
x86/smpboot: Support parallel startup of secondary CPUs
x86/smpboot: Implement a bit spinlock to protect the realmode stack
x86/apic: Save the APIC virtual base address
cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE
x86/apic: Provide cpu_primary_thread mask
x86/smpboot: Enable split CPU startup
cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism
cpu/hotplug: Reset task stack state in _cpu_up()
cpu/hotplug: Remove unused state functions
riscv: Switch to hotplug core state synchronization
parisc: Switch to hotplug core state synchronization
...
Pull x86 boot updates from Thomas Gleixner:
"Initialize FPU late.
Right now FPU is initialized very early during boot. There is no real
requirement to do so. The only requirement is to have it done before
alternatives are patched.
That's done in check_bugs() which does way more than what the function
name suggests.
So first rename check_bugs() to arch_cpu_finalize_init() which makes
it clear what this is about.
Move the invocation of arch_cpu_finalize_init() earlier in
start_kernel() as it has to be done before fork_init() which needs to
know the FPU register buffer size.
With those prerequisites the FPU initialization can be moved into
arch_cpu_finalize_init(), which removes it from the early and fragile
part of the x86 bringup"
* tag 'x86-boot-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mem_encrypt: Unbreak the AMD_MEM_ENCRYPT=n build
x86/fpu: Move FPU initialization into arch_cpu_finalize_init()
x86/fpu: Mark init functions __init
x86/fpu: Remove cpuinfo argument from init functions
x86/init: Initialize signal frame size late
init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()
init: Invoke arch_cpu_finalize_init() earlier
init: Remove check_bugs() leftovers
um/cpu: Switch to arch_cpu_finalize_init()
sparc/cpu: Switch to arch_cpu_finalize_init()
sh/cpu: Switch to arch_cpu_finalize_init()
mips/cpu: Switch to arch_cpu_finalize_init()
m68k/cpu: Switch to arch_cpu_finalize_init()
loongarch/cpu: Switch to arch_cpu_finalize_init()
ia64/cpu: Switch to arch_cpu_finalize_init()
ARM: cpu: Switch to arch_cpu_finalize_init()
x86/cpu: Switch to arch_cpu_finalize_init()
init: Provide arch_cpu_finalize_init()
Pull misc vfs updates from Christian Brauner:
"Miscellaneous features, cleanups, and fixes for vfs and individual fs
Features:
- Use mode 0600 for file created by cachefilesd so it can be run by
unprivileged users. This aligns them with directories which are
already created with mode 0700 by cachefilesd
- Reorder a few members in struct file to prevent some false sharing
scenarios
- Indicate that an eventfd is used a semaphore in the eventfd's
fdinfo procfs file
- Add a missing uapi header for eventfd exposing relevant uapi
defines
- Let the VFS protect transitions of a superblock from read-only to
read-write in addition to the protection it already provides for
transitions from read-write to read-only. Protecting read-only to
read-write transitions allows filesystems such as ext4 to perform
internal writes, keeping writers away until the transition is
completed
Cleanups:
- Arnd removed the architecture specific arch_report_meminfo()
prototypes and added a generic one into procfs.h. Note, we got a
report about a warning in amdpgpu codepaths that suggested this was
bisectable to this change but we concluded it was a false positive
- Remove unused parameters from split_fs_names()
- Rename put_and_unmap_page() to unmap_and_put_page() to let the name
reflect the order of the cleanup operation that has to unmap before
the actual put
- Unexport buffer_check_dirty_writeback() as it is not used outside
of block device aops
- Stop allocating aio rings from highmem
- Protecting read-{only,write} transitions in the VFS used open-coded
barriers in various places. Replace them with proper little helpers
and document both the helpers and all barrier interactions involved
when transitioning between read-{only,write} states
- Use flexible array members in old readdir codepaths
Fixes:
- Use the correct type __poll_t for epoll and eventfd
- Replace all deprecated strlcpy() invocations, whose return value
isn't checked with an equivalent strscpy() call
- Fix some kernel-doc warnings in fs/open.c
- Reduce the stack usage in jffs2's xattr codepaths finally getting
rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088
bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
royally annoying compilation warning
- Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not
fmode_t is required to avoid fmode_t to integer degradation
warnings
- Create coredumps with O_WRONLY instead of O_RDWR. There's a long
explanation in that commit how O_RDWR is actually a bug which we
found out with the help of Linus and git archeology
- Fix "no previous prototype" warnings in the pipe codepaths
- Add overflow calculations for remap_verify_area() as a signed
addition overflow could be triggered in xfstests
- Fix a null pointer dereference in sysv
- Use an unsigned variable for length calculations in jfs avoiding
compilation warnings with gcc 13
- Fix a dangling pipe pointer in the watch queue codepath
- The legacy mount option parser provided as a fallback by the VFS
for filesystems not yet converted to the new mount api did prefix
the generated mount option string with a leading ',' causing issues
for some filesystems
- Fix a repeated word in a comment in fs.h
- autofs: Update the ctime when mtime is updated as mandated by
POSIX"
* tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
readdir: Replace one-element arrays with flexible-array members
fs: Provide helpers for manipulating sb->s_readonly_remount
fs: Protect reconfiguration of sb read-write from racing writes
eventfd: add a uapi header for eventfd userspace APIs
autofs: set ctime as well when mtime changes on a dir
eventfd: show the EFD_SEMAPHORE flag in fdinfo
fs/aio: Stop allocating aio rings from HIGHMEM
fs: Fix comment typo
fs: unexport buffer_check_dirty_writeback
fs: avoid empty option when generating legacy mount string
watch_queue: prevent dangling pipe pointer
fs.h: Optimize file struct to prevent false sharing
highmem: Rename put_and_unmap_page() to unmap_and_put_page()
cachefiles: Allow the cache to be non-root
init: remove unused names parameter in split_fs_names()
jfs: Use unsigned variable for length calculations
fs/sysv: Null check to prevent null-ptr-deref bug
fs: use UB-safe check for signed addition overflow in remap_verify_area
procfs: consolidate arch_report_meminfo declaration
fs: pipe: reveal missing function protoypes
...
The previous patch ("function_graph: Support recording and printing
the return value of function") has laid the groundwork for the for
the funcgraph-retval, and this modification makes it available on
the x86 platform.
We introduce a new structure called fgraph_ret_regs for the x86
platform to hold return registers and the frame pointer. We then
fill its content in the return_to_handler and pass its address
to the function ftrace_return_to_handler to record the return
value.
Link: https://lkml.kernel.org/r/53a506f0f18ff4b7aeb0feb762f1c9a5e9b83ee9.1680954589.git.pengdonglin@sangfor.com.cn
Signed-off-by: Donglin Peng <pengdonglin@sangfor.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Parking CPUs in a HLT loop is not completely safe vs. kexec() as HLT can
resume execution due to NMI, SMI and MCE, which has the same issue as the
MWAIT loop.
Kicking the secondary CPUs into INIT makes this safe against NMI and SMI.
A broadcast MCE will take the machine down, but a broadcast MCE which makes
HLT resume and execute overwritten text, pagetables or data will end up in
a disaster too.
So chose the lesser of two evils and kick the secondary CPUs into INIT
unless the system has installed special wakeup mechanisms which are not
using INIT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230615193330.608657211@linutronix.de