Commit Graph

240 Commits

Author SHA1 Message Date
Linus Torvalds
d3464152e5 Merge tag 'ras_core_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:

 - Just cleanups and fixes this time around: make threshold_ktype const,
   an objtool fix and use proper size for a bitmap

* tag 'ras_core_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/MCE/AMD: Use an u64 for bank_map
  x86/mce: Always inline old MCA stubs
  x86/MCE/AMD: Make kobj_type structure constant
2023-04-25 09:56:33 -07:00
Muralidhara M K
4c1cdec319 x86/MCE/AMD: Use an u64 for bank_map
Thee maximum number of MCA banks is 64 (MAX_NR_BANKS), see

  a0bc32b3ca ("x86/mce: Increase maximum number of banks to 64").

However, the bank_map which contains a bitfield of which banks to
initialize is of type unsigned int and that overflows when those bit
numbers are >= 32, leading to UBSAN complaining correctly:

  UBSAN: shift-out-of-bounds in arch/x86/kernel/cpu/mce/amd.c:1365:38
  shift exponent 32 is too large for 32-bit type 'int'

Change the bank_map to a u64 and use the proper BIT_ULL() macro when
modifying bits in there.

  [ bp: Rewrite commit message. ]

Fixes: a0bc32b3ca ("x86/mce: Increase maximum number of banks to 64")
Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230127151601.1068324-1-muralimk@amd.com
2023-03-19 19:07:04 +01:00
Yazen Ghannam
4783b9cb37 x86/mce: Make sure logged MCEs are processed after sysfs update
A recent change introduced a flag to queue up errors found during
boot-time polling. These errors will be processed during late init once
the MCE subsystem is fully set up.

A number of sysfs updates call mce_restart() which goes through a subset
of the CPU init flow. This includes polling MCA banks and logging any
errors found. Since the same function is used as boot-time polling,
errors will be queued. However, the system is now past late init, so the
errors will remain queued until another error is found and the workqueue
is triggered.

Call mce_schedule_work() at the end of mce_restart() so that queued
errors are processed.

Fixes: 3bff147b18 ("x86/mce: Defer processing of early errors")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230301221420.2203184-1-yazen.ghannam@amd.com
2023-03-12 21:12:21 +01:00
Borislav Petkov (AMD)
554eec0b4a x86/mce: Always inline old MCA stubs
The stubs for the ancient MCA support (CONFIG_X86_ANCIENT_MCE) are
normally optimized away on 64-bit builds. However, an allmodconfig one
causes the compiler to add sanitizer calls gunk into them and they exist
as constprop calls. Which objtool then complains about:

  vmlinux.o: warning: objtool: do_machine_check+0xad8: call to \
    pentium_machine_check.constprop.0() leaves .noinstr.text section

due to them missing noinstr. One could tag them "noinstr" but what
should really happen is, they should be forcefully inlined so that all
that gunk gets optimized away and the warning doesn't even have a chance
to fire.

Do so.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230222191054.4701-1-bp@alien8.de
2023-03-08 13:50:07 +01:00
Thomas Weißschuh
7214b32b6f x86/MCE/AMD: Make kobj_type structure constant
Since

  ee6d3dd4ed ("driver core: make kobj_type constant.")

the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definition to prevent
modification at runtime.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230217-kobj_type-mce-amd-v1-1-40ef94816444@weissschuh.net
2023-03-06 09:57:27 +01:00
Linus Torvalds
0246725d73 Merge tag 'ras_core_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:

 - Add support for reporting more bits of the physical address on error,
   on newer AMD CPUs

 - Mask out bits which don't belong to the address of the error being
   reported

* tag 'ras_core_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Mask out non-address bits from machine check bank
  x86/mce: Add support for Extended Physical Address MCA changes
  x86/mce: Define a function to extract ErrorAddr from MCA_ADDR
2023-02-21 08:04:51 -08:00
Tony Luck
8a01ec97dc x86/mce: Mask out non-address bits from machine check bank
Systems that support various memory encryption schemes (MKTME, TDX, SEV)
use high order physical address bits to indicate which key should be
used for a specific memory location.

When a memory error is reported, some systems may report those key
bits in the IA32_MCi_ADDR machine check MSR.

The Intel SDM has a footnote for the contents of the address register
that says: "Useful bits in this field depend on the address methodology
in use when the register state is saved."

AMD Processor Programming Reference has a more explicit description
of the MCA_ADDR register:

 "For physical addresses, the most significant bit is given by
  Core::X86::Cpuid::LongModeInfo[PhysAddrSize]."

Add a new #define MCI_ADDR_PHYSADDR for the mask of valid physical
address bits within the machine check bank address register. Use this
mask for recoverable machine check handling and in the EDAC driver to
ignore any key bits that may be present.

  [ Tony: Based on independent fixes proposed by Fan Du and Isaku Yamahata ]

Reported-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reported-by: Fan Du <fan.du@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20230109152936.397862-1-tony.luck@intel.com
2023-01-10 11:47:07 +01:00
Xu Panda
7ddf0050a2 x86/mce/dev-mcelog: use strscpy() to instead of strncpy()
The implementation of strscpy() is more robust and safer.
That's now the recommended way to copy NUL terminated strings.

Signed-off-by: Xu Panda <xu.panda@zte.com.cn>
Signed-off-by: Yang Yang <yang.yang29@zte.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/202212031419324523731@zte.com.cn
2023-01-07 11:47:35 +01:00
Smita Koralahalli
fcd343a285 x86/mce: Add support for Extended Physical Address MCA changes
Newer AMD CPUs support more physical address bits.

That is, the MCA_ADDR registers on Scalable MCA systems contain the
ErrorAddr in bits [56:0] instead of [55:0]. Hence, the existing LSB field
from bits [61:56] in MCA_ADDR must be moved around to accommodate the
larger ErrorAddr size.

MCA_CONFIG[McaLsbInStatusSupported] indicates this change. If set, the
LSB field will be found in MCA_STATUS rather than MCA_ADDR.

Each logical CPU has unique MCA bank in hardware and is not shared with
other logical CPUs. Additionally, on SMCA systems, each feature bit may
be different for each bank within same logical CPU.

Check for MCA_CONFIG[McaLsbInStatusSupported] for each MCA bank and for
each CPU.

Additionally, all MCA banks do not support maximum ErrorAddr bits in
MCA_ADDR. Some banks might support fewer bits but the remaining bits are
marked as reserved.

  [ Yazen: Rebased and fixed up formatting.
    bp: Massage comments. ]

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20221206173607.1185907-5-yazen.ghannam@amd.com
2022-12-28 22:37:37 +01:00
Smita Koralahalli
2117654e80 x86/mce: Define a function to extract ErrorAddr from MCA_ADDR
Move MCA_ADDR[ErrorAddr] extraction into a separate helper function. This
will be further refactored to support extended ErrorAddr bits in MCA_ADDR
in newer AMD CPUs.

  [ bp: Massage. ]

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/all/20220225193342.215780-3-Smita.KoralahalliChannabasappa@amd.com/
2022-12-28 22:11:48 +01:00
Tony Luck
a51cbd0d86 x86/mce: Use severity table to handle uncorrected errors in kernel
mce_severity_intel() has a special case to promote UC and AR errors
in kernel context to PANIC severity.

The "AR" case is already handled with separate entries in the severity
table for all instruction fetch errors, and those data fetch errors that
are not in a recoverable area of the kernel (i.e. have an extable fixup
entry).

Add an entry to the severity table for UC errors in kernel context that
reports severity = PANIC. Delete the special case code from
mce_severity_intel().

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220922195136.54575-2-tony.luck@intel.com
2022-10-31 17:01:19 +01:00
Yazen Ghannam
bc1b705b0e x86/MCE/AMD: Clear DFR errors found in THR handler
AMD's MCA Thresholding feature counts errors of all severity levels, not
just correctable errors. If a deferred error causes the threshold limit
to be reached (it was the error that caused the overflow), then both a
deferred error interrupt and a thresholding interrupt will be triggered.

The order of the interrupts is not guaranteed. If the threshold
interrupt handler is executed first, then it will clear MCA_STATUS for
the error. It will not check or clear MCA_DESTAT which also holds a copy
of the deferred error. When the deferred error interrupt handler runs it
will not find an error in MCA_STATUS, but it will find the error in
MCA_DESTAT. This will cause two errors to be logged.

Check for deferred errors when handling a threshold interrupt. If a bank
contains a deferred error, then clear the bank's MCA_DESTAT register.

Define a new helper function to do the deferred error check and clearing
of MCA_DESTAT.

  [ bp: Simplify, convert comment to passive voice. ]

Fixes: 37d43acfd7 ("x86/mce/AMD: Redo error logging from APIC LVT interrupt handlers")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220621155943.33623-1-yazen.ghannam@amd.com
2022-10-27 17:01:25 +02:00
Jane Chu
f9781bb18e x86/mce: Retrieve poison range from hardware
When memory poison consumption machine checks fire, MCE notifier
handlers like nfit_handle_mce() record the impacted physical address
range which is reported by the hardware in the MCi_MISC MSR. The error
information includes data about blast radius, i.e. how many cachelines
did the hardware determine are impacted. A recent change

  7917f9cdb5 ("acpi/nfit: rely on mce->misc to determine poison granularity")

updated nfit_handle_mce() to stop hard coding the blast radius value of
1 cacheline, and instead rely on the blast radius reported in 'struct
mce' which can be up to 4K (64 cachelines).

It turns out that apei_mce_report_mem_error() had a similar problem in
that it hard coded a blast radius of 4K rather than reading the blast
radius from the error information. Fix apei_mce_report_mem_error() to
convey the proper poison granularity.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8502@oracle.com
Link: https://lore.kernel.org/r/20220826233851.1319100-1-jane.chu@oracle.com
2022-08-29 09:33:42 +02:00
Smita Koralahalli
891e465a1b x86/mce: Check whether writes to MCA_STATUS are getting ignored
The platform can sometimes - depending on its settings - cause writes
to MCA_STATUS MSRs to get ignored, regardless of HWCR[McStatusWrEn]'s
value.

For further info see

  PPR for AMD Family 19h, Model 01h, Revision B1 Processors, doc ID 55898

at https://bugzilla.kernel.org/show_bug.cgi?id=206537.

Therefore, probe for ignored writes to MCA_STATUS to determine if hardware
error injection is at all possible.

  [ bp: Heavily massage commit message and patch. ]

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220214233640.70510-2-Smita.KoralahalliChannabasappa@amd.com
2022-06-28 12:08:10 +02:00
Linus Torvalds
35cdd8656e Merge tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and DAX updates from Dan Williams:
 "New support for clearing memory errors when a file is in DAX mode,
  alongside with some other fixes and cleanups.

  Previously it was only possible to clear these errors using a truncate
  or hole-punch operation to trigger the filesystem to reallocate the
  block, now, any page aligned write can opportunistically clear errors
  as well.

  This change spans x86/mm, nvdimm, and fs/dax, and has received the
  appropriate sign-offs. Thanks to Jane for her work on this.

  Summary:

   - Add support for clearing memory error via pwrite(2) on DAX

   - Fix 'security overwrite' support in the presence of media errors

   - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)"

* tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  pmem: implement pmem_recovery_write()
  pmem: refactor pmem_clear_poison()
  dax: add .recovery_write dax_operation
  dax: introduce DAX_RECOVERY_WRITE dax access mode
  mce: fix set_mce_nospec to always unmap the whole page
  x86/mce: relocate set{clear}_mce_nospec() functions
  acpi/nfit: rely on mce->misc to determine poison granularity
  testing: nvdimm: asm/mce.h is not needed in nfit.c
  testing: nvdimm: iomap: make __nfit_test_ioremap a macro
  nvdimm: Allow overwrite in the presence of disabled dimms
  tools/testing/nvdimm: remove unneeded flush_workqueue
2022-05-27 15:49:30 -07:00
Linus Torvalds
1961b06c91 Merge tag 'acpi-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
 "These update the ACPICA kernel code to upstream revision 20220331,
  improve handling of PCI devices that are in D3cold during system
  initialization, add support for a few features, fix bugs and clean up
  code.

  Specifics:

   - Update ACPICA code in the kernel to upstream revision 20220331
     including the following changes:
       - Add support for the Windows 11 _OSI string (Mario Limonciello)
       - Add the CFMWS subtable to the CEDT table (Lawrence Hileman).
       - iASL: NHLT: Treat Terminator as specific_config (Piotr
         Maziarz).
       - iASL: NHLT: Fix parsing undocumented bytes at the end of
         Endpoint Descriptor (Piotr Maziarz).
       - iASL: NHLT: Rename linux specific strucures to device_info
         (Piotr Maziarz).
       - Add new ACPI 6.4 semantics to Load() and LoadTable() (Bob
         Moore).
       - Clean up double word in comment (Tom Rix).
       - Update copyright notices to the year 2022 (Bob Moore).
       - Remove some tabs and // comments - automated cleanup (Bob
         Moore).
       - Replace zero-length array with flexible-array member (Gustavo
         A. R. Silva).
       - Interpreter: Add units to time variable names (Paul Menzel).
       - Add support for ARM Performance Monitoring Unit Table (Besar
         Wicaksono).
       - Inform users about ACPI spec violation related to sleep length
         (Paul Menzel).
       - iASL/MADT: Add OEM-defined subtable (Bob Moore).
       - Interpreter: Fix some typo mistakes (Selvarasu Ganesan).
       - Updates for revision E.d of IORT (Shameer Kolothum).
       - Use ACPI_FORMAT_UINT64 for 64-bit output (Bob Moore).

   - Improve debug messages in the ACPI device PM code (Rafael Wysocki).

   - Block ASUS B1400CEAE from suspend to idle by default (Mario
     Limonciello).

   - Improve handling of PCI devices that are in D3cold during system
     initialization (Rafael Wysocki).

   - Fix BERT error region memory mapping (Lorenzo Pieralisi).

   - Add support for NVIDIA 16550-compatible port subtype to the SPCR
     parsing code (Jeff Brasen).

   - Use static for BGRT_SHOW kobj_attribute defines (Tom Rix).

   - Fix missing prototype warning for acpi_agdi_init() (Ilkka
     Koskinen).

   - Fix missing ERST record ID in the APEI code (Liu Xinpeng).

   - Make APEI error injection to refuse to inject into the zero page
     (Tony Luck).

   - Correct description of INT3407 / INT3532 DPTF attributes in sysfs
     (Sumeet Pawnikar).

   - Add support for high frequency impedance notification to the DPTF
     driver (Sumeet Pawnikar).

   - Make mp_config_acpi_gsi() a void function (Li kunyu).

   - Unify Package () representation for properties in the ACPI device
     properties documentation (Andy Shevchenko).

   - Include UUID in _DSM evaluation warning (Michael Niewöhner)"

* tag 'acpi-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (41 commits)
  Revert "ACPICA: executer/exsystem: Warn about sleeps greater than 10 ms"
  ACPI: utils: include UUID in _DSM evaluation warning
  ACPI: PM: Block ASUS B1400CEAE from suspend to idle by default
  x86: ACPI: Make mp_config_acpi_gsi() a void function
  ACPI: DPTF: Add support for high frequency impedance notification
  ACPI: AGDI: Fix missing prototype warning for acpi_agdi_init()
  ACPI: bus: Avoid non-ACPI device objects in walks over children
  ACPI: DPTF: Correct description of INT3407 / INT3532 attributes
  ACPI: BGRT: use static for BGRT_SHOW kobj_attribute defines
  ACPI, APEI, EINJ: Refuse to inject into the zero page
  ACPI: PM: Always print final debug message in acpi_device_set_power()
  ACPI: SPCR: Add support for NVIDIA 16550-compatible port subtype
  ACPI: docs: enumeration: Unify Package () for properties (part 2)
  ACPI: APEI: Fix missing ERST record id
  ACPICA: Update version to 20220331
  ACPICA: exsystem.c: Use ACPI_FORMAT_UINT64 for 64-bit output
  ACPICA: IORT: Updates for revision E.d
  ACPICA: executer/exsystem: Fix some typo mistakes
  ACPICA: iASL/MADT: Add OEM-defined subtable
  ACPICA: executer/exsystem: Warn about sleeps greater than 10 ms
  ...
2022-05-24 15:46:55 -07:00
Jane Chu
5898b43af9 mce: fix set_mce_nospec to always unmap the whole page
The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."

Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.

Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/

Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().

Fixes: 284ce4011b ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/165272615484.103830.2563950688772226611.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-16 11:46:44 -07:00
Carlos Bilbao
fa619f5156 x86/mce: Add messages for panic errors in AMD's MCE grading
When a machine error is graded as PANIC by the AMD grading logic, the
MCE handler calls mce_panic(). The notification chain does not come
into effect so the AMD EDAC driver does not decode the errors. In these
cases, the messages displayed to the user are more cryptic and miss
information that might be relevant, like the context in which the error
took place.

Add messages to the grading logic for machine errors so that it is clear
what error it was.

  [ bp: Massage commit message. ]

Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20220405183212.354606-3-carlos.bilbao@amd.com
2022-04-25 12:40:48 +02:00
Carlos Bilbao
70c459d915 x86/mce: Simplify AMD severity grading logic
The MCE handler needs to understand the severity of the machine errors to
act accordingly. Simplify the AMD grading logic following a logic that
closely resembles the descriptions of the public PPR documents. This will
help include more fine-grained grading of errors in the future.

  [ bp: Touchups. ]

Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20220405183212.354606-2-carlos.bilbao@amd.com
2022-04-25 12:32:03 +02:00
Liu Xinpeng
a090931524 ACPI: APEI: Fix missing ERST record id
Read a record is cleared by others, but the deleted record cache entry is
still created by erst_get_record_id_next. When next enumerate the records,
get the cached deleted record, then erst_read() return -ENOENT and try to
get next record, loop back to first ID will return 0 in function
__erst_record_id_cache_add_one and then set record_id as
APEI_ERST_INVALID_RECORD_ID, finished this time read operation.
It will result in read the records just in the cache hereafter.

This patch cleared the deleted record cache, fix the issue that
"./erst-inject -p" shows record counts not equal to "./erst-inject -n".

A reproducer of the problem(retry many times):

[root@localhost erst-inject]# ./erst-inject -c 0xaaaaa00011
[root@localhost erst-inject]# ./erst-inject -p
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00012
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00013
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00014
[root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000006
[root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000007
[root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000008
[root@localhost erst-inject]# ./erst-inject -p
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00012
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00013
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00014
[root@localhost erst-inject]# ./erst-inject -n
total error record count: 6

Signed-off-by: Liu Xinpeng <liuxp11@chinatelecom.cn>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-04-13 20:29:24 +02:00
Ammar Faizi
e5f28623ce x86/MCE/AMD: Fix memory leak when threshold_create_bank() fails
In mce_threshold_create_device(), if threshold_create_bank() fails, the
previously allocated threshold banks array @bp will be leaked because
the call to mce_threshold_remove_device() will not free it.

This happens because mce_threshold_remove_device() fetches the pointer
through the threshold_banks per-CPU variable but bp is written there
only after the bank creation is successful, and not before, when
threshold_create_bank() fails.

Add a helper which unwinds all the bank creation work previously done
and pass into it the previously allocated threshold banks array for
freeing.

  [ bp: Massage. ]

Fixes: 6458de97fc ("x86/mce/amd: Straighten CPU hotplug path")
Co-developed-by: Alviro Iskandar Setiawan <alviro.iskandar@gnuweeb.org>
Signed-off-by: Alviro Iskandar Setiawan <alviro.iskandar@gnuweeb.org>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220329104705.65256-3-ammarfaizi2@gnuweeb.org
2022-04-05 21:24:37 +02:00
Smita Koralahalli
9f1b19b977 x86/mce: Avoid unnecessary padding in struct mce_bank
Convert struct mce_bank member "init" from bool to a bitfield to get rid
of unnecessary padding.

$ pahole -C mce_bank arch/x86/kernel/cpu/mce/core.o

before:

  /* size: 16, cachelines: 1, members: 2 */
  /* padding: 7 */
  /* last cacheline: 16 bytes */

after:

  /* size: 16, cachelines: 1, members: 3 */
  /* last cacheline: 16 bytes */

No functional changes.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220225193342.215780-2-Smita.KoralahalliChannabasappa@amd.com
2022-04-05 21:23:34 +02:00
Linus Torvalds
636f64db07 Merge tag 'ras_core_for_v5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:

 - More noinstr fixes

 - Add an erratum workaround for Intel CPUs which, in certain
   circumstances, end up consuming an unrelated uncorrectable memory
   error when using fast string copy insns

 - Remove the MCE tolerance level control as it is not really needed or
   used anymore

* tag 'ras_core_for_v5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Remove the tolerance level control
  x86/mce: Work around an erratum on fast string copy instructions
  x86/mce: Use arch atomic and bit helpers
2022-03-25 12:34:53 -07:00
Linus Torvalds
3bf03b9a08 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs

 - Most the MM patches which precede the patches in Willy's tree: kasan,
   pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
   sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb,
   userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp,
   cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap,
   zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits)
  mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()
  Docs/ABI/testing: add DAMON sysfs interface ABI document
  Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface
  selftests/damon: add a test for DAMON sysfs interface
  mm/damon/sysfs: support DAMOS stats
  mm/damon/sysfs: support DAMOS watermarks
  mm/damon/sysfs: support schemes prioritization
  mm/damon/sysfs: support DAMOS quotas
  mm/damon/sysfs: support DAMON-based Operation Schemes
  mm/damon/sysfs: support the physical address space monitoring
  mm/damon/sysfs: link DAMON for virtual address spaces monitoring
  mm/damon: implement a minimal stub for sysfs-based DAMON interface
  mm/damon/core: add number of each enum type values
  mm/damon/core: allow non-exclusive DAMON start/stop
  Docs/damon: update outdated term 'regions update interval'
  Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling
  Docs/vm/damon: call low level monitoring primitives the operations
  mm/damon: remove unnecessary CONFIG_DAMON option
  mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}()
  mm/damon/dbgfs-test: fix is_target_id() change
  ...
2022-03-22 16:11:53 -07:00
luofei
d1fe111fb6 mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
When the hwpoison page meets the filter conditions, it should not be
regarded as successful memory_failure() processing for mce handler, but
should return a distinct value, otherwise mce handler regards the error
page has been identified and isolated, which may lead to calling
set_mce_nospec() to change page attribute, etc.

Here memory_failure() return -EOPNOTSUPP to indicate that the error
event is filtered, mce handler should not take any action for this
situation and hwpoison injector should treat as correct.

Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com
Signed-off-by: luofei <luofei@unicloud.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:07 -07:00
Borislav Petkov
7f1b8e0d63 x86/mce: Remove the tolerance level control
This is pretty much unused and not really useful. What is more, all
relevant MCA hardware has recoverable machine checks support so there's
no real need to tweak MCA tolerance levels in order to *maybe* extend
machine lifetime.

So rip it out.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/YcDq8PxvKtTENl/e@zn.tnic
2022-02-23 11:09:25 +01:00
Jue Wang
8ca97812c3 x86/mce: Work around an erratum on fast string copy instructions
A rare kernel panic scenario can happen when the following conditions
are met due to an erratum on fast string copy instructions:

1) An uncorrected error.
2) That error must be in first cache line of a page.
3) Kernel must execute page_copy from the page immediately before that
page.

The fast string copy instructions ("REP; MOVS*") could consume an
uncorrectable memory error in the cache line _right after_ the desired
region to copy and raise an MCE.

Bit 0 of MSR_IA32_MISC_ENABLE can be cleared to disable fast string
copy and will avoid such spurious machine checks. However, that is less
preferable due to the permanent performance impact. Considering memory
poison is rare, it's desirable to keep fast string copy enabled until an
MCE is seen.

Intel has confirmed the following:
1. The CPU erratum of fast string copy only applies to Skylake,
Cascade Lake and Cooper Lake generations.

Directly return from the MCE handler:
2. Will result in complete execution of the "REP; MOVS*" with no data
loss or corruption.
3. Will not result in another MCE firing on the next poisoned cache line
due to "REP; MOVS*".
4. Will resume execution from a correct point in code.
5. Will result in the same instruction that triggered the MCE firing a
second MCE immediately for any other software recoverable data fetch
errors.
6. Is not safe without disabling the fast string copy, as the next fast
string copy of the same buffer on the same CPU would result in a PANIC
MCE.

This should mitigate the erratum completely with the only caveat that
the fast string copy is disabled on the affected hyper thread thus
performance degradation.

This is still better than the OS crashing on MCEs raised on an
irrelevant process due to "REP; MOVS*' accesses in a kernel context,
e.g., copy_page.

Tested:

Injected errors on 1st cache line of 8 anonymous pages of process
'proc1' and observed MCE consumption from 'proc2' with no panic
(directly returned).

Without the fix, the host panicked within a few minutes on a
random 'proc2' process due to kernel access from copy_page.

  [ bp: Fix comment style + touch ups, zap an unlikely(), improve the
    quirk function's readability. ]

Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20220218013209.2436006-1-juew@google.com
2022-02-19 14:26:42 +01:00
Borislav Petkov
f11445ba7a x86/mce: Use arch atomic and bit helpers
The arch helpers do not have explicit KASAN instrumentation. Use them in
noinstr code.

Inline a couple more functions with single call sites, while at it:

mce_severity_amd_smca() has a single call-site which is noinstr so force
the inlining and fix:

  vmlinux.o: warning: objtool: mce_severity_amd.constprop.0()+0xca: call to \
	  mce_severity_amd_smca() leaves .noinstr.text section

Always inline mca_msr_reg():

     text    data     bss     dec     hex filename
  16065240        128031326       36405368        180501934       ac23dae vmlinux.before
  16065240        128031294       36405368        180501902       ac23d8e vmlinux.after

and mce_no_way_out() as the latter one is used only once, to fix:

  vmlinux.o: warning: objtool: mce_read_aux()+0x53: call to mca_msr_reg() leaves .noinstr.text section
  vmlinux.o: warning: objtool: do_machine_check()+0xc9: call to mce_no_way_out() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20220204083015.17317-4-bp@alien8.de
2022-02-13 22:08:27 +01:00
Tony Luck
822ccfade5 x86/cpu: Read/save PPIN MSR during initialization
Currently, the PPIN (Protected Processor Inventory Number) MSR is read
by every CPU that processes a machine check, CMCI, or just polls machine
check banks from a periodic timer. This is not a "fast" MSR, so this
adds to overhead of processing errors.

Add a new "ppin" field to the cpuinfo_x86 structure. Read and save the
PPIN during initialization. Use this copy in mce_setup() instead of
reading the MSR.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220131230111.2004669-4-tony.luck@intel.com
2022-02-01 16:29:26 +01:00
Tony Luck
0dcab41d34 x86/cpu: Merge Intel and AMD ppin_init() functions
The code to decide whether a system supports the PPIN (Protected
Processor Inventory Number) MSR was cloned from the Intel
implementation. Apart from the X86_FEATURE bit and the MSR numbers it is
identical.

Merge the two functions into common x86 code, but use x86_match_cpu()
instead of the switch (c->x86_model) that was used by the old Intel
code.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220131230111.2004669-2-tony.luck@intel.com
2022-02-01 12:56:23 +01:00
Greg Kroah-Hartman
7f99cb5e60 x86/CPU/AMD: Use default_groups in kobj_type
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the AMD mce sysfs code to use default_groups field which has
been the preferred way since

  aa30f47cf6 ("kobject: Add support for default attribute groups to kobj_type")

so that the obsolete default_attrs field can be removed soon.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20220106103537.3663852-1-gregkh@linuxfoundation.org
2022-02-01 12:41:24 +01:00
Tony Luck
e464121f2d x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPIN
Missed adding the Icelake-D CPU to the list. It uses the same MSRs
to control and read the inventory number as all the other models.

Fixes: dc6b025de9 ("x86/mce: Add Xeon Icelake to list of CPUs that support PPIN")
Reported-by: Ailin Xu <ailin.xu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220121174743.1875294-2-tony.luck@intel.com
2022-01-25 18:40:30 +01:00
Yazen Ghannam
1f52b0aba6 x86/MCE/AMD: Allow thresholding interface updates after init
Changes to the AMD Thresholding sysfs code prevents sysfs writes from
updating the underlying registers once CPU init is completed, i.e.
"threshold_banks" is set.

Allow the registers to be updated if the thresholding interface is
already initialized or if in the init path. Use the "set_lvt_off" value
to indicate if running in the init path, since this value is only set
during init.

Fixes: a037f3ca0e ("x86/mce/amd: Make threshold bank setting hotplug robust")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220117161328.19148-1-yazen.ghannam@amd.com
2022-01-23 20:50:18 +01:00
Zhang Zixun
de768416b2 x86/mce/inject: Avoid out-of-bounds write when setting flags
A contrived zero-length write, for example, by using write(2):

  ...
  ret = write(fd, str, 0);
  ...

to the "flags" file causes:

  BUG: KASAN: stack-out-of-bounds in flags_write
  Write of size 1 at addr ffff888019be7ddf by task writefile/3787

  CPU: 4 PID: 3787 Comm: writefile Not tainted 5.16.0-rc7+ #12
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014

due to accessing buf one char before its start.

Prevent such out-of-bounds access.

  [ bp: Productize into a proper patch. Link below is the next best
    thing because the original mail didn't get archived on lore. ]

Fixes: 0451d14d05 ("EDAC, mce_amd_inj: Modify flags attribute to use string arguments")
Signed-off-by: Zhang Zixun <zhang133010@icloud.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-edac/YcnePfF1OOqoQwrX@zn.tnic/
2021-12-28 11:45:36 +01:00
Yazen Ghannam
91f75eb481 x86/MCE/AMD, EDAC/mce_amd: Support non-uniform MCA bank type enumeration
AMD systems currently lay out MCA bank types such that the type of bank
number "i" is either the same across all CPUs or is Reserved/Read-as-Zero.

For example:

  Bank # | CPUx | CPUy
    0      LS     LS
    1      RAZ    UMC
    2      CS     CS
    3      SMU    RAZ

Future AMD systems will lay out MCA bank types such that the type of
bank number "i" may be different across CPUs.

For example:

  Bank # | CPUx | CPUy
    0      LS     LS
    1      RAZ    UMC
    2      CS     NBIO
    3      SMU    RAZ

Change the structures that cache MCA bank types to be per-CPU and update
smca_get_bank_type() to handle this change.

Move some SMCA-specific structures to amd.c from mce.h, since they no
longer need to be global.

Break out the "count" for bank types from struct smca_hwid, since this
should provide a per-CPU count rather than a system-wide count.

Apply the "const" qualifier to the struct smca_hwid_mcatypes array. The
values in this array should not change at runtime.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211216162905.4132657-3-yazen.ghannam@amd.com
2021-12-22 17:22:09 +01:00
Yazen Ghannam
5176a93ab2 x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types
Add HWID and McaType values for new SMCA bank types, and add their error
descriptions to edac_mce_amd.

The "PHY" bank types all have the same error descriptions, and the NBIF
and SHUB bank types have the same error descriptions. So reuse the same
arrays where appropriate.

  [ bp: Remove useless comments over hwid types. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211216162905.4132657-2-yazen.ghannam@amd.com
2021-12-22 17:19:18 +01:00
Borislav Petkov
1acd85feba x86/mce: Check regs before accessing it
Commit in Fixes accesses pt_regs before checking whether it is NULL or
not. Make sure the NULL pointer check happens first.

Fixes: 0a5b288e85 ("x86/mce: Prevent severity computation from being instrumented")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20211217102029.GA29708@kili
2021-12-20 11:41:02 +01:00
Borislav Petkov
e3d72e8eee x86/mce: Mark mce_start() noinstr
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x4ae: call to __const_udelay() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-13-bp@alien8.de
2021-12-13 14:14:05 +01:00
Borislav Petkov
edb3d07e24 x86/mce: Mark mce_timed_out() noinstr
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x482: call to mce_timed_out() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-12-bp@alien8.de
2021-12-13 14:13:54 +01:00
Borislav Petkov
75581a203e x86/mce: Move the tainting outside of the noinstr region
add_taint() is yet another external facility which the #MC handler
calls. Move that tainting call into the instrumentation-allowed part of
the handler.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x617: call to add_taint() leaves .noinstr.text section

While at it, allow instrumentation around the mce_log() call.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x690: call to mce_log() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-11-bp@alien8.de
2021-12-13 14:13:35 +01:00
Borislav Petkov
db6c996d6c x86/mce: Mark mce_read_aux() noinstr
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x681: call to mce_read_aux() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-10-bp@alien8.de
2021-12-13 14:13:23 +01:00
Borislav Petkov
b4813539d3 x86/mce: Mark mce_end() noinstr
It is called by the #MC handler which is noinstr.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xbd6: call to memset() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-9-bp@alien8.de
2021-12-13 14:13:12 +01:00
Borislav Petkov
3c7ce80a81 x86/mce: Mark mce_panic() noinstr
And allow instrumentation inside it because it does calls to other
facilities which will not be tagged noinstr.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xc73: call to mce_panic() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-8-bp@alien8.de
2021-12-13 14:13:01 +01:00
Borislav Petkov
0a5b288e85 x86/mce: Prevent severity computation from being instrumented
Mark all the MCE severity computation logic noinstr and allow
instrumentation when it "calls out".

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xc5d: call to mce_severity() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-7-bp@alien8.de
2021-12-13 14:12:48 +01:00
Borislav Petkov
4fbce464db x86/mce: Allow instrumentation during task work queueing
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xdb1: call to queue_task_work() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-6-bp@alien8.de
2021-12-13 14:12:35 +01:00
Borislav Petkov
487d654db3 x86/mce: Remove noinstr annotation from mce_setup()
Instead, sandwitch around the call which is done in noinstr context and
mark the caller - mce_gather_info() - as noinstr.

Also, document what the whole instrumentation strategy with #MC is going
to be in the future and where it all is supposed to be going to.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-5-bp@alien8.de
2021-12-13 14:12:21 +01:00
Borislav Petkov
88f66a4235 x86/mce: Use mce_rdmsrl() in severity checking code
MCA has its own special MSR accessors. Use them.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-4-bp@alien8.de
2021-12-13 14:12:08 +01:00
Borislav Petkov
ad669ec16a x86/mce: Remove function-local cpus variables
Use num_online_cpus() directly.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-3-bp@alien8.de
2021-12-13 14:11:53 +01:00
Borislav Petkov
cd5e0d1fc9 x86/mce: Do not use memset to clear the banks bitmaps
The bitmap is a single unsigned long so no need for the function call.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-2-bp@alien8.de
2021-12-13 14:11:22 +01:00
Smita Koralahalli
1e56279a49 x86/mce/inject: Set the valid bit in MCA_STATUS before error injection
MCA handlers check the valid bit in each status register
(MCA_STATUS[Val]) and continue processing the error only if the valid
bit is set.

Set the valid bit unconditionally in the corresponding MCA_STATUS
register and correct any Val=0 injections made by the user as such
errors will get ignored and such injections will be largely pointless.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211104215846.254012-3-Smita.KoralahalliChannabasappa@amd.com
2021-12-08 12:01:01 +01:00