Commit Graph

3745 Commits

Author SHA1 Message Date
Linus Torvalds
2a8120d7b4 Merge tag 's390-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Alexander Gordeev:

 - Switch read and write software bits for PUDs

 - Add missing hardware bits for PUDs and PMDs

 - Generate unwind information for C modules to fix GDB unwind error for
   vDSO functions

 - Create .build-id links for unstripped vDSO files to enable vDSO
   debugging with symbols

 - Use standard stack frame layout for vDSO generated stack frames to
   manually walk stack frames without DWARF information

 - Rework perf_callchain_user() and arch_stack_walk_user() functions to
   reduce code duplication

 - Skip first stack frame when walking user stack

 - Add basic checks to identify invalid instruction pointers when
   walking stack frames

 - Introduce and use struct stack_frame_vdso_wrapper within vDSO user
   wrapper code to automatically generate an asm-offset define. Also use
   STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to document
   that the code works with user space stack

 - Clear the backchain of the extra stack frame added by the vDSO user
   wrapper code. This allows the user stack walker to detect and skip
   the non-standard stack frame. Without this an incorrect instruction
   pointer would be added to stack traces.

 - Rewrite psw_idle() function in C to ease maintenance and further
   enhancements

 - Remove get_vtimer() function and use get_cpu_timer() instead

 - Mark psw variable in __load_psw_mask() as __unitialized to avoid
   superfluous clearing of PSW

 - Remove obsolete and superfluous comment about removed TIF_FPU flag

 - Replace memzero_explicit() and kfree() with kfree_sensitive() to fix
   warnings reported by Coccinelle

 - Wipe sensitive data and all copies of protected- or secure-keys from
   stack when an IOCTL fails

 - Both do_airq_interrupt() and do_io_interrupt() functions set
   CIF_NOHZ_DELAY flag. Move it in do_io_irq() to simplify the code

 - Provide iucv_alloc_device() and iucv_release_device() helpers, which
   can be used to deduplicate more or less identical IUCV device
   allocation and release code in four different drivers

 - Make use of iucv_alloc_device() and iucv_release_device() helpers to
   get rid of quite some code and also remove a cast to an incompatible
   function (clang W=1)

 - There is no user of iucv_root outside of the core IUCV code left.
   Therefore remove the EXPORT_SYMBOL

 - __apply_alternatives() contains a runtime check which verifies that
   the size of the to be patched code area is even. Convert this to a
   compile time check

 - Increase size of buffers for sending z/VM CP DIAGNOSE X'008' commands
   from 128 to 240

 - Do not accept z/VM CP DIAGNOSE X'008' commands longer than maximally
   allowed

 - Use correct defines IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN instead of
   IPL_BP_FCP_LEN and IPL_BP0_FCP_LEN ones to initialize NVMe reIPL
   block on 'scp_data' sysfs attribute update

 - Initialize the correct fields of the NVMe dump block, which were
   confused with FCP fields

 - Refactor macros for 'scp_data' (re-)IPL sysfs attribute to reduce
   code duplication

 - Introduce 'scp_data' sysfs attribute for dump IPL to allow tools such
   as dumpconf passing additional kernel command line parameters to a
   stand-alone dumper

 - Rework the CPACF query functions to use the correct RRE or RRF
   instruction formats and set instruction register fields correctly

 - Instead of calling BUG() at runtime force a link error during compile
   when a unsupported opcode is used with __cpacf_query() or
   __cpacf_check_opcode() functions

 - Fix a crash in ap_parse_bitmap_str() function on /sys/bus/ap/apmask
   or /sys/bus/ap/aqmask sysfs file update with a relative mask value

 - Fix "bindings complete" udev event which should be sent once all AP
   devices have been bound to device drivers and again when unbind/bind
   actions take place and all AP devices are bound again

 - Facility list alt_stfle_fac_list is nowhere used in the decompressor,
   therefore remove it there

 - Remove custom kprobes insn slot allocator in favour of the standard
   module_alloc() one, since kernel image and module areas are located
   within 4GB

 - Use kvcalloc() instead of kvmalloc_array() in zcrypt driver to avoid
   calling memset() with a large byte count and get rid of the sparse
   warning as result

* tag 's390-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
  s390/zcrypt: Use kvcalloc() instead of kvmalloc_array()
  s390/kprobes: Remove custom insn slot allocator
  s390/boot: Remove alt_stfle_fac_list from decompressor
  s390/ap: Fix bind complete udev event sent after each AP bus scan
  s390/ap: Fix crash in AP internal function modify_bitmap()
  s390/cpacf: Make use of invalid opcode produce a link error
  s390/cpacf: Split and rework cpacf query functions
  s390/ipl: Introduce sysfs attribute 'scp_data' for dump ipl
  s390/ipl: Introduce macros for (re)ipl sysfs attribute 'scp_data'
  s390/ipl: Fix incorrect initialization of nvme dump block
  s390/ipl: Fix incorrect initialization of len fields in nvme reipl block
  s390/ipl: Do not accept z/VM CP diag X'008' cmds longer than max length
  s390/ipl: Fix size of vmcmd buffers for sending z/VM CP diag X'008' cmds
  s390/alternatives: Convert runtime sanity check into compile time check
  s390/iucv: Unexport iucv_root
  tty: hvc-iucv: Make use of iucv_alloc_device()
  s390/smsgiucv_app: Make use of iucv_alloc_device()
  s390/netiucv: Make use of iucv_alloc_device()
  s390/vmlogrdr: Make use of iucv_alloc_device()
  s390/iucv: Provide iucv_alloc_device() / iucv_release_device()
  ...
2024-05-21 12:09:36 -07:00
Linus Torvalds
61307b7be4 Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
 "The usual shower of singleton fixes and minor series all over MM,
  documented (hopefully adequately) in the respective changelogs.
  Notable series include:

   - Lucas Stach has provided some page-mapping cleanup/consolidation/
     maintainability work in the series "mm/treewide: Remove pXd_huge()
     API".

   - In the series "Allow migrate on protnone reference with
     MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
     MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
     one test.

   - In their series "Memory allocation profiling" Kent Overstreet and
     Suren Baghdasaryan have contributed a means of determining (via
     /proc/allocinfo) whereabouts in the kernel memory is being
     allocated: number of calls and amount of memory.

   - Matthew Wilcox has provided the series "Various significant MM
     patches" which does a number of rather unrelated things, but in
     largely similar code sites.

   - In his series "mm: page_alloc: freelist migratetype hygiene"
     Johannes Weiner has fixed the page allocator's handling of
     migratetype requests, with resulting improvements in compaction
     efficiency.

   - In the series "make the hugetlb migration strategy consistent"
     Baolin Wang has fixed a hugetlb migration issue, which should
     improve hugetlb allocation reliability.

   - Liu Shixin has hit an I/O meltdown caused by readahead in a
     memory-tight memcg. Addressed in the series "Fix I/O high when
     memory almost met memcg limit".

   - In the series "mm/filemap: optimize folio adding and splitting"
     Kairui Song has optimized pagecache insertion, yielding ~10%
     performance improvement in one test.

   - Baoquan He has cleaned up and consolidated the early zone
     initialization code in the series "mm/mm_init.c: refactor
     free_area_init_core()".

   - Baoquan has also redone some MM initializatio code in the series
     "mm/init: minor clean up and improvement".

   - MM helper cleanups from Christoph Hellwig in his series "remove
     follow_pfn".

   - More cleanups from Matthew Wilcox in the series "Various
     page->flags cleanups".

   - Vlastimil Babka has contributed maintainability improvements in the
     series "memcg_kmem hooks refactoring".

   - More folio conversions and cleanups in Matthew Wilcox's series:
	"Convert huge_zero_page to huge_zero_folio"
	"khugepaged folio conversions"
	"Remove page_idle and page_young wrappers"
	"Use folio APIs in procfs"
	"Clean up __folio_put()"
	"Some cleanups for memory-failure"
	"Remove page_mapping()"
	"More folio compat code removal"

   - David Hildenbrand chipped in with "fs/proc/task_mmu: convert
     hugetlb functions to work on folis".

   - Code consolidation and cleanup work related to GUP's handling of
     hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".

   - Rick Edgecombe has developed some fixes to stack guard gaps in the
     series "Cover a guard gap corner case".

   - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
     series "mm/ksm: fix ksm exec support for prctl".

   - Baolin Wang has implemented NUMA balancing for multi-size THPs.
     This is a simple first-cut implementation for now. The series is
     "support multi-size THP numa balancing".

   - Cleanups to vma handling helper functions from Matthew Wilcox in
     the series "Unify vma_address and vma_pgoff_address".

   - Some selftests maintenance work from Dev Jain in the series
     "selftests/mm: mremap_test: Optimizations and style fixes".

   - Improvements to the swapping of multi-size THPs from Ryan Roberts
     in the series "Swap-out mTHP without splitting".

   - Kefeng Wang has significantly optimized the handling of arm64's
     permission page faults in the series
	"arch/mm/fault: accelerate pagefault when badaccess"
	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"

   - GUP cleanups from David Hildenbrand in "mm/gup: consistently call
     it GUP-fast".

   - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
     path to use struct vm_fault".

   - selftests build fixes from John Hubbard in the series "Fix
     selftests/mm build without requiring "make headers"".

   - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
     series "Improved Memory Tier Creation for CPUless NUMA Nodes".
     Fixes the initialization code so that migration between different
     memory types works as intended.

   - David Hildenbrand has improved follow_pte() and fixed an errant
     driver in the series "mm: follow_pte() improvements and acrn
     follow_pte() fixes".

   - David also did some cleanup work on large folio mapcounts in his
     series "mm: mapcount for large folios + page_mapcount() cleanups".

   - Folio conversions in KSM in Alex Shi's series "transfer page to
     folio in KSM".

   - Barry Song has added some sysfs stats for monitoring multi-size
     THP's in the series "mm: add per-order mTHP alloc and swpout
     counters".

   - Some zswap cleanups from Yosry Ahmed in the series "zswap
     same-filled and limit checking cleanups".

   - Matthew Wilcox has been looking at buffer_head code and found the
     documentation to be lacking. The series is "Improve buffer head
     documentation".

   - Multi-size THPs get more work, this time from Lance Yang. His
     series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
     optimizes the freeing of these things.

   - Kemeng Shi has added more userspace-visible writeback
     instrumentation in the series "Improve visibility of writeback".

   - Kemeng Shi then sent some maintenance work on top in the series
     "Fix and cleanups to page-writeback".

   - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
     the series "Improve anon_vma scalability for anon VMAs". Intel's
     test bot reported an improbable 3x improvement in one test.

   - SeongJae Park adds some DAMON feature work in the series
	"mm/damon: add a DAMOS filter type for page granularity access recheck"
	"selftests/damon: add DAMOS quota goal test"

   - Also some maintenance work in the series
	"mm/damon/paddr: simplify page level access re-check for pageout"
	"mm/damon: misc fixes and improvements"

   - David Hildenbrand has disabled some known-to-fail selftests ni the
     series "selftests: mm: cow: flag vmsplice() hugetlb tests as
     XFAIL".

   - memcg metadata storage optimizations from Shakeel Butt in "memcg:
     reduce memory consumption by memcg stats".

   - DAX fixes and maintenance work from Vishal Verma in the series
     "dax/bus.c: Fixups for dax-bus locking""

* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
  memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
  selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
  selftests: cgroup: add tests to verify the zswap writeback path
  mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
  mm/damon/core: fix return value from damos_wmark_metric_value
  mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
  selftests: cgroup: remove redundant enabling of memory controller
  Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
  Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
  Docs/mm/damon/design: use a list for supported filters
  Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
  Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
  selftests/damon: classify tests for functionalities and regressions
  selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
  selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
  selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
  mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
  selftests/damon: add a test for DAMOS quota goal
  ...
2024-05-19 09:21:03 -07:00
Linus Torvalds
ff9a79307f Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:

 - Avoid 'constexpr', which is a keyword in C23

 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of
   'dt_binding_check'

 - Fix weak references to avoid GOT entries in position-independent code
   generation

 - Convert the last use of 'optional' property in arch/sh/Kconfig

 - Remove support for the 'optional' property in Kconfig

 - Remove support for Clang's ThinLTO caching, which does not work with
   the .incbin directive

 - Change the semantics of $(src) so it always points to the source
   directory, which fixes Makefile inconsistencies between upstream and
   downstream

 - Fix 'make tar-pkg' for RISC-V to produce a consistent package

 - Provide reasonable default coverage for objtool, sanitizers, and
   profilers

 - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.

 - Remove the last use of tristate choice in drivers/rapidio/Kconfig

 - Various cleanups and fixes in Kconfig

* tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits)
  kconfig: use sym_get_choice_menu() in sym_check_prop()
  rapidio: remove choice for enumeration
  kconfig: lxdialog: remove initialization with A_NORMAL
  kconfig: m/nconf: merge two item_add_str() calls
  kconfig: m/nconf: remove dead code to display value of bool choice
  kconfig: m/nconf: remove dead code to display children of choice members
  kconfig: gconf: show checkbox for choice correctly
  kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal
  Makefile: remove redundant tool coverage variables
  kbuild: provide reasonable defaults for tool coverage
  modules: Drop the .export_symbol section from the final modules
  kconfig: use menu_list_for_each_sym() in sym_check_choice_deps()
  kconfig: use sym_get_choice_menu() in conf_write_defconfig()
  kconfig: add sym_get_choice_menu() helper
  kconfig: turn defaults and additional prompt for choice members into error
  kconfig: turn missing prompt for choice members into error
  kconfig: turn conf_choice() into void function
  kconfig: use linked list in sym_set_changed()
  kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED
  kconfig: gconf: remove debug code
  ...
2024-05-18 12:39:20 -07:00
Linus Torvalds
70a663205d Merge tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:

 - tracing/probes: Add new pseudo-types %pd and %pD support for dumping
   dentry name from 'struct dentry *' and file name from 'struct file *'

 - uprobes performance optimizations:
    - Speed up the BPF uprobe event by delaying the fetching of the
      uprobe event arguments that are not used in BPF
    - Avoid locking by speculatively checking whether uprobe event is
      valid
    - Reduce lock contention by using read/write_lock instead of
      spinlock for uprobe list operation. This improved BPF uprobe
      benchmark result 43% on average

 - rethook: Remove non-fatal warning messages when tracing stack from
   BPF and skip rcu_is_watching() validation in rethook if possible

 - objpool: Optimize objpool (which is used by kretprobes and fprobe as
   rethook backend storage) by inlining functions and avoid caching
   nr_cpu_ids because it is a const value

 - fprobe: Add entry/exit callbacks types (code cleanup)

 - kprobes: Check ftrace was killed in kprobes if it uses ftrace

* tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobe/ftrace: bail out if ftrace was killed
  selftests/ftrace: Fix required features for VFS type test case
  objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
  objpool: enable inlining objpool_push() and objpool_pop() operations
  rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
  ftrace: make extra rcu_is_watching() validation check optional
  uprobes: reduce contention on uprobes_tree access
  rethook: Remove warning messages printed for finding return address of a frame.
  fprobe: Add entry/exit callbacks types
  selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD"
  selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"
  Documentation: tracing: add new type '%pd' and '%pD' for kprobe
  tracing/probes: support '%pD' type for print struct file's name
  tracing/probes: support '%pd' type for print struct dentry's name
  uprobes: add speculative lockless system-wide uprobe filter check
  uprobes: prepare uprobe args buffer lazily
  uprobes: encapsulate preparation of uprobe args buffer
2024-05-17 18:29:30 -07:00
Heiko Carstens
d890e6af50 s390/kprobes: Remove custom insn slot allocator
Since commit c98d2ecae0 ("s390/mm: Uncouple physical vs virtual address
spaces") the kernel image and module area are within the same 4GB area.

This eliminates the need of a custom insn slot allocator for kprobes within
the kernel image, since standard module_alloc() allocated pages are
sufficient for PC relative instructions with a signed 32 bit offset.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-05-16 10:31:32 +02:00
Sven Schnelle
e7dec0b792 s390/boot: Remove alt_stfle_fac_list from decompressor
It is nowhere used in the decompressor, therefore remove it.

Fixes: 17e89e1340 ("s390/facilities: move stfl information from lowcore to global data")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-05-16 10:17:12 +02:00
Stephen Brennan
1a7d0890dd kprobe/ftrace: bail out if ftrace was killed
If an error happens in ftrace, ftrace_kill() will prevent disarming
kprobes. Eventually, the ftrace_ops associated with the kprobes will be
freed, yet the kprobes will still be active, and when triggered, they
will use the freed memory, likely resulting in a page fault and panic.

This behavior can be reproduced quite easily, by creating a kprobe and
then triggering a ftrace_kill(). For simplicity, we can simulate an
ftrace error with a kernel module like [1]:

[1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer

  sudo perf probe --add commit_creds
  sudo perf trace -e probe:commit_creds
  # In another terminal
  make
  sudo insmod ftrace_killer.ko  # calls ftrace_kill(), simulating bug
  # Back to perf terminal
  # ctrl-c
  sudo perf probe --del commit_creds

After a short period, a page fault and panic would occur as the kprobe
continues to execute and uses the freed ftrace_ops. While ftrace_kill()
is supposed to be used only in extreme circumstances, it is invoked in
FTRACE_WARN_ON() and so there are many places where an unexpected bug
could be triggered, yet the system may continue operating, possibly
without the administrator noticing. If ftrace_kill() does not panic the
system, then we should do everything we can to continue operating,
rather than leave a ticking time bomb.

Link: https://lore.kernel.org/all/20240501162956.229427-1-stephen.s.brennan@oracle.com/

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-16 07:23:30 +09:00
Alexander Egorenkov
24bf80f78e s390/ipl: Introduce sysfs attribute 'scp_data' for dump ipl
This is analogous to the reipl's sysfs attribute named equally and enables
tools such as s390-tools' dumpconf to pass additional kernel cmdline
parameters to a stand-alone dumper such as zfcpdump (e.g. to enable
debug output with 'dump_debug' parameter) or ngdump.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:55 +02:00
Alexander Egorenkov
005ca01333 s390/ipl: Introduce macros for (re)ipl sysfs attribute 'scp_data'
This is a refactoring change to reduce code duplication and improve code
reuse.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:55 +02:00
Alexander Egorenkov
7faacaeaf6 s390/ipl: Fix incorrect initialization of nvme dump block
Initialize the correct fields of the nvme dump block.
This bug had not been detected before because first, the fcp and nvme fields
of struct ipl_parameter_block are part of the same union and, therefore,
overlap in memory and second, they are identical in structure and size.

Fixes: d70e38cb1d ("s390: nvme dump support")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:55 +02:00
Alexander Egorenkov
9c922b73ac s390/ipl: Fix incorrect initialization of len fields in nvme reipl block
Use correct symbolic constants IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN
to initialize nvme reipl block when 'scp_data' sysfs attribute is
being updated. This bug had not been detected before because
the corresponding fcp and nvme symbolic constants are equal.

Fixes: 23a457b8d5 ("s390: nvme reipl")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:54 +02:00
Alexander Egorenkov
247576bf62 s390/ipl: Do not accept z/VM CP diag X'008' cmds longer than max length
The old implementation of vmcmd sysfs string attributes truncated passed
z/VM CP diagnose X'008' commands which were longer than the max allowed
number of characters but the reported number of written characters was
still equal to the entire length of a given string. This can result in
silent failures of some s390-tools (e.g. dumpconf) which can be very hard
to detect. Therefore, this commit makes a write attempt to a vmcmd sysfs
attribute
* fail with E2BIG error if a given string is longer than the maximum
  allowed one
* never destroy the old data in the vmcmd sysfs attribute if the new data
  doesn't fit into it entirely
* return the actual number of written characters if it succeeds

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:54 +02:00
Alexander Egorenkov
72935e3ada s390/ipl: Fix size of vmcmd buffers for sending z/VM CP diag X'008' cmds
z/VM CP diagnose X'008' accepts commands of max 240 characters.
Using a smaller value as a buffer size makes kernel send truncated CP
commands which are longer than the old buffer size. This can result in
invalid CP commands passed to z/VM.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:54 +02:00
Heiko Carstens
207ddb9189 s390/alternatives: Convert runtime sanity check into compile time check
__apply_alternatives() contains a runtime check which verifies that the
size of the to be patched code area is even. Convert this to a compile time
check using a similar ".org" trick, which is already used to verify that
old and new code areas have the same size.

Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:21:23 +02:00
Sven Schnelle
1084562ec8 s390/irq: Set CIF_NOHZ_DELAY in do_io_irq()
Both do_airq_interrupt() and do_io_interrupt() set
CIF_NOHZ_DELAY. Move it to do_io_irq() to simplify
the code.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 20:19:46 +02:00
Masahiro Yamada
7f7f6f7ad6 Makefile: remove redundant tool coverage variables
Now Kbuild provides reasonable defaults for objtool, sanitizers, and
profilers.

Remove redundant variables.

Note:

This commit changes the coverage for some objects:

  - include arch/mips/vdso/vdso-image.o into UBSAN, GCOV, KCOV
  - include arch/sparc/vdso/vdso-image-*.o into UBSAN
  - include arch/sparc/vdso/vma.o into UBSAN
  - include arch/x86/entry/vdso/extable.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
  - include arch/x86/entry/vdso/vdso-image-*.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
  - include arch/x86/entry/vdso/vdso32-setup.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
  - include arch/x86/entry/vdso/vma.o into GCOV, KCOV
  - include arch/x86/um/vdso/vma.o into KASAN, GCOV, KCOV

I believe these are positive effects because all of them are kernel
space objects.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Roberto Sassu <roberto.sassu@huawei.com>
2024-05-14 23:35:48 +09:00
Thomas Huth
980fffff14 s390/fpu: Remove comment about TIF_FPU
It has been removed in commit 2c6b96762f ("s390/fpu: remove TIF_FPU"),
so we should not mention TIF_FPU in the comment here anymore. Since the
remaining parts of the comment just document the obvious fact that
save_user_fpu_regs() saves the FPU state, simply remove the comment now
completely.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20240503080648.81461-1-thuth@redhat.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:08 +02:00
Sven Schnelle
095c89e99b s390/vtime: Use get_cpu_timer()
Instead of implementing get_vtimer() use get_cpu_timer()
which does the same.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:07 +02:00
Sven Schnelle
fa2ae4a377 s390/idle: Rewrite psw_idle() in C
To ease maintenance and further enhancements, convert
the psw_idle() function to C.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:07 +02:00
Heiko Carstens
62b672c4ba s390/stackstrace: Detect vdso stack frames
Clear the backchain of the extra stack frame added by the vdso user wrapper
code. This allows the user stack walker to detect and skip the non-standard
stack frame. Without this an incorrect instruction pointer would be added
to stack traces, and stack frame walking would be continued with a more or
less random back chain.

Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:07 +02:00
Heiko Carstens
be72ea09c1 s390/vdso: Introduce and use struct stack_frame_vdso_wrapper
Introduce and use struct stack_frame_vdso_wrapper within vdso user wrapper
code.  With this structure it is possible to automatically generate an
asm-offset define which can be used to save and restore the return address
of the calling function.

Also use STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to
document that the code works with user space stack frames with the standard
stack frame layout.

Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:07 +02:00
Heiko Carstens
cd58109283 s390/stacktrace: Improve detection of invalid instruction pointers
Add basic checks to identify invalid instruction pointers when walking
stack frames:

Instruction pointers must

- have even addresses
- be larger than mmap_min_addr
- lower than the asce_limit of the process

Alternatively it would also be possible to walk page tables similar to fast
GUP and verify that the mapping of the corresponding page is executable,
however that seems to be overkill.

Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:06 +02:00
Heiko Carstens
87eceb17a9 s390/stacktrace: Skip first user stack frame
When walking user stack frames the first stack frame (where the stack
pointer points to) should be skipped: the return address of the current
function is saved in the previous stack frame, not the current stack frame,
which is allocated for to be called functions.

Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:06 +02:00
Heiko Carstens
ebd912ff99 s390/stacktrace: Merge perf_callchain_user() and arch_stack_walk_user()
The two functions perf_callchain_user() and arch_stack_walk_user() are
nearly identical. Reduce code duplication and add a common helper which can
be called by both functions.

Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:06 +02:00
Heiko Carstens
185445c7c1 s390/vdso: Use standard stack frame layout
By default user space is compiled with standard stack frame layout and not
with the packed stack layout. The vdso code however inherited the
-mpacked-stack compiler option from the kernel. Remove this option to make
sure the vdso is compiled with standard stack frame layout.

This makes sure that the stack frame backchain location for vdso generated
stack frames is the same like for calling code (if compiled with default
options). This allows to manually walk stack frames without DWARF
information, like the kernel is doing it e.g. with arch_stack_walk_user().

Fixes: 4bff8cb545 ("s390: convert to GENERIC_VDSO")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:06 +02:00
Jens Remus
10f7052536 s390/vdso: Generate unwind information for C modules
GDB fails to unwind vDSO functions with error message "PC not saved",
for instance when stepping through gettimeofday().

Add -fasynchronous-unwind-tables to CFLAGS to generate .eh_frame
DWARF unwind information for the vDSO C modules.

Fixes: 4bff8cb545 ("s390: convert to GENERIC_VDSO")
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:05 +02:00
Mike Rapoport (IBM)
0cc2dc4902 arch: make execmem setup available regardless of CONFIG_MODULES
execmem does not depend on modules, on the contrary modules use
execmem.

To make execmem available when CONFIG_MODULES=n, for instance for
kprobes, split execmem_params initialization out from
arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:44 -07:00
Mike Rapoport (IBM)
223b5e57d0 mm/execmem, arch: convert remaining overrides of module_alloc to execmem
Extend execmem parameters to accommodate more complex overrides of
module_alloc() by architectures.

This includes specification of a fallback range required by arm, arm64
and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
allocation of KASAN shadow required by s390 and x86 and support for
late initialization of execmem required by arm64.

The core implementation of execmem_alloc() takes care of suppressing
warnings when the initial allocation fails but there is a fallback range
defined.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Tested-by: Liviu Dudau <liviu@dudau.co.uk>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
Mike Rapoport (IBM)
12af2b83d0 mm: introduce execmem_alloc() and execmem_free()
module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing execmem_alloc()
and execmem_free() APIs.

Initially, execmem_alloc() is a wrapper for module_alloc() and
execmem_free() is a replacement of module_memfree() to allow updating all
call sites to use the new APIs.

Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem_alloc() takes
a type argument, that will be used to identify the calling subsystem and to
allow architectures define parameters for ranges suitable for that
subsystem.

No functional changes.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
Linus Torvalds
6e5a0c30b6 Merge tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Add cpufreq pressure feedback for the scheduler

 - Rework misfit load-balancing wrt affinity restrictions

 - Clean up and simplify the code around ::overutilized and
   ::overload access.

 - Simplify sched_balance_newidle()

 - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
   handling that changed the output.

 - Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch()

 - Reorganize, clean up and unify most of the higher level
   scheduler balancing function names around the sched_balance_*()
   prefix

 - Simplify the balancing flag code (sched_balance_running)

 - Miscellaneous cleanups & fixes

* tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  sched/pelt: Remove shift of thermal clock
  sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
  thermal/cpufreq: Remove arch_update_thermal_pressure()
  sched/cpufreq: Take cpufreq feedback into account
  cpufreq: Add a cpufreq pressure feedback for the scheduler
  sched/fair: Fix update of rd->sg_overutilized
  sched/vtime: Do not include <asm/vtime.h> header
  s390/irq,nmi: Include <asm/vtime.h> header directly
  s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
  sched/vtime: Get rid of generic vtime_task_switch() implementation
  sched/vtime: Remove confusing arch_vtime_task_switch() declaration
  sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
  sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
  sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
  sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
  sched/fair: Rename root_domain::overload to ::overloaded
  sched/fair: Use helper functions to access root_domain::overload
  sched/fair: Check root_domain::overload value before update
  sched/fair: Combine EAS check with root_domain::overutilized access
  sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
  ...
2024-05-13 17:18:51 -07:00
Linus Torvalds
d65e1a0f30 Merge tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:

 - Store AP Query Configuration Information in a static buffer

 - Rework the AP initialization and add missing cleanups to the error
   path

 - Swap IRQ and AP bus/device registration to avoid race conditions

 - Export prot_virt_guest symbol

 - Introduce AP configuration changes notifier interface to facilitate
   modularization of the AP bus

 - Add CONFIG_AP kernel configuration option to allow modularization of
   the AP bus

 - Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description
   and dependency and rename it to CONFIG_AP_DEBUG

 - Convert sprintf() and snprintf() to sysfs_emit() in CIO code

 - Adjust indentation of RELOCS command build step

 - Make crypto performance counters upward compatible

 - Convert make_page_secure() and gmap_make_secure() to use folio

 - Rework channel-utilization-block (CUB) handling in preparation of
   introducing additional CUBs

 - Use attribute groups to simplify registration, removal and extension
   of measurement-related channel-path sysfs attributes

 - Add a per-channel-path binary "ext_measurement" sysfs attribute that
   provides access to extended channel-path measurement data

 - Export measurement data for all channel-measurement-groups (CMG), not
   only for a specific ones. This enables support of new CMG data
   formats in userspace without the need for kernel changes

 - Add a per-channel-path sysfs attribute "speed_bps" that provides the
   operating speed in bits per second or 0 if the operating speed is not
   available

 - The CIO tracepoint subchannel-type field "st" is incorrectly set to
   the value of subchannel-enabled SCHIB "ena" field. Fix that

 - Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS

 - Consider the maximum physical address available to a DCSS segment
   (512GB) when memory layout is set up

 - Simplify the virtual memory layout setup by reducing the size of
   identity mapping vs vmemmap overlap

 - Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory.
   This will allow to place the kernel image next to kernel modules

 - Move everyting KASLR related from <asm/setup.h> to <asm/page.h>

 - Put virtual memory layout information into a structure to improve
   code generation

 - Currently __kaslr_offset is the kernel offset in both physical and
   virtual memory spaces. Uncouple these offsets to allow uncoupling of
   the addresses spaces

 - Currently the identity mapping base address is implicit and is always
   set to zero. Make it explicit by putting into __identity_base
   persistent boot variable and use it in proper context

 - Introduce .amode31 section start and end macros AMODE31_START and
   AMODE31_END

 - Introduce OS_INFO entries that do not reference any data in memory,
   but rather provide only values

 - Store virtual memory layout in OS_INFO. It is read out by
   makedumpfile, crash and other tools

 - Store virtual memory layout in VMCORE_INFO. It is read out by crash
   and other tools when /proc/kcore device is used

 - Create additional PT_LOAD ELF program header that covers kernel image
   only, so that vmcore tools could locate kernel text and data when
   virtual and physical memory spaces are uncoupled

 - Uncouple physical and virtual address spaces

 - Map kernel at fixed location when KASLR mode is disabled. The
   location is defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration
   value.

 - Rework deployment of kernel image for both compressed and
   uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel
   configuration value

 - Move .vmlinux.relocs section in front of the compressed kernel. The
   interim section rescue step is avoided as result

 - Correct modules thunk offset calculation when branch target is more
   than 2GB away

 - Kernel modules contain their own set of expoline thunks. Now that the
   kernel modules area is less than 4GB away from kernel expoline
   thunks, make modules use kernel expolines. Also make EXPOLINE_EXTERN
   the default if the compiler supports it

 - userfaultfd can insert shared zeropages into processes running VMs,
   but that is not allowed for s390. Fallback to allocating a fresh
   zeroed anonymous folio and insert that instead

 - Re-enable shared zeropages for non-PV and non-skeys KVM guests

 - Rename hex2bitmap() to ap_hex2bitmap() and export it for external use

 - Add ap_config sysfs attribute to provide the means for setting or
   displaying adapters, domains and control domains assigned to a
   vfio-ap mediated device in a single operation

 - Make vfio_ap_mdev_link_queue() ignore duplicate link requests

 - Add write support to ap_config sysfs attribute to allow atomic update
   a vfio-ap mediated device state

 - Document ap_config sysfs attribute

 - Function os_info_old_init() is expected to be called only from a
   regular kdump kernel. Enable it to be called from a stand-alone dump
   kernel

 - Address gcc -Warray-bounds warning and fix array size in struct
   os_info

 - s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks

 - Use unwinder instead of __builtin_return_address() with ftrace to
   prevent returning of undefined values

 - Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
   kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is
   disabled

 - Compile kernel with -fPIC and link with -no-pie to allow kpatch
   feature always succeed and drop the whole CONFIG_PIE_BUILD
   option-enabled code

 - Add missing virt_to_phys() converter for VSIE facility and crypto
   control blocks

* tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
  Revert "s390: Relocate vmlinux ELF data to virtual address space"
  KVM: s390: vsie: Use virt_to_phys for crypto control block
  s390: Relocate vmlinux ELF data to virtual address space
  s390: Compile kernel with -fPIC and link with -no-pie
  s390: vmlinux.lds.S: Drop .hash and .gnu.hash for !CONFIG_PIE_BUILD
  s390/ftrace: Use unwinder instead of __builtin_return_address()
  s390/pci: Drop unneeded reference to CONFIG_DMI
  s390/os_info: Fix array size in struct os_info
  s390/os_info: Initialize old os_info in standalone dump kernel
  docs: Update s390 vfio-ap doc for ap_config sysfs attribute
  s390/vfio-ap: Add write support to sysfs attr ap_config
  s390/vfio-ap: Ignore duplicate link requests in vfio_ap_mdev_link_queue
  s390/vfio-ap: Add sysfs attr, ap_config, to export mdev state
  s390/ap: Externalize AP bus specific bitmap reading function
  s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
  mm/userfaultfd: Do not place zeropages when zeropages are disallowed
  s390/expoline: Make modules use kernel expolines
  s390/nospec: Correct modules thunk offset calculation
  s390/boot: Do not rescue .vmlinux.relocs section
  s390/boot: Rework deployment of the kernel image
  ...
2024-05-13 08:33:52 -07:00
Masahiro Yamada
b1992c3772 kbuild: use $(src) instead of $(srctree)/$(src) for source directory
Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for
checked-in source files. It is merely a convention without any functional
difference. In fact, $(obj) and $(src) are exactly the same, as defined
in scripts/Makefile.build:

    src := $(obj)

When the kernel is built in a separate output directory, $(src) does
not accurately reflect the source directory location. While Kbuild
resolves this discrepancy by specifying VPATH=$(srctree) to search for
source files, it does not cover all cases. For example, when adding a
header search path for local headers, -I$(srctree)/$(src) is typically
passed to the compiler.

This introduces inconsistency between upstream and downstream Makefiles
because $(src) is used instead of $(srctree)/$(src) for the latter.

To address this inconsistency, this commit changes the semantics of
$(src) so that it always points to the directory in the source tree.

Going forward, the variables used in Makefiles will have the following
meanings:

  $(obj)     - directory in the object tree
  $(src)     - directory in the source tree  (changed by this commit)
  $(objtree) - the top of the kernel object tree
  $(srctree) - the top of the kernel source tree

Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced
with $(src).

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-05-10 04:34:52 +09:00
Masahiro Yamada
b957df3b85 arch: use $(obj)/ instead of $(src)/ for preprocessed linker scripts
These are generated files. Prefix them with $(obj)/ instead of $(src)/.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-05-02 20:14:16 +09:00
Sumanth Korikkar
00cda11d3b s390: Compile kernel with -fPIC and link with -no-pie
When the kernel is built with CONFIG_PIE_BUILD option enabled it
uses dynamic symbols, for which the linker does not allow more
than 64K number of entries. This can break features like kpatch.

Hence, whenever possible the kernel is built with CONFIG_PIE_BUILD
option disabled. For that support of unaligned symbols generated by
linker scripts in the compiler is necessary.

However, older compilers might lack such support. In that case the
build process resorts to CONFIG_PIE_BUILD option-enabled build.

Compile object files with -fPIC option and then link the kernel
binary with -no-pie linker option.

As result, the dynamic symbols are not generated and not only kpatch
feature succeeds, but also the whole CONFIG_PIE_BUILD option-enabled
code could be dropped.

[ agordeev: Reworded the commit message ]

Suggested-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-29 17:33:30 +02:00
Sumanth Korikkar
5f90003f09 s390: vmlinux.lds.S: Drop .hash and .gnu.hash for !CONFIG_PIE_BUILD
Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
option is enabled. Drop these for the case CONFIG_PIE_BUILD is disabled.

[ agordeev: Reworded the commit message ]

Fixes: 778666df60 ("s390: compile relocatable kernel without -fPIE")
Suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-29 17:33:30 +02:00
Sven Schnelle
cae74ba8c2 s390/ftrace: Use unwinder instead of __builtin_return_address()
Using __builtin_return_address(n) might return undefined values
when used with values of n outside of the stack. This was noticed
when __builtin_return_address() was called in ftrace on top level
functions like the interrupt handlers.

As this behaviour cannot be fixed, use the s390 stack unwinder and
remove the ftrace compilation flags for unwind_bc.c and stacktrace.c
to prevent the unwinding function polluting function traces.

Another advantage is that this also works with clang.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-29 17:33:30 +02:00
Sven Schnelle
fe742c08f3 s390/os_info: Fix array size in struct os_info
gcc's -Warray-bounds warned about an out-of-bounds access to
the entry array contained in struct os_info. This doesn't trigger
a bug right now because there's a large reserved space after the
array. Nevertheless fix this, and also add a BUILD_BUG_ON to make
sure struct os_info is always exactly on page in size.

Fixes: f4cac27dc0 ("s390/crash: Use old os_info to create PT_LOAD headers")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-29 17:33:29 +02:00
Alexander Egorenkov
a2269a66ee s390/os_info: Initialize old os_info in standalone dump kernel
The commit be42660d0c13 ("s390/crash: use old os_info to create PT_LOAD headers")
introduced use of the old os_info into standalone dump kernel.
Before this change os_info_old_init() expected to be called only from
a regular kdump kernel although the function itself is able to work
in standalone dump kernels as well (because copy_oldmem_kernel() is able
to handle both use cases). Therefore, fix the expectation of os_info_old_init()
and enable it to be called from a standalone dump kernel.

Fixes: f4cac27dc0 ("s390/crash: Use old os_info to create PT_LOAD headers")
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-29 17:33:29 +02:00
Jens Remus
b961ec10b9 s390/vdso: Add CFI for RA register to asm macro vdso_func
The return-address (RA) register r14 is specified as volatile in the
s390x ELF ABI [1]. Nevertheless proper CFI directives must be provided
for an unwinder to restore the return address, if the RA register
value is changed from its value at function entry, as it is the case.

[1]: s390x ELF ABI, https://github.com/IBM/s390x-abi/releases

Fixes: 4bff8cb545 ("s390: convert to GENERIC_VDSO")
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-26 16:22:38 +02:00
Kent Overstreet
0069455bcb fix missing vmalloc.h includes
Patch series "Memory allocation profiling", v6.

Overview:
Low overhead [1] per-callsite memory allocation profiling. Not just for
debug kernels, overhead low enough to be deployed in production.

Example output:
  root@moria-kvm:~# sort -rn /proc/allocinfo
   127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
    56373248     4737 mm/slub.c:2259 func:alloc_slab_page
    14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
    14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
    13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
    11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
     9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
     4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
     4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
     3940352      962 mm/memory.c:4214 func:alloc_anon_folio
     2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
     ...

Usage:
kconfig options:
 - CONFIG_MEM_ALLOC_PROFILING
 - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
 - CONFIG_MEM_ALLOC_PROFILING_DEBUG
   adds warnings for allocations that weren't accounted because of a
   missing annotation

sysctl:
  /proc/sys/vm/mem_profiling

Runtime info:
  /proc/allocinfo

Notes:

[1]: Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline with CONFIG_MEMCG_KMEM=n
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
(5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
(6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)  && CONFIG_MEMCG_KMEM=y
(7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y

Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below are results
from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
56 core Intel Xeon:

                        kmalloc                 pgalloc
(1 baseline)            6.764s                  16.902s
(2 default disabled)    6.793s  (+0.43%)        17.007s (+0.62%)
(3 default enabled)     7.197s  (+6.40%)        23.666s (+40.02%)
(4 runtime enabled)     7.405s  (+9.48%)        23.901s (+41.41%)
(5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
(6 def disabled+memcg)  13.332s (+97.10%)       48.105s (+184.61%)
(7 def enabled+memcg)   13.446s (+98.78%)       54.963s (+225.18%)

Memory overhead:
Kernel size:

   text           data        bss         dec         diff
(1) 26515311	      18890222    17018880    62424413
(2) 26524728	      19423818    16740352    62688898    264485
(3) 26524724	      19423818    16740352    62688894    264481
(4) 26524728	      19423818    16740352    62688898    264485
(5) 26541782	      18964374    16957440    62463596    39183

Memory consumption on a 56 core Intel CPU with 125GB of memory:
Code tags:           192 kB
PageExts:         262144 kB (256MB)
SlabExts:           9876 kB (9.6MB)
PcpuExts:            512 kB (0.5MB)

Total overhead is 0.2% of total memory.

Benchmarks:

Hackbench tests run 100 times:
hackbench -s 512 -l 200 -g 15 -f 25 -P
      baseline       disabled profiling           enabled profiling
avg   0.3543         0.3559 (+0.0016)             0.3566 (+0.0023)
stdev 0.0137         0.0188                       0.0077


hackbench -l 10000
      baseline       disabled profiling           enabled profiling
avg   6.4218         6.4306 (+0.0088)             6.5077 (+0.0859)
stdev 0.0933         0.0286                       0.0489

stress-ng tests:
stress-ng --class memory --seq 4 -t 60
stress-ng --class cpu --seq 4 -t 60
Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/

[2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/


This patch (of 37):

The next patch drops vmalloc.h from a system header in order to fix a
circular dependency; this adds it to all the files that were pulling it in
implicitly.

[kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c]
  Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev
[surenb@google.com: fix arch/x86/mm/numa_32.c]
  Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com
[kent.overstreet@linux.dev: a few places were depending on sizes.h]
  Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev
[arnd@arndb.de: fix mm/kasan/hw_tags.c]
  Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org
[surenb@google.com: fix arc build]
  Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Gary Guo <gary@garyguo.net>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:55:49 -07:00
Sven Schnelle
d111855ab7 s390/mm: Fix NULL pointer dereference
The recently added check to figure out if a fault happened on gmap ASCE
dereferences the gmap pointer in lowcore without checking that it is not
NULL. For all non-KVM processes the pointer is NULL, so that some value
from lowcore will be read. With the current layouts of struct gmap and
struct lowcore the read value (aka ASCE) is zero, so that this doesn't lead
to any observable bug; at least currently.

Fix this by adding the missing NULL pointer check.

Fixes: 64c3431808 ("s390/entry: compare gmap asce to determine guest/host fault")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 17:26:34 +02:00
Vasily Gorbik
ea84f14d2a s390/nospec: Correct modules thunk offset calculation
Fix offset calculation when branch target is more then 2Gb away.

Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:02 +02:00
Alexander Gordeev
56b1069c40 s390/boot: Rework deployment of the kernel image
Rework deployment of kernel image for both compressed and
uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED
kernel configuration variable.

In case CONFIG_KERNEL_UNCOMPRESSED is disabled avoid uncompressing
the kernel to a temporary buffer and copying it to the target
address. Instead, uncompress it directly to the target destination.

In case CONFIG_KERNEL_UNCOMPRESSED is enabled avoid moving the
kernel to default 0x100000 location when KASLR is disabled or
failed. Instead, use the uncompressed kernel image directly.

In case KASLR is disabled or failed .amode31 section location in
memory is not randomized and precedes the kernel image. In case
CONFIG_KERNEL_UNCOMPRESSED is disabled that location overlaps the
area used by the decompression algorithm. That is fine, since that
area is not used after the decompression finished and the size of
.amode31 section is not expected to exceed BOOT_HEAP_SIZE ever.

There is no decompression in case CONFIG_KERNEL_UNCOMPRESSED is
enabled. Therefore, rename decompress_kernel() to deploy_kernel(),
which better describes both uncompressed and compressed cases.

Introduce AMODE31_SIZE macro to avoid immediate value of 0x3000
(the size of .amode31 section) in the decompressor linker script.
Modify the vmlinux linker script to force the size of .amode31
section to AMODE31_SIZE (the value of (_eamode31 - _samode31)
could otherwise differ as result of compiler options used).

Introduce __START_KERNEL macro that defines the kernel ELF image
entry point and set it to the currrent value of 0x100000.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:02 +02:00
Alexander Gordeev
f4cac27dc0 s390/crash: Use old os_info to create PT_LOAD headers
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

The vmcore ELF program headers describe virtual memory
regions of a crashed kernel. User level tools use that
information for the kernel text and data analysis (e.g
vmcore-dmesg extracts the kernel log).

Currently the kernel image is covered by program headers
describing the identity mapping regions. But in the future
the kernel image will be mapped into separate region outside
of the identity mapping. Create the additional ELF program
header that covers kernel image only, so that vmcore tools
could locate kernel text and data.

Further, the identity mapping in crashed and capture kernels
will have different base address. Due to that __va() macro
can not be used in the capture kernel. Instead, read crashed
kernel identity mapping base address from os_info and use
it for PT_LOAD type program headers creation.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:01 +02:00
Alexander Gordeev
378e32aa81 s390/vmcoreinfo: Store virtual memory layout
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

The virtual memory layout is needed for address translation
by crash tool when /proc/kcore device is used as the memory
image.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:01 +02:00
Alexander Gordeev
8572f52518 s390/os_info: Store virtual memory layout
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

The virtual memory layout will be read out by makedumpfile,
crash and other user tools for virtual address translation.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:01 +02:00
Alexander Gordeev
88702793c5 s390/os_info: Introduce value entries
Introduce entries that do not reference any data in memory,
but rather provide values. Set the size of such entries to
zero and do not compute checksum for them, since there is no
data which integrity needs to be checked. The integrity of
the value entries itself is still covered by the os_info
checksum.

Reserve the lowest unused entry index OS_INFO_RESERVED for
future use - presumably for the number of entries present.
That could later be used by user level tools. The existing
tools would not notice any difference.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:01 +02:00
Alexander Gordeev
5fb50fa66a s390/boot: Make .amode31 section address range explicit
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

Introduce .amode31 section address range AMODE31_START
and AMODE31_END macros for later use.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:00 +02:00
Alexander Gordeev
7de0446f0b s390/boot: Make identity mapping base address explicit
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

Currently the identity mapping base address is implicit
and is always set to zero. Make it explicit by putting
into __identity_base persistent boot variable and use it
in proper context - which is the value of PAGE_OFFSET.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:00 +02:00
Alexander Gordeev
236f324b74 s390/mm: Create virtual memory layout structure
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.

Put virtual memory layout information into a structure
to improve code generation when accessing the structure
members, which are currently only ident_map_size and
__kaslr_offset.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:38:00 +02:00