Commit Graph

34714 Commits

Author SHA1 Message Date
David Woodhouse
c4d51a52c6 sched/wait: Add add_wait_queue_priority()
This allows an exclusive wait_queue_entry to be added at the head of the
queue, instead of the tail as normal. Thus, it gets to consume events
first without allowing non-exclusive waiters to be woken at all.

The (first) intended use is for KVM IRQFD, which currently has
inconsistent behaviour depending on whether posted interrupts are
available or not. If they are, KVM will bypass the eventfd completely
and deliver interrupts directly to the appropriate vCPU. If not, events
are delivered through the eventfd and userspace will receive them when
polling on the eventfd.

By using add_wait_queue_priority(), KVM will be able to consistently
consume events within the kernel without accidentally exposing them
to userspace when they're supposed to be bypassed. This, in turn, means
that userspace doesn't have to jump through hoops to avoid listening
on the erroneously noisy eventfd and injecting duplicate interrupts.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201027143944.648769-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:09 -05:00
Linus Torvalds
db7c953555 Merge tag 'net-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Current release - regressions:

   - arm64: dts: fsl-ls1028a-kontron-sl28: specify in-band mode for
     ENETC

  Current release - bugs in new features:

   - mptcp: provide rmem[0] limit offset to fix oops

  Previous release - regressions:

   - IPv6: Set SIT tunnel hard_header_len to zero to fix path MTU
     calculations

   - lan743x: correctly handle chips with internal PHY

   - bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE

   - mlx5e: Fix VXLAN port table synchronization after function reload

  Previous release - always broken:

   - bpf: Zero-fill re-used per-cpu map element

   - fix out-of-order UDP packets when forwarding with UDP GSO fraglists
     turned on:
       - fix UDP header access on Fast/frag0 UDP GRO
       - fix IP header access and skb lookup on Fast/frag0 UDP GRO

   - ethtool: netlink: add missing netdev_features_change() call

   - net: Update window_clamp if SOCK_RCVBUF is set

   - igc: Fix returning wrong statistics

   - ch_ktls: fix multiple leaks and corner cases in Chelsio TLS offload

   - tunnels: Fix off-by-one in lower MTU bounds for ICMP/ICMPv6 replies

   - r8169: disable hw csum for short packets on all chip versions

   - vrf: Fix fast path output packet handling with async Netfilter
     rules"

* tag 'net-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits)
  lan743x: fix use of uninitialized variable
  net: udp: fix IP header access and skb lookup on Fast/frag0 UDP GRO
  net: udp: fix UDP header access on Fast/frag0 UDP GRO
  devlink: Avoid overwriting port attributes of registered port
  vrf: Fix fast path output packet handling with async Netfilter rules
  cosa: Add missing kfree in error path of cosa_write
  net: switch to the kernel.org patchwork instance
  ch_ktls: stop the txq if reaches threshold
  ch_ktls: tcb update fails sometimes
  ch_ktls/cxgb4: handle partial tag alone SKBs
  ch_ktls: don't free skb before sending FIN
  ch_ktls: packet handling prior to start marker
  ch_ktls: Correction in middle record handling
  ch_ktls: missing handling of header alone
  ch_ktls: Correction in trimmed_len calculation
  cxgb4/ch_ktls: creating skbs causes panic
  ch_ktls: Update cheksum information
  ch_ktls: Correction in finding correct length
  cxgb4/ch_ktls: decrypted bit is not enough
  net/x25: Fix null-ptr-deref in x25_connect
  ...
2020-11-12 14:02:04 -08:00
Linus Torvalds
fcfb67918c Merge tag 'pm-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "Make the intel_pstate driver behave as expected when it operates in
  the passive mode with HWP enabled and the 'powersave' governor on top
  of it"

* tag 'pm-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Take CPUFREQ_GOV_STRICT_TARGET into account
  cpufreq: Add strict_target to struct cpufreq_policy
  cpufreq: Introduce CPUFREQ_GOV_STRICT_TARGET
  cpufreq: Introduce governor flags
2020-11-12 11:03:38 -08:00
Linus Torvalds
3d5e28bff7 Merge branch 'stable/for-linus-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
Pull swiotlb fixes from Konrad Rzeszutek Wilk:
 "Two tiny fixes for issues that make drivers under Xen unhappy under
  certain conditions"

* 'stable/for-linus-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
  swiotlb: remove the tbl_dma_addr argument to swiotlb_tbl_map_single
  swiotlb: fix "x86: Don't panic if can not alloc buffer for swiotlb"
2020-11-11 14:15:06 -08:00
Linus Torvalds
eccc876724 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull core dump fix from Al Viro:
 "Fix for multithreaded coredump playing fast and loose with getting
  registers of secondary threads; if a secondary gets caught in the
  middle of exit(2), the conditition it will be stopped in for dumper to
  examine might be unusual enough for things to go wrong.

  Quite a few architectures are fine with that, but some are not."

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  don't dump the threads that had been already exiting when zapped.
2020-11-10 10:33:55 -08:00
Rafael J. Wysocki
9a2a9ebc0a cpufreq: Introduce governor flags
A new cpufreq governor flag will be added subsequently, so replace
the bool dynamic_switching fleid in struct cpufreq_governor with a
flags field and introduce CPUFREQ_GOV_DYNAMIC_SWITCHING to set for
the "dynamic switching" governors instead of it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-11-10 18:31:17 +01:00
Eddy Wu
b4e00444ca fork: fix copy_process(CLONE_PARENT) race with the exiting ->real_parent
current->group_leader->exit_signal may change during copy_process() if
current->real_parent exits.

Move the assignment inside tasklist_lock to avoid the race.

Signed-off-by: Eddy Wu <eddy_wu@trendmicro.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-08 11:18:39 -08:00
Linus Torvalds
100e38914a Merge tag 'perf-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Thomas Gleixner:
 "A single fix for the perf core plugging a memory leak in the address
  filter parser"

* tag 'perf-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix a memory leak in perf_event_parse_addr_filter()
2020-11-08 10:05:10 -08:00
Linus Torvalds
aaaaa7ecdc Merge tag 'locking-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fix from Thomas Gleixner:
 "A single fix for the futex code where an intermediate state in the
  underlying RT mutex was not handled correctly and triggering a BUG()
  instead of treating it as another variant of retry condition"

* tag 'locking-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Handle transient "ownerless" rtmutex state correctly
2020-11-08 09:56:37 -08:00
Linus Torvalds
15a9844458 Merge tag 'irq-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A set of fixes for interrupt chip drivers:

   - Fix the fallout of the IPI as interrupt conversion in Kconfig and
     the BCM2836 interrupt chip driver

   - Fixes for interrupt affinity setting and the handling of
     hierarchical irq domains in the SiFive PLIC driver

   - Make the unmapped event handling in the TI SCI driver work
     correctly

   - A few minor fixes and cleanups in various chip drivers and Kconfig"

* tag 'irq-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  dt-bindings: irqchip: ti, sci-inta: Fix diagram indentation for unmapped events
  irqchip/ti-sci-inta: Add support for unmapped event handling
  dt-bindings: irqchip: ti, sci-inta: Update for unmapped event handling
  irqchip/renesas-intc-irqpin: Merge irlm_bit and needs_irlm
  irqchip/sifive-plic: Fix chip_data access within a hierarchy
  irqchip/sifive-plic: Fix broken irq_set_affinity() callback
  irqchip/stm32-exti: Add all LP timer exti direct events support
  irqchip/bcm2836: Fix missing __init annotation
  irqchip/mips: Drop selection of IRQ_DOMAIN_HIERARCHY
  irqchip/mst: Make mst_intc_of_init static
  irqchip/mst: MST_IRQ should depend on ARCH_MEDIATEK or ARCH_MSTARV7
  genirq: Let GENERIC_IRQ_IPI select IRQ_DOMAIN_HIERARCHY
2020-11-08 09:52:57 -08:00
Linus Torvalds
6a8d0d283d Merge tag 'core-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull entry code fix from Thomas Gleixner:
 "A single fix for the generic entry code to correct the wrong
  assumption that the lockdep interrupt state needs not to be
  established before calling the RCU check"

* tag 'core-urgent-2020-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Fix the incorrect ordering of lockdep and RCU check
2020-11-08 09:51:28 -08:00
Mike Galbraith
9f5d1c336a futex: Handle transient "ownerless" rtmutex state correctly
Gratian managed to trigger the BUG_ON(!newowner) in fixup_pi_state_owner().
This is one possible chain of events leading to this:

Task Prio       Operation
T1   120	lock(F)
T2   120	lock(F)   -> blocks (top waiter)
T3   50 (RT)	lock(F)   -> boosts T1 and blocks (new top waiter)
XX   		timeout/  -> wakes T2
		signal
T1   50		unlock(F) -> wakes T3 (rtmutex->owner == NULL, waiter bit is set)
T2   120	cleanup   -> try_to_take_mutex() fails because T3 is the top waiter
     			     and the lower priority T2 cannot steal the lock.
     			  -> fixup_pi_state_owner() sees newowner == NULL -> BUG_ON()

The comment states that this is invalid and rt_mutex_real_owner() must
return a non NULL owner when the trylock failed, but in case of a queued
and woken up waiter rt_mutex_real_owner() == NULL is a valid transient
state. The higher priority waiter has simply not yet managed to take over
the rtmutex.

The BUG_ON() is therefore wrong and this is just another retry condition in
fixup_pi_state_owner().

Drop the locks, so that T3 can make progress, and then try the fixup again.

Gratian provided a great analysis, traces and a reproducer. The analysis is
to the point, but it confused the hell out of that tglx dude who had to
page in all the futex horrors again. Condensed version is above.

[ tglx: Wrote comment and changelog ]

Fixes: c1e2f0eaf0 ("futex: Avoid violating the 10th rule of futex")
Reported-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87a6w6x7bb.fsf@ni.com
Link: https://lore.kernel.org/r/87sg9pkvf7.fsf@nanos.tec.linutronix.de
2020-11-07 22:07:04 +01:00
kiyin(尹亮)
7bdb157cde perf/core: Fix a memory leak in perf_event_parse_addr_filter()
As shown through runtime testing, the "filename" allocation is not
always freed in perf_event_parse_addr_filter().

There are three possible ways that this could happen:

 - It could be allocated twice on subsequent iterations through the loop,
 - or leaked on the success path,
 - or on the failure path.

Clean up the code flow to make it obvious that 'filename' is always
freed in the reallocation path and in the two return paths as well.

We rely on the fact that kfree(NULL) is NOP and filename is initialized
with NULL.

This fixes the leak. No other side effects expected.

[ Dan Carpenter: cleaned up the code flow & added a changelog. ]
[ Ingo Molnar: updated the changelog some more. ]

Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Signed-off-by: "kiyin(尹亮)" <kiyin@tencent.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "Srivatsa S. Bhat" <srivatsa@csail.mit.edu>
Cc: Anthony Liguori <aliguori@amazon.com>
--
 kernel/events/core.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)
2020-11-07 13:07:26 +01:00
Jakub Kicinski
86bbf01977 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-11-06

1) Pre-allocated per-cpu hashmap needs to zero-fill reused element, from David.

2) Tighten bpf_lsm function check, from KP.

3) Fix bpftool attaching to flow dissector, from Lorenz.

4) Use -fno-gcse for the whole kernel/bpf/core.c instead of function attribute, from Ard.

* git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Update verification logic for LSM programs
  bpf: Zero-fill re-used per-cpu map element
  bpf: BPF_PRELOAD depends on BPF_SYSCALL
  tools/bpftool: Fix attaching flow dissector
  libbpf: Fix possible use after free in xsk_socket__delete
  libbpf: Fix null dereference in xsk_socket__delete
  libbpf, hashmap: Fix undefined behavior in hash_bits
  bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE
  tools, bpftool: Remove two unused variables.
  tools, bpftool: Avoid array index warnings.
  xsk: Fix possible memory leak at socket close
  bpf: Add struct bpf_redir_neigh forward declaration to BPF helper defs
  samples/bpf: Set rlimit for memlock to infinity in all samples
  bpf: Fix -Wshadow warnings
  selftest/bpf: Fix profiler test using CO-RE relocation for enums
====================

Link: https://lore.kernel.org/r/20201106221759.24143-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-06 17:49:34 -08:00
KP Singh
6f64e47783 bpf: Update verification logic for LSM programs
The current logic checks if the name of the BTF type passed in
attach_btf_id starts with "bpf_lsm_", this is not sufficient as it also
allows attachment to non-LSM hooks like the very function that performs
this check, i.e. bpf_lsm_verify_prog.

In order to ensure that this verification logic allows attachment to
only LSM hooks, the LSM_HOOK definitions in lsm_hook_defs.h are used to
generate a BTF_ID set. Upon verification, the attach_btf_id of the
program being attached is checked for presence in this set.

Fixes: 9e4e01dfd3 ("bpf: lsm: Implement attach, detach and execution")
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201105230651.2621917-1-kpsingh@chromium.org
2020-11-06 13:15:21 -08:00
David Verbeiren
d3bec0138b bpf: Zero-fill re-used per-cpu map element
Zero-fill element values for all other cpus than current, just as
when not using prealloc. This is the only way the bpf program can
ensure known initial values for all cpus ('onallcpus' cannot be
set when coming from the bpf program).

The scenario is: bpf program inserts some elements in a per-cpu
map, then deletes some (or userspace does). When later adding
new elements using bpf_map_update_elem(), the bpf program can
only set the value of the new elements for the current cpu.
When prealloc is enabled, previously deleted elements are re-used.
Without the fix, values for other cpus remain whatever they were
when the re-used entry was previously freed.

A selftest is added to validate correct operation in above
scenario as well as in case of LRU per-cpu map element re-use.

Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Signed-off-by: David Verbeiren <david.verbeiren@tessares.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201104112332.15191-1-david.verbeiren@tessares.net
2020-11-05 19:55:57 -08:00
Randy Dunlap
7c0afcad75 bpf: BPF_PRELOAD depends on BPF_SYSCALL
Fix build error when BPF_SYSCALL is not set/enabled but BPF_PRELOAD is
by making BPF_PRELOAD depend on BPF_SYSCALL.

ERROR: modpost: "bpf_preload_ops" [kernel/bpf/preload/bpf_preload.ko] undefined!

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201105195109.26232-1-rdunlap@infradead.org
2020-11-05 18:49:29 -08:00
Linus Torvalds
3249fe4563 Merge tag 'trace-v5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Fix off-by-one error in retrieving the context buffer for
   trace_printk()

 - Fix off-by-one error in stack nesting limit

 - Fix recursion to not make all NMI code false positive as recursing

 - Stop losing events in function tracing when transitioning between irq
   context

 - Stop losing events in ring buffer when transitioning between irq
   context

 - Fix return code of error pointer in parse_synth_field() to prevent
   NULL pointer dereference.

 - Fix false positive of NMI recursion in kprobe event handling

* tag 'trace-v5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  kprobes: Tell lockdep about kprobe nesting
  tracing: Make -ENOMEM the default error for parse_synth_field()
  ring-buffer: Fix recursion protection transitions between interrupt context
  tracing: Fix the checking of stackidx in __ftrace_trace_stack
  ftrace: Handle tracing when switching between context
  ftrace: Fix recursion check for NMI test
  tracing: Fix out of bounds write in get_trace_buf
2020-11-05 11:41:38 -08:00
Linus Torvalds
f786dfa374 Merge tag 'pm-5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix the device links support in runtime PM, correct mistakes in
  the cpuidle documentation, fix the handling of policy limits changes
  in the schedutil cpufreq governor, fix assorted issues in the OPP
  (operating performance points) framework and make one janitorial
  change.

  Specifics:

   - Unify the handling of managed and stateless device links in the
     runtime PM framework and prevent runtime PM references to devices
     from being leaked after device link removal (Rafael Wysocki).

   - Fix two mistakes in the cpuidle documentation (Julia Lawall).

   - Prevent the schedutil cpufreq governor from missing policy limits
     updates in some cases (Viresh Kumar).

   - Prevent static OPPs from being dropped by mistake (Viresh Kumar).

   - Prevent helper function in the OPP framework from returning
     prematurely (Viresh Kumar).

   - Prevent opp_table_lock from being held too long during removal of
     OPP tables with no more active references (Viresh Kumar).

   - Drop redundant semicolon from the Intel RAPL power capping driver
     (Tom Rix)"

* tag 'pm-5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: runtime: Resume the device earlier in __device_release_driver()
  PM: runtime: Drop pm_runtime_clean_up_links()
  PM: runtime: Drop runtime PM references to supplier on link removal
  powercap/intel_rapl: remove unneeded semicolon
  Documentation: PM: cpuidle: correct path name
  Documentation: PM: cpuidle: correct typo
  cpufreq: schedutil: Don't skip freq update if need_freq_update is set
  opp: Reduce the size of critical section in _opp_table_kref_release()
  opp: Fix early exit from dev_pm_opp_register_set_opp_helper()
  opp: Don't always remove static OPPs in _of_add_opp_table_v1()
2020-11-05 11:04:29 -08:00
Thomas Gleixner
9d820f68b2 entry: Fix the incorrect ordering of lockdep and RCU check
When an exception/interrupt hits kernel space and the kernel is not
currently in the idle task then RCU must be watching.

irqentry_enter() validates this via rcu_irq_enter_check_tick(), which in
turn invokes lockdep when taking a lock. But at that point lockdep does not
yet know about the fact that interrupts have been disabled by the CPU,
which triggers a lockdep splat complaining about inconsistent state.

Invoking trace_hardirqs_off() before rcu_irq_enter_check_tick() defeats the
point of rcu_irq_enter_check_tick() because trace_hardirqs_off() uses RCU.

So use the same sequence as for the idle case and tell lockdep about the
irq state change first, invoke the RCU check and then do the lockdep and
tracer update.

Fixes: a5497bab5f ("entry: Provide generic interrupt entry/exit code")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87y2jhl19s.fsf@nanos.tec.linutronix.de
2020-11-04 18:06:14 +01:00
Steven Rostedt (VMware)
645f224e7b kprobes: Tell lockdep about kprobe nesting
Since the kprobe handlers have protection that prohibits other handlers from
executing in other contexts (like if an NMI comes in while processing a
kprobe, and executes the same kprobe, it will get fail with a "busy"
return). Lockdep is unaware of this protection. Use lockdep's nesting api to
differentiate between locks taken in INT3 context and other context to
suppress the false warnings.

Link: https://lore.kernel.org/r/20201102160234.fa0ae70915ad9e2b21c08b85@kernel.org

Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-04 09:46:06 -05:00
Steven Rostedt (VMware)
561ca66910 tracing: Make -ENOMEM the default error for parse_synth_field()
parse_synth_field() returns a pointer and requires that errors get
surrounded by ERR_PTR(). The ret variable is initialized to zero, but should
never be used as zero, and if it is, it could cause a false return code and
produce a NULL pointer dereference. It makes no sense to set ret to zero.

Set ret to -ENOMEM (the most common error case), and have any other errors
set it to something else. This removes the need to initialize ret on *every*
error branch.

Fixes: 761a8c58db ("tracing, synthetic events: Replace buggy strcat() with seq_buf operations")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-02 15:58:32 -05:00
Steven Rostedt (VMware)
b02414c8f0 ring-buffer: Fix recursion protection transitions between interrupt context
The recursion protection of the ring buffer depends on preempt_count() to be
correct. But it is possible that the ring buffer gets called after an
interrupt comes in but before it updates the preempt_count(). This will
trigger a false positive in the recursion code.

Use the same trick from the ftrace function callback recursion code which
uses a "transition" bit that gets set, to allow for a single recursion for
to handle transitions between contexts.

Cc: stable@vger.kernel.org
Fixes: 567cd4da54 ("ring-buffer: User context bit recursion checking")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-02 15:58:32 -05:00
Lukas Bulwahn
3b70ae4f5c kernel/hung_task.c: make type annotations consistent
Commit 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
removed various __user annotations from function signatures as part of
its refactoring.

It also removed the __user annotation for proc_dohung_task_timeout_secs()
at its declaration in sched/sysctl.h, but not at its definition in
kernel/hung_task.c.

Hence, sparse complains:

  kernel/hung_task.c:271:5: error: symbol 'proc_dohung_task_timeout_secs' redeclared with different type (incompatible argument 3 (different address spaces))

Adjust the annotation at the definition fitting to that refactoring to make
sparse happy again, which also resolves this warning from sparse:

  kernel/hung_task.c:277:52: warning: incorrect type in argument 3 (different address spaces)
  kernel/hung_task.c:277:52:    expected void *
  kernel/hung_task.c:277:52:    got void [noderef] __user *buffer

No functional change. No change in object code.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ignatov <rdna@fb.com>
Link: https://lkml.kernel.org/r/20201028130541.20320-1-lukas.bulwahn@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-02 12:14:19 -08:00
Zqiang
6993d0fdbe kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled
There is a small race window when a delayed work is being canceled and
the work still might be queued from the timer_fn:

	CPU0						CPU1
kthread_cancel_delayed_work_sync()
   __kthread_cancel_work_sync()
     __kthread_cancel_work()
        work->canceling++;
					      kthread_delayed_work_timer_fn()
						   kthread_insert_work();

BUG: kthread_insert_work() should not get called when work->canceling is
set.

Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201014083030.16895-1-qiang.zhang@windriver.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-02 12:14:19 -08:00
Oleg Nesterov
7b3c36fc4c ptrace: fix task_join_group_stop() for the case when current is traced
This testcase

	#include <stdio.h>
	#include <unistd.h>
	#include <signal.h>
	#include <sys/ptrace.h>
	#include <sys/wait.h>
	#include <pthread.h>
	#include <assert.h>

	void *tf(void *arg)
	{
		return NULL;
	}

	int main(void)
	{
		int pid = fork();
		if (!pid) {
			kill(getpid(), SIGSTOP);

			pthread_t th;
			pthread_create(&th, NULL, tf, NULL);

			return 0;
		}

		waitpid(pid, NULL, WSTOPPED);

		ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE);
		waitpid(pid, NULL, 0);

		ptrace(PTRACE_CONT, pid, 0,0);
		waitpid(pid, NULL, 0);

		int status;
		int thread = waitpid(-1, &status, 0);
		assert(thread > 0 && thread != pid);
		assert(status == 0x80137f);

		return 0;
	}

fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap().

This is because task_join_group_stop() has 2 problems when current is traced:

	1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee
	   can be woken up by debugger and it can clone another thread which
	   should join the group-stop.

	   We need to check group_stop_count || SIGNAL_STOP_STOPPED.

	2. If SIGNAL_STOP_STOPPED is already set, we should not increment
	   sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread
	   should stop without another do_notify_parent_cldstop() report.

To clarify, the problem is very old and we should blame
ptrace_init_task().  But now that we have task_join_group_stop() it makes
more sense to fix this helper to avoid the code duplication.

Reported-by: syzbot+3485e3773f7da290eecc@syzkaller.appspotmail.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <christian@brauner.io>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201019134237.GA18810@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-02 12:14:19 -08:00
Viresh Kumar
23a881852f cpufreq: schedutil: Don't skip freq update if need_freq_update is set
The cpufreq policy's frequency limits (min/max) can get changed at any
point of time, while schedutil is trying to update the next frequency.
Though the schedutil governor has necessary locking and support in place
to make sure we don't miss any of those updates, there is a corner case
where the governor will find that the CPU is already running at the
desired frequency and so may skip an update.

For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz
and is running at 1 GHz currently. Schedutil tries to update the
frequency to 1.2 GHz, during this time the policy limits get changed as
policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the
frequency at various instances, we will eventually set the frequency to
1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq.

Now lets say the policy limits get changed back at this time with
policy->min as 1 GHz. The next time schedutil is invoked by the
scheduler, we will reevaluate the next frequency (because
need_freq_update will get set due to limits change event) and lets say
we want to set the frequency to 1.2 GHz again. At this point
sugov_update_next_freq() will find the next_freq == current_freq and
will abort the update, while the CPU actually runs at 1.4 GHz.

Until now need_freq_update was used as a flag to indicate that the
policy's frequency limits have changed, and that we should consider the
new limits while reevaluating the next frequency.

This patch fixes the above mentioned issue by extending the purpose of
the need_freq_update flag. If this flag is set now, the schedutil
governor will not try to abort a frequency change even if next_freq ==
current_freq.

As similar behavior is required in the case of
CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be
set to false if that flag is set for the driver.

We also don't need to consider the need_freq_update flag in
sugov_update_single() anymore to handle the special case of busy CPU, as
we won't abort a frequency update anymore.

Reported-by: zhuguangqing <zhuguangqing@xiaomi.com>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rearrange code to avoid a branch ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-11-02 17:57:49 +01:00
Qiujun Huang
906695e593 tracing: Fix the checking of stackidx in __ftrace_trace_stack
The array size is FTRACE_KSTACK_NESTING, so the index FTRACE_KSTACK_NESTING
is illegal too. And fix two typos by the way.

Link: https://lkml.kernel.org/r/20201031085714.2147-1-hqjagain@gmail.com

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-02 11:21:40 -05:00
Christoph Hellwig
fc0021aa34 swiotlb: remove the tbl_dma_addr argument to swiotlb_tbl_map_single
The tbl_dma_addr argument is used to check the DMA boundary for the
allocations, and thus needs to be a dma_addr_t.  swiotlb-xen instead
passed a physical address, which could lead to incorrect results for
strange offsets.  Fix this by removing the parameter entirely and hard
code the DMA address for io_tlb_start instead.

Fixes: 91ffe4ad53 ("swiotlb-xen: introduce phys_to_dma/dma_to_phys translations")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2020-11-02 10:10:39 -05:00
Stefano Stabellini
e9696d259d swiotlb: fix "x86: Don't panic if can not alloc buffer for swiotlb"
kernel/dma/swiotlb.c:swiotlb_init gets called first and tries to
allocate a buffer for the swiotlb. It does so by calling

  memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE);

If the allocation must fail, no_iotlb_memory is set.

Later during initialization swiotlb-xen comes in
(drivers/xen/swiotlb-xen.c:xen_swiotlb_init) and given that io_tlb_start
is != 0, it thinks the memory is ready to use when actually it is not.

When the swiotlb is actually needed, swiotlb_tbl_map_single gets called
and since no_iotlb_memory is set the kernel panics.

Instead, if swiotlb-xen.c:xen_swiotlb_init knew the swiotlb hadn't been
initialized, it would do the initialization itself, which might still
succeed.

Fix the panic by setting io_tlb_start to 0 on swiotlb initialization
failure, and also by setting no_iotlb_memory to false on swiotlb
initialization success.

Fixes: ac2cbab21f ("x86: Don't panic if can not alloc buffer for swiotlb")

Reported-by: Elliott Mitchell <ehem+xen@m5p.com>
Tested-by: Elliott Mitchell <ehem+xen@m5p.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2020-11-02 10:10:37 -05:00
Steven Rostedt (VMware)
726b3d3f14 ftrace: Handle tracing when switching between context
When an interrupt or NMI comes in and switches the context, there's a delay
from when the preempt_count() shows the update. As the preempt_count() is
used to detect recursion having each context have its own bit get set when
tracing starts, and if that bit is already set, it is considered a recursion
and the function exits. But if this happens in that section where context
has changed but preempt_count() has not been updated, this will be
incorrectly flagged as a recursion.

To handle this case, create another bit call TRANSITION and test it if the
current context bit is already set. Flag the call as a recursion if the
TRANSITION bit is already set, and if not, set it and continue. The
TRANSITION bit will be cleared normally on the return of the function that
set it, or if the current context bit is clear, set it and clear the
TRANSITION bit to allow for another transition between the current context
and an even higher one.

Cc: stable@vger.kernel.org
Fixes: edc15cafcb ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-02 08:52:18 -05:00
Steven Rostedt (VMware)
ee11b93f95 ftrace: Fix recursion check for NMI test
The code that checks recursion will work to only do the recursion check once
if there's nested checks. The top one will do the check, the other nested
checks will see recursion was already checked and return zero for its "bit".
On the return side, nothing will be done if the "bit" is zero.

The problem is that zero is returned for the "good" bit when in NMI context.
This will set the bit for NMIs making it look like *all* NMI tracing is
recursing, and prevent tracing of anything in NMI context!

The simple fix is to return "bit + 1" and subtract that bit on the end to
get the real bit.

Cc: stable@vger.kernel.org
Fixes: edc15cafcb ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-02 08:52:18 -05:00
Qiujun Huang
c1acb4ac1a tracing: Fix out of bounds write in get_trace_buf
The nesting count of trace_printk allows for 4 levels of nesting. The
nesting counter starts at zero and is incremented before being used to
retrieve the current context's buffer. But the index to the buffer uses the
nesting counter after it was incremented, and not its original number,
which in needs to do.

Link: https://lkml.kernel.org/r/20201029161905.4269-1-hqjagain@gmail.com

Cc: stable@vger.kernel.org
Fixes: 3d9622c12c ("tracing: Add barrier to trace_printk() buffer nesting modification")
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-02 08:52:18 -05:00
Linus Torvalds
4312e0e8d3 Merge tag 'timers-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A few fixes for timers/timekeeping:

   - Prevent undefined behaviour in the timespec64_to_ns() conversion
     which is used for converting user supplied time input to
     nanoseconds. It lacked overflow protection.

   - Mark sched_clock_read_begin/retry() to prevent recursion in the
     tracer

   - Remove unused debug functions in the hrtimer and timerlist code"

* tag 'timers-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Prevent undefined behaviour in timespec64_to_ns()
  timers: Remove unused inline funtion debug_timer_free()
  hrtimer: Remove unused inline function debug_hrtimer_free()
  time/sched_clock: Mark sched_clock_read_begin/retry() as notrace
2020-11-01 11:13:45 -08:00
Linus Torvalds
82423b46fc Merge tag 'smp-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp fix from Thomas Gleixner:
 "A single fix for stop machine.

  Mark functions no trace to prevent a crash caused by recursion when
  enabling or disabling a tracer on RISC-V (probably all architectures
  which patch through stop machine)"

* tag 'smp-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stop_machine, rcu: Mark functions as notrace
2020-11-01 11:11:38 -08:00
Linus Torvalds
8d99084efc Merge tag 'locking-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "A couple of locking fixes:

   - Fix incorrect failure injection handling in the fuxtex code

   - Prevent a preemption warning in lockdep when tracking
     local_irq_enable() and interrupts are already enabled

   - Remove more raw_cpu_read() usage from lockdep which causes state
     corruption on !X86 architectures.

   - Make the nr_unused_locks accounting in lockdep correct again"

* tag 'locking-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Fix nr_unused_locks accounting
  locking/lockdep: Remove more raw_cpu_read() usage
  futex: Fix incorrect should_fail_futex() handling
  lockdep: Fix preemption WARN for spurious IRQ-enable
2020-11-01 11:08:17 -08:00
Linus Torvalds
53760f9b74 Merge tag 'flexible-array-conversions-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull more flexible-array member conversions from Gustavo A. R. Silva:
 "Replace zero-length arrays with flexible-array members"

* tag 'flexible-array-conversions-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  printk: ringbuffer: Replace zero-length array with flexible-array member
  net/smc: Replace zero-length array with flexible-array member
  net/mlx5: Replace zero-length array with flexible-array member
  mei: hw: Replace zero-length array with flexible-array member
  gve: Replace zero-length array with flexible-array member
  Bluetooth: btintel: Replace zero-length array with flexible-array member
  scsi: target: tcmu: Replace zero-length array with flexible-array member
  ima: Replace zero-length array with flexible-array member
  enetc: Replace zero-length array with flexible-array member
  fs: Replace zero-length array with flexible-array member
  Bluetooth: Replace zero-length array with flexible-array member
  params: Replace zero-length array with flexible-array member
  tracepoint: Replace zero-length array with flexible-array member
  platform/chrome: cros_ec_proto: Replace zero-length array with flexible-array member
  platform/chrome: cros_ec_commands: Replace zero-length array with flexible-array member
  mailbox: zynqmp-ipi-message: Replace zero-length array with flexible-array member
  dmaengine: ti-cppi5: Replace zero-length array with flexible-array member
2020-10-31 14:31:28 -07:00
Gustavo A. R. Silva
a38283da05 printk: ringbuffer: Replace zero-length array with flexible-array member
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-30 16:57:42 -05:00
Linus Torvalds
8843f40550 Merge tag 'pm-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These fix a few issues related to running intel_pstate in the passive
  mode with HWP enabled, correct the handling of the max_cstate module
  parameter in intel_idle and make a few janitorial changes.

  Specifics:

   - Modify Kconfig to prevent configuring either the "conservative" or
     the "ondemand" governor as the default cpufreq governor if
     intel_pstate is selected, in which case "schedutil" is the default
     choice for the default governor setting (Rafael Wysocki).

   - Modify the cpufreq core, intel_pstate and the schedutil governor to
     avoid missing updates of the HWP max limit when intel_pstate
     operates in the passive mode with HWP enabled (Rafael Wysocki).

   - Fix max_cstate module parameter handling in intel_idle for
     processor models with C-state tables coming from ACPI (Chen Yu).

   - Clean up assorted pieces of power management code (Jackie Zamow,
     Tom Rix, Zhang Qilong)"

* tag 'pm-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: schedutil: Always call driver if CPUFREQ_NEED_UPDATE_LIMITS is set
  cpufreq: Introduce cpufreq_driver_test_flags()
  cpufreq: speedstep: remove unneeded semicolon
  PM: sleep: fix typo in kernel/power/process.c
  intel_idle: Fix max_cstate for processor models without C-state tables
  cpufreq: intel_pstate: Avoid missing HWP max updates in passive mode
  cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS driver flag
  cpufreq: Avoid configuring old governors as default with intel_pstate
  cpufreq: e_powersaver: remove unreachable break
2020-10-30 12:45:04 -07:00
Peter Zijlstra
1a39340865 lockdep: Fix nr_unused_locks accounting
Chris reported that commit 24d5a3bffef1 ("lockdep: Fix
usage_traceoverflow") breaks the nr_unused_locks validation code
triggered by /proc/lockdep_stats.

By fully splitting LOCK_USED and LOCK_USED_READ it becomes a bad
indicator for accounting nr_unused_locks; simplyfy by using any first
bit.

Fixes: 24d5a3bffef1 ("lockdep: Fix usage_traceoverflow")
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://lkml.kernel.org/r/20201027124834.GL2628@hirez.programming.kicks-ass.net
2020-10-30 17:07:18 +01:00
Peter Zijlstra
d48e385003 locking/lockdep: Remove more raw_cpu_read() usage
I initially thought raw_cpu_read() was OK, since if it is !0 we have
IRQs disabled and can't get migrated, so if we get migrated both CPUs
must have 0 and it doesn't matter which 0 we read.

And while that is true; it isn't the whole store, on pretty much all
architectures (except x86) this can result in computing the address for
one CPU, getting migrated, the old CPU continuing execution with another
task (possibly setting recursion) and then the new CPU reading the value
of the old CPU, which is no longer 0.

Similer to:

  baffd723e4 ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201026152256.GB2651@hirez.programming.kicks-ass.net
2020-10-30 17:07:18 +01:00
Rafael J. Wysocki
dea47cf45a Merge branches 'pm-cpuidle' and 'pm-sleep'
* pm-cpuidle:
  intel_idle: Fix max_cstate for processor models without C-state tables

* pm-sleep:
  PM: sleep: fix typo in kernel/power/process.c
2020-10-30 16:30:14 +01:00
Ard Biesheuvel
080b6f4076 bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE
Commit 3193c0836 ("bpf: Disable GCC -fgcse optimization for
___bpf_prog_run()") introduced a __no_fgcse macro that expands to a
function scope __attribute__((optimize("-fno-gcse"))), to disable a
GCC specific optimization that was causing trouble on x86 builds, and
was not expected to have any positive effect in the first place.

However, as the GCC manual documents, __attribute__((optimize))
is not for production use, and results in all other optimization
options to be forgotten for the function in question. This can
cause all kinds of trouble, but in one particular reported case,
it causes -fno-asynchronous-unwind-tables to be disregarded,
resulting in .eh_frame info to be emitted for the function.

This reverts commit 3193c0836, and instead, it disables the -fgcse
optimization for the entire source file, but only when building for
X86 using GCC with CONFIG_BPF_JIT_ALWAYS_ON disabled. Note that the
original commit states that CONFIG_RETPOLINE=n triggers the issue,
whereas CONFIG_RETPOLINE=y performs better without the optimization,
so it is kept disabled in both cases.

Fixes: 3193c0836f ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/lkml/CAMuHMdUg0WJHEcq6to0-eODpXPOywLot6UD2=GFHpzoj_hCoBQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20201028171506.15682-2-ardb@kernel.org
2020-10-29 20:01:46 -07:00
Gustavo A. R. Silva
fa29c9c11d params: Replace zero-length array with flexible-array member
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-29 17:22:59 -05:00
Gustavo A. R. Silva
9d0a49c702 tracepoint: Replace zero-length array with flexible-array member
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-29 17:22:59 -05:00
Rafael J. Wysocki
d1e7c2996e cpufreq: schedutil: Always call driver if CPUFREQ_NEED_UPDATE_LIMITS is set
Because sugov_update_next_freq() may skip a frequency update even if
the need_freq_update flag has been set for the policy at hand, policy
limits updates may not take effect as expected.

For example, if the intel_pstate driver operates in the passive mode
with HWP enabled, it needs to update the HWP min and max limits when
the policy min and max limits change, respectively, but that may not
happen if the target frequency does not change along with the limit
at hand.  In particular, if the policy min is changed first, causing
the target frequency to be adjusted to it, and the policy max limit
is changed later to the same value, the HWP max limit will not be
updated to follow it as expected, because the target frequency is
still equal to the policy min limit and it will not change until
that limit is updated.

To address this issue, modify get_next_freq() to let the driver
callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
is set regardless of whether or not the new frequency to set is
equal to the previous one.

Fixes: f6ebbcf08f ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Reported-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f4 cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ...
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: a62f68f5ca cpufreq: Introduce cpufreq_driver_test_flags()
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-29 14:12:18 +01:00
Al Viro
77f6ab8b77 don't dump the threads that had been already exiting when zapped.
Coredump logics needs to report not only the registers of the dumping
thread, but (since 2.5.43) those of other threads getting killed.

Doing that might require extra state saved on the stack in asm glue at
kernel entry; signal delivery logics does that (we need to be able to
save sigcontext there, at the very least) and so does seccomp.

That covers all callers of do_coredump().  Secondary threads get hit with
SIGKILL and caught as soon as they reach exit_mm(), which normally happens
in signal delivery, so those are also fine most of the time.  Unfortunately,
it is possible to end up with secondary zapped when it has already entered
exit(2) (or, worse yet, is oopsing).  In those cases we reach exit_mm()
when mm->core_state is already set, but the stack contents is not what
we would have in signal delivery.

At least on two architectures (alpha and m68k) it leads to infoleaks - we
end up with a chunk of kernel stack written into coredump, with the contents
consisting of normal C stack frames of the call chain leading to exit_mm()
instead of the expected copy of userland registers.  In case of alpha we
leak 312 bytes of stack.  Other architectures (including the regset-using
ones) might have similar problems - the normal user of regsets is ptrace
and the state of tracee at the time of such calls is special in the same
way signal delivery is.

Note that had the zapper gotten to the exiting thread slightly later,
it wouldn't have been included into coredump anyway - we skip the threads
that have already cleared their ->mm.  So let's pretend that zapper always
loses the race.  IOW, have exit_mm() only insert into the dumper list if
we'd gotten there from handling a fatal signal[*]

As the result, the callers of do_exit() that have *not* gone through get_signal()
are not seen by coredump logics as secondary threads.  Which excludes voluntary
exit()/oopsen/traps/etc.  The dumper thread itself is unaffected by that,
so seccomp is fine.

[*] originally I intended to add a new flag in tsk->flags, but ebiederman pointed
out that PF_SIGNALED is already doing just what we need.

Cc: stable@vger.kernel.org
Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-28 16:39:49 -04:00
Linus Torvalds
23859ae444 Merge tag 'trace-v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
 "Fix synthetic event "strcat" overrun

  New synthetic event code used strcat() and miscalculated the ending,
  causing the concatenation to write beyond the allocated memory.

  Instead of using strncat(), the code is switched over to seq_buf which
  has all the mechanisms in place to protect against writing more than
  what is allocated, and cleans up the code a bit"

* tag 'trace-v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing, synthetic events: Replace buggy strcat() with seq_buf operations
2020-10-28 12:05:14 -07:00
Mateusz Nosek
921c7ebd13 futex: Fix incorrect should_fail_futex() handling
If should_futex_fail() returns true in futex_wake_pi(), then the 'ret'
variable is set to -EFAULT and then immediately overwritten. So the failure
injection is non-functional.

Fix it by actually leaving the function and returning -EFAULT.

The Fixes tag is kinda blury because the initial commit which introduced
failure injection was already sloppy, but the below mentioned commit broke
it completely.

[ tglx: Massaged changelog ]

Fixes: 6b4f4bc9cb ("locking/futex: Allow low-level atomic operations to return -EAGAIN")
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200927000858.24219-1-mateusznosek0@gmail.com
2020-10-28 15:48:51 +01:00
Jackie Zamow
4d4ce8053b PM: sleep: fix typo in kernel/power/process.c
Fix a typo in a comment in freeze_processes().

Signed-off-by: Jackie Zamow <jackie.zamow@gmail.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-27 19:11:44 +01:00