Commit Graph

32291 Commits

Author SHA1 Message Date
Florian Westphal
25b327d4f8 selftests: nft_concat_range: add socat support
There are different flavors of 'nc' around, this script fails on
my test vm because 'nc' is 'nmap-ncat' which isn't 100% compatible.

Add socat support and use it if available.

Signed-off-by: Florian Westphal <fw@strlen.de>
2022-09-07 15:06:26 +02:00
Yonghong Song
ae63c10fc2 selftests/bpf: Add tracing_struct test in DENYLIST.s390x
Add tracing_struct test in DENYLIST.s390x since s390x does not
support trampoline now.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152723.2081551-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-06 19:51:53 -07:00
Yonghong Song
a7c2ca3a2f selftests/bpf: Use BPF_PROG2 for some fentry programs without struct arguments
Use BPF_PROG2 instead of BPF_PROG for programs in progs/timer.c
to test BPF_PROG2 for cases without struct arguments.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152718.2081091-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-06 19:51:14 -07:00
Yonghong Song
1642a3945e selftests/bpf: Add struct argument tests with fentry/fexit programs.
Add various struct argument tests with fentry/fexit programs.
Also add one test with a kernel func which does not have any
argument to test BPF_PROG2 macro in such situation.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152713.2080039-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-06 19:51:14 -07:00
Yonghong Song
34586d29f8 libbpf: Add new BPF_PROG2 macro
To support struct arguments in trampoline based programs,
existing BPF_PROG doesn't work any more since
the type size is needed to find whether a parameter
takes one or two registers. So this patch added a new
BPF_PROG2 macro to support such trampoline programs.

The idea is suggested by Andrii. For example, if the
to-be-traced function has signature like
  typedef struct {
       void *x;
       int t;
  } sockptr;
  int blah(sockptr x, char y);

In the new BPF_PROG2 macro, the argument can be
represented as
  __bpf_prog_call(
     ({ union {
          struct { __u64 x, y; } ___z;
          sockptr x;
        } ___tmp = { .___z = { ctx[0], ctx[1] }};
        ___tmp.x;
     }),
     ({ union {
          struct { __u8 x; } ___z;
          char y;
        } ___tmp = { .___z = { ctx[2] }};
        ___tmp.y;
     }));
In the above, the values stored on the stack are properly
assigned to the actual argument type value by using 'union'
magic. Note that the macro also works even if no arguments
are with struct types.

Note that new BPF_PROG2 works for both llvm16 and pre-llvm16
compilers where llvm16 supports bpf target passing value
with struct up to 16 byte size and pre-llvm16 will pass
by reference by storing values on the stack. With static functions
with struct argument as always inline, the compiler is able
to optimize and remove additional stack saving of struct values.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152707.2079473-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-06 19:51:14 -07:00
Yonghong Song
27ed9353ae bpf: Update descriptions for helpers bpf_get_func_arg[_cnt]()
Now instead of the number of arguments, the number of registers
holding argument values are stored in trampoline. Update
the description of bpf_get_func_arg[_cnt]() helpers. Previous
programs without struct arguments should continue to work
as usual.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152657.2078805-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-06 19:51:14 -07:00
Paolo Abeni
2786bcff28 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2022-09-05

The following pull-request contains BPF updates for your *net-next* tree.

We've added 106 non-merge commits during the last 18 day(s) which contain
a total of 159 files changed, 5225 insertions(+), 1358 deletions(-).

There are two small merge conflicts, resolve them as follows:

1) tools/testing/selftests/bpf/DENYLIST.s390x

  Commit 27e23836ce ("selftests/bpf: Add lru_bug to s390x deny list") in
  bpf tree was needed to get BPF CI green on s390x, but it conflicted with
  newly added tests on bpf-next. Resolve by adding both hunks, result:

  [...]
  lru_bug                                  # prog 'printk': failed to auto-attach: -524
  setget_sockopt                           # attach unexpected error: -524                                               (trampoline)
  cb_refs                                  # expected error message unexpected error: -524                               (trampoline)
  cgroup_hierarchical_stats                # JIT does not support calling kernel function                                (kfunc)
  htab_update                              # failed to attach: ERROR: strerror_r(-524)=22                                (trampoline)
  [...]

2) net/core/filter.c

  Commit 1227c1771d ("net: Fix data-races around sysctl_[rw]mem_(max|default).")
  from net tree conflicts with commit 29003875bd ("bpf: Change bpf_setsockopt(SOL_SOCKET)
  to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from
  bpf-next tree, result:

  [...]
	if (getopt) {
		if (optname == SO_BINDTODEVICE)
			return -EINVAL;
		return sk_getsockopt(sk, SOL_SOCKET, optname,
				     KERNEL_SOCKPTR(optval),
				     KERNEL_SOCKPTR(optlen));
	}

	return sk_setsockopt(sk, SOL_SOCKET, optname,
			     KERNEL_SOCKPTR(optval), *optlen);
  [...]

The main changes are:

1) Add any-context BPF specific memory allocator which is useful in particular for BPF
   tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov.

2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort
   to reuse the existing core socket code as much as possible, from Martin KaFai Lau.

3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector
   with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani.

4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo.

5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF
   integration with the rstat framework, from Yosry Ahmed.

6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev.

7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao.

8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF
   backend, from James Hilliard.

9) Fix verifier helper permissions and reference state management for synchronous
   callbacks, from Kumar Kartikeya Dwivedi.

10) Add support for BPF selftest's xskxceiver to also be used against real devices that
    support MAC loopback, from Maciej Fijalkowski.

11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet.

12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu.

13) Various minor misc improvements all over the place.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits)
  bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
  bpf: Remove usage of kmem_cache from bpf_mem_cache.
  bpf: Remove prealloc-only restriction for sleepable bpf programs.
  bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  bpf: Remove tracing program restriction on map types
  bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
  bpf: Add percpu allocation support to bpf_mem_alloc.
  bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  bpf: Adjust low/high watermarks in bpf_mem_cache
  bpf: Optimize call_rcu in non-preallocated hash map.
  bpf: Optimize element count in non-preallocated hash map.
  bpf: Relax the requirement to use preallocated hash maps in tracing progs.
  samples/bpf: Reduce syscall overhead in map_perf_test.
  selftests/bpf: Improve test coverage of test_maps
  bpf: Convert hash map to bpf_mem_alloc.
  bpf: Introduce any context BPF specific memory allocator.
  selftest/bpf: Add test for bpf_getsockopt()
  bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt()
  bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()
  bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
  ...
====================

Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-06 23:21:18 +02:00
Mark Brown
05a5980f7f kselftest/arm64: Count SIGUSR2 deliveries in FP stress tests
Currently the floating point stress tests mostly support testing that the
data they are checking can be disrupted from a signal handler triggered by
SIGUSR1. This is not properly implemented for all the tests and in testing
is frequently modified to just handle the signal without corrupting data in
order to ensure that signal handling does not corrupt data. Directly support
this usage by installing a SIGUSR2 handler which simply counts the signal
delivery.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220829154452.824870-3-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2022-09-06 18:31:41 +01:00
Mark Brown
9e40e62723 kselftest/arm64: Always encourage preemption for za-test
Since we now have an explicit test for the syscall ABI there is no need for
za-test to cover getpid() so just unconditionally do sched_yield() like we
do in fpsimd-test.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220829154452.824870-2-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2022-09-06 18:31:41 +01:00
Mark Brown
7a9bcaaad5 kselftest/arm64: Add simple hwcap validation
Add some trivial hwcap validation which checks that /proc/cpuinfo and
AT_HWCAP agree with each other and can verify that for extensions that can
generate a SIGILL due to adding new instructions one appears or doesn't
appear as expected. I've added SVE and SME, other capabilities can be
added later if this gets merged.

This isn't super exciting but on the other hand took very little time to
write and should be handy when verifying that you wired up AT_HWCAP
properly.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220829154602.827275-1-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2022-09-06 18:30:42 +01:00
Shang XiaoJing
4efa8e3143 perf c2c: Prevent potential memory leak in c2c_he_zalloc()
Free allocated resources when zalloc() fails for members in c2c_he, to
prevent potential memory leak in c2c_he_zalloc().

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Reviewed-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20220906032906.21395-4-shangxiaojing@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-09-06 09:45:23 -03:00
Zixuan Tan
6ea9da51a5 perf genelf: Switch deprecated openssl MD5_* functions to new EVP API
Switch to the flavored EVP API like in test-libcrypto.c, and remove the
bad gcc #pragma.

Inspired-by: 5b245985a6 ("tools build: Switch to new openssl API for test-libcrypto")
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/CABwm_eTnARC1GwMD-JF176k8WXU1Z0+H190mvXn61yr369qt6g@mail.gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-09-06 09:45:23 -03:00
Athira Rajeev
cbd7bfc7fd tools/perf: Fix out of bound access to cpu mask array
The cpu mask init code in "record__mmap_cpu_mask_init" function access
"bits" array part of "struct mmap_cpu_mask".  The size of this array is
the value from cpu__max_cpu().cpu.  This array is used to contain the
cpumask value for each cpu. While setting bit for each cpu, it calls
"set_bit" function which access index in "bits" array.

If we provide a command line option to -C which is greater than the
number of CPU's present in the system, the set_bit could access an array
member which is out-of the array size. This is because currently, there
is no boundary check for the CPU. This will result in seg fault:

<<>>
  ./perf record -C 12341234 ls
  Perf can support 2048 CPUs. Consider raising MAX_NR_CPUS
  Segmentation fault (core dumped)
<<>>

Debugging with gdb, points to function flow as below:

<<>>
  set_bit
  record__mmap_cpu_mask_init
  record__init_thread_default_masks
  record__init_thread_masks
  cmd_record
<<>>

Fix this by adding boundary check for the array.

After the patch:

<<>>
./perf record -C 12341234 ls
  Perf can support 2048 CPUs. Consider raising MAX_NR_CPUS
  Failed to initialize parallel data streaming masks
<<>>

With this fix, if -C is given a non-exsiting CPU, perf
record will fail with:

<<>>
  ./perf record -C 50 ls
  Failed to initialize parallel data streaming masks
<<>>

Reported-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Signed-off-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lore.kernel.org/r/20220905141929.7171-2-atrajeev@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-09-06 09:45:23 -03:00
Athira Rajeev
72cd652b73 perf affinity: Fix out of bound access to "sched_cpus" mask
The affinity code in "affinity_set" function access array named
"sched_cpus". The size for this array is allocated in affinity_setup
function which is nothing but value from get_cpu_set_size. This is used
to contain the cpumask value for each cpu.

While setting bit for each cpu, it calls "set_bit" function which access
index in sched_cpus array.  If we provide a command-line option to -C
which is more than the number of CPU's present in the system, the
set_bit could access an array member which is out-of the array size.
This is because currently, there is no boundary check for the CPU.  This
will result in seg fault:

<<>>
   ./perf stat -C 12323431 ls
  Perf can support 2048 CPUs. Consider raising MAX_NR_CPUS
  Segmentation fault (core dumped)
<<>>

Fix this by adding boundary check for the array.

After the fix from powerpc system:

<<>>
  ./perf stat -C 12323431 ls 1>out
  Perf can support 2048 CPUs. Consider raising MAX_NR_CPUS

   Performance counter stats for 'CPU(s) 12323431':

     <not supported> msec cpu-clock
     <not supported>      context-switches
     <not supported>      cpu-migrations
     <not supported>      page-faults
     <not supported>      cycles
     <not supported>      instructions
     <not supported>      branches
     <not supported>      branch-misses

         0.001192373 seconds time elapsed
<<>>

Reported-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Signed-off-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lore.kernel.org/r/20220905141929.7171-1-atrajeev@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-09-06 09:45:23 -03:00
Jagath Jog J
835e699ef8 iio: Add new event type gesture and use direction for single and double tap
Add new event type for tap called gesture and the direction can be used
to differentiate single and double tap. This may be used by accelerometer
sensors to express single and double tap events. For directional tap,
modifiers like IIO_MOD_(X/Y/Z) can be used along with singletap and
doubletap direction.

Signed-off-by: Jagath Jog J <jagathjog1996@gmail.com>
Link: https://lore.kernel.org/r/20220831063117.4141-2-jagathjog1996@gmail.com
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2022-09-05 18:08:42 +01:00
Zhou jie
2258954234 tools: hv: kvp: remove unnecessary (void*) conversions
Remove unnecessary void* type casting.

Signed-off-by: Zhou jie <zhoujie@nfschina.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20220823034552.8596-1-zhoujie@nfschina.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-09-05 16:55:20 +00:00
Alexei Starovoitov
0fd7c5d433 bpf: Optimize call_rcu in non-preallocated hash map.
Doing call_rcu() million times a second becomes a bottle neck.
Convert non-preallocated hash map from call_rcu to SLAB_TYPESAFE_BY_RCU.
The rcu critical section is no longer observed for one htab element
which makes non-preallocated hash map behave just like preallocated hash map.
The map elements are released back to kernel memory after observing
rcu critical section.
This improves 'map_perf_test 4' performance from 100k events per second
to 250k events per second.

bpf_mem_alloc + percpu_counter + typesafe_by_rcu provide 10x performance
boost to non-preallocated hash map and make it within few % of preallocated map
while consuming fraction of memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-8-alexei.starovoitov@gmail.com
2022-09-05 15:33:06 +02:00
Alexei Starovoitov
37521bffdd selftests/bpf: Improve test coverage of test_maps
Make test_maps more stressful with more parallelism in
update/delete/lookup/walk including different value sizes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-4-alexei.starovoitov@gmail.com
2022-09-05 15:33:05 +02:00
Waiman Long
a8c52eba88 kselftest/cgroup: Add cpuset v2 partition root state test
Add a test script test_cpuset_prs.sh with a helper program wait_inotify
for exercising the cpuset v2 partition root state code.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:28 -10:00
Linus Torvalds
685ed983e2 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "s390:

   - PCI interpretation compile fixes

  RISC-V:

   - fix unused variable warnings in vcpu_timer.c

   - move extern sbi_ext declarations to a header

  x86:

   - check validity of argument to KVM_SET_MP_STATE

   - use guest's global_ctrl to completely disable guest PEBS

   - fix a memory leak on memory allocation failure

   - mask off unsupported and unknown bits of IA32_ARCH_CAPABILITIES

   - fix build failure with Clang integrated assembler

   - fix MSR interception

   - always flush TLBs when enabling dirty logging"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: check validity of argument to KVM_SET_MP_STATE
  perf/x86/core: Completely disable guest PEBS via guest's global_ctrl
  KVM: x86: fix memoryleak in kvm_arch_vcpu_create()
  KVM: x86: Mask off unsupported and unknown bits of IA32_ARCH_CAPABILITIES
  KVM: s390: pci: Hook to access KVM lowlevel from VFIO
  riscv: kvm: move extern sbi_ext declarations to a header
  riscv: kvm: vcpu_timer: fix unused variable warnings
  KVM: selftests: Fix ambiguous mov in KVM_ASM_SAFE()
  KVM: selftests: Fix KVM_EXCEPTION_MAGIC build with Clang
  KVM: VMX: Heed the 'msr' argument in msr_write_intercepted()
  kvm: x86: mmu: Always flush TLBs when enabling dirty logging
  kvm: x86: mmu: Drop the need_remote_flush() function
2022-09-04 11:27:14 -07:00
Michael Ellerman
501fe29982 selftests/powerpc: Skip 4PB test on 4K PAGE_SIZE systems
Systems using the hash MMU with a 4K page size don't support 4PB address
space, so skip the test because the bug it tests for can't be triggered.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220901020215.254097-1-mpe@ellerman.id.au
2022-09-04 22:39:59 +10:00
Rebecca Mckeever
42c3ba8658 memblock_tests: move variable declarations to single block
Move variable declarations to a single block at the beginning of each
testing function.

Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/e61431e73977f305fdd027bca99d1dc119e96d84.1662264355.git.remckee0@gmail.com
2022-09-04 13:50:35 +03:00
Rebecca Mckeever
35e49953c3 memblock tests: remove 'cleared' from comment blocks
The tests in alloc_nid_api can now run either memblock_alloc_try_nid()
or memblock_alloc_try_nid_raw(). The comment blocks for these tests
should not refer to a 'cleared' region since that only applies to
memblock_alloc_try_nid(). Remove 'cleared' from the comment blocks so
that the comments are accurate for either memblock function.

Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/e8be24137e54e9f81a06af969ded82b319114d7a.1662264347.git.remckee0@gmail.com
2022-09-04 13:49:33 +03:00
Shi junming
40083734d9 ACPI: tools: pfrut: Do not initialize ret in main()
The initialization is unnecessary, because ret is always assigned a new
value before reading it.

Signed-off-by: Shi junming <junming@nfschina.com>
[ rjw: Subject edits, new changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-09-03 20:36:02 +02:00
Martin KaFai Lau
f649f992de selftest/bpf: Add test for bpf_getsockopt()
This patch removes the __bpf_getsockopt() which directly
reads the sk by using PTR_TO_BTF_ID.  Instead, the test now directly
uses the kernel bpf helper bpf_getsockopt() which supports all
the required optname now.

TCP_SAVE[D]_SYN and TCP_MAXSEG are not tested in a loop for all
the hooks and sock_ops's cb.  TCP_SAVE[D]_SYN only works
in passive connection.  TCP_MAXSEG only works when
it is setsockopt before the connection is established and
the getsockopt return value can only be tested after
the connection is established.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002937.2896904-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02 20:34:32 -07:00
Jakub Kicinski
05a5474efe Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:

====================
netfilter: bug fixes for net

1. Fix IP address check in irc DCC conntrack helper, this should check
   the opposite direction rather than the destination address of the
   packets' direction, from David Leadbeater.

2. bridge netfilter needs to drop dst references, from Harsh Modi.
   This was fine back in the day the code was originally written,
   but nowadays various tunnels can pre-set metadata dsts on packets.

3. Remove nf_conntrack_helper sysctl and the modparam toggle, users
   need to explicitily assign the helpers to use via nftables or
   iptables.  Conntrack helpers, by design, may be used to add dynamic
   port redirections to internal machines, so its necessary to restrict
   which hosts/peers are allowed to use them.
   It was discovered that improper checking in the irc DCC helper makes
   it possible to trigger the 'please do dynamic port forward'
   from outside by embedding a 'DCC' in a PING request; if the client
   echos that back a expectation/port forward gets added.
   The auto-assign-for-everything mechanism has been in "please don't do this"
   territory since 2012.  From Pablo.

4. Fix a memory leak in the netdev hook error unwind path, also from Pablo.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_conntrack_irc: Fix forged IP logic
  netfilter: nf_tables: clean up hook list when offload flags check fails
  netfilter: br_netfilter: Drop dst references before setting.
  netfilter: remove nf_conntrack_helper sysctl and modparam toggles
====================

Link: https://lore.kernel.org/r/20220901071238.3044-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-02 19:38:25 -07:00
Linus Torvalds
cec53f4c8d Merge tag 'io_uring-6.0-2022-09-02' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:

 - A single fix for over-eager retries for networking (Pavel)

 - Revert the notification slot support for zerocopy sends.

   It turns out that even after more than a year or development and
   testing, there's not full agreement on whether just using plain
   ordered notifications is Good Enough to avoid the complexity of using
   the notifications slots. Because of that, we decided that it's best
   left to a future final decision.

   We can always bring back this feature, but we can't really change it
   or remove it once we've released 6.0 with it enabled. The reverts
   leave the usual CQE notifications as the primary interface for
   knowing when data was sent, and when it was acked. (Pavel)

* tag 'io_uring-6.0-2022-09-02' of git://git.kernel.dk/linux-block:
  selftests/net: return back io_uring zc send tests
  io_uring/net: simplify zerocopy send user API
  io_uring/notif: remove notif registration
  Revert "io_uring: rename IORING_OP_FILES_UPDATE"
  Revert "io_uring: add zc notification flush requests"
  selftests/net: temporarily disable io_uring zc test
  io_uring/net: fix overexcessive retries
2022-09-02 16:37:01 -07:00
Linus Torvalds
0c95f02269 Merge tag 'landlock-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux
Pull landlock fix from Mickaël Salaün:
 "This fixes a mis-handling of the LANDLOCK_ACCESS_FS_REFER right when
  multiple rulesets/domains are stacked.

  The expected behaviour was that an additional ruleset can only
  restrict the set of permitted operations, but in this particular case,
  it was potentially possible to re-gain the LANDLOCK_ACCESS_FS_REFER
  right"

* tag 'landlock-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  landlock: Fix file reparenting without explicit LANDLOCK_ACCESS_FS_REFER
2022-09-02 15:24:08 -07:00
Zhengjun Xing
f0c86a2bae perf stat: Fix L2 Topdown metrics disappear for raw events
In perf/Documentation/perf-stat.txt, for "--td-level" the default "0" means
the max level that the current hardware support.

So we need initialize the stat_config.topdown_level to TOPDOWN_MAX_LEVEL
when “--td-level=0” or no “--td-level” option. Otherwise, for the
hardware with a max level is 2, the 2nd level metrics disappear for raw
events in this case.

The issue cannot be observed for the perf stat default or "--topdown"
options. This commit fixes the raw events issue and removes the
duplicated code for the perf stat default.

Before:

 # ./perf stat -e "cpu-clock,context-switches,cpu-migrations,page-faults,instructions,cycles,ref-cycles,branches,branch-misses,{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound,topdown-heavy-ops,topdown-br-mispredict,topdown-fetch-lat,topdown-mem-bound}" sleep 1

 Performance counter stats for 'sleep 1':

              1.03 msec cpu-clock                        #    0.001 CPUs utilized
                 1      context-switches                 #  966.216 /sec
                 0      cpu-migrations                   #    0.000 /sec
                60      page-faults                      #   57.973 K/sec
         1,132,112      instructions                     #    1.41  insn per cycle
           803,872      cycles                           #    0.777 GHz
         1,909,120      ref-cycles                       #    1.845 G/sec
           236,634      branches                         #  228.640 M/sec
             6,367      branch-misses                    #    2.69% of all branches
         4,823,232      slots                            #    4.660 G/sec
         1,210,536      topdown-retiring                 #     25.1% Retiring
           699,841      topdown-bad-spec                 #     14.5% Bad Speculation
         1,777,975      topdown-fe-bound                 #     36.9% Frontend Bound
         1,134,878      topdown-be-bound                 #     23.5% Backend Bound
           189,146      topdown-heavy-ops                #  182.756 M/sec
           662,012      topdown-br-mispredict            #  639.647 M/sec
         1,097,048      topdown-fetch-lat                #    1.060 G/sec
           416,121      topdown-mem-bound                #  402.063 M/sec

       1.002423690 seconds time elapsed

       0.002494000 seconds user
       0.000000000 seconds sys

After:

 # ./perf stat -e "cpu-clock,context-switches,cpu-migrations,page-faults,instructions,cycles,ref-cycles,branches,branch-misses,{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound,topdown-heavy-ops,topdown-br-mispredict,topdown-fetch-lat,topdown-mem-bound}" sleep 1

 Performance counter stats for 'sleep 1':

              1.13 msec cpu-clock                        #    0.001 CPUs utilized
                 1      context-switches                 #  882.128 /sec
                 0      cpu-migrations                   #    0.000 /sec
                61      page-faults                      #   53.810 K/sec
         1,137,612      instructions                     #    1.29  insn per cycle
           881,477      cycles                           #    0.778 GHz
         2,093,496      ref-cycles                       #    1.847 G/sec
           236,356      branches                         #  208.496 M/sec
             7,090      branch-misses                    #    3.00% of all branches
         5,288,862      slots                            #    4.665 G/sec
         1,223,697      topdown-retiring                 #     23.1% Retiring
           767,403      topdown-bad-spec                 #     14.5% Bad Speculation
         2,053,322      topdown-fe-bound                 #     38.8% Frontend Bound
         1,244,438      topdown-be-bound                 #     23.5% Backend Bound
           186,665      topdown-heavy-ops                #      3.5% Heavy Operations       #     19.6% Light Operations
           725,922      topdown-br-mispredict            #     13.7% Branch Mispredict      #      0.8% Machine Clears
         1,327,400      topdown-fetch-lat                #     25.1% Fetch Latency          #     13.7% Fetch Bandwidth
           497,775      topdown-mem-bound                #      9.4% Memory Bound           #     14.1% Core Bound

       1.002701530 seconds time elapsed

       0.002744000 seconds user
       0.000000000 seconds sys

Fixes: 63e39aa6ae ("perf stat: Support L2 Topdown events")
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Xing Zhengjun <zhengjun.xing@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220826140057.3289401-1-zhengjun.xing@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-09-02 13:52:18 -03:00
Ian Rogers
af515a5587 selftests/xsk: Avoid use-after-free on ctx
The put lowers the reference count to 0 and frees ctx, reading it
afterwards is invalid. Move the put after the uses and determine the
last use by the reference count being 1.

Fixes: 39e940d4ab ("selftests/xsk: Destroy BPF resources only when ctx refcount drops to 0")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220901202645.1463552-1-irogers@google.com
2022-09-02 15:57:18 +02:00
Daniel Müller
afef88e655 selftests/bpf: Store BPF object files with .bpf.o extension
BPF object files are, in a way, the final artifact produced as part of
the ahead-of-time compilation process. That makes them somewhat special
compared to "regular" object files, which are a intermediate build
artifacts that can typically be removed safely. As such, it can make
sense to name them differently to make it easier to spot this difference
at a glance.

Among others, libbpf-bootstrap [0] has established the extension .bpf.o
for BPF object files. It seems reasonable to follow this example and
establish the same denomination for selftest build artifacts. To that
end, this change adjusts the corresponding part of the build system and
the test programs loading BPF object files to work with .bpf.o files.

  [0] https://github.com/libbpf/libbpf-bootstrap

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220901222253.1199242-1-deso@posteo.net
2022-09-02 15:55:37 +02:00
Maciej Fijalkowski
fe2ad08e1e selftests/xsk: Add support for zero copy testing
Introduce new mode to xdpxceiver responsible for testing AF_XDP zero
copy support of driver that serves underlying physical device. When
setting up test suite, determine whether driver has ZC support or not by
trying to bind XSK ZC socket to the interface. If it succeeded,
interpret it as ZC support being in place and do softirq and busy poll
tests for zero copy mode.

Note that Rx dropped tests are skipped since ZC path is not touching
rx_dropped stat at all.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220901114813.16275-7-maciej.fijalkowski@intel.com
2022-09-02 15:38:08 +02:00
Maciej Fijalkowski
c29fe883de selftests/xsk: Make sure single threaded test terminates
For single threaded poll tests call pthread_kill() from main thread so
that we are sure worker thread has finished its job and it is possible
to proceed with next test types from test suite. It was observed that on
some platforms it takes a bit longer for worker thread to exit and next
test case sees device as busy in this case.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220901114813.16275-6-maciej.fijalkowski@intel.com
2022-09-02 15:38:03 +02:00
Maciej Fijalkowski
a693ff3ed5 selftests/xsk: Add support for executing tests on physical device
Currently, architecture of xdpxceiver is designed strictly for
conducting veth based tests. Veth pair is created together with a
network namespace and one of the veth interfaces is moved to the
mentioned netns. Then, separate threads for Tx and Rx are spawned which
will utilize described setup.

Infrastructure described in the paragraph above can not be used for
testing AF_XDP support on physical devices. That testing will be
conducted on a single network interface and same queue. Xskxceiver
needs to be extended to distinguish between veth tests and physical
interface tests.

Since same iface/queue id pair will be used by both Tx/Rx threads for
physical device testing, Tx thread, which happen to run after the Rx
thread, is going to create XSK socket with shared umem flag. In order to
track this setting throughout the lifetime of spawned threads, introduce
'shared_umem' boolean variable to struct ifobject and set it to true
when xdpxceiver is run against physical device. In such case, UMEM size
needs to be doubled, so half of it will be used by Rx thread and other
half by Tx thread. For two step based test types, value of XSKMAP
element under key 0 has to be updated as there is now another socket for
the second step. Also, to avoid race conditions when destroying XSK
resources, move this activity to the main thread after spawned Rx and Tx
threads have finished its job. This way it is possible to gracefully
remove shared umem without introducing synchronization mechanisms.

To run xsk selftests suite on physical device, append "-i $IFACE" when
invoking test_xsk.sh. For veth based tests, simply skip it. When "-i
$IFACE" is in place, under the hood test_xsk.sh will use $IFACE for both
interfaces supplied to xdpxceiver, which in turn will interpret that
this execution of test suite is for a physical device.

Note that currently this makes it possible only to test SKB and DRV mode
(in case underlying device has native XDP support). ZC testing support
is added in a later patch.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220901114813.16275-5-maciej.fijalkowski@intel.com
2022-09-02 15:37:57 +02:00
Maciej Fijalkowski
24037ba7c4 selftests/xsk: Increase chars for interface name to 16
So that "enp240s0f0" or such name can be used against xskxceiver.
While at it, also extend character count for netns name.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220901114813.16275-4-maciej.fijalkowski@intel.com
2022-09-02 15:37:53 +02:00
Maciej Fijalkowski
1adef0643b selftests/xsk: Introduce default Rx pkt stream
In order to prepare xdpxceiver for physical device testing, let us
introduce default Rx pkt stream. Reason for doing it is that physical
device testing will use a UMEM with a doubled size where half of it will
be used by Tx and other half by Rx. This means that pkt addresses will
differ for Tx and Rx streams. Rx thread will initialize the
xsk_umem_info::base_addr that is added here so that pkt_set(), when
working on Rx UMEM will add this offset and second half of UMEM space
will be used. Note that currently base_addr is 0 on both sides. Future
commit will do the mentioned initialization.

Previously, veth based testing worked on separate UMEMs, so single
default stream was fine.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220901114813.16275-3-maciej.fijalkowski@intel.com
2022-09-02 15:37:45 +02:00
Maciej Fijalkowski
0d68e6fe12 selftests/xsk: Query for native XDP support
Currently, xdpxceiver assumes that underlying device supports XDP in
native mode - it is fine by now since tests can run only on a veth pair.
Future commit is going to allow running test suite against physical
devices, so let us query the device if it is capable of running XDP
programs in native mode. This way xdpxceiver will not try to run
TEST_MODE_DRV if device being tested is not supporting it.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220901114813.16275-2-maciej.fijalkowski@intel.com
2022-09-02 15:37:37 +02:00
Mickaël Salaün
55e55920bb landlock: Fix file reparenting without explicit LANDLOCK_ACCESS_FS_REFER
This change fixes a mis-handling of the LANDLOCK_ACCESS_FS_REFER right
when multiple rulesets/domains are stacked. The expected behaviour was
that an additional ruleset can only restrict the set of permitted
operations, but in this particular case, it was potentially possible to
re-gain the LANDLOCK_ACCESS_FS_REFER right.

With the introduction of LANDLOCK_ACCESS_FS_REFER, we added the first
globally denied-by-default access right.  Indeed, this lifted an initial
Landlock limitation to rename and link files, which was initially always
denied when the source or the destination were different directories.

This led to an inconsistent backward compatibility behavior which was
only taken into account if no domain layer were using the new
LANDLOCK_ACCESS_FS_REFER right. However, when restricting a thread with
a new ruleset handling LANDLOCK_ACCESS_FS_REFER, all inherited parent
rulesets/layers not explicitly handling LANDLOCK_ACCESS_FS_REFER would
behave as if they were handling this access right and with all their
rules allowing it. This means that renaming and linking files could
became allowed by these parent layers, but all the other required
accesses must also be granted: all layers must allow file removal or
creation, and renaming and linking operations cannot lead to privilege
escalation according to the Landlock policy.  See detailed explanation
in commit b91c3e4ea7 ("landlock: Add support for file reparenting with
LANDLOCK_ACCESS_FS_REFER").

To say it another way, this bug may lift the renaming and linking
limitations of the initial Landlock version, and a same ruleset can
enforce different restrictions depending on previous or next enforced
ruleset (i.e. inconsistent behavior). The LANDLOCK_ACCESS_FS_REFER right
cannot give access to data not already allowed, but this doesn't follow
the contract of the first Landlock ABI. This fix puts back the
limitation for sandboxes that didn't opt-in for this additional right.

For instance, if a first ruleset allows LANDLOCK_ACCESS_FS_MAKE_REG on
/dst and LANDLOCK_ACCESS_FS_REMOVE_FILE on /src, renaming /src/file to
/dst/file is denied. However, without this fix, stacking a new ruleset
which allows LANDLOCK_ACCESS_FS_REFER on / would now permit the
sandboxed thread to rename /src/file to /dst/file .

This change fixes the (absolute) rule access rights, which now always
forbid LANDLOCK_ACCESS_FS_REFER except when it is explicitly allowed
when creating a rule.

Making all domain handle LANDLOCK_ACCESS_FS_REFER was an initial
approach but there is two downsides:
* it makes the code more complex because we still want to check that a
  rule allowing LANDLOCK_ACCESS_FS_REFER is legitimate according to the
  ruleset's handled access rights (i.e. ABI v1 != ABI v2);
* it would not allow to identify if the user created a ruleset
  explicitly handling LANDLOCK_ACCESS_FS_REFER or not, which will be an
  issue to audit Landlock.

Instead, this change adds an ACCESS_INITIALLY_DENIED list of
denied-by-default rights, which (only) contains
LANDLOCK_ACCESS_FS_REFER.  All domains are treated as if they are also
handling this list, but without modifying their fs_access_masks field.

A side effect is that the errno code returned by rename(2) or link(2)
*may* be changed from EXDEV to EACCES according to the enforced
restrictions.  Indeed, we now have the mechanic to identify if an access
is denied because of a required right (e.g. LANDLOCK_ACCESS_FS_MAKE_REG,
LANDLOCK_ACCESS_FS_REMOVE_FILE) or if it is denied because of missing
LANDLOCK_ACCESS_FS_REFER rights.  This may result in different errno
codes than for the initial Landlock version, but this approach is more
consistent and better for rename/link compatibility reasons, and it
wasn't possible before (hence no backport to ABI v1).  The
layout1.rename_file test reflects this change.

Add 4 layout1.refer_denied_by_default* test suites to check that the
behavior of a ruleset not handling LANDLOCK_ACCESS_FS_REFER (ABI v1) is
unchanged even if another layer handles LANDLOCK_ACCESS_FS_REFER (i.e.
ABI v1 precedence).  Make sure rule's absolute access rights are correct
by testing with and without a matching path.  Add test_rename() and
test_exchange() helpers.

Extend layout1.inval tests to check that a denied-by-default access
right is not necessarily part of a domain's handled access rights.

Test coverage for security/landlock is 95.3% of 599 lines according to
gcc/gcov-11.

Fixes: b91c3e4ea7 ("landlock: Add support for file reparenting with LANDLOCK_ACCESS_FS_REFER")
Reviewed-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20220831203840.1370732-1-mic@digikod.net
Cc: stable@vger.kernel.org
[mic: Constify and slightly simplify test helpers]
Signed-off-by: Mickaël Salaün <mic@digikod.net>
2022-09-02 15:29:08 +02:00
Shmulik Ladkani
8cc61b7a64 selftests/bpf: Amend test_tunnel to exercise BPF_F_TUNINFO_FLAGS
Get the tunnel flags in {ipv6}vxlan_get_tunnel_src and ensure they are
aligned with tunnel params set at {ipv6}vxlan_set_tunnel_dst.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220831144010.174110-2-shmulik.ladkani@gmail.com
2022-09-02 15:21:03 +02:00
Shmulik Ladkani
44c51472be bpf: Support getting tunnel flags
Existing 'bpf_skb_get_tunnel_key' extracts various tunnel parameters
(id, ttl, tos, local and remote) but does not expose ip_tunnel_info's
tun_flags to the BPF program.

It makes sense to expose tun_flags to the BPF program.

Assume for example multiple GRE tunnels maintained on a single GRE
interface in collect_md mode. The program expects origins to initiate
over GRE, however different origins use different GRE characteristics
(e.g. some prefer to use GRE checksum, some do not; some pass a GRE key,
some do not, etc..).

A BPF program getting tun_flags can therefore remember the relevant
flags (e.g. TUNNEL_CSUM, TUNNEL_SEQ...) for each initiating remote. In
the reply path, the program can use 'bpf_skb_set_tunnel_key' in order
to correctly reply to the remote, using similar characteristics, based
on the stored tunnel flags.

Introduce BPF_F_TUNINFO_FLAGS flag for bpf_skb_get_tunnel_key. If
specified, 'bpf_tunnel_key->tunnel_flags' is set with the tun_flags.

Decided to use the existing unused 'tunnel_ext' as the storage for the
'tunnel_flags' in order to avoid changing bpf_tunnel_key's layout.

Also, the following has been considered during the design:

  1. Convert the "interesting" internal TUNNEL_xxx flags back to BPF_F_yyy
     and place into the new 'tunnel_flags' field. This has 2 drawbacks:

     - The BPF_F_yyy flags are from *set_tunnel_key* enumeration space,
       e.g. BPF_F_ZERO_CSUM_TX. It is awkward that it is "returned" into
       tunnel_flags from a *get_tunnel_key* call.
     - Not all "interesting" TUNNEL_xxx flags can be mapped to existing
       BPF_F_yyy flags, and it doesn't make sense to create new BPF_F_yyy
       flags just for purposes of the returned tunnel_flags.

  2. Place key.tun_flags into 'tunnel_flags' but mask them, keeping only
     "interesting" flags. That's ok, but the drawback is that what's
     "interesting" for my usecase might be limiting for other usecases.

Therefore I decided to expose what's in key.tun_flags *as is*, which seems
most flexible. The BPF user can just choose to ignore bits he's not
interested in. The TUNNEL_xxx are also UAPI, so no harm exposing them
back in the get_tunnel_key call.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220831144010.174110-1-shmulik.ladkani@gmail.com
2022-09-02 15:20:55 +02:00
Vladimir Oltean
1ab3d41757 selftests: net: dsa: symlink the tc_actions.sh test
This has been validated on the Ocelot/Felix switch family (NXP LS1028A)
and should be relevant to any switch driver that offloads the tc-flower
and/or tc-matchall actions trap, drop, accept, mirred, for which DSA has
operations.

TEST: gact drop and ok (skip_hw)                                    [ OK ]
TEST: mirred egress flower redirect (skip_hw)                       [ OK ]
TEST: mirred egress flower mirror (skip_hw)                         [ OK ]
TEST: mirred egress matchall mirror (skip_hw)                       [ OK ]
TEST: mirred_egress_to_ingress (skip_hw)                            [ OK ]
TEST: gact drop and ok (skip_sw)                                    [ OK ]
TEST: mirred egress flower redirect (skip_sw)                       [ OK ]
TEST: mirred egress flower mirror (skip_sw)                         [ OK ]
TEST: mirred egress matchall mirror (skip_sw)                       [ OK ]
TEST: trap (skip_sw)                                                [ OK ]
TEST: mirred_egress_to_ingress (skip_sw)                            [ OK ]

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220831170839.931184-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-01 20:53:44 -07:00
Paolo Bonzini
29250ba51b Merge tag 'kvm-s390-master-6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
PCI interpretation compile fixes
2022-09-01 19:21:27 -04:00
Jakub Kicinski
60ad1100d5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/testing/selftests/net/.gitignore
  sort the net-next version and use it

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-01 12:58:02 -07:00
Hou Tao
73b97bc78b selftests/bpf: Test concurrent updates on bpf_task_storage_busy
Under full preemptible kernel, task local storage lookup operations on
the same CPU may update per-cpu bpf_task_storage_busy concurrently. If
the update of bpf_task_storage_busy is not preemption safe, the final
value of bpf_task_storage_busy may become not-zero forever and
bpf_task_storage_trylock() will always fail. So add a test case to
ensure the update of bpf_task_storage_busy is preemption safe.

Will skip the test case when CONFIG_PREEMPT is disabled, and it can only
reproduce the problem probabilistically. By increasing
TASK_STORAGE_MAP_NR_LOOP and running it under ARM64 VM with 4-cpus, it
takes about four rounds to reproduce:

> test_maps is modified to only run test_task_storage_map_stress_lookup()
$ export TASK_STORAGE_MAP_NR_THREAD=256
$ export TASK_STORAGE_MAP_NR_LOOP=81920
$ export TASK_STORAGE_MAP_PIN_CPU=1
$ time ./test_maps
test_task_storage_map_stress_lookup(135):FAIL:bad bpf_task_storage_busy got -2

real    0m24.743s
user    0m6.772s
sys     0m17.966s

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20220901061938.3789460-5-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-01 12:16:21 -07:00
Hou Tao
c710136e87 selftests/bpf: Move sys_pidfd_open() into task_local_storage_helpers.h
sys_pidfd_open() is defined twice in both test_bprm_opts.c and
test_local_storage.c, so move it to a common header file. And it will be
used in map_tests as well.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20220901061938.3789460-4-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-01 12:16:20 -07:00
Linus Torvalds
42e66b1cc3 Merge tag 'net-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from bluetooth, bpf and wireless.

  Current release - regressions:

   - bpf:
      - fix wrong last sg check in sk_msg_recvmsg()
      - fix kernel BUG in purge_effective_progs()

   - mac80211:
      - fix possible leak in ieee80211_tx_control_port()
      - potential NULL dereference in ieee80211_tx_control_port()

  Current release - new code bugs:

   - nfp: fix the access to management firmware hanging

  Previous releases - regressions:

   - ip: fix triggering of 'icmp redirect'

   - sched: tbf: don't call qdisc_put() while holding tree lock

   - bpf: fix corrupted packets for XDP_SHARED_UMEM

   - bluetooth: hci_sync: fix suspend performance regression

   - micrel: fix probe failure

  Previous releases - always broken:

   - tcp: make global challenge ack rate limitation per net-ns and
     default disabled

   - tg3: fix potential hang-up on system reboot

   - mac802154: fix reception for no-daddr packets

  Misc:

   - r8152: add PID for the lenovo onelink+ dock"

* tag 'net-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits)
  net/smc: Remove redundant refcount increase
  Revert "sch_cake: Return __NET_XMIT_STOLEN when consuming enqueued skb"
  tcp: make global challenge ack rate limitation per net-ns and default disabled
  tcp: annotate data-race around challenge_timestamp
  net: dsa: hellcreek: Print warning only once
  ip: fix triggering of 'icmp redirect'
  sch_cake: Return __NET_XMIT_STOLEN when consuming enqueued skb
  selftests: net: sort .gitignore file
  Documentation: networking: correct possessive "its"
  kcm: fix strp_init() order and cleanup
  mlxbf_gige: compute MDIO period based on i1clk
  ethernet: rocker: fix sleep in atomic context bug in neigh_timer_handler
  net: lan966x: improve error handle in lan966x_fdma_rx_get_frame()
  nfp: fix the access to management firmware hanging
  net: phy: micrel: Make the GPIO to be non-exclusive
  net: virtio_net: fix notification coalescing comments
  net/sched: fix netdevice reference leaks in attach_default_qdiscs()
  net: sched: tbf: don't call qdisc_put() while holding tree lock
  net: Use u64_stats_fetch_begin_irq() for stats fetch.
  net: dsa: xrs700x: Use irqsave variant for u64 stats update
  ...
2022-09-01 09:20:42 -07:00
Pavel Begunkov
916d72c10a selftests/net: return back io_uring zc send tests
Enable io_uring zerocopy send tests back and fix them up to follow the
new inteface.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c8e5018c516093bdad0b6e19f2f9847dea17e4d2.1662027856.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01 09:13:33 -06:00
Pavel Begunkov
75847100c3 selftests/net: temporarily disable io_uring zc test
We're going to change API, to avoid build problems with a couple of
following commits, disable io_uring testing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/12b7507223df04fbd12aa05fc0cb544b51d7ed79.1662027856.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-01 09:13:33 -06:00
Richard Gobert
0e4d354762 net-next: Fix IP_UNICAST_IF option behavior for connected sockets
The IP_UNICAST_IF socket option is used to set the outgoing interface
for outbound packets.

The IP_UNICAST_IF socket option was added as it was needed by the
Wine project, since no other existing option (SO_BINDTODEVICE socket
option, IP_PKTINFO socket option or the bind function) provided the
needed characteristics needed by the IP_UNICAST_IF socket option. [1]
The IP_UNICAST_IF socket option works well for unconnected sockets,
that is, the interface specified by the IP_UNICAST_IF socket option
is taken into consideration in the route lookup process when a packet
is being sent. However, for connected sockets, the outbound interface
is chosen when connecting the socket, and in the route lookup process
which is done when a packet is being sent, the interface specified by
the IP_UNICAST_IF socket option is being ignored.

This inconsistent behavior was reported and discussed in an issue
opened on systemd's GitHub project [2]. Also, a bug report was
submitted in the kernel's bugzilla [3].

To understand the problem in more detail, we can look at what happens
for UDP packets over IPv4 (The same analysis was done separately in
the referenced systemd issue).
When a UDP packet is sent the udp_sendmsg function gets called and
the following happens:

1. The oif member of the struct ipcm_cookie ipc (which stores the
output interface of the packet) is initialized by the ipcm_init_sk
function to inet->sk.sk_bound_dev_if (the device set by the
SO_BINDTODEVICE socket option).

2. If the IP_PKTINFO socket option was set, the oif member gets
overridden by the call to the ip_cmsg_send function.

3. If no output interface was selected yet, the interface specified
by the IP_UNICAST_IF socket option is used.

4. If the socket is connected and no destination address is
specified in the send function, the struct ipcm_cookie ipc is not
taken into consideration and the cached route, that was calculated in
the connect function is being used.

Thus, for a connected socket, the IP_UNICAST_IF sockopt isn't taken
into consideration.

This patch corrects the behavior of the IP_UNICAST_IF socket option
for connect()ed sockets by taking into consideration the
IP_UNICAST_IF sockopt when connecting the socket.

In order to avoid reconnecting the socket, this option is still
ignored when applied on an already connected socket until connect()
is called again by the Richard Gobert.

Change the __ip4_datagram_connect function, which is called during
socket connection, to take into consideration the interface set by
the IP_UNICAST_IF socket option, in a similar way to what is done in
the udp_sendmsg function.

[1] https://lore.kernel.org/netdev/1328685717.4736.4.camel@edumazet-laptop/T/
[2] https://github.com/systemd/systemd/issues/11935#issuecomment-618691018
[3] https://bugzilla.kernel.org/show_bug.cgi?id=210255

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220829111554.GA1771@debian
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31 19:51:06 -07:00
Hou Tao
1c636b6277 selftests/bpf: Add test cases for htab update
One test demonstrates the reentrancy of hash map update on the same
bucket should fail, and another one shows concureently updates of
the same hash map bucket should succeed and not fail due to
the reentrancy checking for bucket lock.

There is no trampoline support on s390x, so move htab_update to
denylist.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20220831042629.130006-4-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-08-31 14:10:01 -07:00