Martin mentioned that the verifier cannot assume arguments from
LSM hook sk_alloc_security being trusted since after the hook
is called, the sk ref_count is set to 1. This will overwrite
the ref_count changed by the bpf program and may cause ref_count
underflow later on.
I then further checked some other hooks. For example,
for bpf_lsm_file_alloc() hook in fs/file_table.c,
f->f_cred = get_cred(cred);
error = security_file_alloc(f);
if (unlikely(error)) {
file_free_rcu(&f->f_rcuhead);
return ERR_PTR(error);
}
atomic_long_set(&f->f_count, 1);
The input parameter 'f' to security_file_alloc() cannot be trusted
as well.
Specifically, I investiaged bpf_map/bpf_prog/file/sk/task alloc/free
lsm hooks. Except bpf_map_alloc and task_alloc, arguments for all other
hooks should not be considered as trusted. This may not be a complete
list, but it covers common usage for sk and task.
Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203204954.2043348-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 9bb00b2895 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
introduced MEM_RCU and bpf_rcu_read_lock/unlock() support. In that
commit, a rcu pointer is tagged with both MEM_RCU and PTR_TRUSTED
so that it can be passed into kfuncs or helpers as an argument.
Martin raised a good question in [1] such that the rcu pointer,
although being able to accessing the object, might have reference
count of 0. This might cause a problem if the rcu pointer is passed
to a kfunc which expects trusted arguments where ref count should
be greater than 0.
This patch makes the following changes related to MEM_RCU pointer:
- MEM_RCU pointer might be NULL (PTR_MAYBE_NULL).
- Introduce KF_RCU so MEM_RCU ptr can be acquired with
a KF_RCU tagged kfunc which assumes ref count of rcu ptr
could be zero.
- For mem access 'b = ptr->a', say 'ptr' is a MEM_RCU ptr, and
'a' is tagged with __rcu as well. Let us mark 'b' as
MEM_RCU | PTR_MAYBE_NULL.
[1] https://lore.kernel.org/bpf/ac70f574-4023-664e-b711-e0d3b18117fd@linux.dev/
Fixes: 9bb00b2895 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221203184602.477272-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull perf fix from Borislav Petkov:
- Fix a use-after-free case where the perf pending task callback would
see an already freed event
* tag 'perf_urgent_for_v6.1_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix perf_pending_task() UaF
When handing the SIGNAL_GROUP_EXIT flag, the info in ksignal isn't cleared.
However, the info acquired by dequeue_synchronous_signal/dequeue_signal is
initialized and can be safely used. Fortunately, the fatal signal process
just uses the si_signo and doesn't use any other member. Even so, the
initialization before use is more safer.
Signed-off-by: haifeng.xu <haifeng.xu@shopee.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221128065606.19570-1-haifeng.xu@shopee.com
Resolve conflicts in drivers/vfio/vfio_main.c by using the iommfd version.
The rc fix was done a different way when iommufd patches reworked this
code.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Writing the current state back in hotplug/target calls cpu_down()
which will set cpu dying even when it isn't and then nothing will
ever clear it. A stress test that reads values and writes them back
for all cpu device files in sysfs will trigger the BUG() in
select_fallback_rq once all cpus are marked as dying.
kernel/cpu.c::target_store()
...
if (st->state < target)
ret = cpu_up(dev->id, target);
else
ret = cpu_down(dev->id, target);
cpu_down() -> cpu_set_state()
bool bringup = st->state < target;
...
if (cpu_dying(cpu) != !bringup)
set_cpu_dying(cpu, !bringup);
Fix this by letting state==target fall through in the target_store()
conditional. Also make sure st->target == target in that case.
Fixes: 757c989b99 ("cpu/hotplug: Make target state writeable")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20221117162329.3164999-2-pauld@redhat.com
Commit ca16d5bee5 ("futex: Prevent robust futex exit race") addressed
two cases when tasks waiting on a robust non-PI futex remained blocked
despite the futex not being owned anymore:
* if the owner died after writing zero to the futex word, but before
waking up a waiter
* if a task waiting on the futex was woken up, but died before updating
the futex word (effectively swallowing the notification without acting
on it)
In the second case, the task could be woken up either by the previous
owner (after the futex word was reset to zero) or by the kernel (after
the OWNER_DIED bit was set and the TID part of the futex word was reset
to zero) if the previous owner died without the resetting the futex.
Because the referenced commit wakes up a potential waiter only if the
whole futex word is zero, the latter subcase remains unaddressed.
Fix this by looking only at the TID part of the futex when deciding
whether a wake up is needed.
Fixes: ca16d5bee5 ("futex: Prevent robust futex exit race")
Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221111215439.248185-1-izbyshev@ispras.ru
The console_list_lock provides synchronization for console list and
console->flags updates. All call sites that were using the console_lock
for this synchronization have either switched to use the
console_list_lock or the SRCU list iterator.
Remove console_lock usage for console list updates and console->flags
updates.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-40-john.ogness@linutronix.de
With commit 9e124fe16ff2("xen: Enable console tty by default in domU
if it's not a dummy") a hack was implemented to make sure that the
tty console remains the console behind the /dev/console device. The
main problem with the hack is that, after getting the console pointer
to the tty console, it is assumed the pointer is still valid after
releasing the console_sem. This assumption is incorrect and unsafe.
Make the hack safe by introducing a new function
console_force_preferred_locked() and perform the full operation
under the console_list_lock.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-33-john.ogness@linutronix.de
Currently it is not possible for drivers to detect if they have
already successfully registered their console. Several drivers
have multiple paths that lead to console registration. To avoid
attempting a 2nd registration (which leads to a WARN), drivers
are implementing their own solution.
Introduce console_is_registered() so drivers can easily identify
if their console is currently registered. A _locked() variant
is also provided if the caller is already holding the
console_list_lock.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-22-john.ogness@linutronix.de
Use srcu console list iteration for console list traversal. It is
acceptable because the consoles might come and go at any time.
Strict synchronizing with console registration code would not bring
any advantage over srcu.
Document why the console_lock is still necessary. Note that this
is a preparatory change for when console_lock no longer provides
synchronization for the console list.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-21-john.ogness@linutronix.de
Guarantee safe iteration of the console list by using SRCU.
Note that in the case of a handover, the SRCU read lock is also
released. This is documented in the function description and as
comments in the code. It is a bit tricky, but this preserves the
lockdep lock ordering for the context handing over the
console_lock:
console_lock()
| mutex_acquire(&console_lock_dep_map) <-- console lock
|
console_unlock()
| console_flush_all()
| | srcu_read_lock(&console_srcu) <-- srcu lock
| | console_emit_next_record()
| | | console_lock_spinning_disable_and_check()
| | | | srcu_read_unlock(&console_srcu) <-- srcu unlock
| | | | mutex_release(&console_lock_dep_map) <-- console unlock
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-16-john.ogness@linutronix.de
After switching to SRCU for console list iteration, some readers
will begin readings console->flags as a data race. Locklessly
reading console->flags provides a consistent value because there
is at most one CPU modifying console->flags and that CPU is
using only read-modify-write operations.
Introduce a wrapper for SRCU iterators to read console flags.
Introduce a matching wrapper to write to flags of registered
consoles. Writing to flags of registered consoles is synchronized
by the console_list_lock.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-13-john.ogness@linutronix.de
Currently there exist races in register_console(), where the types
of registered consoles are checked (without holding the console_lock)
and then after acquiring the console_lock, it is assumed that the list
has not changed. Also, some code that performs console_unregister()
make similar assumptions.
It might be possible to fix these races using the console_lock. But
it would require a complex analysis of all console drivers to make
sure that the console_lock is not taken in match() and setup()
callbacks. And we really prefer to split up and reduce the
responsibilities of console_lock rather than expand its complexity.
Therefore, introduce a new console_list_lock to provide full
synchronization for any console list changes.
In addition, also use console_list_lock for synchronization of
console->flags updates. All flags are either static or modified only
during the console registration. There are only two exceptions.
The first exception is CON_ENABLED, which is also modified by
console_start()/console_stop(). Therefore, these functions must
also take the console_list_lock.
The second exception is when the flags are modified by the console
driver init code before the console is registered. These will be
ignored because they are not visible to the rest of the system
via the console_drivers list.
Note that one of the various responsibilities of the console_lock is
also intended to provide console list and console->flags
synchronization. Later changes will update call sites relying on the
console_lock for these purposes. Once all call sites have been
updated, the console_lock will be relieved of synchronizing
console_list and console->flags updates.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87sficwokr.fsf@jogness.linutronix.de
It used to be that all consoles were synchronized with respect to
which message they were printing. After commit a699449bb1 ("printk:
refactor and rework printing logic"), all consoles have their own
@seq for tracking which message they are on. That commit also changed
how the initial sequence number was chosen. Instead of choosing the
next non-printed message, it chose the sequence number of the next
message that will be added to the ringbuffer.
That change created a possibility that a non-boot console taking over
for a boot console might skip messages if the boot console was behind
and did not have a chance to catch up before being unregistered.
Since it is not known which boot console is the same device, flush
all consoles and, if necessary, start with the message of the enabled
boot console that is the furthest behind. If no boot consoles are
enabled, begin with the next message that will be added to the
ringbuffer.
Also, since boot consoles are meant to be used at boot time, handle
them the same as CON_PRINTBUFFER to ensure that no initial messages
are skipped.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-7-john.ogness@linutronix.de
The code to initialize @seq for a new console needs to consider
more factors when choosing an initial value. Move the code into
a helper function console_init_seq() "as is" so this code can
be expanded without causing register_console() to become too
long. A later commit will implement the additional code.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-6-john.ogness@linutronix.de
The elf_check_arch() function is also used to test compatibility of
usermode binaries. Kernel modules may have more specific requirements,
for example powerpc would like to test for ABI version compatibility.
Add a weak module_elf_check_arch() that defaults to true, and call it
from elf_validity_check().
Signed-off-by: Jessica Yu <jeyu@kernel.org>
[np: added changelog, adjust name, rebase]
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221128041539.1742489-2-npiggin@gmail.com
Consider a verifier state with three acquired references, all with
release_on_unlock = true:
idx 0 1 2
state->refs = [2 4 6]
(with 2, 4, and 6 being the ref ids).
When bpf_spin_unlock is called, process_spin_lock will loop through all
acquired_refs and, for each ref, if it's release_on_unlock, calls
release_reference on it. That function in turn calls
release_reference_state, which removes the reference from state->refs by
swapping the reference state with the last reference state in
refs array and decrements acquired_refs count.
process_spin_lock's loop logic, which is essentially:
for (i = 0; i < state->acquired_refs; i++) {
if (!state->refs[i].release_on_unlock)
continue;
release_reference(state->refs[i].id);
}
will fail to release release_on_unlock references which are swapped from
the end. Running this logic on our example demonstrates:
state->refs = [2 4 6] (start of idx=0 iter)
release state->refs[0] by swapping w/ state->refs[2]
state->refs = [6 4] (start of idx=1)
release state->refs[1], no need to swap as it's the last idx
state->refs = [6] (start of idx=2, loop terminates)
ref_id 6 should have been removed but was skipped.
Fix this by looping from back-to-front, which results in refs that are
candidates for removal being swapped with refs which have already been
examined and kept.
If we modify our initial example such that ref 6 is replaced with ref 7,
which is _not_ release_on_unlock, and loop from the back, we'd see:
state->refs = [2 4 7] (start of idx=2)
state->refs = [2 4 7] (start of idx=1)
state->refs = [2 7] (start of idx=0, refs 7 and 4 swapped)
state->refs = [7] (after idx=0, 7 and 2 swapped, loop terminates)
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 534e86bc6c ("bpf: Add 'release on unlock' logic for bpf_list_push_{front,back}")
Link: https://lore.kernel.org/r/20221201183406.1203621-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
1. With CONFIG_JUMP_LABEL=n static_key_slow_inc() doesn't have any
protection against key->enabled refcounter overflow.
2. With CONFIG_JUMP_LABEL=y static_key_slow_inc_cpuslocked()
still may turn the refcounter negative as (v + 1) may overflow.
key->enabled is indeed a ref-counter as it's documented in multiple
places: top comment in jump_label.h, Documentation/staging/static-keys.rst,
etc.
As -1 is reserved for static key that's in process of being enabled,
functions would break with negative key->enabled refcount:
- for CONFIG_JUMP_LABEL=n negative return of static_key_count()
breaks static_key_false(), static_key_true()
- the ref counter may become 0 from negative side by too many
static_key_slow_inc() calls and lead to use-after-free issues.
These flaws result in that some users have to introduce an additional
mutex and prevent the reference counter from overflowing themselves,
see bpf_enable_runtime_stats() checking the counter against INT_MAX / 2.
Prevent the reference counter overflow by checking if (v + 1) > 0.
Change functions API to return whether the increment was successful.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit adds lockdep checks for illegal use of synchronize_srcu()
within same-type SRCU read-side critical sections and within normal
RCU read-side critical sections. It also makes synchronize_srcu()
be a no-op during early boot.
These changes bring Tiny synchronize_srcu() into line with both Tree
synchronize_srcu() and Tiny synchronize_rcu().
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: John Ogness <john.ogness@linutronix.de>
The system memory map can change over a hibernation-restore cycle due
to a defect in the platform firmware, and some of the page frames used
by the kernel before hibernation may not be available any more during
the subsequent restore which leads to the error below.
[ T357] PM: Image loading progress: 0%
[ T357] PM: Read 2681596 kbytes in 0.03 seconds (89386.53 MB/s)
[ T357] PM: Error -14 resuming
[ T357] PM: Failed to load hibernation image, recovering.
[ T357] PM: Basic memory bitmaps freed
[ T357] OOM killer enabled.
[ T357] Restarting tasks ... done.
[ T357] PM: resume from hibernation failed (-14)
[ T357] PM: Hibernation image not present or could not be loaded.
Add an error message to the unpack() function to allow problematic
page frames to be identified and the source of the problem to be
diagnosed more easily. This can save developers quite a bit of
debugging time.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
[ rjw: New subject, edited changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Many Linux systems are configured to not panic on oops; but allowing an
attacker to oops the system **really** often can make even bugs that look
completely unexploitable exploitable (like NULL dereferences and such) if
each crash elevates a refcount by one or a lock is taken in read mode, and
this causes a counter to eventually overflow.
The most interesting counters for this are 32 bits wide (like open-coded
refcounts that don't use refcount_t). (The ldsem reader count on 32-bit
platforms is just 16 bits, but probably nobody cares about 32-bit platforms
that much nowadays.)
So let's panic the system if the kernel is constantly oopsing.
The speed of oopsing 2^32 times probably depends on several factors, like
how long the stack trace is and which unwinder you're using; an empirically
important one is whether your console is showing a graphical environment or
a text console that oopses will be printed to.
In a quick single-threaded benchmark, it looks like oopsing in a vfork()
child with a very short stack trace only takes ~510 microseconds per run
when a graphical console is active; but switching to a text console that
oopses are printed to slows it down around 87x, to ~45 milliseconds per
run.
(Adding more threads makes this faster, but the actual oops printing
happens under &die_lock on x86, so you can maybe speed this up by a factor
of around 2 and then any further improvement gets eaten up by lock
contention.)
It looks like it would take around 8-12 days to overflow a 32-bit counter
with repeated oopsing on a multi-core X86 system running a graphical
environment; both me (in an X86 VM) and Seth (with a distro kernel on
normal hardware in a standard configuration) got numbers in that ballpark.
12 days aren't *that* short on a desktop system, and you'd likely need much
longer on a typical server system (assuming that people don't run graphical
desktop environments on their servers), and this is a *very* noisy and
violent approach to exploiting the kernel; and it also seems to take orders
of magnitude longer on some machines, probably because stuff like EFI
pstore will slow it down a ton if that's active.
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org
Since the introduction of clockevents, i.e., commit d316c57ff6
("clockevents: add core functionality"), there has been a mismatch between
the function and the kernel-doc comment for clockevent_delta2ns().
Hence, ./scripts/kernel-doc -none kernel/time/clockevents.c warns about it.
Adjust the kernel-doc comment for clockevent_delta2ns() for make W=1
happiness.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221102091048.15068-1-lukas.bulwahn@gmail.com
find_timens_vvar_page() is not architecture-specific, as can be seen from
how all five per-architecture versions of it are the same.
(arm64, powerpc and riscv are exactly the same; x86 and s390 have two
characters difference inside a comment, less blank lines, and mark the
!CONFIG_TIME_NS version as inline.)
Refactor the five copies into a central copy in kernel/time/namespace.c.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130115320.2918447-1-jannh@google.com
When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the
following compilation error will appear:
$ make LLVM=1 LLVM_IAS=1 -j
...
ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined
cgroup_storage_map_btf_ids:;
^
make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1
In local_storage.c, we have
BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map)
Commit c4bcfb38a9 ("bpf: Implement cgroup storage available to
non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE
definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker
complains with lto build.
Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times
for sk, inode, task and cgrp local storages. Let us define a single global one
with a different name than cgroup_storage_map_btf_ids, which also fixed
the lto compilation error.
Fixes: c4bcfb38a9 ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221130052147.1591625-1-yhs@fb.com
Following the pattern of io_uring, perf, skb, and bpf, iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.
user->locked_vm is the good accounting to use for ulimit because it is
per-user, and the security sandboxing of locked pages is not supposed to
be per-process. Other places (vfio, vdpa and infiniband) have used
mm->pinned_vm and/or mm->locked_vm for accounting pinned pages, but this
is only per-process and inconsistent with the new FOLL_LONGTERM users in
the kernel.
Concurrent work is underway to try to put this in a cgroup, so everything
can be consistent and the kernel can provide a FOLL_LONGTERM limit that
actually provides security.
Link: https://lore.kernel.org/r/7-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>