Commit Graph

648280 Commits

Author SHA1 Message Date
Omar Sandoval
950cd7e9ff blk-mq: move hctx->dispatch and ctx->rq_list from sysfs to debugfs
These lists are only useful for debugging; they definitely don't belong
in sysfs. Putting them in debugfs also removes the limitation of a
single page of output.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 08:17:44 -07:00
Omar Sandoval
9abb2ad21e blk-mq: add hctx->{state,flags} to debugfs
hctx->state could come in handy for bugs where the hardware queue gets
stuck in the stopped state, and hctx->flags is just useful to know.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 08:17:44 -07:00
Omar Sandoval
07e4fead45 blk-mq: create debugfs directory tree
In preparation for putting blk-mq debugging information in debugfs,
create a directory tree mirroring the one in sysfs:

    # tree -d /sys/kernel/debug/block
    /sys/kernel/debug/block
    |-- nvme0n1
    |   `-- mq
    |       |-- 0
    |       |   `-- cpu0
    |       |-- 1
    |       |   `-- cpu1
    |       |-- 2
    |       |   `-- cpu2
    |       `-- 3
    |           `-- cpu3
    `-- vda
        `-- mq
            `-- 0
                |-- cpu0
                |-- cpu1
                |-- cpu2
                `-- cpu3

Also add the scaffolding for the actual files that will go in here,
either under the hardware queue or software queue directories.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 08:17:44 -07:00
Jens Axboe
b48fda0976 blk-mq-sched: check for successful allocation before assigning tag
We don't trigger this from the normal IO path, since we always use
blocking allocations from there. But Bart saw it testing multipath
dm, since that is a heavy user of atomic request allocations in
the map and clone path.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-26 14:52:20 -07:00
Jens Axboe
5a797e00dc blk-mq: don't lose flags passed in to blk_mq_alloc_request()
If we come in from blk_mq_alloc_requst() with NOWAIT set in flags,
we must ensure that we don't later overwrite that in
blk_mq_sched_get_request(). Initialize alloc_data->flags before
passing it in.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-26 12:22:11 -07:00
Jens Axboe
200e86b337 blk-mq: only apply active queue tag throttling for driver tags
If we have a scheduler attached, we have two sets of tags. We don't
want to apply our active queue throttling for the scheduler side
of tags, that only applies to driver tags since that's the resource
we need to dispatch an IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-25 08:11:38 -07:00
Markus Elfring
1cf4175303 cfq-iosched: Adjust one function call together with a variable assignment
The script "checkpatch.pl" pointed information out like the following.

ERROR: do not use assignment in if condition

Thus fix the affected source code place.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-23 08:32:18 -07:00
Markus Elfring
d609af3a13 blk-throttle: Adjust two function calls together with a variable assignment
The script "checkpatch.pl" pointed information out like the following.

ERROR: do not use assignment in if condition

Thus fix the affected source code places.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-23 08:32:15 -07:00
Alexander Potapenko
4d608baac5 block: Initialize cfqq->ioprio_class in cfq_get_queue()
KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
uninitialized memory in cfq_init_cfqq():

==================================================================
BUG: KMSAN: use of unitialized memory
...
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff8202ac97>] dump_stack+0x157/0x1d0 lib/dump_stack.c:51
 [<ffffffff813e9b65>] kmsan_report+0x205/0x360 ??:?
 [<ffffffff813eabbb>] __msan_warning+0x5b/0xb0 ??:?
 [<     inline     >] cfq_init_cfqq block/cfq-iosched.c:3754
 [<ffffffff8201e110>] cfq_get_queue+0xc80/0x14d0 block/cfq-iosched.c:3857
...
origin:
 [<ffffffff8103ab37>] save_stack_trace+0x27/0x50 arch/x86/kernel/stacktrace.c:67
 [<ffffffff813e836b>] kmsan_internal_poison_shadow+0xab/0x150 ??:?
 [<ffffffff813e88ab>] kmsan_poison_slab+0xbb/0x120 ??:?
 [<     inline     >] allocate_slab mm/slub.c:1627
 [<ffffffff813e533f>] new_slab+0x3af/0x4b0 mm/slub.c:1641
 [<     inline     >] new_slab_objects mm/slub.c:2407
 [<ffffffff813e0ef3>] ___slab_alloc+0x323/0x4a0 mm/slub.c:2564
 [<     inline     >] __slab_alloc mm/slub.c:2606
 [<     inline     >] slab_alloc_node mm/slub.c:2669
 [<ffffffff813dfb42>] kmem_cache_alloc_node+0x1d2/0x1f0 mm/slub.c:2746
 [<ffffffff8201d90d>] cfq_get_queue+0x47d/0x14d0 block/cfq-iosched.c:3850
...
==================================================================
(the line numbers are relative to 4.8-rc6, but the bug persists
upstream)

The uninitialized struct cfq_queue is created by kmem_cache_alloc_node()
and then passed to cfq_init_cfqq(), which accesses cfqq->ioprio_class
before it's initialized.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-23 08:30:28 -07:00
Jens Axboe
70f36b6001 blk-mq: allow resize of scheduler requests
Add support for growing the tags associated with a hardware queue, for
the scheduler tags. Currently we only support resizing within the
limits of the original depth, change that so we can grow it as well by
allocating and replacing the existing scheduler tag set.

This is similar to how we could increase the software queue depth with
the legacy IO stack and schedulers.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-20 09:05:53 -07:00
Jens Axboe
7e79dadce2 blk-mq: stop hardware queue in blk_mq_delay_queue()
The run handler we register for the delayed work requires that the
queue be stopped, yet we leave that up to the caller. Let's move
it into blk_mq_delay_queue() itself, so that the API is sane.

This fixes a stall with SCSI, where it calls blk_mq_delay_queue()
without having stopped the queue. Hence the queue is never run.

Reported-by: Hannes Reinecke <hare@suse.com>
Fixes: 70f4db639c ("blk-mq: add blk_mq_delay_queue")
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-19 08:01:55 -07:00
Jens Axboe
8cecb07d70 blk-mq-tag: remove redundant check for 'data->hctx' being non-NULL
We used to pass in NULL for hctx for reserved tags, but we don't
do that anymore. Hence the check for whether hctx is NULL or not
is now redundant, kill it.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: a642a158aec6 ("blk-mq-tag: cleanup the normal/reserved tag allocation")
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-19 07:43:05 -07:00
Jens Axboe
610d886c0c elevator: fix unnecessary put of elevator in failure case
We already checked that e is NULL, so no point in calling
elevator_put() to free it.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: dc877dbd088f ("blk-mq-sched: add framework for MQ capable IO schedulers")
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-19 07:43:05 -07:00
Jens Axboe
38dbb7dd4d blk-cgroup: don't quiesce the queue on policy activate/deactivate
There's no potential harm in quiescing the queue, but it also doesn't
buy us anything. And we can't run the queue async for policy
deactivate, since we could be in the path of tearing the queue down.
If we schedule an async run of the queue at that time, we're racing
with queue teardown AFTER having we've already torn most of it down.

Reported-by: Omar Sandoval <osandov@fb.com>
Fixes: 4d199c6f1c ("blk-cgroup: ensure that we clear the stop bit on quiesced queues")
Tested-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-18 15:37:27 -07:00
Omar Sandoval
6c0ca7ae29 sbitmap: fix wakeup hang after sbq resize
When we resize a struct sbitmap_queue, we update the wakeup batch size,
but we don't update the wait count in the struct sbq_wait_states. If we
resized down from a size which could use a bigger batch size, these
counts could be too large and cause us to miss necessary wakeups. To fix
this, update the wait counts when we resize (ensuring some careful
memory ordering so that it's safe w.r.t. concurrent clears).

This also fixes a theoretical issue where two threads could end up
bumping the wait count up by the batch size, which could also
potentially lead to hangs.

Reported-by: Martin Raiber <martin@urbackup.org>
Fixes: e3a2b3f931 ("blk-mq: allow changing of queue depth through sysfs")
Fixes: 2971c35f35 ("blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-18 13:41:55 -07:00
Omar Sandoval
f66227de59 sbitmap: use smp_mb__after_atomic() in sbq_wake_up()
We always do an atomic clear_bit() right before we call sbq_wake_up(),
so we can use smp_mb__after_atomic(). While we're here, comment the
memory barriers in here a little more.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-18 13:41:49 -07:00
Jens Axboe
4d199c6f1c blk-cgroup: ensure that we clear the stop bit on quiesced queues
If we call blk_mq_quiesce_queue() on a queue, we must remember to
pair that with something that clears the stopped by on the
queues later on.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-18 07:43:26 -07:00
Jens Axboe
d348499138 blk-mq-sched: allow setting of default IO scheduler
Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:31 -07:00
Jens Axboe
945ffb60c1 mq-deadline: add blk-mq adaptation of the deadline IO scheduler
This is basically identical to deadline-iosched, except it registers
as a MQ capable scheduler. This is still a single queue design.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:25 -07:00
Jens Axboe
bd166ef183 blk-mq-sched: add framework for MQ capable IO schedulers
This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.

We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.

We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:20 -07:00
Jens Axboe
2af8cbe305 blk-mq: split tag ->rqs[] into two
This is in preparation for having two sets of tags available. For
that we need a static index, and a dynamically assignable one.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:15 -07:00
Jens Axboe
fd2d332677 blk-mq: add support for carrying internal tag information in blk_qc_t
No functional change in this patch, just in preparation for having
two types of tags available to the block layer for a single request.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:09 -07:00
Jens Axboe
cc71a6f438 blk-mq: abstract out helpers for allocating/freeing tag maps
Prep patch for adding an extra tag map for scheduler requests.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:04 -07:00
Jens Axboe
4941115bef blk-mq-tag: cleanup the normal/reserved tag allocation
This is in preparation for having another tag set available. Cleanup
the parameters, and allow passing in of tags for blk_mq_put_tag().

Signed-off-by: Jens Axboe <axboe@fb.com>
[hch: even more cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:03:59 -07:00
Jens Axboe
2c3ad66790 blk-mq: export some helpers we need to the scheduling framework
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:03:53 -07:00
Jens Axboe
16a3c2a70c blk-mq: un-export blk_mq_free_hctx_request()
It's only used in blk-mq, kill it from the main exported header
and kill the symbol export as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:03:48 -07:00
Jens Axboe
c23ecb4260 block: move rq_ioc() to blk.h
We want to use it outside of blk-core.c.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:03:42 -07:00
Jens Axboe
c51ca6cf54 block: move existing elevator ops to union
Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:03:33 -07:00
Alden Tondettar
c5082b70ad partitions/efi: Fix integer overflow in GPT size calculation
If a GUID Partition Table claims to have more than 2**25 entries, the
calculation of the partition table size in alloc_read_gpt_entries() will
overflow a 32-bit integer and not enough space will be allocated for the
table.

Nothing seems to get written out of bounds, but later efi_partition() will
read up to 32768 bytes from a 128 byte buffer, possibly OOPSing or exposing
information to /proc/partitions and uevents.

The problem exists on both 64-bit and 32-bit platforms.

Fix the overflow and also print a meaningful debug message if the table
size is too large.

Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-17 09:02:31 -07:00
Josef Bacik
1e668f4e79 MAINTAINERS: Update maintainer entry for NBD
The old maintainers email is bouncing and I've rewritten most of this
driver in the recent months.  Also add linux-block to the mailinglist
and remove the old tree, I will send patches through the linux-block
tree.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-11 20:49:17 -07:00
Jens Axboe
f8a5b12247 blk-mq: make mq_ops a const pointer
We never change it, make that clear.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
2017-01-11 20:47:44 -07:00
Ming Lei
729204ef49 block: relax check on sg gap
If the last bvec of the 1st bio and the 1st bvec of the next
bio are physically contigious, and the latter can be merged
to last segment of the 1st bio, we should think they don't
violate sg gap(or virt boundary) limit.

Both Vitaly and Dexuan reported lots of unmergeable small bios
are observed when running mkfs on Hyper-V virtual storage, and
performance becomes quite low. This patch fixes that performance
issue.

The same issue should exist on NVMe, since it sets virt boundary too.

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reported-by: Dexuan Cui <decui@microsoft.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-11 20:47:08 -07:00
Vlastimil Babka
1661f2e21c floppy: replace wrong kmalloc(GFP_USER) with GFP_KERNEL
The raw_cmd_copyin() function does a kmalloc() with GFP_USER, although the
allocated structure is obviously not mapped to userspace, just copied from/to.
In this case GFP_KERNEL is more appropriate, so let's use it, although in the
current implementation this does not manifest as any error.

Reported-by: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-01-11 20:47:08 -07:00
Linus Torvalds
a121103c92 Linux 4.10-rc3 v4.10-rc3 2017-01-08 14:18:17 -08:00
Linus Torvalds
83280e90ef Merge tag 'usb-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
 "Here are a bunch of USB fixes for 4.10-rc3. Yeah, it's a lot, an
  artifact of the holiday break I think.

  Lots of gadget and the usual XHCI fixups for reported issues (one day
  that driver will calm down...) Also included are a bunch of usb-serial
  driver fixes, and for good measure, a number of much-reported MUSB
  driver issues have finally been resolved.

  All of these have been in linux-next with no reported issues"

* tag 'usb-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (72 commits)
  USB: fix problems with duplicate endpoint addresses
  usb: ohci-at91: use descriptor-based gpio APIs correctly
  usb: storage: unusual_uas: Add JMicron JMS56x to unusual device
  usb: hub: Move hub_port_disable() to fix warning if PM is disabled
  usb: musb: blackfin: add bfin_fifo_offset in bfin_ops
  usb: musb: fix compilation warning on unused function
  usb: musb: Fix trying to free already-free IRQ 4
  usb: musb: dsps: implement clear_ep_rxintr() callback
  usb: musb: core: add clear_ep_rxintr() to musb_platform_ops
  USB: serial: ti_usb_3410_5052: fix NULL-deref at open
  USB: serial: spcp8x5: fix NULL-deref at open
  USB: serial: quatech2: fix sleep-while-atomic in close
  USB: serial: pl2303: fix NULL-deref at open
  USB: serial: oti6858: fix NULL-deref at open
  USB: serial: omninet: fix NULL-derefs at open and disconnect
  USB: serial: mos7840: fix misleading interrupt-URB comment
  USB: serial: mos7840: remove unused write URB
  USB: serial: mos7840: fix NULL-deref at open
  USB: serial: mos7720: remove obsolete port initialisation
  USB: serial: mos7720: fix parallel probe
  ...
2017-01-08 11:42:04 -08:00
Linus Torvalds
cc250e267b Merge tag 'char-misc-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
 "Here are a few small char/misc driver fixes for 4.10-rc3.

  Two MEI driver fixes, and three NVMEM patches for reported issues, and
  a new Hyper-V driver MAINTAINER update. Nothing major at all, all have
  been in linux-next with no reported issues"

* tag 'char-misc-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  hyper-v: Add myself as additional MAINTAINER
  nvmem: fix nvmem_cell_read() return type doc
  nvmem: imx-ocotp: Fix wrong register size
  nvmem: qfprom: Allow single byte accesses for read/write
  mei: move write cb to completion on credentials failures
  mei: bus: fix mei_cldev_enable KDoc
2017-01-08 11:37:44 -08:00
Linus Torvalds
6ea17ed15d Merge tag 'staging-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO fixes from Greg KH:
 "Here are some staging and IIO driver fixes for 4.10-rc3.

  Most of these are minor IIO fixes of reported issues, along with one
  network driver fix to resolve an issue. And a MAINTAINERS update with
  a new mailing list. All of these, except the MAINTAINERS file update,
  have been in linux-next with no reported issues (the MAINTAINERS patch
  happened on Friday...)"

* tag 'staging-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  MAINTAINERS: add greybus subsystem mailing list
  staging: octeon: Call SET_NETDEV_DEV()
  iio: accel: st_accel: fix LIS3LV02 reading and scaling
  iio: common: st_sensors: fix channel data parsing
  iio: max44000: correct value in illuminance_integration_time_available
  iio: adc: TI_AM335X_ADC should depend on HAS_DMA
  iio: bmi160: Fix time needed to sleep after command execution
  iio: 104-quad-8: Fix active level mismatch for the preset enable option
  iio: 104-quad-8: Fix off-by-one errors when addressing IOR
  iio: 104-quad-8: Fix index control configuration
2017-01-08 11:22:00 -08:00
Johannes Weiner
ea07b862ac mm: workingset: fix use-after-free in shadow node shrinker
Several people report seeing warnings about inconsistent radix tree
nodes followed by crashes in the workingset code, which all looked like
use-after-free access from the shadow node shrinker.

Dave Jones managed to reproduce the issue with a debug patch applied,
which confirmed that the radix tree shrinking indeed frees shadow nodes
while they are still linked to the shadow LRU:

  WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
  CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
  Call Trace:
     delete_node+0x1e4/0x200
     __radix_tree_delete_node+0xd/0x10
     shadow_lru_isolate+0xe6/0x220
     __list_lru_walk_one.isra.4+0x9b/0x190
     list_lru_walk_one+0x23/0x30
     scan_shadow_nodes+0x2e/0x40
     shrink_slab.part.44+0x23d/0x5d0
     shrink_node+0x22c/0x330
     kswapd+0x392/0x8f0

This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the
inlined radix_tree_shrink().

The problem is with 14b468791f ("mm: workingset: move shadow entry
tracking to radix tree exceptional tracking"), which passes an update
callback into the radix tree to link and unlink shadow leaf nodes when
tree entries change, but forgot to pass the callback when reclaiming a
shadow node.

While the reclaimed shadow node itself is unlinked by the shrinker, its
deletion from the tree can cause the left-most leaf node in the tree to
be shrunk.  If that happens to be a shadow node as well, we don't unlink
it from the LRU as we should.

Consider this tree, where the s are shadow entries:

       root->rnode
            |
       [0       n]
        |       |
     [s    ] [sssss]

Now the shadow node shrinker reclaims the rightmost leaf node through
the shadow node LRU:

       root->rnode
            |
       [0        ]
        |
    [s     ]

Because the parent of the deleted node is the first level below the
root and has only one child in the left-most slot, the intermediate
level is shrunk and the node containing the single shadow is put in
its place:

       root->rnode
            |
       [s        ]

The shrinker again sees a single left-most slot in a first level node
and thus decides to store the shadow in root->rnode directly and free
the node - which is a leaf node on the shadow node LRU.

  root->rnode
       |
       s

Without the update callback, the freed node remains on the shadow LRU,
where it causes later shrinker runs to crash.

Pass the node updater callback into __radix_tree_delete_node() in case
the deletion causes the left-most branch in the tree to collapse too.

Also add warnings when linked nodes are freed right away, rather than
wait for the use-after-free when the list is scanned much later.

Fixes: 14b468791f ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Leech <cleech@redhat.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-07 18:22:40 -08:00
Hugh Dickins
b0b9b3df27 mm: stop leaking PageTables
4.10-rc loadtest (even on x86, and even without THPCache) fails with
"fork: Cannot allocate memory" or some such; and /proc/meminfo shows
PageTables growing.

Commit 953c66c2b2 ("mm: THP page cache support for ppc64") that got
merged in rc1 removed the freeing of an unused preallocated pagetable
after do_fault_around() has called map_pages().

This is usually a good optimization, so that the followup doesn't have
to reallocate one; but it's not sufficient to shift the freeing into
alloc_set_pte(), since there are failure cases (most commonly
VM_FAULT_RETRY) which never reach finish_fault().

Check and free it at the outer level in do_fault(), then we don't need
to worry in alloc_set_pte(), and can restore that to how it was (I
cannot find any reason to pte_free() under lock as it was doing).

And fix a separate pagetable leak, or crash, introduced by the same
change, that could only show up on some ppc64: why does do_set_pmd()'s
failure case attempt to withdraw a pagetable when it never deposited
one, at the same time overwriting (so leaking) the vmf->prealloc_pte?
Residue of an earlier implementation, perhaps? Delete it.

Fixes: 953c66c2b2 ("mm: THP page cache support for ppc64")
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-07 17:49:33 -08:00
Linus Torvalds
87bc610730 Merge branch 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild fix from Michal Marek:
 "The asm-prototypes.h file added in the last merge window results in
  invalid code with CONFIG_KMEMCHECK=y. The net result is that genksyms
  segfaults.

  This pull request fixes the header, the genksyms fix is in my kbuild
  branch for 4.11"

* 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  asm-prototypes: Clear any CPP defines before declaring the functions
2017-01-07 09:47:43 -08:00
Greg Kroah-Hartman
01d0f71586 MAINTAINERS: add greybus subsystem mailing list
The Greybus driver subsystem has a mailing list, so list it in the
MAINTAINERS file so that people know to send patches there as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-07 17:05:42 +01:00
Linus Torvalds
308c470bc4 Merge tag 'sound-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "Nothing particular stands out, only a few small fixes for USB-audio,
  HD-audio and Firewire. The USB-audio fix is the respin of the previous
  race fix after a revert due to the regression"

* tag 'sound-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  Revert "ALSA: firewire-lib: change structure member with proper type"
  ALSA: usb-audio: test EP_FLAG_RUNNING at urb completion
  ALSA: usb-audio: Fix irq/process data synchronization
  ALSA: hda - Apply asus-mode8 fixup to ASUS X71SL
  ALSA: hda - Fix up GPIO for ASUS ROG Ranger
  ALSA: firewire-lib: change structure member with proper type
  ALSA: firewire-tascam: Fix to handle error from initialization of stream data
  ALSA: fireworks: fix asymmetric API call at unit removal
2017-01-06 15:38:39 -08:00
Linus Torvalds
d72f0ded89 Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
 "One fix for a broken driver on Renesas RZ/A1 SoCs with bootloaders
  that don't turn all the clks on and another fix for stm32f4 SoCs where
  we have multiple drivers attaching to the same DT node"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: stm32f4: Use CLK_OF_DECLARE_DRIVER initialization method
  clk: renesas: mstp: Support 8-bit registers for r7s72100
2017-01-06 15:35:27 -08:00
Linus Torvalds
baaf031521 Merge tag 'hwmon-for-linus-v4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fix from Guenter Roeck:
 "Fix temp1_max_alarm attribute in lm90 driver"

* tag 'hwmon-for-linus-v4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (lm90) fix temp1_max_alarm attribute
2017-01-06 15:32:40 -08:00
Linus Torvalds
08289086b0 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
 "MIPS:
   - fix host kernel crashes when receiving a signal with 64-bit
     userspace

   - flush instruction cache on all vcpus after generating entry code

     (both for stable)

  x86:
   - fix NULL dereference in MMU caused by SMM transitions (for stable)

   - correct guest instruction pointer after emulating some VMX errors

   - minor cleanup"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: remove duplicated declaration
  KVM: MIPS: Flush KVM entry code from icache globally
  KVM: MIPS: Don't clobber CP0_Status.UX
  KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS
  KVM: nVMX: fix instruction skipping during emulated vm-entry
2017-01-06 15:27:17 -08:00
Linus Torvalds
b1ee51702e Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:

 - re-introduce the arm64 get_current() optimisation

 - KERN_CONT fallout fix in show_pte()

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: restore get_current() optimisation
  arm64: mm: fix show_pte KERN_CONT fallout
2017-01-06 15:18:58 -08:00
Linus Torvalds
5824f92463 Merge tag 'vfio-v4.10-rc3' of git://github.com/awilliam/linux-vfio
Pull VFIO fixes from Alex Williamson:
 - Add mtty sample driver properly into build system (Alex Williamson)
 - Restore type1 mapping performance after mdev (Alex Williamson)
 - Fix mdev device race (Alex Williamson)
 - Cleanups to the mdev ABI used by vendor drivers (Alex Williamson)
 - Build fix for old compilers (Arnd Bergmann)
 - Fix sample driver error path (Dan Carpenter)
 - Handle pci_iomap() error (Arvind Yadav)
 - Fix mdev ioctl return type (Paul Gortmaker)

* tag 'vfio-v4.10-rc3' of git://github.com/awilliam/linux-vfio:
  vfio-mdev: fix non-standard ioctl return val causing i386 build fail
  vfio-pci: Handle error from pci_iomap
  vfio-mdev: fix some error codes in the sample code
  vfio-pci: use 32-bit comparisons for register address for gcc-4.5
  vfio-mdev: Make mdev_device private and abstract interfaces
  vfio-mdev: Make mdev_parent private
  vfio-mdev: de-polute the namespace, rename parent_device & parent_ops
  vfio-mdev: Fix remove race
  vfio/type1: Restore mapping performance with mdev support
  vfio-mdev: Fix mtty sample driver building
2017-01-06 11:19:03 -08:00
Linus Torvalds
2fd8774c79 Merge branch 'stable/for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
Pull swiotlb fixes from Konrad Rzeszutek Wilk:
 "This has one fix to make i915 work when using Xen SWIOTLB, and a
  feature from Geert to aid in debugging of devices that can't do DMA
  outside the 32-bit address space.

  The feature from Geert is on top of v4.10 merge window commit
  (specifically you pulling my previous branch), as his changes were
  dependent on the Documentation/ movement patches.

  I figured it would just easier than me trying than to cherry-pick the
  Documentation patches to satisfy git.

  The patches have been soaking since 12/20, albeit I updated the last
  patch due to linux-next catching an compiler error and adding an
  Tested-and-Reported-by tag"

* 'stable/for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
  swiotlb: Export swiotlb_max_segment to users
  swiotlb: Add swiotlb=noforce debug option
  swiotlb: Convert swiotlb_force from int to enum
  x86, swiotlb: Simplify pci_swiotlb_detect_override()
2017-01-06 10:53:21 -08:00
Linus Torvalds
65cdc405b3 Merge tag 'iommu-fixes-v4.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU fixes from Joerg Roedel:
 "Three fixes queued up:

   - fix an issue with command buffer overflow handling in the AMD IOMMU
     driver

   - add an additional context entry flush to the Intel VT-d driver to
     make sure any old context entry from kdump copying is flushed out
     of the cache

   - correct the encoding of the PASID table size in the Intel VT-d
     driver"

* tag 'iommu-fixes-v4.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/amd: Fix the left value check of cmd buffer
  iommu/vt-d: Fix pasid table size encoding
  iommu/vt-d: Flush old iommu caches for kdump when the device gets context mapped
2017-01-06 10:49:36 -08:00
Linus Torvalds
7397e1e838 Merge tag 'acpi-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
 "These fix a device enumeration problem related to _ADR matching and an
  IOMMU initialization issue related to the DMAR table missing, remove
  an excessive function call from the core ACPI code, update an error
  message in the ACPI WDAT watchdog driver and add a way to work around
  problems with unhandled GPE notifications.

  Specifics:

   - Fix a device enumeration issue leading to incorrect associations
     between ACPI device objects and platform device objects
     representing physical devices if the given device object has both
     _ADR and _HID (Rafael Wysocki).

   - Avoid passing NULL to acpi_put_table() during IOMMU initialization
     which triggers a (rightful) warning from ACPICA (Rafael Wysocki).

   - Drop an excessive call to acpi_dma_deconfigure() from the core code
     that binds ACPI device objects to device objects representing
     physical devices (Lorenzo Pieralisi).

   - Update an error message in the ACPI WDAT watchdog driver to make it
     provide more useful information (Mika Westerberg).

   - Add a mechanism to work around issues with unhandled GPE
     notifications that occur during system initialization and cannot be
     prevented by means of sysfs (Lv Zheng)"

* tag 'acpi-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / DMAR: Avoid passing NULL to acpi_put_table()
  ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
  ACPI / watchdog: Print out error number when device creation fails
  ACPI / sysfs: Provide quirk mechanism to prevent GPE flooding
  ACPI: Drop misplaced acpi_dma_deconfigure() call from acpi_bind_one()
2017-01-06 10:40:17 -08:00