We've only been setting the sg_io_hdr status values on SCSI commands
that require an nvme command to complete the translation. The fields
in the struct are output parameters, so we have to set them, otherwise
user space will see whatever was in memory from before. In the case of
compat SG_IO, this would reveal kernel memory. This fixes the issue by
initializing the sg_io_hdr with successful status.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can return -ENOIOCTLCMD and the ioctl will be handled by
fs/compat_ioctl.c instead. This removes a lot of duplicate code in the
nvme driver.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If an nvme device is removed but user space has an open reference,
the nvme driver would have been holding an invalid reference to its pci
device. You may get a general protection fault on x86 h/w when the driver
uses that reference in dma_map_sg(), as is done in nvme_map_user_pages()
from the IOCTL interface.
This patch fixes the fault by taking a reference on the pci device and
holding it even after device removal until all opens on the nvme device
are closed.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
(..., NULL)
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
nvme_submit_io_cmd() uses smp_processor_id() to pick an IO queue index.
This patch fixes the case where there are more cpus from which the ioctl
call can originate than online queues, which can happen when a device
supports or was allocated fewer interrupt vectors than exist cpu cores.
Thanks to Keith Busch for the implementation suggestion.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This changes the order of deleting the gendisks so it happens after the
nvme IO queues are freed. If a device is removed while a filesystem has
associated dirty data, the removal will wait on these to complete before
proceeding from del_gendisk, which could have caused deadlock before.
The implication of this is that an orderly removal of a responsive
device won't necessarily wait for dirty data to be written, but we are
not guaranteed the device is even going to respond at this point either.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Rather than relying on call_rcu, this patch directly frees the
nvme_queue's memory after ensuring no readers exist. Some arch specific
dma_free_coherent implementations may not be called from a call_rcu's
soft interrupt context, hence the change.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Matthew Minter <matthew_minter@xyratex.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The current implementation hard-codes the shutdown timeout to 2 seconds.
Some devices take longer than this to complete a normal shutdown.
Changing the shutdown timeout to a module parameter with a default
timeout of 5 seconds.
Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Rather than skipping shutdown only for devices that have been removed,
skip the orderly shutdown on failed devices to avoid the long timeout
handling that inevitably happens when deleting queues on such a device.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Race conditions are theoretically possible between the NVMe PCI device
removal and the generic PCI bus rescan and device removal that can be
triggered via sysfs.
To avoid those race conditions make the NVMe code use
pci_stop_and_remove_bus_device_locked().
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is a minor refactor for handling devices that are incapable of IO.
The driver previously used special error codes to know that IO queues
are unavailable, but we have an online queue count now.
This also fixes an issue where the driver successfully sets the queue
count, but either is unable to allocate an IO queue or the device can't
create one for some reason.
If the driver can successfully enable the device and get responses to
admin commands, the driver will bring up a character device for managment
but not create block devices.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Change the behavior of nvme_enable_ctrl to set EN.
Clear CC.SH for both nvme_enable_ctrl and nvme_disable_ctrl.
Remove reading of the CC register and manage the state in
dev->ctrl_config.
Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
[removed an unwanted write to CC]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Adds support for devices with max page size smaller than the host's.
In the case we encounter such a host/device combination, the driver will
split a page into as many PRP entries as necessary for the device's page
size capabilities. If the device's reported minimum page size is greater
than the host's, the driver will not attempt to enable the device and
return an error instead.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Submits NVMe asynchronous event requests, one event up to the controller
maximum or number of possible different event types (8), whichever is
smaller. Events successfully returned by the controller are logged.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use the zeroing function instead of dma_alloc_coherent & memset(,0,)
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
else is unnecessary after return 0 in __udp4_lib_rcv()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove __inline__ / inline and let compiler decide what to do
with static functions
Inspired-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
In addtion expand the table to handle other values of sysclk. Instead
of making the table 3D, expand it to a more descriptive struct. The
divisors are specified in Table 19 of the 98090 data sheet version
0p94.
The dmic frequency was previously assumed. Instead make it explicit
and configurable through device tree. This now handles independently
set pclk and dmic frequency.
Based on downstream work by Ralph Birt.
Signed-off-by: Dylan Reid <dgreid@chromium.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
The Baytrail-based chromebooks have a 20MHz mclk, the code was setting
the divisor incorrectly in this case. According to the 98090
datasheet, the divisor should be set to DIV1 for 10 <= mclk <= 20.
Correct this and the surrounding clock ranges as well to match the
datasheet.
Signed-off-by: Dylan Reid <dgreid@chromium.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Pull thermal fixes from Eduardo Valentin:
"Specifics:
- a few code fixes improving the Exynos code base. They remove dead
and unreachable code. No functional changes here
- in Exynos code base, fixes regarding the right usage of features
(TRIMINFO and TRIMRELOAD)
- documentation of RCAR thermal
- fix in the of-thermal, regarding the proper usage of of-APIs
- fixes on thermal-core, removal of unreachable code"
[ Eduardo is sending the thermal fixes on behalf of Rui Zhang this time.
Rui is currently unable to send pull requests due to troubles with his
machine and he's currently in a business trip ]
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal:
Thermal:Remove usless if(!result) before return tz
thermal: exynos: fix IRQ clearing on TMU initialization
thermal: fix multiple disbalanced device node counters
thermal: rcar: Add binding docs for new R-Car Gen2 SoCs
thermal: exynos: Add support for TRIM_RELOAD feature at Exynos3250
thermal: exynos: Add support for many TRIMINFO_CTRL registers
thermal: samsung: Exynos5260 and Exynos5420 should not use TRIM_RELOAD flag
thermal: exynos: remove identical values from exynos*_tmu_registers structures
thermal: exynos: remove redundant pdata checks from exynos_tmu_control()
thermal: exynos: cache non_hw_trigger_levels in pdata
thermal: exynos: simplify temp_to_code() and code_to_temp()
thermal: exynos: remove redundant threshold_code checks from exynos_tmu_initialize()
thermal: exynos: remove redundant pdata checks from exynos_tmu_initialize()
thermal: exynos: remove dead code for HW_MODE calibration
thermal: exynos: remove unused struct exynos_tmu_registers entries
In order to describe a single slave device that has no chip select line
the 'num-chipselects' property has to be <0> and the 'cs-gpios' property
doesn't need to be set.
Signed-off-by: Torsten Fleischer <torfl6749@gmail.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Pull x86 platform drievr updates from Darren Hart:
"A short list of patches applying quirks and new DMI matches. These
pass my basic build tests and have spent 4 days in linux-next"
* tag 'platform-drivers-x86-v3.18-2' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
quirk for Lenovo Yoga 3: no rfkill switch
acer-wmi: Add acpi_backlight=video quirk for the Acer KAV80
samsung-laptop: Add broken-acpi-video quirk for NC210/NC110
asus-nb-wmi: Add wapf4 quirk for the X550VB
toshiba_acpi: Add Toshiba TECRA A50-A to the alt keymap dmi list
Observing that per-CPU data (in the SMP case) is reachable by
exploiting 64-bit address wraparound (building on the default kernel
load address being at 16Mb), the one byte shorter RIP-relative
addressing form can be used for most per-CPU accesses. The one
exception are the "stable" reads, where the use of the "P" operand
modifier prevents the compiler from using RIP-relative addressing, but
is unavoidable due to the use of the "p" constraint (side note: with
gcc 4.9.x the intended effect of this isn't being achieved anymore,
see gcc bug 63637).
With the dependency on the minimum kernel load address, arbitrarily
low values for CONFIG_PHYSICAL_START are now no longer possible. A
link time assertion is being added, directing to the need to increase
that value when it triggers.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/5458A1780200007800044A9D@mail.emea.novell.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes if you please"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc: use device_online/offline() instead of cpu_up/down()
powerpc/powernv: Properly fix LPC debugfs endianness
powerpc: do_notify_resume can be called with bad thread_info flags argument
powerpc/fadump: Fix endianess issues in firmware assisted dump handling
powerpc: Fix section mismatch warning
Both this_cpu_off and cpu_info aren't getting modified post boot, yet
are being accessed on enough code paths that grouping them with other
frequently read items seems desirable. For cpu_info this at the same
time implies removing the cache line alignment (which afaict became
pointless when it got converted to per-CPU data years ago).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/54589BD20200007800044A84@mail.emea.novell.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull ftracetest fix from Steven Rostedt:
"Running the ftracetests on a machine that had the debugfs file system
mounted in two locations caused the ftracetests to fail. This is
because the ftracetests script does a grep of the /proc/mounts file to
find where the debugfs file system is mounted. If it is mounted
twice, then the grep returns two lines instead of just one. This
causes the ftracetests to get confused and fail.
Use "head -1" to only return the first mount point for debugfs"
* tag 'ftracetest-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftracetest: Take the first debugfs mount found
Now that Keystone PCI controller is merged, add pcie related options
by default for keystone architecture so that driver can be enabled in
the build.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
K2E SoC has a second PCI port based on Synopsis Designware PCIe h/w.
Add DT bindings to support PCI controller for port 1 for this SoC.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Add common DT bindings to support PCI controller driver for port 0 on all
of the K2 SoCs that has Synopsis Designware based pcie h/w.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Changes into the AHCI subsystem have introduced a bug by not taking into
account the force_port_map and mask_port_map parameters when using the
ahci_pci_save_initial_config function. This commit fixes it by setting
the internal parameters of the ahci_port_priv structure.
Fixes: 725c7b570f
Reported-and-tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
tests inode->i_state locklessly to see whether it already has all the
necessary I_DIRTY bits set. The comment above the barrier doesn't
contain any useful information - memory barriers can't ensure "changes
are seen by all cpus" by itself.
And it sure enough was broken. Please consider the following
scenario.
CPU 0 CPU 1
-------------------------------------------------------------------------------
enters __writeback_single_inode()
grabs inode->i_lock
tests PAGECACHE_TAG_DIRTY which is clear
enters __set_page_dirty()
grabs mapping->tree_lock
sets PAGECACHE_TAG_DIRTY
releases mapping->tree_lock
leaves __set_page_dirty()
enters __mark_inode_dirty()
smp_mb()
sees I_DIRTY_PAGES set
leaves __mark_inode_dirty()
clears I_DIRTY_PAGES
releases inode->i_lock
Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
to lead to an immediately critical problem because requeue_inode()
later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
deciding whether the inode needs to be requeued for IO and there are
enough unintentional memory barriers inbetween, so while the inode
ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
IO list.
The lack of explicit barrier may also theoretically affect the other
I_DIRTY bits which deal with metadata dirtiness. There is no
guarantee that a strong enough barrier exists between
I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
inode. Filesystem inode writeout path likely has enough stuff which
can behave as full barrier but it's theoretically possible that the
writeout may not see all the updates from ->dirty_inode().
Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
that I_DIRTY_PAGES needs a special treatment as it always needs to be
cleared to be interlocked with the lockless test on
__mark_inode_dirty() side. It's cleared unconditionally and
reinstated after smp_mb() if the mapping still has dirty pages.
Also add comments explaining how and why the barriers are paired.
Lightly tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
On large-memory x86-64 systems of 64GB or more with memory hot-plug
enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces
the number of directories in /sys/devices/system/memory from 512 to 32,
making it more manageable, and reducing the creation time accordingly.
This caveat is that the memory can't be offlined (for hotplug or
otherwise) with the finer default 128MB granularity, but this is
unimportant due to the high memory densities generally used with such
large-memory systems, where eg a single DIMM is the order of 16GB.
Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Cc: Steffen Persvold <sp@numascale.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Link: http://lkml.kernel.org/r/1415089784-28779-4-git-send-email-daniel@numascale.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>