In order to do task placement on systems with complex NUMA topologies,
it is necessary to count the faults on nodes nearby the node that is
being examined for a potential move.
In case of a system with a backplane interconnect, we are dealing with
groups of NUMA nodes; each of the nodes within a group is the same number
of hops away from nodes in other groups in the system. Optimal placement
on this topology is achieved by counting all nearby nodes equally. When
comparing nodes A and B at distance N, nearby nodes are those at distances
smaller than N from nodes A or B.
Placement strategy on a system with a glueless mesh NUMA topology needs
to be different, because there are no natural groups of nodes determined
by the hardware. Instead, when dealing with two nodes A and B at distance
N, N >= 2, there will be intermediate nodes at distance < N from both nodes
A and B. Good placement can be achieved by right shifting the faults on
nearby nodes by the number of hops from the node being scored. In this
context, a nearby node is any node less than the maximum distance in the
system away from the node. Those nodes are skipped for efficiency reasons,
there is no real policy reason to do so.
Placement policy on directly connected NUMA systems is not affected.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1413530994-9732-5-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Smaller NUMA systems tend to have all NUMA nodes directly connected
to each other. This includes the degenerate case of a system with just
one node, ie. a non-NUMA system.
Larger systems can have two kinds of NUMA topology, which affects how
tasks and memory should be placed on the system.
On glueless mesh systems, nodes that are not directly connected to
each other will bounce traffic through intermediary nodes. Task groups
can be run closer to each other by moving tasks from a node to an
intermediary node between it and the task's preferred node.
On NUMA systems with backplane controllers, the intermediary hops
are incapable of running programs. This creates "islands" of nodes
that are at an equal distance to anywhere else in the system.
Each kind of topology requires a slightly different placement
algorithm; this patch provides the mechanism to detect the kind
of NUMA topology of a system.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
[ Changed to use kernel/sched/sched.h ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1413530994-9732-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the more common pr_warn.
Other miscellanea:
o Coalesce formats
o Realign arguments
o typo fixes of Siple to Simple
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
1) switched_to_dl() check is wrong. We reschedule only
if rq->curr is deadline task, and we do not reschedule
if it's a lower priority task. But we must always
preempt a task of other classes.
2) dl_task_timer():
Policy does not change in case of priority inheritance.
rt_mutex_setprio() changes prio, while policy remains old.
So we lose some balancing logic in dl_task_timer() and
switched_to_dl() when we check policy instead of priority. Boosted
task may be rq->curr.
(I didn't change switched_from_dl() because no check is necessary
there at all).
I've looked at this place(switched_to_dl) several times and even fixed
this function, but found just now... I suppose some performance tests
may work better after this.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While offling node by hot removing memory, the following divide error
occurs:
divide error: 0000 [#1] SMP
[...]
Call Trace:
[...] handle_mm_fault
[...] ? try_to_wake_up
[...] ? wake_up_state
[...] __do_page_fault
[...] ? do_futex
[...] ? put_prev_entity
[...] ? __switch_to
[...] do_page_fault
[...] page_fault
[...]
RIP [<ffffffff810a7081>] task_numa_fault
RSP <ffff88084eb2bcb0>
The issue occurs as follows:
1. When page fault occurs and page is allocated from node 1,
task_struct->numa_faults_buffer_memory[] of node 1 is
incremented and p->numa_faults_locality[] is also incremented
as follows:
o numa_faults_buffer_memory[] o numa_faults_locality[]
NR_NUMA_HINT_FAULT_TYPES
| 0 | 1 |
---------------------------------- ----------------------
node 0 | 0 | 0 | remote | 0 |
node 1 | 0 | 1 | locale | 1 |
---------------------------------- ----------------------
2. node 1 is offlined by hot removing memory.
3. When page fault occurs, fault_types[] is calculated by using
p->numa_faults_buffer_memory[] of all online nodes in
task_numa_placement(). But node 1 was offline by step 2. So
the fault_types[] is calculated by using only
p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
are set to 0.
4. The values(0) of fault_types[] pass to update_task_scan_period().
5. numa_faults_locality[1] is set to 1. So the following division is
calculated.
static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private){
...
ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
}
6. But both of private and shared are set to 0. So divide error
occurs here.
The divide error is rare case because the trigger is node offline.
This patch always increments denominator for avoiding divide error.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:
task_numa_compare() do_exit()
... current->flags |= PF_EXITING;
... release_task()
... ~~delayed_put_task_struct()~~
... schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...
As noted by Oleg:
<<The lockless get_task_struct(tsk) is only safe if tsk == current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().
And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp>>
The patch provides simple check of PF_EXITING flag. If it's not set,
this guarantees that call_rcu() of delayed_put_task_struct() callback
hasn't happened yet, so we can safely do get_task_struct() in
task_numa_assign().
Locked dst_rq->lock protects from concurrency with the last schedule().
Reusing or unmapping of cur's memory may happen without it.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The race may happen when somebody is changing task_group of a forking task.
Child's cgroup is the same as parent's after dup_task_struct() (there just
memory copying). Also, cfs_rq and rt_rq are the same as parent's.
But if parent changes its task_group before it's called cgroup_post_fork(),
we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
the same, while child's task_group changes in cgroup_post_fork().
To fix this we introduce fork() method, which calls sched_move_task() directly.
This function changes sched_task_group on appropriate (also its logic has
no problem with freshly created tasks, so we shouldn't introduce something
special; we are able just to use it).
Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current GPL only licensing on the DTSI makes it very impractical for other
software components licensed under another license.
In order to make it easier for them to reuse our device trees, relicense our
device trees under a GPL/X11 dual-license.
Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Acked-by: Carlo Caione <carlo@caione.org>
Acked-by: Emilio López <emilio@elopez.com.ar>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Olliver Schinagl <oliver@schinagl.nl>
Acked-by: Roman Byshko <rbyshko@gmail.com>
Change ownership of the hpsa driver from Stephen M. Cameron
(Hewlett-Packard) to Don Brace (PMC-Sierra).
Change ownership of the cciss driver from Mike Miller (Hewlett-Packard)
to Don Brace (PMC-Sierra).
Signed-off-by: Don Brace <don.brace@pmcs.com>
Reviewed-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
libcxgbi was always returning an ipv4 address for ISCSI_HOST_PARAM_IPADDRESS,
return appropriate address based on address family
Signed-off-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: Karen Xie <kxie@chelsio.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fix a memory leak with scsi-mq triggered by commands with large data
transfer length.
Fixes: c53c6d6a68 ("scatterlist: allow chaining to preallocated chunks")
Cc: <stable@vger.kernel.org> # 3.17.x
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This patch removes curly braces for single statement
blocks using following coccinelle script:
@@
expression e1;
@@
- if (e1) {
+ if (e1)
return ...;
- }
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In certain circumstances the PSL (Power Service Layer, which provides
translation services for CXL hardware) can send an interrupt for a
segment miss that the kernel has already handled. This can happen if
multiple translations for the same segment are queued in the PSL before
the kernel has restarted the first translation.
The CXL driver does not expect this situation and does not check if a
segment had already been handled. This could cause a duplicate segment
table entry which in turn caused a PSL error taking down the card.
This patch fixes the issue by checking for existing entries in the
segment table that match the segment we are trying to insert, so as to
avoid inserting duplicate entries.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch makes copro_calculate_slb() mask the ESID by the correct mask
for 1T vs 256M segments.
This has no effect by itself as the extra bits were ignored, but it
makes debugging the segment table entries easier and means that we can
directly compare the ESID values for duplicates without needing to worry
about masking in the comparison.
This will be used to simplify a comparison in the following patch.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves the segment table hash calculation from cxl_load_segment()
into find_free_sste() since that is the only place it is actually used.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch simplifies the process of finding a free segment table entry
by disabling the secondary hash. This reduces the number of possible
entries in the segment table for a given address from 16 to 8.
Due to the large segment sizes we use it is extremely unlikely that the
secondary hash would ever have been used in practice, so this should not
have any negative impacts and may even improve performance due to the
reduced number of comparisons that software & hardware need to perform.
This patch clears the SC bit in the hardware's state register
(CXL_PSL_SR_An) to disable the secondary hash in the hardware since we
can no longer fill out entries using it.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add an ftrace_graph_is_dead() check to prepare_ftrace_return() in order to
detect an internal ftrace graph error. This allows to prevent further ftrace
graph handling and hopefully keeps the kernel alive.
This patch is the same like for all other architectures.
For unkown reasons s390 was left out.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The following semantic patch is used to do this
@r exists@ // find anything that might decrement the variable
identifier i;
expression E;
position p;
@@
int i@p;
...
(
&i
|
i--
|
--i
|
i-=E
|
i+=E
)
@x disable decl_init@
identifier r.i;
expression E;
position p1 != r.p;
@@
(
volatile int i = 0;
|
volatile int i;
|
+ unsigned
int i@p1 = 0;
|
+ unsigned
int i@p1;
)
<... when != i = E
(
i = 0
|
i = 1
)
...>
Signed-off-by: Tapasweni Pathak <tapaswenipathak@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
'timeval' is used to print timestamps in seconds and millisecond.
32-bit systems using 'struct timeval' will break in the year 2038,
So we have to replace that code with more appropriate types.
This patch changes the android driver to use timespec64.
Since this is a staging driver and the output is only used for a
debugfs file, it is fine to slightly change the output and
use nanoseconds instead. ktime_to_timespec64, is used which will return
seconds and nanoseconds.
Signed-off-by: Tapasweni Pathak <tapaswenipathak@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch removes all of the debug symbols and the variable dgnc_debug
since they are not used anywhere in the code.
Signed-off-by: Aya Mahfouz <mahfouz.saif.elyazal@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch removes the variable rawreadok since it is only intialized
but not used in any of the driver functions.
Signed-off-by: Aya Mahfouz <mahfouz.saif.elyazal@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch removes the variable trcbuf_size given there is no real
use for it anymore. All of the debug variables in this driver is
basically dead code.
Signed-off-by: Aya Mahfouz <mahfouz.saif.elyazal@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch removes the macros PARM_STR and PARM_ULONG given that they
are not used anywhere in the driver files.
Signed-off-by: Aya Mahfouz <mahfouz.saif.elyazal@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Replace C99 "//" comments by "/* */", remove comments about
adding code, remove commented code (one left there for
debugging of the function)
Signed-off-by: Veronika Kabatova <veronicca114@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>