this_cpu_sub() is implemented as negation and addition.
This patch casts the adjustment to the counter type before negation to
sign extend the adjustment. This helps in cases where the counter type
is wider than an unsigned adjustment. An alternative to this patch is
to declare such operations unsupported, but it seemed useful to avoid
surprises.
This patch specifically helps the following example:
unsigned int delta = 1
preempt_disable()
this_cpu_write(long_counter, 0)
this_cpu_sub(long_counter, delta)
preempt_enable()
Before this change long_counter on a 64 bit machine ends with value
0xffffffff, rather than 0xffffffffffffffff. This is because
this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
which is basically:
long_counter = 0 + 0xffffffff
Also apply the same cast to:
__this_cpu_sub()
__this_cpu_sub_return()
this_cpu_sub_return()
All percpu_test.ko passes, especially the following cases which
previously failed:
l -= ui_one;
__this_cpu_sub(long_counter, ui_one);
CHECK(l, long_counter, -1);
l -= ui_one;
this_cpu_sub(long_counter, ui_one);
CHECK(l, long_counter, -1);
CHECK(l, long_counter, 0xffffffffffffffff);
ul -= ui_one;
__this_cpu_sub(ulong_counter, ui_one);
CHECK(ul, ulong_counter, -1);
CHECK(ul, ulong_counter, 0xffffffffffffffff);
ul = this_cpu_sub_return(ulong_counter, ui_one);
CHECK(ul, ulong_counter, 2);
ul = __this_cpu_sub_return(ulong_counter, ui_one);
CHECK(ul, ulong_counter, 1);
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff Kirsher says:
====================
This series contains updates to vxlan, net, ixgbe, ixgbevf, and i40e.
Joseph provides a single patch against vxlan which removes the burden
from the NIC drivers to check if the vxlan driver is enabled in the
kernel and also makes available the vxlan headrooms to the drivers.
Jacob provides majority of the patches, with patches against net, ixgbe
and ixgbevf. His net patch adds might_sleep() call to napi_disable so
that every use of napi_disable during atomic context will be visible.
Then Jacob provides a patch to fix qv_lock_napi call in
ixgbe_napi_disable_all. The other ixgbe patches cleanup
ixgbe_check_minimum_link function to correctly show that there are some
minor loss of encoding, even though we don't calculate it and remove
unnecessary duplication of PCIe bandwidth display. Lastly, Jacob
provides 4 patches against ixgbevf to add ixgbevf_rx_skb in line with
how ixgbe handles the variations on how packets can be received, adds
support in order to track how many packets were cleaned during busy poll
as part of the extended statistics.
Wei Yongjun provides a fix for i40e to return -ENOMEN in the memory
allocation error handling case instead of returning 0, as done
elsewhere in this function.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This work contains a lightweight BPF-based traffic classifier that can
serve as a flexible alternative to ematch-based tree classification, i.e.
now that BPF filter engine can also be JITed in the kernel. Naturally, tc
actions and policies are supported as well with cls_bpf. Multiple BPF
programs/filter can be attached for a class, or they can just as well be
written within a single BPF program, that's really up to the user how he
wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
The notion of a BPF program's return/exit codes is being kept as follows:
0: No match
-1: Select classid given in "tc filter ..." command
else: flowid, overwrite the default one
As a minimal usage example with iproute2, we use a 3 band prio root qdisc
on a router with sfq each as leave, and assign ssh and icmp bpf-based
filters to band 1, http traffic to band 2 and the rest to band 3. For the
first two bands we load the bytecode from a file, in the 2nd we load it
inline as an example:
echo 1 > /proc/sys/net/core/bpf_jit_enable
tc qdisc del dev em1 root
tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
tc qdisc add dev em1 parent 1:1 sfq perturb 16
tc qdisc add dev em1 parent 1:2 sfq perturb 16
tc qdisc add dev em1 parent 1:3 sfq perturb 16
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3
BPF programs can be easily created and passed to tc, either as inline
'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
compile opcodes, for example:
1) People familiar with tcpdump-like filters:
tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf
2) People that want to low-level program their filters or use BPF
extensions that lack support by libpcap's compiler:
bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf
ssh.ops example code:
ldh [12]
jne #0x800, drop
ldb [23]
jneq #6, drop
ldh [20]
jset #0x1fff, drop
ldxb 4 * ([14] & 0xf)
ldh [%x + 14]
jeq #0x16, pass
ldh [%x + 16]
jne #0x16, drop
pass: ret #-1
drop: ret #0
It was chosen to load bytecode into tc, since the reverse operation,
tc filter list dev em1, is then able to show the exact commands again.
Possible follow-up work could also include a small expression compiler
for iproute2. Tested with the help of bmon. This idea came up during
the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
Eric Dumazet!
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
napi_disable uses an msleep() call to wait for outstanding napi work to be
finished after setting the disable bit. It does not always sleep incase there
was no outstanding work. This resulted in a rare bug in ixgbe_down operation
where a napi_disable call took place inside of a local_bh_disable()d context.
In order to enable easier detection of future sleep while atomic BUGs, this
patch adds a might_sleep() call, so that every use of napi_disable during
atomic context will be visible.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Alexander Duyck <alexander.duyck@intel.com>
Cc: Hyong-Youb Kim <hykim@myri.com>
Cc: Amir Vadai <amirv@mellanox.com>
Cc: Dmitry Kravkov <dmitry@broadcom.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch removes the burden from the NIC drivers to check if the
vxlan driver is enabled in the kernel and also makes available
the vxlan headrooms to them.
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The code for privacy extentions is very mature, and making it
configurable only gives marginal memory/code savings in exchange
for obfuscation and hard to read code via CPP ifdef'ery.
Signed-off-by: David S. Miller <davem@davemloft.net>
Process the CSA frame according to the procedures define in IEEE Std
802.11-2012 section 10.9.8.4.3 as follow:
* The mesh channel switch parameters element (MCSP) must be availabe.
* If the MCSP's TTL is 1, drop the frame but still process the CSA.
* If the MCSP's precedence value is less than or equal to the current
precedence value, drop the frame and do not process the CSA.
* The CSA frame is forwarded after TTL is decremented by 1 and the
initiator field is set to 0. Transmit restrict field and others
are maintained as is.
* No beacon or probe response frame are handled here.
Also, introduce the debug message used for mesh CSA purpose.
Signed-off-by: Chun-Yeow Yeoh <yeohchunyeow@cozybit.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This can be used by a driver to prepare skbs for transmission, which were
obtained via functions such as ieee80211_probereq_get or
ieee80211_nullfunc_get.
This is useful for drivers that want to send those frames directly, but
need rate control information to be prepared first.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Drivers can now use this to parse the regulatory request and
be more verbose when needed.
Signed-off-by: Luis R. Rodriguez <mcgrof@do-not-panic.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
To use DFS in IBSS mode, userspace is required to react to radar events.
It can inform nl80211 that it is capable of doing so by adding a
NL80211_ATTR_HANDLE_DFS attribute when joining the IBSS.
This attribute is supplied to let the kernelspace know that the
userspace application can and will handle radar events, e.g. by
intiating channel switches to a valid channel. DFS channels may
only be used if this attribute is supplied and the driver supports
it. Driver support will be checked even if a channel without DFS
will be initially joined, as a DFS channel may be chosen later.
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
[fix attribute name in commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull SCSI target fixes from Nicholas Bellinger:
"Here are the outstanding target pending fixes for v3.12-rc7.
This includes a number of EXTENDED_COPY related fixes as a result of
Thomas and Doug's continuing testing and feedback.
Also included is an important vhost/scsi fix that addresses a long
standing issue where the 'write' parameter for get_user_pages_fast()
was incorrectly set for virtio-scsi WRITEs -> DMA_TO_DEVICE, and not
for virtio-scsi READs -> DMA_FROM_DEVICE.
This resulted in random userspace segfaults and other unpleasantness
on KVM host, and unfortunately has been an issue since the initial
merge of vhost/scsi in v3.6. This patch is CC'ed to stable, along
with two other less critical items"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
vhost/scsi: Fix incorrect usage of get_user_pages_fast write parameter
target/pscsi: fix return value check
target: Fail XCOPY for non matching source + destination block_size
target: Generate failure for XCOPY I/O with non-zero scsi_status
target: Add missing XCOPY I/O operation sense_buffer
iser-target: check device before dereferencing its variable
target: Return an error for WRITE SAME with ANCHOR==1
target: Fix assignment of LUN in tracepoints
target: Reject EXTENDED_COPY when emulate_3pc is disabled
target: Allow non zero ListID in EXTENDED_COPY parameter list
target: Make target_do_xcopy failures return INVALID_PARAMETER_LIST
On receiving a packet too big icmp error we update the expire value by
calling rt6_update_expires. This function uses dst_set_expires which is
implemented that it can only reduce the expiration value of the dst entry.
If we insert new routing non-expiry information into the ipv6 fib where
we already have a matching rt6_info we only clear the RTF_EXPIRES flag
in rt6i_flags and leave the dst.expires value as is.
When new mtu information arrives for that cached dst_entry we again
call dst_set_expires. This time it won't update the dst.expire value
because we left the dst.expire value intact from the last update. So
dst_set_expires won't touch dst.expires.
Fix this by resetting dst.expires when clearing the RTF_EXPIRE flag.
dst_set_expires checks for a zero expiration and updates the
dst.expires.
In the past this (not updating dst.expires) was necessary because
dst.expire was placed in a union with the dst_entry *from reference
and rt6_clean_expires did assign NULL to it. This split happend in
ecd9883724 ("ipv6: fix race condition
regarding dst->expires and dst->from").
Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com>
Reported-by: Valentijn Sessink <valentyn@blub.net>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Valentijn Sessink <valentyn@blub.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now skb->data is passed to rx_hook() even if the skb
has not been linearised and without giving rx_hook() a way
to linearise it.
Change the rx_hook() interface and make it accept the skb
and the offset to the UDP payload as arguments. rx_hook() is
also renamed to rx_skb_hook() to ensure that out of the tree
users notice the API change.
In this way any rx_skb_hook() implementation can perform all
the needed operations to properly (and safely) access the
skb data.
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I initial build non irq safe version of net_get_random_once because I
would liked to have the freedom to defer even the extraction process of
get_random_bytes until the nonblocking pool is fully seeded.
I don't think this is a good idea anymore and thus this patch makes
net_get_random_once irq safe. Now someone using net_get_random_once does
not need to care from where it is called.
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
All fragmentation hash secrets now get initialized by their
corresponding hash function with net_get_random_once. Thus we can
eliminate the initial seeding.
Also provide a comment that hash secret seeding happens at the first
call to the corresponding hashing function.
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer the fragmentation hash secret initialization for IPv6 like the
previous patch did for IPv4.
Because the netfilter logic reuses the hash secret we have to split it
first. Thus introduce a new nf_hash_frag function which takes care to
seed the hash secret.
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/qmi_wwan.c
include/net/dst.h
Trivial merge conflicts, both were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
2e685cad57 (tcp_memcontrol: Kill struct tcp_memcontrol) falsly modified
the access to memory_pressure of sk->sk_prot->memory_pressure. The patch
did modify the memory_pressure-field of struct cg_proto, but not the one
of struct proto.
So, the access to sk_prot->memory_pressure should not be changed.
Acked-by: Eric Dumazet <edumazet@google.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull infiniband bugfix from Roland Dreier:
"Disable not-quite-ready userspace ABI for IB flow steering"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/core: Temporarily disable create_flow/destroy_flow uverbs
Pull networking fixes from David Miller:
"Sorry I let so much accumulate, I was in Buffalo and wanted a few
things to cook in my tree for a while before sending to you. Anyways,
it's a lot of little things as usual at this stage in the game"
1) Make bonding MAINTAINERS entry reflect reality, from Andy
Gospodarek.
2) Fix accidental sock_put() on timewait mini sockets, from Eric
Dumazet.
3) Fix crashes in l2tp due to mis-handling of ipv4 mapped ipv6
addresses, from François CACHEREUL.
4) Fix heap overflow in __audit_sockaddr(), from the eagle eyed Dan
Carpenter.
5) tcp_shifted_skb() doesn't take handle FINs properly, from Eric
Dumazet.
6) SFC driver bug fixes from Ben Hutchings.
7) Fix TX packet scheduling wedge after channel change in ath9k driver,
from Felix Fietkau.
8) Fix user after free in BPF JIT code, from Alexei Starovoitov.
9) Source address selection test is reversed in
__ip_route_output_key(), fix from Jiri Benc.
10) VLAN and CAN layer mis-size netlink attributes, from Marc
Kleine-Budde.
11) Fix permission checks in sysctls to use current_euid() instead of
current_uid(). From Eric W Biederman.
12) IPSEC policies can go away while a timer is still pending for them,
add appropriate ref-counting to fix, from Steffen Klassert.
13) Fix mis-programming of FDR and RMCR registers on R8A7740 sh_eth
chips, from Nguyen Hong Ky and Simon Horman.
14) MLX4 forgets to DMA unmap pages on RX, fix from Amir Vadai.
15) IPV6 GRE tunnel MTU upper limit is miscalculated, from Oussama
Ghorbel.
16) Fix typo in fq_change(), we were assigning "initial quantum" to
"quantum". From Eric Dumazet.
17) Set a more appropriate sk_pacing_rate for non-TCP sockets, otherwise
FQ packet scheduler does not pace those flows properly. Also from
Eric Dumazet.
18) rtlwifi miscalculates packet pointers, from Mark Cave-Ayland.
19) l2tp_xmit_skb() can be called from process context, not just softirq
context, so we must always make sure to BH disable around it. From
Eric Dumazet.
20) On qdisc reset, we forget to purge the RB tree of SKBs in netem
packet scheduler. From Stephen Hemminger.
21) Fix info leak in farsync WAN driver ioctl() handler, from Dan
Carpenter and Salva Peiró.
22) Fix PHY reset and other issues in dm9000 driver, from Nikita
Kiryanov and Michael Abbott.
23) When hardware can do SCTP crc32 checksums, we accidently don't
disable the csum offload when IPSEC transformations have been
applied. From Fan Du and Vlad Yasevich.
24) Tail loss probing in TCP leaves the socket in the wrong congestion
avoidance state. From Yuchung Cheng.
25) In CPSW driver, enable NAPI before interrupts are turned on, from
Markus Pargmann.
26) Integer underflow and dual-assignment in YAM hamradio driver, from
Dan Carpenter.
27) If we are going to mangle a packet in tcp_set_skb_tso_segs() we must
unclone it. This fixes various hard to track down crashes in
drivers where the SKBs ->gso_segs was changing right from underneath
the driver during TX queueing. From Eric Dumazet.
28) Fix the handling of VLAN IDs, and in particular the special IDs 0
and 4095, in the bridging layer. From Toshiaki Makita.
29) Another info leak, this time in wanxl WAN driver, from Salva Peiró.
30) Fix race in socket credential passing, from Daniel Borkmann.
31) WHen NETLABEL is disabled, we don't validate CIPSO packets properly,
from Seif Mazareeb.
32) Fix identification of fragmented frames in ipv4/ipv6 UDP
Fragmentation Offload output paths, from Jiri Pirko.
33) Virtual Function fixes in bnx2x driver from Yuval Mintz and Ariel
Elior.
34) When we removed the explicit neighbour pointer from ipv6 routes a
slight regression was introduced for users such as IPVS, xt_TEE, and
raw sockets. We mix up the users requested destination address with
the routes assigned nexthop/gateway. From Julian Anastasov and
Simon Horman.
35) Fix stack overruns in rt6_probe(), the issue is that can end up
doing two full packet xmit paths at the same time when emitting
neighbour discovery messages. From Hannes Frederic Sowa.
36) davinci_emac driver doesn't handle IFF_ALLMULTI correctly, from
Mariusz Ceier.
37) Make sure to set TCP sk_pacing_rate after the first legitimate RTT
sample, from Neal Cardwell.
38) Wrong netlink attribute passed to xfrm_replay_verify_len(), from
Steffen Klassert.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (152 commits)
ax88179_178a: Add VID:DID for Samsung USB Ethernet Adapter
ax88179_178a: Correct the RX error definition in RX header
Revert "bridge: only expire the mdb entry when query is received"
tcp: initialize passive-side sk_pacing_rate after 3WHS
davinci_emac.c: Fix IFF_ALLMULTI setup
mac802154: correct a typo in ieee802154_alloc_device() prototype
ipv6: probe routes asynchronous in rt6_probe
netfilter: nf_conntrack: fix rt6i_gateway checks for H.323 helper
ipv6: fill rt6i_gateway with nexthop address
ipv6: always prefer rt6i_gateway if present
bnx2x: Set NETIF_F_HIGHDMA unconditionally
bnx2x: Don't pretend during register dump
bnx2x: Lock DMAE when used by statistic flow
bnx2x: Prevent null pointer dereference on error flow
bnx2x: Fix config when SR-IOV and iSCSI are enabled
bnx2x: Fix Coalescing configuration
bnx2x: Unlock VF-PF channel on MAC/VLAN config error
bnx2x: Prevent an illegal pointer dereference during panic
bnx2x: Fix Maximum CoS estimation for VFs
drivers: net: cpsw: fix kernel warn during iperf test with interrupt pacing
...
What sk_reset_txq() does is just calls function sk_tx_queue_reset(),
and sk_reset_txq() is used only in sock.h, by dst_negative_advice().
Let dst_negative_advice() calls sk_tx_queue_reset() directly so we
can remove unneeded sk_reset_txq().
Signed-off-by: ZHAO Gang <gamerh2o@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Collect mega flow mask stats. ovs-dpctl show command can be used to
display them for debugging and performance tuning.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Instead of cb->data, use callback dump args only and introduce symbolic
names instead of plain numbers at accessing the argument members.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull drm fixes from Dave Airlie:
"Travelling slowed down getting these out.
Two vmwgfx fixes, a radeon revert to avoid a regression, i915 fixes,
and some ioctl sizing issues fixed with 32 on 64"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon/audio: don't set speaker allocation on DCE4+
drm/radeon: rework audio option
drm/radeon/audio: don't set speaker allocation on DCE3.2
drm/radeon: make missing smc ucode non-fatal (CI)
drm/radeon: make missing smc ucode non-fatal (r7xx-SI)
drm/radeon/uvd: revert lower msg&fb buffer requirements on UVD3
drm/radeon: stop the leaks in cik_ib_test
drm/radeon/atom: workaround vbios bug in transmitter table on rs780
drm/i915: Disable GGTT PTEs on GEN6+ suspend
drm/i915: Make PTE valid encoding optional
drm: Pad drm_mode_get_connector to 64-bit boundary
drm: Prevent overwriting from userspace underallocating core ioctl structs
drm/vmwgfx: Don't kill clients on VT switch
drm/vmwgfx: Don't put resources with invalid id's on lru list
drm/i915: disable LVDS clock gating on CPT v2
Just an lvds clock gating fix and a pte clearing hack for hsw to avoid
memory corruption when hibernating - something doesn't seem to switch off
properly, we're still investigating.
* tag 'drm-intel-fixes-2013-10-21' of git://people.freedesktop.org/~danvet/drm-intel: (96 commits)
drm/i915: Disable GGTT PTEs on GEN6+ suspend
drm/i915: Make PTE valid encoding optional
drm/i915: disable LVDS clock gating on CPT v2
Now ipv6_gso_segment() is stackable, its relatively easy to
implement GSO/TSO support for SIT tunnels
Performance results, when segmentation is done after tunnel
device (as no NIC is yet enabled for TSO SIT support) :
Before patch :
lpq84:~# ./netperf -H 2002:af6:1153:: -Cc
MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.00 3168.31 4.81 4.64 2.988 2.877
After patch :
lpq84:~# ./netperf -H 2002:af6:1153:: -Cc
MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.00 5525.00 7.76 5.17 2.763 1.840
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the pointers in struct cg_proto with actual data fields and kill
struct tcp_memcontrol as it is not fully redundant.
This removes a confusing, unnecessary layer of abstraction.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code that is implemented is per memory cgroup not per netns, and
having per netns bits is just confusing. Remove the per netns bits to
make it easier to see what is really going on.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code is broken and does not constrain sysctl_tcp_mem as
tcp_update_limit does. With the result that it allows the cgroup tcp
memory limits to be bypassed.
The semantics are broken as the settings are not per netns and are in a
per netns table, and instead looks at current.
Since the code is broken in both design and implementation and does not
implement the functionality for which it was written remove it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure rt6i_gateway contains nexthop information in
all routes returned from lookup or when routes are directly
attached to skb for generated ICMP packets.
The effect of this patch should be a faster version of
rt6_nexthop() and the consideration of local addresses as
nexthop.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In v3.9 6fd6ce2056 ("ipv6: Do not depend on rt->n in
ip6_finish_output2()." changed the behaviour of ip6_finish_output2()
such that the recently introduced rt6_nexthop() is used
instead of an assigned neighbor.
As rt6_nexthop() prefers rt6i_gateway only for gatewayed
routes this causes a problem for users like IPVS, xt_TEE and
RAW(hdrincl) if they want to use different address for routing
compared to the destination address.
Another case is when redirect can create RTF_DYNAMIC
route without RTF_GATEWAY flag, we ignore the rt6i_gateway
in rt6_nexthop().
Fix the above problems by considering the rt6i_gateway if
present, so that traffic routed to address on local subnet is
not wrongly diverted to the destination address.
Thanks to Simon Horman and Phil Oester for spotting the
problematic commit.
Thanks to Hannes Frederic Sowa for his review and help in testing.
Reported-by: Phil Oester <kernel@linuxace.com>
Reported-by: Mark Brooks <mark@loadbalancer.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no access to chan->sk in L2CAP core now. This change marks the
end of the task of splitting L2CAP between Core and Socket, thus sk is now
gone from struct l2cap_chan.
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Adding the channel to the skb private data makes possible to us know which
channel the skb we have came from.
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch fixes the following warning:
In file included from include/linux/skbuff.h:27:0,
from include/linux/netfilter.h:5,
from include/net/netns/netfilter.h:5,
from include/net/net_namespace.h:20,
from include/linux/init_task.h:14,
from init/init_task.c:1:
include/linux/net.h:243:14: warning: 'struct static_key' declared inside parameter list [enabled by default]
struct static_key *done_key);
on x86_64 allnoconfig, um defconfig and ia64 allmodconfig and maybe others as well.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The create_flow/destroy_flow uverbs and the associated extensions to
the user-kernel verbs ABI are under review and are too experimental to
freeze at this point.
So userspace is not exposed to experimental features and an uinstable
ABI, temporarily disable this for v3.12 (with a Kconfig option behind
staging to reenable it if desired).
The feature will be enabled after proper cleanup for v3.13.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1381351016.git.ydroneaud@opteya.com
Link: http://marc.info/?i=cover.1381177342.git.ydroneaud@opteya.com
[ Add a Kconfig option to reenable these verbs. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Changed key initialization of tcp_fastopen cookies to net_get_random_once.
If the user sets a custom key net_get_random_once must be called at
least once to ensure we don't overwrite the user provided key when the
first cookie is generated later on.
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the ehash and ipv6_hash_secrets with net_get_random_once.
Each compilation unit gets its own secret now:
ipv4/inet_hashtables.o
ipv4/udp.o
ipv6/inet6_hashtables.o
ipv6/udp.o
rds/connection.o
The functions still get inlined into the hashing functions. In the fast
path we have at most two (needed in ipv6) if (unlikely(...)).
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch splits the secret key for syncookies for ipv4 and ipv6 and
initializes them with net_get_random_once. This change was the reason I
did this series. I think the initialization of the syncookie_secret is
way to early.
Cc: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_get_random_once is a new macro which handles the initialization
of secret keys. It is possible to call it in the fast path. Only the
initialization depends on the spinlock and is rather slow. Otherwise
it should get used just before the key is used to delay the entropy
extration as late as possible to get better randomness. It returns true
if the key got initialized.
The usage of static_keys for net_get_random_once is a bit uncommon so
it needs some further explanation why this actually works:
=== In the simple non-HAVE_JUMP_LABEL case we actually have ===
no constrains to use static_key_(true|false) on keys initialized with
STATIC_KEY_INIT_(FALSE|TRUE). So this path just expands in favor of
the likely case that the initialization is already done. The key is
initialized like this:
___done_key = { .enabled = ATOMIC_INIT(0) }
The check
if (!static_key_true(&___done_key)) \
expands into (pseudo code)
if (!likely(___done_key > 0))
, so we take the fast path as soon as ___done_key is increased from the
helper function.
=== If HAVE_JUMP_LABELs are available this depends ===
on patching of jumps into the prepared NOPs, which is done in
jump_label_init at boot-up time (from start_kernel). It is forbidden
and dangerous to use net_get_random_once in functions which are called
before that!
At compilation time NOPs are generated at the call sites of
net_get_random_once. E.g. net/ipv6/inet6_hashtable.c:inet6_ehashfn (we
need to call net_get_random_once two times in inet6_ehashfn, so two NOPs):
71: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
76: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Both will be patched to the actual jumps to the end of the function to
call __net_get_random_once at boot time as explained above.
arch_static_branch is optimized and inlined for false as return value and
actually also returns false in case the NOP is placed in the instruction
stream. So in the fast case we get a "return false". But because we
initialize ___done_key with (enabled != (entries & 1)) this call-site
will get patched up at boot thus returning true. The final check looks
like this:
if (!static_key_true(&___done_key)) \
___ret = __net_get_random_once(buf, \
expands to
if (!!static_key_false(&___done_key)) \
___ret = __net_get_random_once(buf, \
So we get true at boot time and as soon as static_key_slow_inc is called
on the key it will invert the logic and return false for the fast path.
static_key_slow_inc will change the branch because it got initialized
with .enabled == 0. After static_key_slow_inc is called on the key the
branch is replaced with a nop again.
=== Misc: ===
The helper defers the increment into a workqueue so we don't
have problems calling this code from atomic sections. A seperate boolean
(___done) guards the case where we enter net_get_random_once again before
the increment happend.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Usage of the static key primitives to toggle a branch must not be used
before jump_label_init() is called from init/main.c. jump_label_init
reorganizes and wires up the jump_entries so usage before that could
have unforeseen consequences.
Following primitives are now checked for correct use:
* static_key_slow_inc
* static_key_slow_dec
* static_key_slow_dec_deferred
* jump_label_rate_limit
The x86 architecture already checks this by testing if the default_nop
was already replaced with an optimal nop or with a branch instruction. It
will panic then. Other architectures don't check for this.
Because we need to relax this check for the x86 arch to allow code to
transition from default_nop to the enabled state and other architectures
did not check for this at all this patch introduces checking on the
static_key primitives in a non-arch dependent manner.
All checked functions are considered slow-path so the additional check
does no harm to performance.
The warnings are best observed with earlyprintk.
Based on a patch from Andi Kleen.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch splits the inet6_ehashfn into separate ones in
ipv6/inet6_hashtables.o and ipv6/udp.o to ease the introduction of
seperate secrets keys later.
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>