Commit Graph

5560 Commits

Author SHA1 Message Date
David S. Miller
6a06e5e1bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/team/team.c
	drivers/net/usb/qmi_wwan.c
	net/batman-adv/bat_iv_ogm.c
	net/ipv4/fib_frontend.c
	net/ipv4/route.c
	net/l2tp/l2tp_netlink.c

The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.

qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

With help from Antonio Quartulli.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-28 14:40:49 -04:00
Eric Dumazet
e2bcabec6e net: remove sk_init() helper
It seems sk_init() has no value today and even does strange things :

# grep . /proc/sys/net/core/?mem_*
/proc/sys/net/core/rmem_default:212992
/proc/sys/net/core/rmem_max:131071
/proc/sys/net/core/wmem_default:212992
/proc/sys/net/core/wmem_max:131071

We can remove it completely.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 18:42:00 -04:00
stephen hemminger
eccc1bb8d4 tunnel: drop packet if ECN present with not-ECT
Linux tunnels were written before RFC6040 and therefore never
implemented the corner case of ECN getting set in the outer header
and the inner header not being ready for it.

Section 4.2.  Default Tunnel Egress Behaviour.
 o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
      propagate any other ECN codepoint onwards.  This is because the
      inner Not-ECT marking is set by transports that rely on dropped
      packets as an indication of congestion and would not understand or
      respond to any other ECN codepoint [RFC4774].  Specifically:

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         CE, the decapsulator MUST drop the packet.

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
         outgoing packet with the ECN field cleared to Not-ECT.

This patch moves the ECN decap logic out of the individual tunnels
into a common place.

It also adds logging to allow detecting broken systems that
set ECN bits incorrectly when tunneling (or an intermediate
router might be changing the header).

Overloads rx_frame_error to keep track of ECN related error.

Thanks to Chris Wright who caught this while reviewing the new VXLAN
tunnel.

This code was tested by injecting faulty logic in other end GRE
to send incorrectly encapsulated packets.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 18:12:37 -04:00
Eric Dumazet
5640f76858 net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 16:31:37 -04:00
David S. Miller
2a6c8c7998 net: Remove unnecessary NULL check in scm_destroy().
All callers provide a non-NULL scm argument.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 15:52:33 -04:00
Neal Cardwell
016818d076 tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS
When taking SYNACK RTT samples for servers using TCP Fast Open, fix
the code to ensure that we only call tcp_valid_rtt_meas() after we
receive the ACK that completes the 3-way handshake.

Previously we were always taking an RTT sample in
tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
to take the RTT sample.

To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
already ensures that we only take a SYNACK RTT sample if snt_synack is
non-zero. To be careful, we only take a snt_synack timestamp when
a SYNACK transmit or retransmit succeeds.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-22 15:47:10 -04:00
Neal Cardwell
623df484a7 tcp: extract code to compute SYNACK RTT
In preparation for adding another spot where we compute the SYNACK
RTT, extract this code so that it can be shared.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-22 15:47:10 -04:00
Linus Torvalds
abef3bd710 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking updates from David Miller:
 "More bug fixes, nothing gets past these guys"

 1) More kernel info leaks found by Mathias Krause, this time in the
    IPSEC configuration layers.

 2) When IPSEC policies change, we do not properly make sure that cached
    routes (which could now be stale) throughout the system will be
    revalidated.  Fix this by generalizing the generation count
    invalidation scheme used by ipv4.  From Nicolas Dichtel.

 3) When repairing TCP sockets, we need to allow to restore not just the
    send window scale, but the receive one too.  Extend the existing
    interface to achieve this in a backwards compatible way.  From
    Andrey Vagin.

 4) A fix for FCOE scatter gather feature validation erroneously caused
    scatter gather to be disabled for things like AOE too.  From Ed L
    Cashin.

 5) Several cases of mishandling of error pointers, from Mathias Krause,
    Wei Yongjun, and Devendra Naga.

 6) Fix gianfar build, from Richard Cochran.

 7) CAP_NET_* failures should return -EPERM not -EACCES, from Zhao
    Hongjiang.

 8) Hardware reset fix in janz-ican3 CAN driver, from Ira W Snyder.

 9) Fix oops during rmmod in ti_hecc CAN driver, from Marc Kleine-Budde.

10) The removal of the conditional compilation of the clk support code
    in the stmmac driver broke things.  This is because the interfaces
    used are the ones that don't also perform the enable/disable of the
    clk.  Fix from Stefan Roese.

11) The QFQ packet scheduler can record out of range virtual start
    times, resulting later in misbehavior and even crashes.  Fix from
    Paolo Valente.

12) If MSG_WAITALL is used with IOAT DMA under TCP, we can wedge the
    receiver when the advertised receive window goes to zero.  Detect
    this case and force the processing of the IOAT DMA queue when it
    happens to avoid getting stuck.  Fix from Michal Kubecek.

13) batman-adv assumes that test_bit() returns only 0 or 1, but this is
    not true for x86 (which returns -1 or 0, via the 'sbb' instruction).
    Fix from Linus Lussing.

14) Fix small packet corruption in e1000, from Tushar Dave.

15) make_blackhole() in the IPSEC policy code can do one read unlock too
    many, fix from Li RongQing.

16) The new tcp_try_coalesce() code introduced a bug in TCP URG
    handling, fix from Eric Dumazet.

17) Fix memory leak in __netif_receive_skb() when doing zerocopy and
    when hit an OOM condition.  From Michael S Tsirkin.

18) netxen blindly deferences pdev->bus->self, which is not guarenteed
    to be non-NULL.  Fix from Nikolay Aleksandrov.

19) Fix a performance regression caused by mistakes in ipv6 checksum
    validation in the bnx2x driver, fix from Michal Schmidt.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (45 commits)
  net/stmmac: Use clk_prepare_enable and clk_disable_unprepare
  net: change return values from -EACCES to -EPERM
  net/irda: sh_sir: fix return value check in sh_sir_set_baudrate()
  stmmac: fix return value check in stmmac_open_ext_timer()
  gianfar: fix phc index build failure
  ipv6: fix return value check in fib6_add()
  bnx2x: remove false warning regarding interrupt number
  can: ti_hecc: fix oops during rmmod
  can: janz-ican3: fix support for older hardware revisions
  net: do not disable sg for packets requiring no checksum
  aoe: assert AoE packets marked as requiring no checksum
  at91ether: return PTR_ERR if call to clk_get fails
  xfrm_user: don't copy esn replay window twice for new states
  xfrm_user: ensure user supplied esn replay window is valid
  xfrm_user: fix info leak in copy_to_user_tmpl()
  xfrm_user: fix info leak in copy_to_user_policy()
  xfrm_user: fix info leak in copy_to_user_state()
  xfrm_user: fix info leak in copy_to_user_auth()
  net: qmi_wwan: adding Huawei E367, ZTE MF683 and Pantech P4200
  tcp: restore rcv_wscale in a repair mode (v2)
  ...
2012-09-21 14:32:55 -07:00
Amerigo Wang
6b102865e7 ipv6: unify fragment thresh handling code
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:23:28 -04:00
Amerigo Wang
d4915c087f ipv6: make ip6_frag_nqueues() and ip6_frag_mem() static inline
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:23:28 -04:00
Amerigo Wang
b836c99fd6 ipv6: unify conntrack reassembly expire code with standard one
Two years ago, Shan Wei tried to fix this:
http://patchwork.ozlabs.org/patch/43905/

The problem is that RFC2460 requires an ICMP Time
Exceeded -- Fragment Reassembly Time Exceeded message should be
sent to the source of that fragment, if the defragmentation
times out.

"
   If insufficient fragments are received to complete reassembly of a
   packet within 60 seconds of the reception of the first-arriving
   fragment of that packet, reassembly of that packet must be
   abandoned and all the fragments that have been received for that
   packet must be discarded.  If the first fragment (i.e., the one
   with a Fragment Offset of zero) has been received, an ICMP Time
   Exceeded -- Fragment Reassembly Time Exceeded message should be
   sent to the source of that fragment.
"

As Herbert suggested, we could actually use the standard IPv6
reassembly code which follows RFC2460.

With this patch applied, I can see ICMP Time Exceeded sent
from the receiver when the sender sent out 3/4 fragmented
IPv6 UDP packet.

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: David Miller <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:23:28 -04:00
Amerigo Wang
c038a767cd ipv6: add a new namespace for nf_conntrack_reasm
As pointed by Michal, it is necessary to add a new
namespace for nf_conntrack_reasm code, this prepares
for the second patch.

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: David Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:23:28 -04:00
Nicolas Dichtel
6f3118b571 ipv6: use net->rt_genid to check dst validity
IPv6 dst should take care of rt_genid too. When a xfrm policy is inserted or
deleted, all dst should be invalidated.
To force the validation, dst entries should be created with ->obsolete set to
DST_OBSOLETE_FORCE_CHK. This was already the case for all functions calling
ip6_dst_alloc(), except for ip6_rt_copy().

As a consequence, we can remove the specific code in inet6_connection_sock.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18 15:57:03 -04:00
Nicolas Dichtel
b42664f898 netns: move net->ipv4.rt_genid to net->rt_genid
This commit prepares the use of rt_genid by both IPv4 and IPv6.
Initialization is left in IPv4 part.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18 15:57:03 -04:00
Nicolas Dichtel
bafa6d9d89 ipv4/route: arg delay is useless in rt_cache_flush()
Since route cache deletion (89aef8921b), delay is no
more used. Remove it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18 15:44:34 -04:00
Chuck Lever
35c448a8a3 include/net/sock.h: squelch compiler warning in sk_rmem_schedule()
This warning:

  In file included from linux/include/linux/tcp.h:227:0,
                   from linux/include/linux/ipv6.h:221,
                   from linux/include/net/ipv6.h:16,
                   from linux/include/linux/sunrpc/clnt.h:26,
                   from linux/net/sunrpc/stats.c:22:
  linux/include/net/sock.h: In function `sk_rmem_schedule':
  linux/nfs-2.6/include/net/sock.h:1339:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]

is seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2) using the
-Wextra option.

Commit c76562b670 ("netvm: prevent a stream-specific deadlock")
accidentally replaced the "size" parameter of sk_rmem_schedule() with an
unsigned int.  This changes the semantics of the comparison in the
return statement.

In sk_wmem_schedule we have syntactically the same comparison, but
"size" is a signed integer.  In addition, __sk_mem_schedule() takes a
signed integer for its "size" parameter, so there is an implicit type
conversion in sk_rmem_schedule() anyway.

Revert the "size" parameter back to a signed integer so that the
semantics of the expressions in both sk_[rw]mem_schedule() are exactly
the same.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:38 -07:00
David S. Miller
b4516a288e llc: Remove stray reference to sysctl_llc_station_ack_timeout.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-17 13:13:24 -04:00
David S. Miller
ba01dfe182 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:

====================
This is another batch of updates intended for the 3.7 stream.

There are not a lot of large items, but iwlwifi, mwifiex, rt2x00,
ath9k, and brcmfmac all get some attention.  Wei Yongjun also provides
a series of small maintenance fixes.

This also includes a pull of the wireless tree in order to satisfy
some prerequisites for later patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-17 00:57:32 -04:00
David S. Miller
b48b63a1f6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/netfilter/nfnetlink_log.c
	net/netfilter/xt_LOG.c

Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.

Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-15 11:43:53 -04:00
John W. Linville
9316f0e3c6 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2012-09-14 13:53:49 -04:00
Eric W. Biederman
15e473046c netlink: Rename pid to portid to avoid confusion
It is a frequent mistake to confuse the netlink port identifier with a
process identifier.  Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.

I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.

I have successfully built an allyesconfig kernel with this change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10 15:30:41 -04:00
John W. Linville
fac805f8c1 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless 2012-09-07 15:07:55 -04:00
John W. Linville
4a3e12fd7a Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next 2012-09-07 14:49:46 -04:00
Nicolas Dichtel
4ccfe6d410 ipv4/route: arg delay is useless in rt_cache_flush()
Since route cache deletion (89aef8921b), delay is no
more used. Remove it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07 14:44:08 -04:00
Eric W. Biederman
dbe9a4173e scm: Don't use struct ucred in NETLINK_CB and struct scm_cookie.
Passing uids and gids on NETLINK_CB from a process in one user
namespace to a process in another user namespace can result in the
wrong uid or gid being presented to userspace.  Avoid that problem by
passing kuids and kgids instead.

- define struct scm_creds for use in scm_cookie and netlink_skb_parms
  that holds uid and gid information in kuid_t and kgid_t.

- Modify scm_set_cred to fill out scm_creds by heand instead of using
  cred_to_ucred to fill out struct ucred.  This conversion ensures
  userspace does not get incorrect uid or gid values to look at.

- Modify scm_recv to convert from struct scm_creds to struct ucred
  before copying credential values to userspace.

- Modify __scm_send to populate struct scm_creds on in the scm_cookie,
  instead of just copying struct ucred from userspace.

- Modify netlink_sendmsg to copy scm_creds instead of struct ucred
  into the NETLINK_CB.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07 14:42:05 -04:00
John W. Linville
777bf135b7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem
John W. Linville says:

====================
Please pull these fixes intended for 3.6.  There are more commits
here than I would like -- I got a bit behind while I was stalking
Steven Rostedt in San Diego last week...  I'll slow it down after this!

There are a couple of pulls here.  One is from Johannes:

"Please pull (according to the below information) to get a few fixes.

 * a fix to properly disconnect in the driver when authentication or
   association fails
 * a fix to prevent invalid information about mesh paths being reported
   to userspace
 * a memory leak fix in an nl80211 error path"

The other comes via Gustavo:

"A few updates for the 3.6 kernel. There are two btusb patches to add
more supported devices through the new USB_VENDOR_AND_INTEFACE_INFO()
macro and another one that add a new device id for a Sony Vaio laptop,
one fix for a user-after-free and, finally, two patches from Vinicius
to fix a issue in SMP pairing."

Along with those...

Arend van Spriel provides a fix for a use-after-free bug in brcmfmac.

Daniel Drake avoids a hang by not trying to touch the libertas hardware
duing suspend if it is already powered-down.

Felix Fietkau provides a batch of ath9k fixes that adress some
potential problems with power settings, as well as a fix to avoid a
potential interrupt storm.

Gertjan van Wingerde provides a register-width fix for rt2x00, and
a rt2x00 fix to prevent incorrectly detecting the rfkill status.
He also provides a device ID patch.

Hante Meuleman gives us three brcmfmac fixes, one that properly
initializes a command structure, one that fixes a race condition that
could lose usb requests, and one that removes some log spam.

Marc Kleine-Budde offers an rt2x00 fix for a voltage setting on some
specific devices.

Mohammed Shafi Shajakhan sent an ath9k fix to avoid a crash related to
using timers that aren't allocated when 2 wire bluetooth coexistence
hardware is in use.

Sergei Poselenov changes rt2800usb to do some validity checking for
received packets, avoiding crashes on an ARM Soc.

Stone Piao gives us an mwifiex fix for an incorrectly set skb length
value for a command buffer.

All of these are localized to their specific drivers, and relatively
small.  The power-related patches from Felix are bigger than I would
like, but I merged them in consideration of their isolation to ath9k
and the sensitive nature of power settings in wireless devices.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07 14:38:50 -04:00
Nicolas Dichtel
ef2c7d7b59 ipv6: fix handling of blackhole and prohibit routes
When adding a blackhole or a prohibit route, they were handling like classic
routes. Moreover, it was only possible to add this kind of routes by specifying
an interface.

Bug already reported here:
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498498

Before the patch:
  $ ip route add blackhole 2001::1/128
  RTNETLINK answers: No such device
  $ ip route add blackhole 2001::1/128 dev eth0
  $ ip -6 route | grep 2001
  2001::1 dev eth0  metric 1024

After:
  $ ip route add blackhole 2001::1/128
  $ ip -6 route | grep 2001
  blackhole 2001::1 dev lo  metric 1024  error -22

v2: wrong patch
v3: add a field fc_type in struct fib6_config to store RTN_* type

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05 17:49:28 -04:00
Robert P. J. Day
c9a0a30252 cfg80211: add kerneldoc entry for "vht_cap"
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2012-09-05 15:54:07 +02:00
Steffen Klassert
3b59df46a4 xfrm: Workaround incompatibility of ESN and async crypto
ESN for esp is defined in RFC 4303. This RFC assumes that the
sequence number counters are always up to date. However,
this is not true if an async crypto algorithm is employed.

If the sequence number counters are not up to date on sequence
number check, we may incorrectly update the upper 32 bit of
the sequence number. This leads to a DOS.

We workaround this by comparing the upper sequence number,
(used for authentication) with the upper sequence number
computed after the async processing. We drop the packet
if these numbers are different.

To do this, we introduce a recheck function that does this
check in the ESN case.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-04 14:09:45 -04:00
David S. Miller
1e9f0207d3 Merge branch 'master' of git://1984.lsi.us.es/nf-next 2012-09-03 20:26:45 -04:00
Yuchung Cheng
684bad1107 tcp: use PRR to reduce cwin in CWR state
Use proportional rate reduction (PRR) algorithm to reduce cwnd in CWR state,
in addition to Recovery state. Retire the current rate-halving in CWR.
When losses are detected via ACKs in CWR state, the sender enters Recovery
state but the cwnd reduction continues and does not restart.

Rename and refactor cwnd reduction functions since both CWR and Recovery
use the same algorithm:
tcp_init_cwnd_reduction() is new and initiates reduction state variables.
tcp_cwnd_reduction() is previously tcp_update_cwnd_in_recovery().
tcp_ends_cwnd_reduction() is previously  tcp_complete_cwr().

The rate halving functions and logic such as tcp_cwnd_down(), tcp_min_cwnd(),
and the cwnd moderation inside tcp_enter_cwr() are removed. The unused
parameter, flag, in tcp_cwnd_reduction() is also removed.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-03 14:34:02 -04:00
Pablo Neira Ayuso
ace1fe1231 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
This merges (3f509c6 netfilter: nf_nat_sip: fix incorrect handling
of EBUSY for RTCP expectation) to Patrick McHardy's IPv6 NAT changes.
2012-09-03 15:34:51 +02:00
Pablo Neira Ayuso
84b5ee939e netfilter: nf_conntrack: add nf_ct_timeout_lookup
This patch adds the new nf_ct_timeout_lookup function to encapsulate
the timeout policy attachment that is called in the nf_conntrack_in
path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-09-03 13:33:03 +02:00
Jerry Chu
8336886f78 tcp: TCP Fast Open Server - support TFO listeners
This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 20:02:19 -04:00
Jerry Chu
1046716368 tcp: TCP Fast Open Server - header & support functions
This patch adds all the necessary data structure and support
functions to implement TFO server side. It also documents a number
of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
extension MIBs.

In addition, it includes the following:

1. a new TCP_FASTOPEN socket option an application must call to
supply a max backlog allowed in order to enable TFO on its listener.

2. A number of key data structures:
"fastopen_rsk" in tcp_sock - for a big socket to access its
request_sock for retransmission and ack processing purpose. It is
non-NULL iff 3WHS not completed.

"fastopenq" in request_sock_queue - points to a per Fast Open
listener data structure "fastopen_queue" to keep track of qlen (# of
outstanding Fast Open requests) and max_qlen, among other things.

"listener" in tcp_request_sock - to point to the original listener
for book-keeping purpose, i.e., to maintain qlen against max_qlen
as part of defense against IP spoofing attack.

3. various data structure and functions, many in tcp_fastopen.c, to
support server side Fast Open cookie operations, including
/proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 20:02:18 -04:00
Alex Bergmann
6c9ff979d1 tcp: Increase timeout for SYN segments
Commit 9ad7c049 ("tcp: RFC2988bis + taking RTT sample from 3WHS for
the passive open side") changed the initRTO from 3secs to 1sec in
accordance to RFC6298 (former RFC2988bis). This reduced the time till
the last SYN retransmission packet gets sent from 93secs to 31secs.

RFC1122 is stating that the retransmission should be done for at least 3
minutes, but this seems to be quite high.

  "However, the values of R1 and R2 may be different for SYN
  and data segments.  In particular, R2 for a SYN segment MUST
  be set large enough to provide retransmission of the segment
  for at least 3 minutes.  The application can close the
  connection (i.e., give up on the open attempt) sooner, of
  course."

This patch increases the value of TCP_SYN_RETRIES to the value of 6,
providing a retransmission window of 63secs.

The comments for SYN and SYNACK retries have also been updated to
describe the current settings. The same goes for the documentation file
"Documentation/networking/ip-sysctl.txt".

Signed-off-by: Alexander Bergmann <alex@linlab.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 15:42:10 -04:00
David S. Miller
c32f38619a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge the 'net' tree to get the recent set of netfilter bug fixes in
order to assist with some merge hassles Pablo is going to have to deal
with for upcoming changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 15:14:18 -04:00
David S. Miller
0dcd5052c8 Merge branch 'master' of git://1984.lsi.us.es/nf 2012-08-31 13:06:37 -04:00
Pablo Neira Ayuso
5b423f6a40 netfilter: nf_conntrack: fix racy timer handling with reliable events
Existing code assumes that del_timer returns true for alive conntrack
entries. However, this is not true if reliable events are enabled.
In that case, del_timer may return true for entries that were
just inserted in the dying list. Note that packets / ctnetlink may
hold references to conntrack entries that were just inserted to such
list.

This patch fixes the issue by adding an independent timer for
event delivery. This increases the size of the ecache extension.
Still we can revisit this later and use variable size extensions
to allocate this area on demand.

Tested-by: Oliver Smith <olipro@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-08-31 15:50:28 +02:00
Patrick McHardy
b3f644fc82 netfilter: ip6tables: add MASQUERADE target
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:18 +02:00
Patrick McHardy
58a317f106 netfilter: ipv6: add IPv6 NAT support
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:17 +02:00
Patrick McHardy
2cf545e835 net: core: add function for incremental IPv6 pseudo header checksum updates
Add inet_proto_csum_replace16 for incrementally updating IPv6 pseudo header
checksums for IPv6 NAT.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: David S. Miller <davem@davemloft.net>
2012-08-30 03:00:16 +02:00
Patrick McHardy
c7232c9979 netfilter: add protocol independent NAT core
Convert the IPv4 NAT implementation to a protocol independent core and
address family specific modules.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:14 +02:00
Patrick McHardy
051966c0c6 netfilter: nf_nat: add protoff argument to packet mangling functions
For mangling IPv6 packets the protocol header offset needs to be known
by the NAT packet mangling functions. Add a so far unused protoff argument
and convert the conntrack and NAT helpers to use it in preparation of
IPv6 NAT.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:13 +02:00
Vinicius Costa Gomes
cc110922da Bluetooth: Change signature of smp_conn_security()
To make it clear that it may be called from contexts that may not have
any knowledge of L2CAP, we change the connection parameter, to receive
a hci_conn.

This also makes it clear that it is checking the security of the link.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
2012-08-27 08:07:18 -07:00
Patrick McHardy
5f2d04f1f9 ipv4: fix path MTU discovery with connection tracking
IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and
(in case of forwarded packets) refragments them at POST_ROUTING
independent of the IP_DF flag. Refragmentation uses the dst_mtu() of
the local route without caring about the original fragment sizes,
thereby breaking PMTUD.

This patch fixes this by keeping track of the largest received fragment
with IP_DF set and generates an ICMP fragmentation required error during
refragmentation if that size exceeds the MTU.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
2012-08-26 19:13:55 +02:00
David S. Miller
e6acb38480 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24 18:54:37 -04:00
John W. Linville
f20b6213f1 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2012-08-24 12:25:30 -04:00
Rami Rosen
f63c45e0e6 packet: fix broken build.
This patch fixes a broken build due to a missing header:
...
  CC      net/ipv4/proc.o
In file included from include/net/net_namespace.h:15,
                 from net/ipv4/proc.c:35:
include/net/netns/packet.h:11: error: field 'sklist_lock' has incomplete type
...

The lock of netns_packet has been replaced by a recent patch to be a mutex instead of a spinlock,
but we need to replace the header file to be linux/mutex.h instead of linux/spinlock.h as well.

See commit 0fa7fa98db:
packet: Protect packet sk list with mutex (v2) patch,

Signed-off-by: Rami Rosen <rosenr@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-23 09:29:45 -07:00
Pavel Emelyanov
0fa7fa98db packet: Protect packet sk list with mutex (v2)
Change since v1:

* Fixed inuse counters access spotted by Eric

In patch eea68e2f (packet: Report socket mclist info via diag module) I've
introduced a "scheduling in atomic" problem in packet diag module -- the
socket list is traversed under rcu_read_lock() while performed under it sk
mclist access requires rtnl lock (i.e. -- mutex) to be taken.

[152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
[152363.820573] 4 locks held by crtools/12517:
[152363.820581]  #0:  (sock_diag_mutex){+.+.+.}, at: [<ffffffff81a2dcb5>] sock_diag_rcv+0x1f/0x3e
[152363.820613]  #1:  (sock_diag_table_mutex){+.+.+.}, at: [<ffffffff81a2de70>] sock_diag_rcv_msg+0xdb/0x11a
[152363.820644]  #2:  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81a67d01>] netlink_dump+0x23/0x1ab
[152363.820693]  #3:  (rcu_read_lock){.+.+..}, at: [<ffffffff81b6a049>] packet_diag_dump+0x0/0x1af

Similar thing was then re-introduced by further packet diag patches (fanount
mutex and pgvec mutex for rings) :(

Apart from being terribly sorry for the above, I propose to change the packet
sk list protection from spinlock to mutex. This lock currently protects two
modifications:

* sklist
* prot inuse counters

The sklist modifications can be just reprotected with mutex since they already
occur in a sleeping context. The inuse counters modifications are trickier -- the
__this_cpu_-s are used inside, thus requiring the caller to handle the potential
issues with contexts himself. Since packet sockets' counters are modified in two
places only (packet_create and packet_release) we only need to protect the context
from being preempted. BH disabling is not required in this case.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 22:58:27 -07:00