linux/include/net
Jakub Sitnicki 91d0b78c51 inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
   and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
   the socket.

   This approach has a couple of downsides:

   a) Search for a free port has to be implemented in the user-space. If
      the chosen 4-tuple happens to be busy, the application needs to retry
      from a different local port number.

      Detecting if 4-tuple is busy can be either easy (TCP) or hard
      (UDP). In TCP case, the application simply has to check if connect()
      returned an error (EADDRNOTAVAIL). That is assuming that the local
      port sharing was enabled (REUSEADDR) by all the sockets.

        # Assume desired local port range is 60_000-60_511
        s = socket(AF_INET, SOCK_STREAM)
        s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s.bind(("192.0.2.1", 60_000))
        s.connect(("1.1.1.1", 53))
        # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
        # Application must retry with another local port

      In case of UDP, the network stack allows binding more than one socket
      to the same 4-tuple, when local port sharing is enabled
      (REUSEADDR). Hence detecting the conflict is much harder and involves
      querying sock_diag and toggling the REUSEADDR flag [1].

   b) For TCP, bind()-ing to a port within the ephemeral port range means
      that no connecting sockets, that is those which leave it to the
      network stack to find a free local port at connect() time, can use
      the this port.

      IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
      will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
   ip_local_port_range sysctl to adjust the ephemeral port range bounds.

   The per-netns setting affects all sockets, so this approach can be used
   only if:

   - there is just one egress IP address, or
   - the desired egress port range is the same for all egress IP addresses
     used by the application.

   For TCP, this approach avoids the downsides of (1). Free port search and
   4-tuple conflict detection is done by the network stack:

     system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

     s = socket(AF_INET, SOCK_STREAM)
     s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
     s.bind(("192.0.2.1", 0))
     s.connect(("1.1.1.1", 53))
     # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

  For UDP this approach has limited applicability. Setting the
  IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
  port being shared with other connected UDP sockets.

  Hence relying on the network stack to find a free source port, limits the
  number of outgoing UDP flows from a single IP address down to the number
  of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

  PORT_LO = 40_000
  PORT_HI = 40_511

  s = socket(AF_INET, SOCK_STREAM)
  v = struct.pack("I", PORT_HI << 16 | PORT_LO)
  s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
  s.bind(("127.0.0.1", 0))
  s.getsockname()
  # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
  # if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25 22:45:00 -08:00
..
9p net/9p: distinguish zero-copy requests 2022-12-06 07:30:55 +09:00
bluetooth Bluetooth: Add quirk to disable MWS Transport Configuration 2022-12-12 14:19:24 -08:00
caif
iucv
mana v6.2 merge window pull request 2022-12-14 09:27:13 -08:00
netfilter netfilter: nf_tables: avoid retpoline overhead for some ct expression calls 2023-01-18 13:05:25 +01:00
netns net: xsk: Don't include <linux/rculist.h> 2022-12-06 20:04:34 -08:00
nfc
phonet
sctp sctp: delete free member from struct sctp_sched_ops 2022-12-01 20:14:23 -08:00
tc_act net: sched: add helper support in act_ct 2022-11-08 12:15:19 +01:00
6lowpan.h
Space.h
act_api.h net/sched: move struct action_ops definition out of ifdef 2022-12-09 09:18:07 +00:00
addrconf.h
af_ieee802154.h
af_rxrpc.h rxrpc: Tidy up abort generation infrastructure 2023-01-06 09:43:32 +00:00
af_unix.h
af_vsock.h
ah.h
amt.h
arp.h
atmclip.h
ax25.h
ax88796.h
bareudp.h
bond_3ad.h net: bonding: Share lacpdu_mcast_addr definition 2022-09-16 14:34:01 +01:00
bond_alb.h bonding (gcc13): synchronize bond_{a,t}lb_xmit() types 2022-11-02 20:38:13 -07:00
bond_options.h
bonding.h bond: Disable TLS features indication 2022-10-27 12:45:38 +02:00
bpf_sk_storage.h
busy_poll.h
calipso.h
cfg80211-wext.h wifi: cfg80211: Avoid clashing function prototypes 2022-11-16 11:31:47 +02:00
cfg80211.h wifi: cfg80211: Use MLD address to indicate MLD STA disconnection 2023-01-18 17:31:50 +01:00
cfg802154.h ieee802154: Advertize coordinators discovery 2022-11-29 15:34:22 +01:00
checksum.h
cipso_ipv4.h
cls_cgroup.h
codel.h
codel_impl.h
codel_qdisc.h
compat.h
datalink.h
dcbevent.h
dcbnl.h net: dcb: add helper functions to retrieve PCP and DSCP rewrite maps 2023-01-20 09:33:22 +00:00
devlink.h devlink: remove devl*_port_health_reporter_destroy() 2023-01-19 19:08:37 -08:00
dropreason.h net: dropreason: add SKB_DROP_REASON_FRAG_TOO_FAR 2022-10-31 20:14:27 -07:00
dsa.h net: dsa: add plumbing for changing and getting MAC merge layer state 2023-01-23 12:44:18 +00:00
dsfield.h
dst.h net: add atomic_long_t to net_device_stats fields 2022-11-16 12:48:44 +00:00
dst_cache.h
dst_metadata.h xfrm: interface: Add unstable helpers for setting/getting XFRM metadata from TC-BPF 2022-12-05 21:58:27 -08:00
dst_ops.h ipv6: remove max_size check inline with ipv4 2023-01-13 20:59:14 -08:00
erspan.h
esp.h
espintcp.h
ethoc.h
failover.h
fib_notifier.h
fib_rules.h
firewire.h
flow.h net: Remove DECnet leftovers from flow.h. 2022-10-03 12:41:59 +01:00
flow_dissector.h flow_dissector: Add L2TPv3 dissectors 2022-09-20 09:13:38 +02:00
flow_offload.h net: flow_offload: add support for ARP frame matching 2022-11-14 11:24:16 +00:00
fou.h
fq.h
fq_impl.h wifi: mac80211: add support for restricting netdev features per vif 2022-12-01 15:09:10 +01:00
garp.h
gen_stats.h
genetlink.h mptcp: more detailed error reporting on endpoint creation 2022-11-21 13:09:07 +00:00
geneve.h net: geneve: fix array of flexible structures warnings 2022-10-31 10:43:04 +00:00
gre.h
gro.h
gro_cells.h
gtp.h
gue.h
hwbm.h
icmp.h
ieee80211_radiotap.h
ieee802154_netdev.h Merge tag 'ieee802154-for-net-next-2022-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next 2022-10-26 15:24:36 +01:00
if_inet6.h
ife.h
ila.h
inet6_connection_sock.h
inet6_hashtables.h
inet_common.h
inet_connection_sock.h
inet_dscp.h
inet_ecn.h
inet_frag.h net: dropreason: add SKB_DROP_REASON_FRAG_REASM_TIMEOUT 2022-10-31 20:14:27 -07:00
inet_hashtables.h tcp: Add TIME_WAIT sockets in bhash2. 2022-12-30 07:25:52 +00:00
inet_sock.h inet: Add IP_LOCAL_PORT_RANGE socket option 2023-01-25 22:45:00 -08:00
inet_timewait_sock.h tcp: Add TIME_WAIT sockets in bhash2. 2022-12-30 07:25:52 +00:00
inetpeer.h
ioam6.h
ip.h inet: Add IP_LOCAL_PORT_RANGE socket option 2023-01-25 22:45:00 -08:00
ip6_checksum.h
ip6_fib.h
ip6_route.h ipv6: Make ip6_route_output_flags_noref() static. 2023-01-24 18:12:52 -08:00
ip6_tunnel.h
ip_fib.h
ip_tunnels.h net: Add helper function to parse netlink msg of ip_tunnel_parm 2022-10-03 07:59:06 +01:00
ip_vs.h ipvs: run_estimation should control the kthread tasks 2022-12-10 22:44:43 +01:00
ipcomp.h xfrm: ipcomp: add extack to ipcomp{4,6}_init_state 2022-09-29 07:18:00 +02:00
ipconfig.h
ipv6.h IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver 2022-12-12 15:41:44 -08:00
ipv6_frag.h net: dropreason: add SKB_DROP_REASON_FRAG_REASM_TIMEOUT 2022-10-31 20:14:27 -07:00
ipv6_stubs.h
iw_handler.h
kcm.h
l3mdev.h
lag.h
lapb.h
lib80211.h
llc.h
llc_c_ac.h
llc_c_ev.h
llc_c_st.h
llc_conn.h
llc_if.h
llc_pdu.h
llc_s_ac.h
llc_s_ev.h
llc_s_st.h
llc_sap.h
lwtunnel.h
mac80211.h wifi: mac80211: drop extra 'e' from ieeee80211... name 2023-01-19 14:57:51 +01:00
mac802154.h mac802154: Drop IEEE802154_HW_RX_DROP_BAD_CKSUM 2022-10-12 12:57:19 +02:00
macsec.h net: macsec: remove the prepare flag from the MACsec offloading context 2022-09-23 06:56:08 -07:00
mctp.h
mctpdevice.h
mip6.h
mld.h
mpls.h
mpls_iptunnel.h
mptcp.h mptcp: remove MPTCP 'ifdef' in TCP SYN cookies 2022-12-12 13:11:24 -08:00
mrp.h mrp: introduce active flags to prevent UAF when applicant uninit 2022-11-18 12:14:55 +00:00
ncsi.h
ndisc.h
neighbour.h net: neigh: decrement the family specific qlen 2022-11-18 10:29:50 +00:00
net_debug.h
net_failover.h
net_namespace.h net: add a refcount tracker for kernel sockets 2022-10-24 11:04:43 +01:00
net_ratelimit.h
net_trackers.h
netevent.h
netlabel.h
netlink.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-11-03 13:21:54 -07:00
netprio_cgroup.h
netrom.h
nexthop.h
nl802154.h ieee802154: Advertize coordinators discovery 2022-11-29 15:34:22 +01:00
nsh.h
p8022.h
page_pool.h
pie.h
ping.h inet: ping: use hlist_nulls rcu iterator during lookup 2022-12-01 12:42:46 +01:00
pkt_cls.h net: sched: cls_api: introduce tc_cls_bind_class() helper 2022-10-02 16:07:17 +01:00
pkt_sched.h net/sched: taprio: allow user input of per-tc max SDU 2022-09-29 18:52:05 -07:00
pptp.h
protocol.h
psample.h
psnap.h
raw.h
rawv6.h
red.h treewide: use get_random_u32() when possible 2022-10-11 17:42:58 -06:00
regulatory.h
request_sock.h
rose.h
route.h
rpl.h
rsi_91x.h
rtnetlink.h rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link 2022-10-31 18:10:21 -07:00
rtnh.h
sch_generic.h net/sched: sch_taprio: fix possible use-after-free 2023-01-16 13:25:34 +00:00
scm.h
secure_seq.h
seg6.h
seg6_hmac.h
seg6_local.h
selftests.h
slhc_vj.h
smc.h net/smc: De-tangle ism and smc device initialization 2023-01-25 09:46:49 +00:00
snmp.h
sock.h net: simplify sk_page_frag 2022-12-19 17:28:50 -08:00
sock_reuseport.h soreuseport: Fix socket selection for SO_INCOMING_CPU. 2022-10-25 11:35:16 +02:00
stp.h
strparser.h
switchdev.h bridge: switchdev: Allow device drivers to install locked FDB entries 2022-11-09 19:06:13 -08:00
tc_wrapper.h net/sched: fix retpoline wrapper compilation on configs without tc filters 2022-12-28 12:11:32 +00:00
tcp.h bpf-next-for-netdev 2022-12-12 11:27:42 -08:00
tcp_states.h
timewait_sock.h
tipc.h
tls.h net/tls: Describe ciphers sizes by const structs 2022-09-22 17:27:41 -07:00
tls_toe.h
transp_v6.h inet6: Remove inet6_destroy_sock(). 2022-10-24 09:40:39 +01:00
tso.h net: tso: inline tso_count_descs() 2022-12-12 15:04:39 -08:00
tun_proto.h
udp.h udp: track the forward memory release threshold in an hot cacheline 2022-10-24 10:52:50 +01:00
udp_tunnel.h net: Change the udp encap_err_rcv to allow use of {ip,ipv6}_icmp_error() 2022-11-08 16:42:28 +00:00
udplite.h tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct(). 2022-10-12 17:50:37 -07:00
vsock_addr.h
vxlan.h
wext.h
x25.h
x25device.h
xdp.h xdp: Adjust xdp_frame layout to avoid using bitfields 2022-09-26 13:28:19 -07:00
xdp_priv.h
xdp_sock.h
xdp_sock_drv.h xsk: Remove unused xsk_buff_discard 2022-09-30 07:55:46 -07:00
xfrm.h bpf-next-for-netdev 2022-12-12 11:27:42 -08:00
xsk_buff_pool.h xsk: Inherit need_wakeup flag for shared sockets 2022-09-22 17:16:22 +02:00