linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-02 17:11:33 +00:00

A mirror of the official Linux kernel repository just in case

Go to file

Jakub Sitnicki 91d0b78c51 inet: Add IP_LOCAL_PORT_RANGE socket option Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 \| PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>		2023-01-25 22:45:00 -08:00
arch	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	2023-01-20 12:28:23 -08:00
block	block-6.2-2023-01-13	2023-01-13 17:41:19 -06:00
certs	certs: make system keyring depend on built-in x509 parser	2022-09-24 04:31:18 +09:00
crypto	wifi: cfg80211: Deduplicate certificate loading	2023-01-19 14:46:45 +01:00
Documentation	netlink: add a proto specification for FOU	2023-01-24 10:58:11 +01:00
drivers	net/smc: De-tangle ism and smc device initialization	2023-01-25 09:46:49 +00:00
fs	net/sock: Introduce trace_sk_data_ready()	2023-01-23 11:26:50 +00:00
include	inet: Add IP_LOCAL_PORT_RANGE socket option	2023-01-25 22:45:00 -08:00
init	21 hotfixes. Thirteen of these address pre-6.1 issues and hence have	2023-01-16 16:36:39 -08:00
io_uring	io_uring: lock overflowing for IOPOLL	2023-01-13 07:32:46 -07:00
ipc	Non-MM patches for 6.2-rc1.	2022-12-12 17:28:58 -08:00
kernel	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	2023-01-20 12:28:23 -08:00
lib	21 hotfixes. Thirteen of these address pre-6.1 issues and hence have	2023-01-16 16:36:39 -08:00
LICENSES	LICENSES: Add the copyleft-next-0.3.1 license	2022-11-08 15:44:01 +01:00
mm	slab fixes for 6.2-rc5	2023-01-19 12:24:39 -08:00
net	inet: Add IP_LOCAL_PORT_RANGE socket option	2023-01-25 22:45:00 -08:00
rust	rust: types: add `Opaque` type	2022-12-04 01:59:16 +01:00
samples	bpf-next-for-netdev	2023-01-04 20:21:25 -08:00
scripts	kernel hardening fixes for v6.2-rc4	2023-01-14 10:04:00 -06:00
security	tomoyo: Update website link	2023-01-13 23:11:38 +09:00
sound	sound fixes for 6.2-rc4	2023-01-13 08:20:29 -06:00
tools	tools: ynl: add a completely generic client	2023-01-24 10:58:11 +01:00
usr	usr/gen_init_cpio.c: remove unnecessary -1 values from int file	2022-10-03 14:21:44 -07:00
virt	KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering rule	2023-01-11 13:32:21 -05:00
.clang-format	iommufd for 6.2	2022-12-14 09:15:43 -08:00
.cocciconfig
.get_maintainer.ignore	get_maintainer: add Alan to .get_maintainer.ignore	2022-08-20 15:17:44 -07:00
.gitattributes	.gitattributes: use 'dts' diff driver for dts files	2019-12-04 19:44:11 -08:00
.gitignore	.gitignore: ignore *.rpm	2022-12-30 17:22:14 +09:00
.mailmap	21 hotfixes. Thirteen of these address pre-6.1 issues and hence have	2023-01-16 16:36:39 -08:00
.rustfmt.toml	rust: add `.rustfmt.toml`	2022-09-28 09:02:20 +02:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	MAINTAINERS: zram: zsmalloc: Add an additional co-maintainer	2022-12-15 16:37:49 -08:00
Kbuild	Kbuild updates for v6.1	2022-10-10 12:00:45 -07:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	net: add basic C code generators for Netlink	2023-01-24 10:58:11 +01:00
Makefile	Linux 6.2-rc4	2023-01-15 09:22:43 -06:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.