NAT between identical networks using VRF
Table of Contents
Last week, Oskar Stenberg asked on Unix & Linux if it were possible to configure connectivity between two networks, both using the same address range, without involving network namespaces. That is, given this high level view of the network…
…can we set things up so that hosts on the “inner” network can communicate with hosts on the “outer” network using the range
192.168.3.0/24, and similarly for communication in the other direction?
Setting up a lab⌗
When investigating this sort of networking question, I find it easiest to reproduce the topology in a virtual environment so that it’s easy to test things out. I generally use Mininet for this, which provides a simple Python API for creating virtual nodes and switches and creating links between them.
I created the following network topology for this test:
In the rest of this post, I’ll be referring to these hostnames.
See the bottom of this post for a link to the repository that contains the complete test environment.
VRF in theory⌗
VRF stands for “Virtual Routing and Forwarding”. From the Wikipedia article on the topic:
In IP-based computer networks, virtual routing and forwarding (VRF) is a technology that allows multiple instances of a routing table to co-exist within the same router at the same time. One or more logical or physical interfaces may have a VRF and these VRFs do not share routes therefore the packets are only forwarded between interfaces on the same VRF. VRFs are the TCP/IP layer 3 equivalent of a VLAN. Because the routing instances are independent, the same or overlapping IP addresses can be used without conflicting with each other. Network functionality is improved because network paths can be segmented without requiring multiple routers.1
In Linux, VRF support is implemented as a special type of network device. A VRF device sets up an isolated routing domain; network traffic on devices associated with a VRF will use the routing table associated with that VRF, rather than the main routing table, which permits us to connect multiple networks with overlapping address ranges.
We can create new VRF devices with the
ip link add command:
ip link add vrf-inner type vrf table 100
Running the above command results in the following changes:
It creates a new network device named
It adds a new route policy rule (if it doesn’t already exist) that looks like:
1000: from all lookup [l3mdev-table]
This causes route lookups to use the appropriate route table for interfaces associated with a VRF.
After creating a VRF device, we can add interfaces to it like this:
ip link set eth0 master vrf-inner
This associates the given interface with the VRF device, and it moves all routes associated with the interface out of the
main routing tables and into the VRF-specific routing table.
You can see a list of vrf devices by running
ip vrf show:
# ip vrf show Name Table ----------------------- vrf-inner 100
You can see a list of devices associated with a particular VRF with the
ip link command:
# ip -brief link show master vrf-inner eth0@if448 UP 72:87:af:d3:b5:f9 <BROADCAST,MULTICAST,UP,LOWER_UP>
VRF in practice⌗
We’re going to create two VRF devices on the
middleman host; one associated with the “inner” network and one associated with the “outer” network. In our virtual network topology, the
middleman host has two network interfaces:
middleman-eth0is connected to the “inner” network
middleman-eth1is connected to the “outer” network
Both devices have the same address (
# ip addr show 2: middleman-eth0@if426: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrf-inner state UP group default qlen 1000 link/ether 32:9e:01:2e:78:2f brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.2.1/24 brd 192.168.2.255 scope global middleman-eth0 valid_lft forever preferred_lft forever root@mininet-vm:~/unix-735931# ip addr show middleman-eth1 3: middleman-eth1@if427: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrf-outer state UP group default qlen 1000 link/ether 12:be:9a:09:33:93 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.2.1/24 brd 192.168.2.255 scope global middleman-eth1 valid_lft forever preferred_lft forever
And the main routing table looks like this:
# ip route show 192.168.2.0/24 dev middleman-eth1 proto kernel scope link src 192.168.2.1 192.168.2.0/24 dev middleman-eth0 proto kernel scope link src 192.168.2.1
If you’re at all familiar with Linux network configuration, that probably looks weird. Right now this isn’t a particularly functional network configuration, but we can fix that!
To create our two VRF devices, we run the following commands:
ip link add vrf-inner type vrf table 100 ip link add vrf-outer type vrf table 200 ip link set vrf-inner up ip link set vrf-outer up
vrf-inner with route table 100, and
vrf-outer with route table 200. At this point, tables 100 and 200 are empty:
# ip route show table 100 Error: ipv4: FIB table does not exist. Dump terminated # ip route show table 200 Error: ipv4: FIB table does not exist. Dump terminated
Next, we add our interfaces to the appropriate VRF devices:
ip link set middleman-eth0 master vrf-inner ip link set middleman-eth1 master vrf-outer
After running these commands, there are no routes left in the main routing table:
# ip route show <no output>
And the routes associated with our two physical interfaces are now contained by the appropriate VRF routing tables. Here’s table 100:
root@mininet-vm:~/unix-735931# ip route show table 100 broadcast 192.168.2.0 dev middleman-eth0 proto kernel scope link src 192.168.2.1 192.168.2.0/24 dev middleman-eth0 proto kernel scope link src 192.168.2.1 local 192.168.2.1 dev middleman-eth0 proto kernel scope host src 192.168.2.1 broadcast 192.168.2.255 dev middleman-eth0 proto kernel scope link src 192.168.2.1
And table 200:
root@mininet-vm:~/unix-735931# ip route show table 200 broadcast 192.168.2.0 dev middleman-eth1 proto kernel scope link src 192.168.2.1 192.168.2.0/24 dev middleman-eth1 proto kernel scope link src 192.168.2.1 local 192.168.2.1 dev middleman-eth1 proto kernel scope host src 192.168.2.1 broadcast 192.168.2.255 dev middleman-eth1 proto kernel scope link src 192.168.2.1
This configuration effectively gives us two isolated networks:
We can verify that nodes in the “inner” and “outer” networks are now able to communicate with
middleman. We can reach
innernode0; in this case, we’re communicating with interface
innernode0# ping -c1 192.168.2.1 PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data. 64 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=0.126 ms --- 192.168.2.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.126/0.126/0.126/0.000 ms
Similarly, we can reach
outernode, but in this case we’re communicating with interface
outernode0# ping -c1 192.168.2.1 PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data. 64 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=1.02 ms --- 192.168.2.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 1.020/1.020/1.020/0.000 ms
Configure routing on the nodes⌗
Our goal is to let nodes on one side of the network to use the address range
192.168.3.0/24 to refer to nodes on the other side of the network. Right now, if we were to try to access
innernode0, the attempt would fail with:
innernode0# ping 192.168.3.10 ping: connect: Network is unreachable
The “network is unreachable” message means that
innernode0 has no idea where to send that request. That’s because at the moment, the routing table on all the nodes look like:
innernode0# ip route 192.168.2.0/24 dev outernode0-eth0 proto kernel scope link src 192.168.2.10
There is neither a default gateway nor a network-specific route appropriate for
192.168.3.0/24 addresses. Let’s add a network route that will route that address range through
innernode0# ip route add 192.168.3.0/24 via 192.168.2.1 innernode0# ip route 192.168.2.0/24 dev innernode0-eth0 proto kernel scope link src 192.168.2.10 192.168.3.0/24 via 192.168.2.1 dev innernode0-eth0
This same change needs to be made on all the
With the route in place, attempts to reach
innernode0 will still fail, but now they’re getting rejected by
middleman because it doesn’t have any appropriate routes:
innernode0# ping -c1 192.168.3.10 PING 192.168.3.10 (192.168.3.10) 56(84) bytes of data. From 192.168.2.1 icmp_seq=1 Destination Net Unreachable --- 192.168.3.10 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
We need to tell
middleman what to do with these packets.
Configure routing and NAT on middleman⌗
In order to achieve our desired connectivity, we need to:
- Map the
192.168.3.0/24destination address to the equivalent
192.168.2.0/24address before the kernel makes a routing decision.
- Map the
192.168.2.0/24source address to the equivalent
192.168.3.0/24address after the kernel makes a routing decision (so that replies will go back to “other” side).
- Ensure that the kernel uses the routing table for the target network when making routing decisions for these connections.
We can achieve (1) and (2) using the netfilter
NETMAP extension by adding the following two rules:
iptables -t nat -A PREROUTING -d 192.168.3.0/24 -j NETMAP --to 192.168.2.0/24 iptables -t nat -A POSTROUTING -s 192.168.2.0/24 -j NETMAP --to 192.168.3.0/24
For incoming traffic destined for the 192.168.3.0/24 network, this maps the destination address to the matching
192.168.2.0/24 address. For outgoing traffic with a source address on the
192.168.2.0/24 network, this maps the source to the equivalent
192.168.3.0/24 network (so that the recipient see the traffic as coming from “the other side”).
(For those of you wondering, “can we do this using
nftables instead?”, as of this writing
nftables does not appear to have
NETMAP support, so we have to use
iptables for this step.)
With this change in place, re-trying that
ping command on
innernode0 will apparently succeed:
innernode0 ping -c1 192.168.3.10 PING 192.168.3.10 (192.168.3.10) 56(84) bytes of data. 64 bytes from 192.168.3.10: icmp_seq=1 ttl=63 time=0.063 ms --- 192.168.3.10 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.063/0.063/0.063/0.000 ms
middleman will show us that we haven’t yet achieved our goal:
12:59:52.899054 middleman-eth0 In IP 192.168.2.10 > 192.168.3.10: ICMP echo request, id 16520, seq 1, length 64 12:59:52.899077 middleman-eth0 Out IP 192.168.3.10 > 192.168.2.10: ICMP echo request, id 16520, seq 1, length 64 12:59:52.899127 middleman-eth0 In IP 192.168.2.10 > 192.168.3.10: ICMP echo reply, id 16520, seq 1, length 64 12:59:52.899130 middleman-eth0 Out IP 192.168.3.10 > 192.168.2.10: ICMP echo reply, id 16520, seq 1, length 64
You can see that our packet is coming on on
middleman-eth0…and going right back out the same interface. We have thus far achieved a very complicated loopback interface.
The missing piece is some logic to have the kernel use the routing table for the “other side” when making routing decisions for these packets. We’re going to do that by:
- Tagging packets with a mark that indicates the interface on which they were recieved
- Using this mark to select an appropriate routing table
We add the packet mark by adding these rules to the
iptables -t mangle -A PREROUTING -i middleman-eth0 -d 192.168.3.0/24 -j MARK --set-mark 100 iptables -t mangle -A PREROUTING -i middleman-eth1 -d 192.168.3.0/24 -j MARK --set-mark 200
And we utilize that mark in route lookups by adding the following two route policy rules:
ip rule add prio 100 fwmark 100 lookup 200 ip rule add prio 200 fwmark 200 lookup 100
It is critical that these rules come before (aka “have a higher priority than”, aka “have a lower number than”) the
l3mdev rule added when we created the VRF devices.
Validation: Does it actually work?⌗
With that last set of changes in place, if we repeat the
ping test from
outernode0 and run
middleman, we see:
13:05:27.667793 middleman-eth0 In IP 192.168.2.10 > 192.168.3.10: ICMP echo request, id 16556, seq 1, length 64 13:05:27.667816 middleman-eth1 Out IP 192.168.3.10 > 192.168.2.10: ICMP echo request, id 16556, seq 1, length 64 13:05:27.667863 middleman-eth1 In IP 192.168.2.10 > 192.168.3.10: ICMP echo reply, id 16556, seq 1, length 64 13:05:27.667868 middleman-eth0 Out IP 192.168.3.10 > 192.168.2.10: ICMP echo reply, id 16556, seq 1, length 64
Now we finally see the desired behavior: the request from
innernode0 comes in on
eth0, goes out on
eth1 with the addresses appropriately mapped and gets delivered to
outernode0. The reply from
outernode0 goes through the process in reverse, and arrives back at
Connection tracking (or, “One more thing…”)⌗
There is a subtle problem with the configuration we’ve implemented so far: the Linux connection tracking mechanism ("conntrack") by default identifies a connection by the 4-tuple
(source_address, source_port, destination_address, destination_port). To understand why this is a problem, assume that we’re running a web server on port 80 on all the “inner” and “outer” nodes.
To connect from
outernode0, we could use the following command. We’re using the
--local-port option here because we want to control the source port of our connections:
innernode0# curl --local-port 4000 192.168.3.10
To connect from
innernode0, we would use the same command:
outernode0# curl --local-port 4000 192.168.3.10
If we look at the connection tracking table on
middleman, we will see a single connection:
middleman# conntrack -L tcp 6 115 TIME_WAIT src=192.168.2.10 dst=192.168.3.10 sport=4000 dport=80 src=192.168.2.10 dst=192.168.3.10 sport=80 dport=4000 [ASSURED] mark=0 use=1
This happens because the 4-tuple for our two connections is identical. Conflating connections like this can cause traffic to stop flowing if both connections are active at the same time.
We need to provide the connection track subsystem with some additional information to uniquely identify these connections. We can do this by using the netfilter
CT module to assign each connection to a unique conntrack origination “zone”:
iptables -t raw -A PREROUTING -s 192.168.2.0/24 -i middleman-eth0 -j CT --zone-orig 100 iptables -t raw -A PREROUTING -s 192.168.2.0/24 -i middleman-eth1 -j CT --zone-orig 200
What is a “zone”? From the patch adding this feature:
A zone is simply a numerical identifier associated with a network device that is incorporated into the various hashes and used to distinguish entries in addition to the connection tuples.
With these rules in place, if we repeat the test with
curl we will see two distinct connections:
middleman# conntrack -L tcp 6 117 TIME_WAIT src=192.168.2.10 dst=192.168.3.10 sport=4000 dport=80 zone-orig=100 src=192.168.2.10 dst=192.168.3.10 sport=80 dport=26148 [ASSURED] mark=0 use=1 tcp 6 115 TIME_WAIT src=192.168.2.10 dst=192.168.3.10 sport=4000 dport=80 zone-orig=200 src=192.168.2.10 dst=192.168.3.10 sport=80 dport=4000 [ASSURED] mark=0 use=1
Repository and demo⌗
You can find a complete test environment in this repository; that includes the mininet topology I mentioned at the beginning of this post as well as shell scripts to implement all the address, route, and netfilter configurations.
And here’s a video that runs through the steps described in this post: