overlay – IoSonoUnRouter

Bringing GRE to backbone and VPNs

GRE is a very common protocol that allows to create a tunnel between two endpoints and encapsulate packets going through that tunnel.

GRE is without doubt one of the first overlay solutions networks have seen.

GRE relies on IP, meaning that “where we have IP, we can have GRE”.

Think of a classic BGP VPN. Normally, between PEs we have a MPLS backbone transporting packets through LSPs (static, RSVP, LDP).

However, it might happen that our backbone cannot provide MPLS connectivity between two PEs. If so, a valid alternative is to replace LSPs with GRE tunnels.

By doing this, we no longer have MPLSoMPLS (external MPLS transport label and internal MPLS service label) but we move to MPLSoGRE (external GRE header and MPLS service label).

Let’s consider this lab topology:

Between PE1 and PE2 we have a GRE tunnel. This tunnel works as a LSP. and will be used as next-hop for BGP-signaled VPN routes.

TO make things more complex I added another tunnel: a GRE tunnel between PE1 VRF and branch 2. This tunnel connects customer VRF on SP router (PE1) to customer branch router directly.

As a result, backbone is traversed by MPLSoverGREoverGRE packets:

external GRE tunnel (PE1 VRF to branch2)
MPLS VPN service label
internal GRE (PE1 to PE2)

Let’s start by configuring the backbone GRE tunnel, the one connecting PE1 to PE2.

On PE1, we enable tunnel services:

set chassis fpc 0 pic 0 tunnel-services bandwidth 1g

Next, we define a GRE tunnel towards PE2:

set interfaces gr-0/0/10 unit 4 tunnel source 1.1.1.1
set interfaces gr-0/0/10 unit 4 tunnel destination 4.4.4.4
set interfaces gr-0/0/10 unit 4 family inet
set interfaces gr-0/0/10 unit 4 family mpls

MPLS must be enabled as PE1 has to push MPLS service label (VPN label).

Of course, be sure we have reachability to tunnel endpoint:

root@pe1# run show route table inet.0 4.4.4.4 active-path

inet.0: 27 destinations, 27 routes (27 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

4.4.4.4/32         *[OSPF/10] 01:18:12, metric 2
                    >  to 192.168.13.1 via ge-0/0/2.0

Configuration on PE2 is identical so we omit it.

In order to use that GRE tunnel for VPN routes, we need to add a static route into inet.3:

set routing-options rib inet.3 static route 4.4.4.4/32 next-hop gr-0/0/10.4

Basically, we tell Junos to use the GRE tunnel for VPN routes whose protocol next-hop is 4.4.4.4 (GRE tunnel endpoint but PE2 loopback also).

As a result, a route to 4.4.4.4 is available within inet.3:

root@pe1# run show route 4.4.4.4

inet.0: 27 destinations, 27 routes (27 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

4.4.4.4/32         *[OSPF/10] 01:31:23, metric 2
                    >  to 192.168.13.1 via ge-0/0/2.0

inet.3: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

4.4.4.4/32         *[Static/5] 05:50:42
                    >  via gr-0/0/10.4

Now, let’s move to the other GRE tunnel whose endpoints are:

PE1 VRF
branch 2 router

Endpoint addresses are:

on PE1 50.50.50.50
on branch2 100.100.100.100

Let’s start with PE1.

On PE1 we have a VRF for a L3VPN. That VRF “sees” branch 1 on one side (branch1 is a CE) and branch2 on the other side (through the GRE tunnel).

GRE tunnel address is configured as a loopback IFL assigned to the VRF:

set interfaces lo0 unit 0 family inet address 1.1.1.1/32
set interfaces lo0 unit 100 family inet address 50.50.50.50/32
set routing-instances l3vpn instance-type vrf
set routing-instances l3vpn interface ge-0/0/0.100
set routing-instances l3vpn interface lo0.100
set routing-instances l3vpn route-distinguisher 1.1.1.1:101
set routing-instances l3vpn vrf-import l3vpn-import
set routing-instances l3vpn vrf-export l3vpn-export
set routing-instances l3vpn vrf-table-label

Interface ge-0/0/0.100 connects PE1 to branch1 (CE). We have eBGP with branch 1:

set interfaces ge-0/0/0 flexible-vlan-tagging
set interfaces ge-0/0/0 encapsulation flexible-ethernet-services
set interfaces ge-0/0/0 unit 100 vlan-id 100
set interfaces ge-0/0/0 unit 100 family inet address 192.168.100.1/31
set routing-instances l3vpn protocols bgp group ce type external
set routing-instances l3vpn protocols bgp group ce peer-as 65001
set routing-instances l3vpn protocols bgp group ce neighbor 192.168.100.0

Branch1 advertises the address of a user connected to its LAN:

root@pe1# run show route receive-protocol bgp 192.168.100.0 table l3vpn.inet

l3vpn.inet.0: 9 destinations, 10 routes (9 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
* 10.1.1.1/32             192.168.100.0                           65001 I

PE1 VRF – branch1 is done!

Now the GRE. We define a new GRE IFL and assign it to the VRF:

set interfaces gr-0/0/10 unit 100 tunnel source 50.50.50.50
set interfaces gr-0/0/10 unit 100 tunnel destination 100.100.100.100
set interfaces gr-0/0/10 unit 100 tunnel routing-instance destination l3vpn
set interfaces gr-0/0/10 unit 100 family inet address 100.64.1.1/31
set routing-instances l3vpn interface gr-0/0/10.100

Notice, we tell Junos that GRE endpoint (100.100.100.100) is reachable via the VRF itself.

root@pe1# run show route table l3vpn.inet.0 100.100.100.100

l3vpn.inet.0: 9 destinations, 10 routes (9 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

100.100.100.100/32 *[BGP/170] 05:08:36, localpref 100, from 10.10.10.10
                      AS path: 65003 I, validation-state: unverified
                    >  via gr-0/0/10.4, Push 299856

Here we see a MPLSoGRE route. This is what we typically find with a BGP VPN relying on a GRE backbone. That route tells us that traffic towards 100.100.100.100 will be encapsulated into a MPLS packet (label 299856) first and into a GRE tunnel (src: 1.1.1.1, dst: 4.4.4.4).

You may have noticed that, this time, GRE IFL was assigned an IP address as well. This is because we are going to configure eBGP between tunnel endpoints. On PE1 we configure BGP inside the VRF:

set routing-instances l3vpn protocols bgp group pe type external
set routing-instances l3vpn protocols bgp group pe peer-as 65003
set routing-instances l3vpn protocols bgp group pe neighbor 100.64.1.0

This BGP session is used to advertise branch1 LAN address (our end user) to branch2:

root@pe1# run show route advertising-protocol bgp 100.64.1.0

l3vpn.inet.0: 9 destinations, 10 routes (9 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
* 10.1.1.1/32             Self                                    65001 I

Here is VRF full config:

set routing-instances l3vpn instance-type vrf
set routing-instances l3vpn interface ge-0/0/0.100
set routing-instances l3vpn interface gr-0/0/10.100
set routing-instances l3vpn interface lo0.100
set routing-instances l3vpn route-distinguisher 1.1.1.1:101
set routing-instances l3vpn vrf-import l3vpn-import
set routing-instances l3vpn vrf-export l3vpn-export
set routing-instances l3vpn vrf-table-label
set routing-instances l3vpn protocols bgp group ce type external
set routing-instances l3vpn protocols bgp group ce peer-as 65001
set routing-instances l3vpn protocols bgp group ce neighbor 192.168.100.0
set routing-instances l3vpn protocols bgp group pe type external
set routing-instances l3vpn protocols bgp group pe peer-as 65003
set routing-instances l3vpn protocols bgp group pe neighbor 100.64.1.0

Let’s have a look at policies:

set policy-options policy-statement l3vpn-export term ok from interface lo0.100
set policy-options policy-statement l3vpn-export term ok then community set l3vpn
set policy-options policy-statement l3vpn-export term ok then accept
set policy-options policy-statement l3vpn-export then reject
set policy-options policy-statement l3vpn-import term ok from protocol bgp
set policy-options policy-statement l3vpn-import term ok from community l3vpn
set policy-options policy-statement l3vpn-import term ok then accept
set policy-options policy-statement l3vpn-import then reject
set policy-options community l3vpn members target:100:1

We advertise lo0.100 (PE1-branch2 GRE endpoint). This way we allow PE2 to learn about 50.50.50.50. PE2 needs it as branch 2 will send GRE packets destined to 50.50.50.50 to PE2. PE2 will take that GRE packet, push a MPLS service label and encapsulate the packet into another GRE packet (the PE-PE tunnel).

PE1 should be fine.

It might be worth to check PE2 as well. VRF configuration is lighter:

set routing-instances l3vpn-gre instance-type vrf
set routing-instances l3vpn-gre interface ge-0/0/0.100
set routing-instances l3vpn-gre route-distinguisher 4.4.4.4:101
set routing-instances l3vpn-gre vrf-import l3vpn-import
set routing-instances l3vpn-gre vrf-export l3vpn-export
set routing-instances l3vpn-gre protocols bgp group ce type external
set routing-instances l3vpn-gre protocols bgp group ce peer-as 65003
set routing-instances l3vpn-gre protocols bgp group ce neighbor 192.168.100.0

There is a BGP session with branch2:

root@pe4# run show route receive-protocol bgp 192.168.100.0 table l3vpn-gre.inet

l3vpn-gre.inet.0: 6 destinations, 6 routes (4 active, 0 holddown, 2 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
* 100.100.100.100/32      192.168.100.0                           65003 I

PE2 receives the GRE endpoint address. This way, now, PE2 VRF is able to reach both 50.50.50.50 (via MPLSoGRE backbone) and 100.100.100.100.

Like PE1, PE-PE GRE tunnel is in inet.0 and referenced as next-hop in a inet.3 static route:

root@pe4# show interfaces gr-0/0/10 | display set
set interfaces gr-0/0/10 unit 1 tunnel source 4.4.4.4
set interfaces gr-0/0/10 unit 1 tunnel destination 1.1.1.1
set interfaces gr-0/0/10 unit 1 family inet
set interfaces gr-0/0/10 unit 1 family mpls

[edit]
root@pe4# show routing-options | display set
set routing-options rib inet.3 static route 1.1.1.1/32 next-hop gr-0/0/10.1

[edit]
root@pe4# run show route 1.1.1.1

inet.0: 26 destinations, 26 routes (26 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

1.1.1.1/32         *[OSPF/10] 01:52:59, metric 2
                    >  to 192.168.36.0 via ge-0/0/1.0

inet.3: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

1.1.1.1/32         *[Static/5] 06:11:22
                    >  via gr-0/0/10.1

Let’s move to branch2!

Again, we enable gre tunneling:

set chassis fpc 0 pic 0 tunnel-services bandwidth 1g

And we create the GRE tunnel:

set interfaces gr-0/0/10 unit 0 tunnel source 100.100.100.100
set interfaces gr-0/0/10 unit 0 tunnel destination 50.50.50.50
set interfaces gr-0/0/10 unit 0 tunnel routing-instance destination l3vpn-gre
set interfaces gr-0/0/10 unit 0 family inet address 100.64.1.0/31
set interfaces lo0 unit 100 family inet address 100.100.100.100/32

For lab reasons, on branch2, I isolated this use-case (GRE backbone + PE-CE GRE tunnel) into a virtual router:

set routing-instances l3vpn-gre instance-type virtual-router
set routing-instances l3vpn-gre interface ge-0/0/0.100
set routing-instances l3vpn-gre interface gr-0/0/10.0
set routing-instances l3vpn-gre interface lo0.100
set routing-instances l3vpn-gre protocols bgp group pe type external
set routing-instances l3vpn-gre protocols bgp group pe export l3vpn-exp-bgp
set routing-instances l3vpn-gre protocols bgp group pe peer-as 100
set routing-instances l3vpn-gre protocols bgp group pe neighbor 192.168.100.1
set routing-instances l3vpn-gre protocols bgp group gre type external
set routing-instances l3vpn-gre protocols bgp group gre export exp-gre
set routing-instances l3vpn-gre protocols bgp group gre peer-as 100
set routing-instances l3vpn-gre protocols bgp group gre neighbor 100.64.1.1

Within that VR we have both the IFD towards PE2 (ge-0/0/0.100) and the GRE interface.

Then, we have 2 BGP sessions:

one with PE2 to send 100.100.100.100 and receive 50.50.50.50 (to establish PE-CE GRE tunnel)
one with PE1 through the GRE tunnel to send/receive branches LAN addresses

LAN addresses are:

branch1: 10.1.1.1
branch2: 10.3.3.3

For lab reasons, branch2 LAN address in configured on the same loopback interface used as endpoint for the PE-CE tunnel:

root@ce3# show interfaces lo0
unit 100 {
    family inet {
        address 10.3.3.3/32;
        address 100.100.100.100/32;
    }
}

Let’s see everything is in place:

root@ce3# run show route advertising-protocol bgp 192.168.100.1

l3vpn-gre.inet.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
* 100.100.100.100/32      Self                                    I

[edit]
root@ce3# run show route receive-protocol bgp 192.168.100.1

inet.0: 3 destinations, 3 routes (3 active, 0 holddown, 0 hidden)

l3vpn-gre.inet.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
* 50.50.50.50/32          192.168.100.1                           100 I

[edit]
root@ce3# run show route advertising-protocol bgp 100.64.1.1

l3vpn-gre.inet.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
* 10.3.3.3/32             Self                                    I

[edit]
root@ce3# run show route receive-protocol bgp 100.64.1.1

inet.0: 3 destinations, 3 routes (3 active, 0 holddown, 0 hidden)

l3vpn-gre.inet.0: 9 destinations, 9 routes (9 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
* 10.1.1.1/32             100.64.1.1                              100 65001 I

All the routes are there!

Las, we verify e2e connectivity:

root@ce3# run ping routing-instance l3vpn-gre source 10.3.3.3 10.1.1.1 size 800 count 13 rapid
PING 10.1.1.1 (10.1.1.1): 800 data bytes
!!!!!!!!!!!!!
--- 10.1.1.1 ping statistics ---
13 packets transmitted, 13 packets received, 0% packet loss
round-trip min/avg/max/stddev = 3.886/6.466/26.326/5.850 ms

It works!

Let’s sum up what we have built:

end to end IP connectivity
PE-PE GRE tunnel instead of MPLS LSPs
MPLS BGP L3VPN relying on that GRE backbone tunnel
PE-CE GRE tunnel
PE-PE tunnel encapsulates PE-CE GRE tunnel and adds a MPLS VPN service label

not bad 🙂

How can we be sure all the headers were added correctly? For now, trust me…next time, we’ll see how!

Ciao
IoSonoUmberto

There’s always routing behind an overlay

If you have had the chance to look at SDWAN and its foundations, you probably heard the word “overlay” being repeated over and over.

Vendors try to make their products flexible by offering the capability to provision different overlays: full mesh, hub and spoke, to hub only, byod and so on…

From an user perspective, those might be just buttons to click on while creating a SDWAN network via a wizard GUI.

However, most of the time, behind those names, well known protocols make them possible. To make a name: BGP VPN!

I’ve experienced the same when I worked on Contrail. SDN controller, network virtualization, service chaining…everything seemed so “futuristic” but, at the end of the day, the key idea behind it was to bring BGP VPNs inside the DC with an IP transport instead of a MPLS one.

Here is the same! Let’s forget about the data plane; GRE, IPSEC, VXLAN… that’s just how you want to build your tunnels but think about creating a full-mesh vs creating a hub-only topology. You need nothing new to do that. It is just, once again, BGP VPNs 🙂

Said that, i’ve tried to imagine how SDWAN products might leverage BGP VPNs to build overlays.

I’ve put myself into the shoes of an enterprise needing different overlays for different use-cases. I’ve thought of three possible ones:

branches and HQ need to exchange information about products availability at different locations. Those interactions are not human but it is a machine to machine dialogue. Here, we expect a lot of traffic between all the sites and there is no need for any centralized control of traffic. A full mesh topology might be the right pick here
then we have employees devices that can talk to each other and access resources on different sites. For security reasons, traffic has to go through a centralized firewall. Here a typical hub and spoke topology could fit
next, some systems on branch sites have to access data from systems within the hq. In this case, there is no communications between branches. Here, we can opt for to-hub-only topology
last, guests might connect to corporate network from a branch. Those guest might be given internet access but they should not have access to company resources and should not be able to talk to other branches. Here, we will go with what I call a “byod” topology

Let’s start looking at them.

First, this is the lab topology I have built to run my tests:

There are 2 spokes (branches), one hub (hq), a P router (network connecting spokes and hubs) and a route reflector for routing updates.

Following loopback addresses can be used to identify devices:

spoke1: 1.1.1.1
spoke2: 1.1.1.2
hub: 1.1.1.10
rr: 10.10.10.10

Here, i’m not interested about the multiple wan networks aspect of a SDWAN solution. My goal here is to see how overlay topologies are nothing more than classic BGP VPNs.

Before jumping to topologies, let’s see the base configuration of those devices.

Spokes and hub have similar configuration. Here, I show snippets from spoke1 but you can easily deduct spoke2/hub configs.

We have loopback interface and physical interface:

set interfaces lo0 unit 0 family inet address 1.1.1.1/32
set interfaces ge-0/0/0 unit 0 family inet address 172.30.1.0/31

OSPF is the IGP:

set protocols ospf area 0.0.0.0 interface ge-0/0/0.0 interface-type p2p
set protocols ospf area 0.0.0.0 interface lo0.0 passive

We build iBGP session to RR:

set routing-options autonomous-system 100
set protocols bgp group rr type internal
set protocols bgp group rr local-address 1.1.1.1
set protocols bgp group rr family inet-vpn unicast
set protocols bgp group rr neighbor 10.10.10.10

Data plane will be GRE so we build tunnels towards spokes/hubs:

set chassis fpc 0 pic 0 tunnel-services bandwidth 1g
###to spoke2
set interfaces gr-0/0/10 unit 2 tunnel source 1.1.1.1
set interfaces gr-0/0/10 unit 2 tunnel destination 1.1.1.2
set interfaces gr-0/0/10 unit 2 family inet address 100.64.12.0/31
###to hub
set interfaces gr-0/0/10 unit 10 tunnel source 1.1.1.1
set interfaces gr-0/0/10 unit 10 tunnel destination 1.1.1.10
set interfaces gr-0/0/10 unit 10 family inet address 100.64.101.0/31

Next, we add inet.3 routes via gre tunnels to resolve inet-vpn bgp next-hops:

set routing-options rib inet.3 static route 1.1.1.2/32 next-hop gr-0/0/10.2
set routing-options rib inet.3 static route 1.1.1.10/32 next-hop gr-0/0/10.10

I omit P config as it is just interface configuration + OSPF.

Route reflector configuration, instead, is the following:

set interfaces ge-0/0/0 unit 0 family inet address 172.30.6.0/31
set interfaces lo0 unit 0 family inet address 10.10.10.10/32
set routing-options rib inet.3 static route 0.0.0.0/0 discard
set routing-options autonomous-system 100
set protocols bgp group rr type internal
set protocols bgp group rr local-address 10.10.10.10
set protocols bgp group rr family inet-vpn unicast
set protocols bgp group rr cluster 0.0.0.10
set protocols bgp group rr neighbor 1.1.1.1
set protocols bgp group rr neighbor 1.1.1.2
set protocols bgp group rr neighbor 1.1.1.10
set protocols ospf area 0.0.0.0 interface ge-0/0/0.0 interface-type p2p
set protocols ospf area 0.0.0.0 interface lo0.0 passive

The underlying infrastructure is ready.

Let’s start with full mesh:

Here, as the name says, we have a full mesh of tunnels.

This is the easiest use-case as we are dealing with a standard BGP VPN.

We simulate systems belonging to a site using loopback ifls:

spoke1: 192.168.1.1
spoke2: 192.168.1.2
hub: 192.168.1.10

Think of 192.168.1.1 as a user in site 1, connected to the full mesh topology.

On a spoke, we configure a vrf with appropriate policies:

set policy-options policy-statement full-exp term ok from protocol direct
set policy-options policy-statement full-exp term ok from route-filter 192.168.0.0/16 orlonger
set policy-options policy-statement full-exp term ok then community add full-vpn
set policy-options policy-statement full-exp term ok then accept
set policy-options policy-statement full-exp term ko then reject
set policy-options policy-statement full-imp term ok from protocol bgp
set policy-options policy-statement full-imp term ok from community full-vpn
set policy-options policy-statement full-imp term ok then accept
set policy-options policy-statement full-imp term ko then reject
set policy-options community full-vpn members target:100:1
set routing-instances full instance-type vrf
set routing-instances full interface lo0.1
set routing-instances full route-distinguisher 1.1.1.1:1
set routing-instances full vrf-import full-imp
set routing-instances full vrf-export full-exp
set routing-instances full vrf-table-label
set interfaces lo0 unit 1 family inet address 192.168.1.1/32

Hub configuration is identical (apart from obvious different parameters like route distinguisher, lo0.1 address, …).

As a result, a spoke can reach any site through a direct tunnel to that site:

root@s1> show route table full.inet.0

full.inet.0: 3 destinations, 3 routes (3 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.1.1/32     *[Direct/0] 1d 04:50:05
                    > via lo0.1
192.168.1.2/32     *[BGP/170] 09:35:01, localpref 100, from 10.10.10.10
                      AS path: I, validation-state: unverified
                    > via gr-0/0/10.2, Push 17
192.168.1.10/32    *[BGP/170] 09:35:22, localpref 100, from 10.10.10.10
                      AS path: I, validation-state: unverified
                    > via gr-0/0/10.10, Push 16

root@s1> traceroute routing-instance full source 192.168.1.1 192.168.1.2 no-resolve
traceroute to 192.168.1.2 (192.168.1.2) from 192.168.1.1, 30 hops max, 52 byte packets
 1  192.168.1.2  7.909 ms  5.204 ms  3.697 ms

root@s1> traceroute routing-instance full source 192.168.1.1 192.168.1.10 no-resolve
traceroute to 192.168.1.10 (192.168.1.10) from 192.168.1.1, 30 hops max, 52 byte packets
 1  192.168.1.10  78.748 ms  142.098 ms  96.770 ms

Very easy.

Let’s move to a classic hub and spoke topology:

Here, all the communications take place through the hub. If two spoke sites want to talk to each other, they need to go through the hub first.

Again, we use loopbacks to emulate users connected to this topology:

spoke1: 192.168.2.1
spoke2: 192.168.2.2
hub: 192.168.2.10

Let’s start by hub configuration:

set routing-instances central instance-type vrf
set routing-instances central interface lo0.2
set routing-instances central route-distinguisher 1.1.1.10:2
set routing-instances central vrf-import central-imp
set routing-instances central vrf-export central-exp
set routing-instances central vrf-table-label
set routing-instances central routing-options static route 0.0.0.0/0 discard
set policy-options policy-statement central-exp term def from protocol static
set policy-options policy-statement central-exp term def from route-filter 0.0.0.0/0 exact
set policy-options policy-statement central-exp term def then community add central-vpn
set policy-options policy-statement central-exp term def then accept
set policy-options policy-statement central-exp term ko then reject
set policy-options policy-statement central-imp term ok from protocol bgp
set policy-options policy-statement central-imp term ok from community central-vpn
set policy-options policy-statement central-imp term ok then accept
set policy-options policy-statement central-imp term ko then reject

Simply put:

hub imports spoke routes
hub exports a 0/0 so to attract anything to it
vrf-table-label allows route lookup inside the vrf after service label is popped

Let’s move to the spoke:

set routing-instances central instance-type vrf
set routing-instances central interface lo0.2
set routing-instances central route-distinguisher 1.1.1.1:2
set routing-instances central vrf-import central-imp
set routing-instances central vrf-export central-exp
set routing-instances central vrf-table-label
set policy-options policy-statement central-exp term ok from protocol direct
set policy-options policy-statement central-exp term ok from route-filter 192.168.0.0/16 orlonger
set policy-options policy-statement central-exp term ok then community add central-vpn
set policy-options policy-statement central-exp term ok then community add central-spoke
set policy-options policy-statement central-exp term ok then accept
set policy-options policy-statement central-exp term ko then reject
set policy-options policy-statement central-imp term rem-spoke from protocol bgp
set policy-options policy-statement central-imp term rem-spoke from community central-spoke
set policy-options policy-statement central-imp term rem-spoke then reject
set policy-options policy-statement central-imp term ok from protocol bgp
set policy-options policy-statement central-imp term ok from community central-vpn
set policy-options policy-statement central-imp term ok then accept
set policy-options policy-statement central-imp term ko then reject

Vrf policies do the trick. Unlike the full mesh scenario, here, spokes append a seocnd community when exporting routes to RR.

That same community is used to match incoming routes from RR. If matched, route is rejected. This way we discard any remote spoke route.

As a result, spoke routing table only has a 0/0 towards the hub:

root@s1> show route table central.inet.0

central.inet.0: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

0.0.0.0/0          *[BGP/170] 09:49:10, localpref 100, from 10.10.10.10
                      AS path: I, validation-state: unverified
                    > via gr-0/0/10.10, Push 18
192.168.2.1/32     *[Direct/0] 1d 04:48:19
                    > via lo0.2

root@s1> traceroute no-resolve routing-instance central source 192.168.2.1 192.168.2.10
traceroute to 192.168.2.10 (192.168.2.10) from 192.168.2.1, 30 hops max, 52 byte packets
 1  192.168.2.10  268.940 ms  207.031 ms  126.407 ms

root@s1> traceroute no-resolve routing-instance central source 192.168.2.1 192.168.2.2
traceroute to 192.168.2.2 (192.168.2.2) from 192.168.2.1, 30 hops max, 52 byte packets
 1  * * *
 2  192.168.2.2  349.741 ms  273.855 ms  343.618 ms

As you can see, now, to reach a destination on a remote spoke, we have an additional hop…the hub.

Of course, hub has routes to any spoke.

The third use-case comes almost for free.

Again, we use loopbacks to emulate users:

spoke1: 192.168.3.1
spoke2: 192.168.3.2
hub: 192.168.3.10

To recall the usecase, here we simply want spokes to be able to reach and only reach endpoints at the hub location.

On the hub:

set routing-instances hq instance-type vrf
set routing-instances hq interface lo0.3
set routing-instances hq route-distinguisher 1.1.1.10:3
set routing-instances hq vrf-import hq-imp
set routing-instances hq vrf-export hq-exp
set routing-instances hq vrf-table-label
set policy-options policy-statement hq-exp term ok from protocol direct
set policy-options policy-statement hq-exp term ok from route-filter 192.168.0.0/16 orlonger
set policy-options policy-statement hq-exp term ok then community add hq-vpn
set policy-options policy-statement hq-exp term ok then community add hq-hub
set policy-options policy-statement hq-exp term ok then accept
set policy-options policy-statement hq-exp term ko then reject
set policy-options policy-statement hq-imp term ok from protocol bgp
set policy-options policy-statement hq-imp term ok from community hq-vpn
set policy-options policy-statement hq-imp term ok then accept
set policy-options policy-statement hq-imp term ko then reject

Hub appends a second community when exporting its local routes to tell “this is a hub route”.

On spokes:

set routing-instances hq instance-type vrf
set routing-instances hq interface lo0.3
set routing-instances hq route-distinguisher 1.1.1.1:3
set routing-instances hq vrf-import hq-imp
set routing-instances hq vrf-export hq-exp
set routing-instances hq vrf-table-label
set policy-options policy-statement hq-exp term ok from protocol direct
set policy-options policy-statement hq-exp term ok from route-filter 192.168.0.0/16 orlonger
set policy-options policy-statement hq-exp term ok then community add hq-vpn
set policy-options policy-statement hq-exp term ok then accept
set policy-options policy-statement hq-exp term ko then reject
set policy-options policy-statement hq-imp term ok from protocol bgp
set policy-options policy-statement hq-imp term ok from community hq-hub
set policy-options policy-statement hq-imp term ok then accept
set policy-options policy-statement hq-imp term ko then reject

Spokes only imports routes with a community saying “this is a hub route”.

Alternatively, we might have followed the same approach of the previous use-case and have spokes to append a second community and discard those routes on remote spokes.

Looking at the routing table, spoke can only reach hub routes:

root@s1> show route table hq.inet.0

hq.inet.0: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.3.1/32     *[Direct/0] 1d 01:36:03
                    > via lo0.3
192.168.3.10/32    *[BGP/170] 09:55:50, localpref 100, from 10.10.10.10
                      AS path: I, validation-state: unverified
                    > via gr-0/0/10.10, Push 19

Of course, hub has routes to any spoke.

Finally, let’s look at the last use-case. This one is interesting. We want to create an overlay networks allowing guests on branch sites to access the internet but NOT company resources. Internet access is available at the hub so spokes must reach the hub first.

Let’s start from the spoke:

set routing-instances byod instance-type vrf
set routing-instances byod interface lo0.4
set routing-instances byod route-distinguisher 1.1.1.1:4
set routing-instances byod vrf-import byod-imp
set routing-instances byod vrf-export byod-exp
set routing-instances byod vrf-table-label
set policy-options policy-statement byod-exp term ok from protocol direct
set policy-options policy-statement byod-exp term ok from route-filter 192.168.0.0/16 orlonger
set policy-options policy-statement byod-exp term ok then community add byod-vpn
set policy-options policy-statement byod-exp term ok then accept
set policy-options policy-statement byod-exp term ko then reject
set policy-options policy-statement byod-imp term ok from protocol bgp
set policy-options policy-statement byod-imp term ok from community byod-vpn
set policy-options policy-statement byod-imp term ok from route-filter 0.0.0.0/0 exact
set policy-options policy-statement byod-imp term ok then accept
set policy-options policy-statement byod-imp term ko then reject

spoke exports it local routes (192.168.4.1 for spoke1 and 192.168.4.2 for spoke2)
spoke import 0/0 from hub

Now, let’s move to the hub:

set routing-instances byod instance-type vrf
set routing-instances byod interface lt-0/0/10.1
set routing-instances byod route-distinguisher 1.1.1.10:4
set routing-instances byod vrf-import byod-imp
set routing-instances byod vrf-export byod-exp
set routing-instances byod vrf-table-label
set policy-options policy-statement byod-exp term ok from protocol bgp
set policy-options policy-statement byod-exp term ok from route-filter 0.0.0.0/0 exact
set policy-options policy-statement byod-exp term ok then community add byod-vpn
set policy-options policy-statement byod-exp term ok then accept
set policy-options policy-statement byod-exp term ko then reject
set policy-options policy-statement byod-imp term ok from protocol bgp
set policy-options policy-statement byod-imp term ok from community byod-vpn
set policy-options policy-statement byod-imp term ok then accept
set policy-options policy-statement byod-imp term ko then reject

hub imports spoke routes
hub exports 0/0

Look how 0/0 is exported…from protocol bgp. Yes, here I imagined hub peering with another device providing internet access. This device might be a firewall inspecting traffic and performing source nat.

In my lab, I simulated the “NAT device” with a virtual router configured on the same hub device. Byod vrf and “NAT device” VR talk to each other through a logical tunnel. I will not show this part.

However, we need to add a BGP session from byod vrf towards “NAT device”:

set routing-instances byod protocols bgp group internet type external
set routing-instances byod protocols bgp group internet export exp-ent-lans
set routing-instances byod protocols bgp group internet peer-as 200
set routing-instances byod protocols bgp group internet neighbor 172.30.100.1
set policy-options policy-statement exp-ent-lans term ok from protocol bgp
set policy-options policy-statement exp-ent-lans term ok from community byod-vpn
set policy-options policy-statement exp-ent-lans term ok then accept
set policy-options policy-statement exp-ent-lans term ko then reject

Through that session, we receive the default route and we advertise spoke routes.

As a result, byod vrf on hub can reach the internet and spokes:

root@h# run show route table byod.inet.0

byod.inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

0.0.0.0/0          *[BGP/170] 23:53:36, localpref 100
                      AS path: 200 I, validation-state: unverified
                    > to 172.30.100.1 via lt-0/0/10.1
172.30.100.0/31    *[Direct/0] 1d 00:21:01
                    > via lt-0/0/10.1
172.30.100.0/32    *[Local/0] 1d 00:21:01
                      Local via lt-0/0/10.1
192.168.4.1/32     *[BGP/170] 10:04:55, localpref 100, from 10.10.10.10
                      AS path: I, validation-state: unverified
                    > via gr-0/0/10.1, Push 20
192.168.4.2/32     *[BGP/170] 10:04:35, localpref 100, from 10.10.10.10
                      AS path: I, validation-state: unverified
                    > via gr-0/0/10.2, Push 19

That’s not ok as this configuration might allow guests on different branches to talk to each other. They reach their branch following the 0/0 towards the hub and, from the byod vrf on hub, they reach a remote spoke.

To avoid this, we configure a forwarding table policy discarding traffic towards spke routes gong throgh byod vrf:

set policy-options policy-statement byod-discard term no-intra from protocol bgp
set policy-options policy-statement byod-discard term no-intra from rib byod.inet.0
set policy-options policy-statement byod-discard term no-intra from community byod-vpn
set policy-options policy-statement byod-discard term no-intra then next-hop discard
set policy-options policy-statement byod-discard term no-intra then accept
set routing-options forwarding-table export byod-discard

This results in:


byod.inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

0.0.0.0/0          *[BGP/170] 23:57:08, localpref 100
                      AS path: 200 I, validation-state: unverified
                    > to 172.30.100.1 via lt-0/0/10.1
172.30.100.0/31    *[Direct/0] 1d 00:24:33
                    > via lt-0/0/10.1
172.30.100.0/32    *[Local/0] 1d 00:24:33
                      Local via lt-0/0/10.1
192.168.4.1/32     *[BGP/170] 10:08:27, localpref 100, from 10.10.10.10
                      AS path: I, validation-state: unverified
                    > via gr-0/0/10.1, Push 20
192.168.4.2/32     *[BGP/170] 10:08:07, localpref 100, from 10.10.10.10
                      AS path: I, validation-state: unverified
                    > via gr-0/0/10.2, Push 19

[edit]
root@h# run show route forwarding-table table byod family inet destination 192.168.4.1/32
Routing table: byod.inet
Internet:
Enabled protocols: Bridging, All VLANs,
Destination        Type RtRef Next hop           Type Index    NhRef Netif
192.168.4.1/32     user     0                    dscd      668     3

[edit]
root@h# run show route forwarding-table table byod family inet destination 192.168.4.2/32
Routing table: byod.inet
Internet:
Enabled protocols: Bridging, All VLANs,
Destination        Type RtRef Next hop           Type Index    NhRef Netif
192.168.4.2/32     user     0                    dscd      668     3

Next-hop is a good one within the rib but is a discard one in the fib…and fib wins!

Is this enough? Nope! Leaving things like this work in the upstream direction but will lead to traffic being discarded in the downstream direction (return traffic).

To overcome this, I create a second vrf which only imports spoke routes:

set routing-instances back-internet instance-type vrf
set routing-instances back-internet route-distinguisher 1.1.1.10:1004
set routing-instances back-internet vrf-import byod-imp
set routing-instances back-internet vrf-export null

root@h# run show route table back-internet.inet.0

back-internet.inet.0: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.4.1/32     *[BGP/170] 10:11:46, localpref 100, from 10.10.10.10
                      AS path: I, validation-state: unverified
                    > via gr-0/0/10.1, Push 20
192.168.4.2/32     *[BGP/170] 10:11:26, localpref 100, from 10.10.10.10
                      AS path: I, validation-state: unverified
                    > via gr-0/0/10.2, Push 19

Last, we need to divert traffic coming back from NAT device, towards this vrf. We achive it through a firewall filter:

set firewall family inet filter return term bgp from source-address 172.30.100.1/32
set firewall family inet filter return term bgp then accept
set firewall family inet filter return term back-internet then count back-internet
set firewall family inet filter return term back-internet then routing-instance back-internet
set interfaces lt-0/0/10 unit 1 family inet filter input return

Filter accepts traffic coming from the NAT device p2p ip. This is done to preserve the bgp session.

Anything else is sent to back-internet vrf. Of course, filter can be improved to manage other types of traffic (not only interface based eBGP).

Let’s sum up how traffic works.

Upstream:

spoke matches 0/0 in byod vrf
traffic sent to hub via gre tunnel
hub performs a lookup inside byod vrf and sends traffic to NAT device (internet)

If a spoke sends traffic destined to another spoke, thre will be a match inside hub byod vrf but traffic will be discarded at the FIB level.

Downstream:

NAT device sends traffic to hub
hub has an input firewall filter applied on the interface connecting it to NAT device
unless it is eBGP traffic with NAT device, traffic is sent to back-internet vrf
hub sends traffic to spoke from back-internet vrf via a gre tunnel

Let’s verify it:

root@s1> ping rapid count 7 routing-instance byod source 192.168.4.1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
!!!!!!!
--- 8.8.8.8 ping statistics ---
7 packets transmitted, 7 packets received, 0% packet loss
round-trip min/avg/max/stddev = 53.037/84.863/104.847/16.420 ms

root@s1> ping rapid count 7 routing-instance byod source 192.168.4.1 192.168.4.2
PING 192.168.4.2 (192.168.4.2): 56 data bytes
.......
--- 192.168.4.2 ping statistics ---
7 packets transmitted, 0 packets received, 100% packet loss

That’s it!

We implemented 4 different potential sdwan topologies.

Sure, we did not have multiple networks or security features…but, from a topology build-up perspective that is irrelevant.

What matters here is that to build all those topologies we only needed to play with BGP VPNs 🙂

Personally, I find this great…like I did with Contrail. Building new solutions but relying on proven-to-work protocols! You get something new, you know it will work and it will be easier to integrate it with your current network. A win for everyone, right?

Ciao
IoSonoUmberto

Running iBGP sessions inside IPSEC tunnels with SRXs

TO secure traffic between SRX devices we will likely build site-to-site route-based IPSEC VPNs.

Once up and running, an IPSEC tunnel is nothing more than a tube carrying traffic. This means we can have a BGP session running inside the tunnel,

It is useful to look at this use-case as it represents one of the many building blocks of the SD-WAN solution. Here, we do not aim at replicating the same exact scenario we have with SD-WAN. The goal is to provide an example showing how an overlay IPSEC tunnel can transport bgp packets so that the bgp session will be totally transparent to the underlay.

Let’s consider the following image:

We have two vsrxs that will establish an ipsec tunnel with a third vsrx called oam. Inside those tunnel we will make bgp sessions flow.

From a bgp perspective, vsrx oam will act as route reflector for vpn routes while vsrxs east and west will be the clients.

As you can see, vsrxs can talk to each other through a service provider network. To us, that network is transparent. The SP network might provide an IP transport, a MPLS transport or something else. However, that is not relevant from a vsrx perspective. The only thing we do care about here is that the SP network allow srxs to reach the interfaces acting as ipsec tunnel endpoints on the other vsrxs (ge* interfaces in the image above).

From a bgp perspective, vsrxs east and west are directly connected to vsrx oam via point to point links. These p2p links are logical links; actually, they are ipsec tunnels. This is why we say ipsec tunnels form the overlay network, built upon the underlay one (the SP network). SP network provides connectivity and allows the ovrlay to be built, overlay brings the services.

All the vsrxs have a 0/0 route pointing to the SP network. From there, to us, it is a blackbox. We assume the SP network grants us all the communications we need.

Let’s start building with same baseline configuration.

We look at vsrx east. Configuration for other vsrx can be deducted easily.

First, we configure core facing interface:

set interfaces ge-0/0/1 unit 0 family inet address 173.30.2.0/31

We place that interface into a zone:

set security zones security-zone bb host-inbound-traffic system-services all
set security zones security-zone bb host-inbound-traffic protocols all
set security zones security-zone bb interfaces ge-0/0/1.0

We also configure the loopback (ibgp session will be loopback based):

set interfaces lo0 unit 0 family inet address 1.1.1.1/32

Next, we build IKE proposal and policy:

set security ike proposal ike-prop authentication-method pre-shared-keys
set security ike proposal ike-prop dh-group group14
set security ike proposal ike-prop authentication-algorithm sha-384
set security ike proposal ike-prop encryption-algorithm aes-256-cbc
set security ike policy ike-pol mode main
set security ike policy ike-pol proposals ike-prop
set security ike policy ike-pol pre-shared-key ascii-text "$9$T3CuREyKvLRheW8Xws5QFntOMWxwYohS7V"

There, we simply configured algorithms and keys.

Next, we create the ike gateway:

set security ike gateway oam-gw ike-policy ike-pol
set security ike gateway oam-gw address 173.30.6.0
set security ike gateway oam-gw dead-peer-detection
set security ike gateway oam-gw external-interface ge-0/0/1.0
set security ike gateway oam-gw version v2-only

ike v2 is used
DPD will allow use to detect failures to reach remote peer
address is the IP of the oam vsrx (the one the SP network must be able to reach)
external interface is the core facing one

Next, we move to ipsec configuration.

Proposal and policy:

set security ipsec proposal ips-prop protocol esp
set security ipsec proposal ips-prop authentication-algorithm hmac-sha-256-128
set security ipsec proposal ips-prop encryption-algorithm aes-256-cbc
set security ipsec policy ips-pol proposals ips-prop

Similarly to what we did with IKE, we configure algorithms to be used.

Last, we define the VPN:

set security ipsec vpn oam bind-interface st0.0
set security ipsec vpn oam ike gateway oam-gw
set security ipsec vpn oam ike ipsec-policy ips-pol
set security ipsec vpn oam establish-tunnels immediately

As this is a route-based vpn, we bind it to a secure tunnel (st0) logical unit. We also reference the ike gateway (so that the vpn knows where to terminate the tunnel) and the ipsec policy (so that the vpn knows which algorithms to use). Moreover, we tell Junos to establish the tunnels immediately (another options is to bring up the tunnel only when traffic that needs to cross it arrives).

We configure the st0.0 interface:

set interfaces st0 unit 0 family inet mtu 1436
set interfaces st0 unit 0 family inet address 10.0.1.2/30

In this case, st0.0 on oam vsrx is assigned IP 10.0.1.1.

The tunnel interface must belong to a zone. Here, we configure its own zone for st0.0:

set security zones security-zone vpn-oam host-inbound-traffic system-services all
set security zones security-zone vpn-oam host-inbound-traffic protocols bgp
set security zones security-zone vpn-oam interfaces st0.0

We verify tunnels are up. Here, we check on oam to see tunnels towards both vsrxs (east and west):

root@oam> show security ike security-associations
Index   State  Initiator cookie  Responder cookie  Mode           Remote Address
5264574 UP     09f2941f0fe04076  54480e7648780965  IKEv2          173.30.5.0
5264575 UP     c167f1caa638c5ff  0c746e442317f3f5  IKEv2          173.30.2.0

root@oam> show security ipsec security-associations
  Total active tunnels: 2     Total Ipsec sas: 2
  ID    Algorithm       SPI      Life:sec/kb  Mon lsys Port  Gateway
  <131073 ESP:aes-cbc-256/sha256 cf36c4ea 998/ unlim - root 500 173.30.2.0
  >131073 ESP:aes-cbc-256/sha256 e5d4d9f7 998/ unlim - root 500 173.30.2.0
  <131074 ESP:aes-cbc-256/sha256 293f5dc7 3374/ unlim - root 500 173.30.5.0
  >131074 ESP:aes-cbc-256/sha256 12a3c7dd 3374/ unlim - root 500 173.30.5.0

Tunnels are up!

Now we want to build the iBGP session traveling inside the tunnel.

Again, we look at vsrx east.

First, we configure the AS (same on all the vsrxs as we configure iBGP):

set routing-options autonomous-system 100

Next, we tell the vsrx to reach the bgp neighbor through the ipsec tunnel:

set routing-options static route 3.3.3.3/32 next-hop st0.0

Check RIB is updated:

root@east> show route 3.3.3.3

inet.0: 21 destinations, 21 routes (21 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

3.3.3.3/32         *[Static/5] 00:59:00
                    >  via st0.0

We configure the iBGP group:

set protocols bgp group i type internal
set protocols bgp group i local-address 1.1.1.1
set protocols bgp group i family inet-vpn unicast
set protocols bgp group i neighbor 3.3.3.3

On the oam vsrx we have a RR-BGP configuration:

set protocols bgp group i type internal
set protocols bgp group i local-address 3.3.3.3
set protocols bgp group i family inet-vpn unicast
set protocols bgp group i cluster 0.0.0.1
set protocols bgp group i neighbor 1.1.1.1
set protocols bgp group i neighbor 2.2.2.2

As the iBGP sessions are loopback-based the lo0 interface will be involved. That interface must belong to a zone as well. Again, we configure lo0 in its own zone:

set security zones security-zone lo0 interfaces lo0.0 host-inbound-traffic system-services all
set security zones security-zone lo0 interfaces lo0.0 host-inbound-traffic protocols all

If we commit now, BGP will not come up. This is becuase bgp traffic is source from lo0 and leaves the router through another interface. As the vSRX is a firewall and two interfaces are involved, we need a security policy to allow this kind of traffic.

The following policy is configured on all the vsrxs (adjust zone names as needed):

set security policies from-zone vpn-oam to-zone lo0 policy allow match source-address any
set security policies from-zone vpn-oam to-zone lo0 policy allow match destination-address any
set security policies from-zone vpn-oam to-zone lo0 policy allow match application junos-bgp
set security policies from-zone vpn-oam to-zone lo0 policy allow then permit

This policy allows bgp traffic coming from the st0.0 interface (zone vpn-oam) to be sent to lo0.0 (zone lo0). As the policy is “from st0.0 to lo0.0”, it will allow bgp packets (dst port 179) coming from the neighbor.

Now, everything should be ok and bgp should be up. Again, we check on vsrx oam to see sessions are up to both east and west:

root@oam> show bgp summary
Threading mode: BGP I/O
Groups: 1 Peers: 2 Down peers: 0
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0
                       0          0          0          0          0          0
bgp.l3vpn.0
                       0          0          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
1.1.1.1                 100        128        126       0       1       56:36 Establ
  bgp.l3vpn.0: 0/0/0/0
2.2.2.2                 100        132        130       0       0       58:17 Establ
  bgp.l3vpn.0: 0/0/0/0

There it is! “BGPoverIPSEC”!

Few words about the oam vsrx. There, we have two tunnels. Let’s briefly see how this is achieved.

First, we need two logical tunnel interfaces:

set interfaces st0 unit 0 family inet mtu 1436
set interfaces st0 unit 0 family inet address 10.0.1.1/30
set interfaces st0 unit 1 family inet mtu 1436
set interfaces st0 unit 1 family inet address 10.0.1.5/30

We use a single zone for both tunnel interfaces:

set security zones security-zone vpn-oam host-inbound-traffic system-services all
set security zones security-zone vpn-oam host-inbound-traffic protocols bgp
set security zones security-zone vpn-oam interfaces st0.0
set security zones security-zone vpn-oam interfaces st0.1

We have 2 ike gateways, one per vpn/ipsec tunnel:

set security ike gateway east-gw ike-policy ike-pol
set security ike gateway east-gw address 173.30.2.0
set security ike gateway east-gw dead-peer-detection
set security ike gateway east-gw external-interface ge-0/0/0.0
set security ike gateway east-gw version v2-only
set security ike gateway west-gw ike-policy ike-pol
set security ike gateway west-gw address 173.30.5.0
set security ike gateway west-gw dead-peer-detection
set security ike gateway west-gw external-interface ge-0/0/0.0
set security ike gateway west-gw version v2-only

The two gateways differ in terms of endpoint address. Anyhow, they both rely on the same ike-policy. This is not mandatory; we might have per ike-gateway ike-policies. Anyhow, here, it was simpler to have a single ike-policy (referencing a single ike-proposal) and re-use multiple times.

Something similar is seen when we look at the actual vpn definitions:

set security ipsec vpn east bind-interface st0.0
set security ipsec vpn east ike gateway east-gw
set security ipsec vpn east ike ipsec-policy ips-pol
set security ipsec vpn east establish-tunnels immediately
set security ipsec vpn ewst bind-interface st0.1
set security ipsec vpn ewst ike gateway west-gw
set security ipsec vpn ewst ike ipsec-policy ips-pol
set security ipsec vpn ewst establish-tunnels immediately

They both use the same ipsec policy but have, obviously, different bind interfaces and ike gateways.

We now have our BGP control plane built over an IPSEC overlay network.

Ciao
IoSonoUmberto

Implementing MPLSoUDP endpoint reachability check to improve North-South convergence

In Contrail environments, we rely on SDN GWs in order to connect virtual workloads to the rest of the network.
With a L3 SDN GW, we normally have MPLSoUDP tunnels established between the SDN GW and the compute nodes.
topo
MP-BGP inet-vpn routes received by Contrail Controller are resolved using these tunnels.
It becomes essential to be able to understand when a compute node dies in order to be able to remove the tunnel and invalidate routes using it as next-hop.
To achieve that, we have to first understand how Junos and MPLSoUDP tunnel endpoint resolution works.
Our Contrail control+data network is 192.168.1.0/24. This is the network we have configured as destination-networks when setting up dynamic tunnels:

set routing-options dynamic-tunnels contrail source-address 2.2.2.2
set routing-options dynamic-tunnels contrail udp
set routing-options dynamic-tunnels contrail destination-networks 192.168.1.0/24

That network is reachable via the underlay:

root@spine> show route table inet.0 192.168.1.0/24

inet.0: 16 destinations, 16 routes (16 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.1.0/24     *[BGP/170] 21:17:38, localpref 100
                      AS path: 65501 I, validation-state: unverified
                    >  to 192.168.2.0 via xe-0/0/0.0

Route is received via BGP from the leaf. The leaf sent this route as a result of an export policy matching the irb interface associated to that network. Remember, our IP Fabric uses an ERB model, meaning the IRBs interfaces acting as “vlan gateways” are configured on the leaves.

set policy-options policy-statement exp-und term lo0 from protocol direct
set policy-options policy-statement exp-und term lo0 from interface lo0.0
set policy-options policy-statement exp-und term lo0 then accept
set policy-options policy-statement exp-und term erb from protocol direct
set policy-options policy-statement exp-und term erb from interface irb.200
set policy-options policy-statement exp-und term erb then accept

Exporting the direct route associated to irb.200 means exporting the contrail control+data subnet:

root@leaf# run show route protocol direct 192.168.1.0

inet.0: 16 destinations, 17 routes (16 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.1.0/24     *[Direct/0] 21:21:56
                    >  via irb.200

Back to the SDN gateway.
Let’s understand how the “resolution chain” works.
First, we have the bgp inet-vpn route:

root@spine> show route table bgp.l3vpn.0 192.168.123.3/32 extensive | match "Tunnel Type"
                                        Tunnel type: UDP, nhid: 0, Reference-count: 8, tunnel id: 0

{master:0}
root@spine> show route table bgp.l3vpn.0 192.168.123.3/32 extensive | match "Protocol Next Hop"
                Protocol next hop: 192.168.1.3
                        Protocol next hop: 192.168.1.3

Next-hop is a MPLSoUDP tunnel with the compute node acting as vtep.
Let’s check the tunnel database, specifically an attribute called IngressRoute:

root@spine> show dynamic-tunnels database | grep Ingress
      Ingress Route: with 192.168.1.0/24
      Ingress Route: with 192.168.1.0/24
      Ingress Route: with 192.168.1.0/24
      Ingress Route: with 192.168.1.0/24
      Ingress Route: with 192.168.1.0/24

Tunnel currently relies on the bgp route we have in inet.0.
That might seem fine; after all, we have a route to reach the compute node, right?
Yes, until we have a fault 😊
Think of this scenario:
two_computes
We have 2 compute nodes. From the SDN GW, we use the same /24 route to reach both computes. At the same times, tunnels to both compute nodes rely on that /24 route in inet.0.
But what if one compute dies? Contrail Controller will detect it and, upon detection, will send BGP updates to the SDN GW telling the router to remove all routes whose next-hop is the failed compute node. All good? Not much. This detection is currently (contrail 1912) slow as it depends on xmpp timers: 15 seconds (not configurable). During those 15 seconds traffic is lost.
We need a way to somehow understand at the SDN GW level whether a compute node is alive or not.
How can we achieve it? We might monitor compute nodes statuses and verify that processes are functional. Yes…but we are a SDN GW, a router, not a monitoring tool. So what can we rely on that is fast and reliable? Routing, more precisely, once again, BGP!
We said MPLSoUDP tunnels rely on a route lookup in inet.0. No route in inet.0, no tunnel!
Easy to say but how do we implement it? If you remember, right now, the SDN GW sees the entire control+data network subnet (a /24). As already described, this is not enough; we need more granular routes, ideally /32 routes for each compute node.
Well, this basically comes for free! Our IP Fabric uses the ERB model. By design, as soon as the leaf learns a MAC:IP association, it generates a /32 route:

root@leaf> show route table inet.0 protocol evpn

inet.0: 16 destinations, 17 routes (16 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.1.3/32     *[EVPN/7] 1d 01:45:49
                    >  via irb.200

Here it is! Leaf has a route, type EVPN, for our compute node.
As said, that route is generated as soon as the leaf learns the MAC:IP association of the compute node. At the same time, when the compute node fails and leaf loses connection to the serve, the route is removed.
Next, step is to advertise those routes on the underlay so that the spine (our SDN GW) learns them:

set policy-options policy-statement exp-und term lo0 from protocol direct
set policy-options policy-statement exp-und term lo0 from interface lo0.0
set policy-options policy-statement exp-und term lo0 then accept
set policy-options policy-statement exp-und term erb from protocol direct
set policy-options policy-statement exp-und term erb from interface irb.0
set policy-options policy-statement exp-und term erb from interface irb.200
set policy-options policy-statement exp-und term erb from prefix-list-filter erbs exact
set policy-options policy-statement exp-und term erb then accept
set policy-options policy-statement exp-und term evpn from protocol evpn
set policy-options policy-statement exp-und term evpn from route-filter 0.0.0.0/0 prefix-length-range /32-/32
set policy-options policy-statement exp-und term evpn then accept
set policy-options policy-statement exp-und then reject

Term evpn is key here!
As a result, SDN GW gets this route:

root@spine> show route table inet.0 192.168.1.3

inet.0: 17 destinations, 17 routes (17 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.1.3/32     *[BGP/170] 00:00:09, localpref 100
                      AS path: 65501 I, validation-state: unverified
                    >  to 192.168.2.0 via xe-0/0/0.0

We are almost there. We still have to tell Junos to only use /32 routes when performing lookup in inet.0.
This is done by adding this configuration:

set routing-options dynamic-tunnels forwarding-rib inet.0 inet-import tunnelres

This command tells Junos to use inet.0 to check endpoint reachability. Moreover, we specify a policy:

set policy-options policy-statement tunnelres term ok from protocol bgp
set policy-options policy-statement tunnelres term ok from route-filter 0.0.0.0/0 prefix-length-range /32-/32
set policy-options policy-statement tunnelres term ok then accept
set policy-options policy-statement tunnelres then reject

This policy specifies that only BGP /32 routes should be used!
Let’s check tunnels ingress route now:

root@spine> show dynamic-tunnels database | match Ingress
      Ingress Route: with 192.168.1.3/32
      Ingress Route: with 192.168.1.3/32
      Ingress Route: with 192.168.1.3/32
      Ingress Route: with 192.168.1.3/32
      Ingress Route: with 192.168.1.3/32

It no longer uses the /32 route!
Now we remove /32 routes advertisement from the leaf configuration. This way, we simulate compute node failure.
Remember, when compute node goes down, leaf detects it as the interface goes down and it reacts by removing the /32 evpn route.
Now, on the SDN GW, we only have the subnet route:

root@spine> show route table inet.0 192.168.1.3

inet.0: 16 destinations, 16 routes (16 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

192.168.1.0/24     *[BGP/170] 1d 01:54:48, localpref 100
                      AS path: 65501 I, validation-state: unverified
                    >  to 192.168.2.0 via xe-0/0/0.0

Let’s check tunnel database:

root@spine> show dynamic-tunnels database | match Ingress

{master:0}
root@spine>

No tunnels! This happens because the import policy we applied to dynamic tunnels forwarding rib only allows bgp /32 routes to be used; as we do not have a bgp /32 routes pointing to 192.168.1.3 in inet.0 lookup fails and tunnel is not established.
This is what we wanted to obtain! Routing and BGP are used to somehow perform next-hop reachability.
Let’s sum up what happens; as soon as compute node goes down, the leaf detects it and reacts by removing the /32 evpn route. Then, underlay bgp reacts by spreading this information and, as a result, SDN GW withdraws the /32 towards the compute. This route removal action impacts MPLSoUDP forwarding lookup; as there is no valid next-hop, the tunnel is “destroyed”. All these things happen very fast, giving us sub-second convergence!
Unfortunately, this solution is not perfect.
Assume the compute fault is not at the hardware level but at the software level. For example, compute node interfaces are up but the vrouter agent fails. In this case this solution still works but it will be slower. This is because the leaf will realize the MAC:IP pair is “dead” only after LACP timer expires (3 seconds with LACP fast).
In this case something like “monitoring tool – like” would be a good fit. This probably requires extra logic on the contrail controller.
There is another scenario where this solution might not provide sub second convergence.
Assume we have 2 VMs, on two different compute nodes, advertising the same route (let’s say X) via BGPaaS. One of those VMs announce that route with worse attributes (e.g. a longer AS Path). This is an active-backup scenario.
In this case, Contrail Controller receives both routes and performs BGP best path selection, picking the single best route.
That best route will be advertised to the SDN Gateway.
Now, assume the compute node with the active route to X dies. Fabric routing detects it and reacts, leading to SDN GW removing the tunnel to that compute. This happens on the SDN GW…but what about Contrail?
As mentioned before, right now (Contrail 1912) Contrail Controller takes at least 15 seconds to detect a compute node is down (xmpp expiration timer). During that time, it still thinks the current best route is available. As a result, no BGP message is sent to the SDN GW telling to switch to the backup route.
So we have this situation: the SDN GW was fast detecting the compute node and removing tunnel and routes to that server but it does not receive up to date routing information from the contrail controller that was not as fast at detecting the fault. Result? Traffic loss!
The missing piece is an extra logic at the Contrail level to monitor and detect compute node vrouter failures.
Just wait…we will come there 🙂
For now, this is a first step to be faster at reacting
Ciao
IoSonoUmberto