Modern and well-built networks are path redundant. This redundancy not only brings higher fault tolerance but it always provide better traffic distribution as those redundant paths also are different paths that can be used to share the load across the network. Simply put, we have equal cost multiple paths.
Moreover, in medium/large networks we will probably have route reflectors to distribute routes within the routing domain.
By default, route reflection and ECMP are not great friends.
Let’s consider this reference topology:
Network 100/8 is reachable through both R1 and R2. That route is advertised, via iBGP, to Route Reflectors that reflect it to R3.
Our final goal is to have ECMP at R3. We want traffic destined to 100/8 to be equally shared among R3-R1 and R3-R2 links.
We configure all the network elements with a minimal basic configuration:
- OSPF among routers (lo0 passive)
- iBGP sessions between clients and RRs
- RRs BGP configuration only includes “cluster” setting (0.0.0.1 and 0.0.0.2)
For simplicity, 100/8 is configured as a “discard static route” on both R1 and R2 and distributed to RRs.
RRs receive a copy of 100/8 from both R1 and R2:
root@rr1_re# run show route protocol bgp
inet.0: 24 destinations, 25 routes (24 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
100.0.0.0/8 *[BGP/170] 00:01:23, localpref 100, from 1.1.1.1
AS path: I, validation-state: unverified
> to 192.168.14.0 via ge-0/0/0.0
[BGP/170] 00:01:41, localpref 100, from 2.2.2.2
AS path: I, validation-state: unverified
> to 192.168.24.0 via ge-0/0/1.0
They only chose one. In this case the first one as it comes from the lowest peer (1.1.1.1):
root@rr1_re# run show route advertising-protocol bgp 3.3.3.3 extensive
inet.0: 24 destinations, 25 routes (24 active, 0 holddown, 0 hidden)
* 100.0.0.0/8 (2 entries, 1 announced)
BGP group rr type Internal
Nexthop: 1.1.1.1
Localpref: 100
AS path: [100] I
Cluster ID: 0.0.0.1
Originator ID: 1.1.1.1
root@rr2_re# run show route advertising-protocol bgp 3.3.3.3 extensive
inet.0: 24 destinations, 25 routes (24 active, 0 holddown, 0 hidden)
* 100.0.0.0/8 (2 entries, 1 announced)
BGP group rr type Internal
Nexthop: 1.1.1.1
Localpref: 100
AS path: [100] I
Cluster ID: 0.0.0.2
Originator ID: 1.1.1.1
R3 receives two copies, one from each RR, but they both point to the same next-hop (1.1.1.1):
root@r3_re# run show route protocol bgp
inet.0: 25 destinations, 26 routes (25 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
100.0.0.0/8 *[BGP/170] 00:03:50, localpref 100, from 11.11.11.11
AS path: I, validation-state: unverified
> to 192.168.13.0 via ge-0/0/0.0
[BGP/170] 00:03:31, localpref 100, from 22.22.22.22
AS path: I, validation-state: unverified
> to 192.168.13.0 via ge-0/0/0.0
As a result, we do not have ECMP, we lost it!
Adding multipath and oad balancing policy on RRs does not help. It brings ECMP on RRs (which is useless as RRs are not part of the forwarding path) but does not lead to multiple next-hops advertised to R3.
What can we do?
Here, I’m going to show three different approaches.
First one is to leverage a BGP feature called add-path.
Add Path requires configuration on both peers of a session.
On RRs we add:
root@rr1_re# show | compare rollback 1
[edit protocols bgp group rr]
+ family inet {
+ unicast {
+ add-path {
+ send {
+ path-count 4;
+ }
+ }
+ }
+ }
where we basically tell BGP to advertise up to 4 paths for a given route.
On R1, R2 and R3 we add:
root@r3_re# show | compare rollback 1
[edit protocols bgp group rr]
+ family inet {
+ unicast {
+ add-path {
+ receive;
+ }
+ }
+ }
this tells Junos to accept multiple paths.
Now RRs announces the route with multiple next-hops:
root@rr1_re# run show route advertising-protocol bgp 3.3.3.3
inet.0: 24 destinations, 25 routes (24 active, 0 holddown, 0 hidden)
Prefix Nexthop MED Lclpref AS path
* 100.0.0.0/8 1.1.1.1 100 I
2.2.2.2 100 I
Let’s check on R3:
root@r3_re# run show route protocol bgp
inet.0: 25 destinations, 27 routes (25 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
100.0.0.0/8 *[BGP/170] 00:02:41, localpref 100, from 11.11.11.11
AS path: I, validation-state: unverified
> to 192.168.13.0 via ge-0/0/0.0
[BGP/170] 00:02:37, localpref 100, from 22.22.22.22
AS path: I, validation-state: unverified
> to 192.168.13.0 via ge-0/0/0.0
[BGP/170] 00:02:41, localpref 100, from 11.11.11.11
AS path: I, validation-state: unverified
> to 192.168.23.0 via ge-0/0/1.0
[edit]
root@r3_re# run show route forwarding-table destination 100.0.0.0/8
Routing table: default.inet
Internet:
Enabled protocols: Bridging,
Destination Type RtRef Next hop Type Index NhRef Netif
100.0.0.0/8 user 0 indr 1048575 2
192.168.13.0 ucst 513 6 ge-0/0/0.0
Not there yet!
On R3, we have load balancing on forwarding table but we still miss multipath on BGP:
root@r3_re# set protocols bgp group rr multipath
Now it works:
root@r3_re# run show route receive-protocol bgp 11.11.11.11
inet.0: 25 destinations, 26 routes (25 active, 0 holddown, 0 hidden)
Prefix Nexthop MED Lclpref AS path
* 100.0.0.0/8 1.1.1.1 100 I
2.2.2.2 100 I
inet6.0: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
[edit]
root@r3_re# run show route receive-protocol bgp 22.22.22.22
inet.0: 25 destinations, 28 routes (25 active, 0 holddown, 0 hidden)
Prefix Nexthop MED Lclpref AS path
100.0.0.0/8 1.1.1.1 100 I
2.2.2.2 100 I
inet6.0: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
root@r3_re# run show route table inet.0 protocol bgp
inet.0: 25 destinations, 28 routes (25 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
100.0.0.0/8 *[BGP/170] 00:02:06, localpref 100, from 11.11.11.11
AS path: I, validation-state: unverified
> to 192.168.13.0 via ge-0/0/0.0
to 192.168.23.0 via ge-0/0/1.0
[BGP/170] 00:00:50, localpref 100, from 22.22.22.22
AS path: I, validation-state: unverified
> to 192.168.13.0 via ge-0/0/0.0
[BGP/170] 00:05:13, localpref 100, from 11.11.11.11
AS path: I, validation-state: unverified
> to 192.168.23.0 via ge-0/0/1.0
[BGP/170] 00:00:50, localpref 100, from 22.22.22.22
AS path: I, validation-state: unverified
> to 192.168.23.0 via ge-0/0/1.0
root@r3_re# run show route forwarding-table destination 100.0.0.0/8 table default
Routing table: default.inet
Internet:
Enabled protocols: Bridging,
Destination Type RtRef Next hop Type Index NhRef Netif
100.0.0.0/8 user 0 ulst 1048580 2
indr 1048579 2
192.168.23.0 ucst 514 6 ge-0/0/1.0
indr 1048575 2
192.168.13.0 ucst 513 6 ge-0/0/0.0
That’s it! ECMP via BGP!
For this approach to work, we need both peers to support Add Path.
It might happen that this is not the case. If so, we have to bet a bit creative!
Second approach is to use MED so that each RR announces different next-hops. As a consequence, RR clients will receive multiple BGP routes and will build ECMP locally.
Let’s see how to do this.
On R1, R2 and R3 we configure these policies:
set policy-options policy-statement med1000 then metric 1000
set policy-options policy-statement med1000 then accept
set policy-options policy-statement med2000 then metric 2000
set policy-options policy-statement med2000 then accept
First policy sets MED to 1000 while second one sets MED to 2000.
The idea behind this approach is that RR1 sees R1 as best next-hop while RR2 sees R2 as best next-hop. This way, they will advertise routes with different next-hops, unlike before where both RRs picks the same next-hop (lowest peer, 1.1.1.1).
Next, we configure export policies towards RRs.
On R1:
set protocols bgp group rr neighbor 11.11.11.11 export med1000
set protocols bgp group rr neighbor 22.22.22.22 export med2000
On R2:
set protocols bgp group rr neighbor 11.11.11.11 export med2000
set protocols bgp group rr neighbor 22.22.22.22 export med1000
Please notice:
- MED1000 towards RR1 on R1 and towards RR2 on R2
- MED2000 towards RR2 on R1 and towards RR1 on R1
That “policy inversion” makes the trick!
- RR1 chooses 100/8 copy from R1 (MED 1000)
- RR2 chooses 100/8 copy from R2 (MED 1000)
Thanks to this trick, R3 still receives two copies but, this time, next-hop are different. R3 applies best path selection algorithm, understand they are equal cost multi paths and installs ECMP routes.
As said, R3 gets routes with different next-hops:
root@r3_re# run show route receive-protocol bgp 11.11.11.11 100/8
inet.0: 25 destinations, 26 routes (25 active, 0 holddown, 0 hidden)
Prefix Nexthop MED Lclpref AS path
* 100.0.0.0/8 1.1.1.1 1000 100 I
[edit]
root@r3_re# run show route receive-protocol bgp 22.22.22.22 100/8
inet.0: 25 destinations, 26 routes (25 active, 0 holddown, 0 hidden)
Prefix Nexthop MED Lclpref AS path
100.0.0.0/8 2.2.2.2 1000 100 I
And we end up with ECMP without Add Path:
root@r3_re# run show bgp neighbor 11.11.11.11 | match AddPath
Peer does not support Addpath
[edit]
root@r3_re# run show bgp neighbor 22.22.22.22 | match AddPath
Peer does not support Addpath
Summing up:
root@r3_re# run show route protocol bgp
inet.0: 25 destinations, 26 routes (25 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
100.0.0.0/8 *[BGP/170] 00:03:24, MED 1000, localpref 100, from 11.11.11.11
AS path: I, validation-state: unverified
> to 192.168.13.0 via ge-0/0/0.0
to 192.168.23.0 via ge-0/0/1.0
[BGP/170] 00:03:24, MED 1000, localpref 100, from 22.22.22.22
AS path: I, validation-state: unverified
> to 192.168.23.0 via ge-0/0/1.0
root@r3_re# run show route forwarding-table table default destination 100/8
Routing table: default.inet
Internet:
Enabled protocols: Bridging,
Destination Type RtRef Next hop Type Index NhRef Netif
100.0.0.0/8 user 0 ulst 1048580 2
indr 1048579 2
192.168.23.0 ucst 514 6 ge-0/0/1.0
indr 1048575 2
192.168.13.0 ucst 513 6 ge-0/0/0.0
Are we done? Not yet. We still have one more approach.
The last way to achieve ECMP with route reflection is the usage of an anycast IP.
On R1 and R2, we configure a discard static route that will be used as anycast IP:
set routing-options static route 1.2.3.4/32 discard
We create a virtual router where we copy the routes remote routers have to see via ECMP path (in our case 100/8):
set routing-instances anycast-check instance-type virtual-router
set routing-instances anycast-check routing-options instance-import anycast-import
set policy-options policy-statement anycast-import term static from instance master
set policy-options policy-statement anycast-import term static from protocol static
set policy-options policy-statement anycast-import term static from prefix-list-filter to-rr exact
set policy-options policy-statement anycast-import term static then accept
set policy-options policy-statement anycast-import then reject
set policy-options prefix-list to-rr 100.0.0.0/8
Here, we used “from protocol static” as we emulated the “end route” as a local static route. Of course, based on the specific scenario we have to edit the policy properly (e.g. from protocol ospf, match a community, etc…).
We export the anycast route into OSPF:
set policy-options policy-statement exp-ospf term anycast from protocol static
set policy-options policy-statement exp-ospf term anycast from route-filter 1.2.3.4/32 exact
set policy-options policy-statement exp-ospf term anycast from condition anycast-check
set policy-options policy-statement exp-ospf term anycast then accept
set policy-options condition anycast-check if-route-exists 100.0.0.0/8 table anycast-check.inet.0
set protocols ospf export exp-ospf
As both R1 and R2 advertise anycast route into OSPF, R3 will have an ECMP route to the anycast route.
Last, we modify the export policy towards RRs on R1 and R2 so to set the next-hop to the anycast IP (1.2.3.4):
set policy-options policy-statement to-rr term ok from protocol static
set policy-options policy-statement to-rr term ok from prefix-list-filter to-rr exact
set policy-options policy-statement to-rr term ok then next-hop 1.2.3.4
set policy-options policy-statement to-rr term ok then accept
set policy-options policy-statement to-rr then reject
On R3, 1.2.3.4 is reachable via 2 ECMP paths (OSPF route):
root@r3_re# run show route 1.2.3.4 exact
inet.0: 26 destinations, 27 routes (26 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
1.2.3.4/32 *[OSPF/150] 00:01:17, metric 0, tag 0
> to 192.168.13.0 via ge-0/0/0.0
to 192.168.23.0 via ge-0/0/1.0
R3 receives route 100/8 from both RRs with next-hop 1.2.3.4:
root@r3_re# run show route protocol bgp extensive | match Proto
Protocol next hop: 1.2.3.4
Protocol next hop: 1.2.3.4 Metric: 0
Protocol next hop: 1.2.3.4
Protocol next hop: 1.2.3.4 Metric: 0
As a result, BGP route next-hop is resolved via the OSPF next-hop. As the OSPF next-hop is an ECMP one, the BGP route will leverage that ECMP next-hop as well:
root@r3_re# run show route protocol bgp
inet.0: 26 destinations, 27 routes (26 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
100.0.0.0/8 *[BGP/170] 00:00:22, localpref 100, from 11.11.11.11
AS path: I, validation-state: unverified
> to 192.168.13.0 via ge-0/0/0.0
to 192.168.23.0 via ge-0/0/1.0
[BGP/170] 00:00:22, localpref 100, from 22.22.22.22
AS path: I, validation-state: unverified
> to 192.168.13.0 via ge-0/0/0.0
to 192.168.23.0 via ge-0/0/1.0
[edit]
root@r3_re# run show route forwarding-table table default destination 1.2.3.4
Routing table: default.inet
Internet:
Enabled protocols: Bridging,
Destination Type RtRef Next hop Type Index NhRef Netif
1.2.3.4/32 user 0 ulst 1048575 3
192.168.13.0 ucst 513 6 ge-0/0/0.0
192.168.23.0 ucst 514 6 ge-0/0/1.0
With this third approach, multipath on R3 bgp configuration is no longer needed (as we do not receive multiple routes from RRs; we receive one whose next-hop is locally resolved to an ECMP next-hop):
root@r3_re# delete protocols bgp group rr multipath
root@r3_re# run show route forwarding-table table default destination 1.2.3.4
Routing table: default.inet
Internet:
Enabled protocols: Bridging,
Destination Type RtRef Next hop Type Index NhRef Netif
1.2.3.4/32 user 0 ulst 1048575 3
192.168.13.0 ucst 513 6 ge-0/0/0.0
192.168.23.0 ucst 514 6 ge-0/0/1.0
[edit]
root@r3_re# run show route forwarding-table table default destination 100.0.0.0
Routing table: default.inet
Internet:
Enabled protocols: Bridging,
Destination Type RtRef Next hop Type Index NhRef Netif
100.0.0.0/8 user 0 indr 1048579 2
ulst 1048575 3
192.168.13.0 ucst 513 6 ge-0/0/0.0
192.168.23.0 ucst 514 6 ge-0/0/1.0
Before calling it a day, let’s spend few words on the condition we added to the ospf export policy:
set policy-options policy-statement exp-ospf term anycast from condition anycast-check
set policy-options condition anycast-check if-route-exists 100.0.0.0/8 table anycast-check.inet.0
Without it, 1.2.3.4/32 would always be advertised into ospf.
Anyhow, it might happen that, for example, R1 does not have a route for 100/8. If so, any traffic destined to 100/8 that R3 sent to R1 (legit as R1 is one of the ECMP next-hops) would get lost.
This way, instead, we advertise 1.2.3.4 into OSPF if and only if we have a route for 100/8. As a result, we “advertise ourselves as potential next-hop” for 100/8 only if we know we can reach 100/8.
And that’s it!
Ciao
IoSonoUmberto
One thought on “Three approaches to achieve ECMP with Route Reflection”