Routing with NSX-T (part 3) – High Availability with ECMP and BGP

Okay, one more thing to do, get redundant! In the topology that I created, I use one Edge Node within a Cluster, so when that edge node fails, all traffic will seize. So we need redundancy.

In this post I am going to write about ECMP and combine it with BGP.

I’m not going to go into the workings of ECMP, there are a lot of excellent blogs about that. For now it is sufficient to know that ECMP stands for Equal Cost Multi Pathing and that it allows routers to use multiple paths to get from A to B and as such it is viable as a means of getting both load balancing (multiple paths, means distributing load across the paths) ánd redundancy (if one path is lost, the other path(s) can still be used). To get this redundancy it is necessary to use a dynamic routing protocol like BGP to detect lost neighbors and to remove those paths from the forwarding table.

Just to have some idea of what I’m configuring in this blog, here an overview of the topology:

topology

I am creating the tier-0 and other objects in the red square. The virtual machine in the orange square (10.200.1.11) is used to test communication.

So after the small piece of theory on ECMP, we create two (virtual) edge nodes within one edge node cluster. The creation of the edge nodes is described in a previous post: Routing with NSX-T (part 2):

edge node cluster

Special consideration around the Edge Cluster Profile. In this profile you can define the way that NSX-T handles BFD (Bidirectional Forwarding Detection). This is a protocol that can be used to detect that a peer is no longer available for forwarding and should be removed from the list.

edge node cluster profile

So when you are using physical edge nodes, the BFD Probe Interval can be reduced to 300 ms. (the minimum value), thus (combined with the default BFD Declare Dead Multiple of 3), makes for sub-second fail-over. The minimum value for BFD Declare Dead Multiple is 2, so you could even make it smaller, but be careful for false positives! You don’t want your routing to flap when a couple of BFD packets are lost, due to congestion.

Since we are using virtual machines, the default values (1000 and 3) are perfectly fine and will lead to failover of a little over 3 seconds.

So after that, I created a new tier-0 gateway, to run on the created edge node cluster. First time around, I used Active/Standby as the HA Mode and I connected it to the edge cluster I just created (consisting of two edge nodes):

tier-0-rjo-ha

Within the tier-0, I connected to a segment, with subnet 10.115.1.0/24 and in this segment, two virtual machines are running, 10.115.1.11 and 10.115.1.12.

I created two up-links to the rest of the network:

tier-0-uplinks

 

Please note, in a real-life situation, you should use two separate VLAN’s for the uplinks, but since I am limited there (and don’t have control over the physical layer), I use one and the same. One of the up-links has .241 as the last octet, the other has .242.

After the creation of the up-links, I configured and connected BGP:

BGP neigbors

 

I used the minimum timers, for a fast failover, but in a production environment it is advised to consider these values carefully). Of course, I also configured the neighbors with the correct settings, to make sure they can communicate with each other and have the same timers set.

After that, it is time to take a look at the routing tables, to see if and how our newly created segment is advertised throughout the network.

In order to get the routing tables, I enabled ssh into the edge nodes. You should be able to use the NSX Manager CLI for this, but I didn’t get the right response, so working on the edge nodes was easier for me.

To enable ssh (disabled by default and not an option in the GUI), use the (remote) console of the edge node and type in:

start service ssh

and to make sure it will start at the next reboot:

set service ssh start-on-boot

After this, we can use putty (or another tool that you wish) to ssh into the edge nodes. First we lookup the logical-router we want to query:

logical routers

We want to query the DR-Tier-0-GW Distributed Router (the tier-0 on the left side of the topology drawing), to see what routes it has to reach the virtual machines within the created segments (subnet 10.115.1.0/24).

To query the forwarding table of this DR, we use the following:

forwarding table DR

The table is truncated, but we can see that the subnet 10.115.1.0/24 is reachable through one (and only one) of the interfaces (last octet .242). The reason for this is that we used Active/Standby as the HA Mode. Later on, we’ll look at the Active/Active mode, but for now this is good enough.

Try to ping one of the virtual machines in the 10.115.1.0/24 segment, from the virtual machine in the orange square and voila:

ping from web-1

Now, as the proof of the pudding is in the eating, we’ll power down the active edge node. So, let’s find out which of the edge nodes is active. For that we can use the Advanced Networking part of NSX, to see which of the interfaces is active and on which edge node it resides:

active SR

So we can see that the active SR is running on Edge-TN-06, which is in line with the configuration. As you could see in earlier pictures and in the routing table, we are using XX.XX.XX.242. This IP-address was configured on the interface which was connected to Edge-TN-06 and that is the one that is Active here.

So time to power-off edge-tn-06 and see what happens…

When we look at the ping:

ping-na-failure

We see three missed pings, which is in line with the configure BFP timers. I did a quick test where I changed the timers from 1000 ms. to 2000 ms. (keeping 3 as the Declare Dead Multiple) and changed the BGP timers to 10 and 30 (to make sure that BGP doesn’t mark the path as failed) and then 6 pings are missed.

So when using BGP ánd BFD in combination, the lesser of the time-outs is the time it takes for the paths to be taken from the routing table. So when I configured BGP as 1 and 3 and kept BFD at 2000 ms. and 3, the time went down to 3 seconds again. Normally the BFD timers would be the lowest, since detecting failures is what this protocol is made for.

When we look at the active interfaces in Advanced Networking:

active SR-after failure

We see that Edge-TN-05 is now active and the status of 06 is unknown. And when we look at the forwarding table on the edge-node:

forwarding table DR - na failure

We can see that the forwarding table has also switched to the other interface.

So after all this, one more thing to do, do it all again, but this time, create a tier-0 which is configured for Active/Active. I will omit all steps and suffice with the last two pictures, from an Active/Active setup, with all edge-nodes running:

forwarding table DR - ecmp

(it is important to have the edge-node turned on again, otherwise the route will not appear ;)).

active SR-ecmp

So in both pictures we can see that there are two active paths for the 10.115.1.0/24 subnet.

The failover time is the same, because it will still take three seconds before the failed path is removed from the routing table:

ping-na-failure2

So after all this routing, I’ll have a look at Load Balancing and NAT within NSX-T. Stay tuned!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s