building a rest API for ExBGP

The last couple of years there is a trend to extend layer three to the top of rack switch (TOR). This gives a more stable and scalable design compared to the classic layer two network design. On major disadvantage of the layer 3 to the TOR switch is IP mobility. In the classic L2 design it was a simple live migration of a vm to a  different compute host in a different rack. When L3 is extended to the TOR IP mobility isn’t that simple anymore. A solution for this might be to let the VM Host advertise a unique service IP for a particular VM when it becomes active on that VM host. A great tool for this use case is ExaBGP.

ExaBGP does not modify the route table on the host itself it only announces routes to its neighbours. After ExaBGP starts the routes it advertises can be influenced by sending messages to STDIN

Below is the config used by the ExaBGP daemon

Most of this is pretty self explanatory the important stuff happens on line 9-11. These lines start a script and all output of this script is parsed by ExaBGP.

The script provides a rest API which outputs on STDOUT the announce and withdraw commands for ExaBGP.

For testing purposes I created a simple setup within KVM and two hosts, docker1 which runs ExaBGP and firewall-1 which runs the birdc bgp daemon. There is a L2 segment between those clients over which BGP peering is created

The python script is only 75 lines long.

The heavy lifting of the web service is handled by this is a powerfull library to create a webserver in Python. I am a network engineer with very limitted experience with Python but creating the script only took me a couple of hours.

The script in action

We start with starting the ExaBGP Daemon

By default the service is started at port 8080

The BGP neighbor is also shown as established by bird

adding a route is as simple as doing a simple curl on the host on which the ExaBGP is running

ExaBGP gets the announce message

the bgp daemon on the firewall also knows the route

the REST API also accepts communities and meds

which is shown by the bird daemon as well

Withdrawing routes can also be done easily with a curl statement

And the route is gone

At the moment there is only limitted input validation. The REST API does check if the ip address entered is valid but no other checks are implemented at this moment. I might add this if need arises.

The script and configs used in this blog can be found on my Github


ERSPAN on the Nexus7000

To troubleshoot some performance issues A span port was required on a Nexus7000. Off course the port to span was not located on the same switch as the SPAN destination.

On the Nexus 7000 it is not possible to use an RSPAN vlan as a SPAN destination. It can only be used as a span source. So this was not an option.

ERSPAN can be used as a SPAN destination but the N7K where the ERSPAN traffic needed to be decapsulated and sent to the monitoring tool didn’t have the correct sofware to do this.  So again not a feasible solution

However it is possible to give the monitoring tool the ip address of the ERSPAN destination and place it in a segment reachable by the N7K generating the ERSPAN traffic.

The basic configuration looks like this

In the admin VDC the source-ip for the ERSPAN traffic needs to be specifed

Not sure why this is needed in the admin VDC.
Give a simple linux VM the ip and capture the data with tcpdump.

ERSPAN uses the GRE protocol to encapsulate the packets and sent them to the collector so we filter on those.
Opening the file in wireshark shows us the data received. In the red box ERSPAN traffic can be seen and in the blue box the actual encapsulated packets.


Recently I have been following the VMWARE VCP-NV course and have been reading about het VXLAN MP-BGP eVPN control plane. In this post I will give a very brief overview of the Layer 2 operation of both solutions
Multicast no longer required

In previous implemenations of VXLAN BUM (Broadcast, Unknown Unicast, Multicast) traffic was sent via multicast to all VTEP’s which might be interested in these packets. Multicast, especially L3 multicast, is rare in a datacenter and the dependency on multicast by VXLAN was a huge limitation for adoption of VXLAN in the datacenter.

They both no longer require multicast to handle BUM . Via the control plane each VTEP knows about the other VTEPs interested in traffic for a particular VNI. BUM traffic is replicated by the local VTEP as a multiple unicast packet to all other VTEPs.

NSX also has a hybrid mode. In hybrid mode BUM traffic destined for VTEP in the local VTEP segment is being sent via a local Multicast group. Traffic towards remote VTEP segements is being forwarded to the forwarder in the remote segment which replicates the packet as a multicast packets in its local segment.

Besides learning of interested VTEPs for a particular VXLAN segment the controlplane is also used to propagate MAC reachabillity between the VTEP. The control plane removes the need for a flood and learn mechanism for MAC learning

Open vs closed control plane
The control plane used by NSX is a proprietary protocol. The VTEP on the ESX servers can only work with the NSX controllers. At this moment the only hardware switch which can be part of the NSX VXLAN cloud is from Arista. The controllers used by NSX are VMs running on the control cluster. This are a dedicated ESXi machines running the various control functions within a NSX deployment. At least three controllers are required which should not run on the same ESXi.
The MP-BGP VXLAN solution is based on open standards. The EVPN address familly of BGP is used to propagate all the required information like VNI, MAC reachabillty between the VTEPs. Vendors like Juniper, Huawei, Cisco and Alcatel Lucent are already supporting this. Although it should be possible to create a full mesh of iBGP between the VTEP’s it seems logical to use BGP route-reflectors for scallabillity.

EANTC has done some interoperability testing with the vendors above and made the white paper available

in a next post I will describe how the routing has been implemented by NSX and the VXLAN MP-BGP solution

Juniper multicast firewall heartbeat traffic and Nexus switches

Recently we moved the cluster traffic from a Juniper SRX cluster from CAT3750 to a Nexus 5000 FEX ports. At that moment the cluster broke. Apparantly the Junipers use multicast for the heartbeat traffic. Adding an IGMP querier to the VLAN did not resolve the problem. Apparantly the Junipers do not sent and IGMP join for the heartbeat traffic.
It seems Catalysts switches seem to be content with flooding the multicast traffic when there is no IGMP querier present where Nexus switches drop all multicast traffic when the querier is absent.

Disabling IGMP snooping on the heartbeat vlan solved it.

Updates from Cisco Technology Summit

Last week the Cisco Technology Summit in The Netherlands took place. I attended a very interesting session about VXLAN and how MP-BGP was used as a control plane. In this new setup bgp is used to signal mac address reachabillity between the VTEPs. In the original VXLAN implementations Multicast was required for BUM traffic. Will dive deeper into this.

O and finally in NXOS 7.2 L3 over VPC will be supported.

VPC-peer gateway and F5

Recently I was involved in a migration project from a pair of Catalyst 6500 towards a pair Nexus7000. Part of it was moving the SVI’s from the Catalyst to the Nexus. As HSRP was used on each SVI to be moved it was decided to use the following logic

  1. Move HSRP Active role to Cat6500_1
  2. Disable SVI on Cat6500_2
  3. Enable SVI on N7K_2
  4. Move HSRP Active role to N7K_2
  5. Disable SVI on Cat6500_1
  6. Enable SVI on N7K_1

Things went wrong after step 4. The F5 BigIP lost its network connectivity. After some troubleshooting it appeared that the F5 uses the source-mac of the incoming packets as the destination-mac of the reply packet. F5 calls this feature auto last hop.

I hear some of you screaming that I should have used VPC peer-gateway to solve that. Well that was already enabled and was cause of the problem.

The F5 was connected behind the Cat6500 and the Cat6500 was connected with a VPC to the new N7K pair.

As return traffic from the F5 was being hashed by the 6500 over the port-channel members towards either one of the N7K there was a 50% change it would be switched to N7K_1. Due to the vpc peer-gateway feature it responds to traffic destined for the MAC address of its VPC peer. The problem was that N7K_1 doesn’t have an SVI to route that traffic which resulted in connectivity issues.
Below is some sample output showing this behaviour. We have standard HSRP configuration on vlan100. HSRP group 1 maps to MAC 0000.0c9f.f001

The MAC of the SVI of both Nexus devices.

the MAC address tables of both devices show they both respond to the HSRP MAC and the MAC of each others SVI. This is indicated by the G flag.

This is all expected behaviour. Now we shut the the SVI and look at the MAC tables of N7K_1

As you can see the HSRP MAC is no longer handed locally but the MAC of the SVI of the VPC peer is.

I can’t imagine why Cisco decided to handle traffic for the peer SVI mac address if the SVI is down. If the SVI is down it no longer responds to the HSRP mac address of the VPC peer. Therefore clients which do an arp for the default gateway and send return traffic to the HSRP mac address work fine.

In short shutting down an SVI in one of the nodes in a VPC cluster with vpc peer-gateway enabled can have unpredictable results

VPC failure due to single link failure

I have been working with Cisco VPC for quite some time and rarely encounter issues, failover when one of the VPC peers fail is very fast so overall I am quite fond of VPC. However this is case is a nice one. I have read about this possible failure once before but stumbled upon it myself today. I built the simple topology below and a VPC domain configured with all the default settings.

The ports between the access device and the Nexus 3000 are 1 gbit copper interfaces. Al ports became up/up but for some reason the VPC didn’t come up. The physical ports on both Nexus devices were configured as below so nothing special there.

Only after checking the vpc consistenancy parameters the fault was shown

Apparently the cable between N3K_1 and the access device was faulty and only capable of running 100Mbit in stead of 1 Gbit. As the speed is part of the consistency check the entire VPC failed. So a single link failure causes a complete failure of the VPC. Also the output for Local Value ware confusing at the time of the command the local port was up and running.
Although it is a rare condition to have such a cable failure it could have very nasty effects.