building a rest API for ExBGP

The last couple of years there is a trend to extend layer three to the top of rack switch (TOR). This gives a more stable and scalable design compared to the classic layer two network design. On major disadvantage of the layer 3 to the TOR switch is IP mobility. In the classic L2 design it was a simple live migration of a vm to a  different compute host in a different rack. When L3 is extended to the TOR IP mobility isn’t that simple anymore. A solution for this might be to let the VM Host advertise a unique service IP for a particular VM when it becomes active on that VM host. A great tool for this use case is ExaBGP.

ExaBGP does not modify the route table on the host itself it only announces routes to its neighbours. After ExaBGP starts the routes it advertises can be influenced by sending messages to STDIN
Below is the config used by the ExaBGP daemon

group ebgp {
router-id 172.16.2.11;
neighbor 172.16.2.252 {
local-address 172.16.2.11;
local-as 65001;
peer-as 65000;
group-updates;
}
process add-routes {
run /etc/exabgp/exabgp_rest3.py;
}
}

Most of this is pretty self explanatory the important stuff happens on line 9-11. These lines start a script and all output of this script is parsed by ExaBGP.

The script exabgp_rest3.py provides a rest API which outputs on STDOUT the announce and withdraw commands for ExaBGP.

For testing purposes I created a simple setup within KVM and two hosts, docker1 which runs ExaBGP and firewall-1 which runs the birdc bgp daemon. There is a L2 segment between those clients over which BGP peering is created

The python script is only 75 lines long.

#!/usr/bin/env python
import web
from sys import stdout
from netaddr import *
from pprint import pprint
urls = (
    '/announce/(.*)', 'announce',
    '/withdraw/(.*)', 'withdraw',
)
class MyOutputStream(object):
    def write(self, data):
        pass   # Ignore output
	
web.httpserver.sys.stderr = MyOutputStream()
class bgpPrefix:
    def __init__(self,prefix,action="announce",next_hop="self",attributes={}):
        self.prefix=prefix
        self.action=action
        self.next_hop=next_hop
        self.attributes=attributes
        print self.attributes
    def get_exabgp_message(self):
        if (self.action=='withdraw'):
            exabgp_message="{0} route {1} next-hop {2}".format(self.action,self.prefix,self.next_hop)
        else:
            attribute_string=""
            for attribute in self.attributes:
                 if attribute == "local-preference":
                     attribute_string+=" local-preference {0}".format(self.attributes[attribute])
                 elif attribute == "med":
                     attribute_string+=" med {0}".format(self.attributes[attribute])
                 elif attribute == "community":
                     print self.attributes[attribute]
                     if len(self.attributes[attribute])>0:
			 attribute_string+=" community [ "
			 for comm in self.attributes[attribute]:
			     attribute_string+=" {0} ".format(comm)
			 attribute_string+=" ]"

                     
            exabgp_message="{0} route {1} next-hop {2}{3}".format(self.action,self.prefix,self.next_hop,attribute_string)
	return exabgp_message
     
def verifyIp(ip):
    if not '/' in ip:
        ip="{0}/32".format(ip)
    try:
        ip_object=IPNetwork(ip)
    except:
        raise web.badrequest("invalid IP")
    return(ip_object)

class announce:
    def GET(self, prefix):
        ip_object=verifyIp(prefix)
       # bgp_prefix=bgpPrefix(str(ip_object),action="announce",attributes={'local-preference': 300})
        bgp_prefix=bgpPrefix(str(ip_object),action="announce",attributes=web.input(community=[]))
        stdout.write( bgp_prefix.get_exabgp_message() + '\n')
        stdout.flush()
        return "OK"


class withdraw:
    def GET(self, prefix):
        ip_object=verifyIp(prefix)
        bgp_prefix=bgpPrefix(str(ip_object),action="withdraw")
        stdout.write( bgp_prefix.get_exabgp_message() + '\n')
        stdout.flush()
        return "OK"

app = web.application(urls, globals())

if __name__ == "__main__":
app.run()

The heavy lifting of the web service is handled by web.py this is a powerfull library to create a webserver in Python. I am a network engineer with very limitted experience with Python but creating the script only took me a couple of hours.

The script in action

We start with starting the ExaBGP Daemo

.
.
.
Mon, 29 Aug 2016 21:01:14 | INFO     | 15213  | reactor       | New peer setup: neighbor 172.16.2.252 local-ip 172.16.2.11 local-as 65001 peer-as 65000 router-id 172.16.2.11 family-allowed in-open
Mon, 29 Aug 2016 21:01:14 | WARNING  | 15213  | configuration | Loaded new configuration successfully
Mon, 29 Aug 2016 21:01:14 | INFO     | 15213  | processes     | Forked process add-routes


Mon, 29 Aug 2016 21:01:16 | INFO     | 15213  | network       | Connected to peer neighbor 172.16.2.252 local-ip 172.16.2.11 local-as 65001 peer-as 65000 router-id 172.16.2.11 family-allowed in-open (out)

By default the service is started at port 8080

root@docker-1:/home/eelcon# netstat -anp | grep 8080
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      15183/python
root@docker-1:/home/eelcon#

The BGP neighbor is also shown as established by bird

bird> show protocols all bgp3
name     proto    table    state  since       info
bgp3     BGP      master   up     15:01:33    Established
  Preference:     100
  Input filter:   ACCEPT
  Output filter:  REJECT
  Routes:         0 imported, 0 exported, 0 preferred
  Route change stats:     received   rejected   filtered    ignored   accepted
    Import updates:              0          0          0          0          0
    Import withdraws:            0          0        ---          0          0
    Export updates:              0          0          0        ---          0
    Export withdraws:            0        ---        ---        ---          0
  BGP state:          Established
    Neighbor address: 172.16.2.11
    Neighbor AS:      65001
    Neighbor ID:      172.16.2.11
    Neighbor caps:    AS4
    Session:          external AS4
    Source address:   172.16.2.252
    Hold timer:       155/180
    Keepalive timer:  51/60

bird>

adding a route is as simple as doing a simple curl on the host on which the ExaBGP is running

nettinkerer@docker-1:~$ curl http://127.0.0.1:8080/announce/1.2.3.0/25
OK
nettinkerer@docker-1:~$

ExaBGP gets the announce message

Mon, 29 Aug 2016 21:08:18 | INFO     | 15231  | processes     | Command from process add-routes : announce route 1.2.3.0/25 next-hop self
Mon, 29 Aug 2016 21:08:18 | INFO     | 15231  | reactor       | Route added to neighbor 172.16.2.252 local-ip 172.16.2.11 local-as 65001 peer-as 65000 router-id 172.16.2.11 family-allowed in-open : 1.2.3.0/25 next-hop 172.16.2.11
Mon, 29 Aug 2016 21:08:18 | INFO     | 15231  | reactor       | Performing dynamic route update
Mon, 29 Aug 2016 21:08:19 | INFO     | 15231  | reactor       | Updated peers dynamic routes successfully

the bgp daemon on the firewall also knows the route

bird> show route 1.2.3.0/25 all
1.2.3.0/25         via 172.16.2.11 on ens9 [bgp3 15:08:36] * (100) [AS65001i]
        Type: BGP unicast univ
        BGP.origin: IGP
        BGP.as_path: 65001
        BGP.next_hop: 172.16.2.11
        BGP.local_pref: 100
bird>

the REST API also accepts communities and meds

curl "http://127.0.0.1:8080/announce/1.2.3.0/25?med=200&comnity=100:400&community=300:600"

which is shown by the bird daemon as well

bird> show route 1.2.3.0/25 all
1.2.3.0/25         via 172.16.2.11 on ens9 [bgp3 15:14:01] * (100) [AS65001i]
        Type: BGP unicast univ
        BGP.origin: IGP
        BGP.as_path: 65001
        BGP.next_hop: 172.16.2.11
        BGP.med: 200
        BGP.local_pref: 100
        BGP.community: (100,400) (300,600)
bird>

Withdrawing routes can also be done easily with a curl statement

 curl "http://127.0.0.1:8080/withdraw/1.2.3.0/25"

And the route is gone

bird> show route 1.2.3.0/25 all
Network not in table
bird>

At the moment there is only limitted input validation. The REST API does check if the ip address entered is valid but no other checks are implemented at this moment. I might add this if need arises.

The script and configs used in this blog can be found on my Github

 

ERSPAN on the Nexus7000

To troubleshoot some performance issues A span port was required on a Nexus7000. Off course the port to span was not located on the same switch as the SPAN destination.

On the Nexus 7000 it is not possible to use an RSPAN vlan as a SPAN destination. It can only be used as a span source. So this was not an option.

ERSPAN can be used as a SPAN destination but the N7K where the ERSPAN traffic needed to be decapsulated and sent to the monitoring tool didn’t have the correct sofware to do this.  So again not a feasible solution

However it is possible to give the monitoring tool the ip address of the ERSPAN destination and place it in a segment reachable by the N7K generating the ERSPAN traffic.

The basic configuration looks like this

monitor session 10 type erspan-source
erspan-id 10
vrf span
destination ip 10.1.11.40
source interface port-channel2 both
no shut

In the admin VDC the source-ip for the ERSPAN traffic needs to be specifed

monitor erspan origin ip-address 1.1.1.1 global

Not sure why this is needed in the admin VDC.
Give a simple linux VM the ip 10.1.11.40 and capture the data with tcpdump.

 tcpdump -i eth3 -s 300 -c 10000 proto gre -w GRE.CAP

ERSPAN uses the GRE protocol to encapsulate the packets and sent them to the collector so we filter on those.
Opening the file in wireshark shows us the data received. In the red box ERSPAN traffic can be seen and in the blue box the actual encapsulated packets.
ERSPAN

NSX vs MP-BGP VXLAN

Recently I have been following the VMWARE VCP-NV course and have been reading about het VXLAN MP-BGP eVPN control plane. In this post I will give a very brief overview of the Layer 2 operation of both solutions
Multicast no longer required

In previous implemenations of VXLAN BUM (Broadcast, Unknown Unicast, Multicast) traffic was sent via multicast to all VTEP’s which might be interested in these packets. Multicast, especially L3 multicast, is rare in a datacenter and the dependency on multicast by VXLAN was a huge limitation for adoption of VXLAN in the datacenter.

They both no longer require multicast to handle BUM . Via the control plane each VTEP knows about the other VTEPs interested in traffic for a particular VNI. BUM traffic is replicated by the local VTEP as a multiple unicast packet to all other VTEPs.

NSX also has a hybrid mode. In hybrid mode BUM traffic destined for VTEP in the local VTEP segment is being sent via a local Multicast group. Traffic towards remote VTEP segements is being forwarded to the forwarder in the remote segment which replicates the packet as a multicast packets in its local segment.

Besides learning of interested VTEPs for a particular VXLAN segment the controlplane is also used to propagate MAC reachabillity between the VTEP. The control plane removes the need for a flood and learn mechanism for MAC learning

Open vs closed control plane
The control plane used by NSX is a proprietary protocol. The VTEP on the ESX servers can only work with the NSX controllers. At this moment the only hardware switch which can be part of the NSX VXLAN cloud is from Arista. The controllers used by NSX are VMs running on the control cluster. This are a dedicated ESXi machines running the various control functions within a NSX deployment. At least three controllers are required which should not run on the same ESXi.
The MP-BGP VXLAN solution is based on open standards. The EVPN address familly of BGP is used to propagate all the required information like VNI, MAC reachabillty between the VTEPs. Vendors like Juniper, Huawei, Cisco and Alcatel Lucent are already supporting this. Although it should be possible to create a full mesh of iBGP between the VTEP’s it seems logical to use BGP route-reflectors for scallabillity.

EANTC has done some interoperability testing with the vendors above and made the white paper available

in a next post I will describe how the routing has been implemented by NSX and the VXLAN MP-BGP solution

Juniper multicast firewall heartbeat traffic and Nexus switches

Recently we moved the cluster traffic from a Juniper SRX cluster from CAT3750 to a Nexus 5000 FEX ports. At that moment the cluster broke. Apparantly the Junipers use multicast for the heartbeat traffic. Adding an IGMP querier to the VLAN did not resolve the problem. Apparantly the Junipers do not sent and IGMP join for the heartbeat traffic.
It seems Catalysts switches seem to be content with flooding the multicast traffic when there is no IGMP querier present where Nexus switches drop all multicast traffic when the querier is absent.

Disabling IGMP snooping on the heartbeat vlan solved it.

Updates from Cisco Technology Summit

Last week the Cisco Technology Summit in The Netherlands took place. I attended a very interesting session about VXLAN and how MP-BGP was used as a control plane. In this new setup bgp is used to signal mac address reachabillity between the VTEPs. In the original VXLAN implementations Multicast was required for BUM traffic. Will dive deeper into this.

O and finally in NXOS 7.2 L3 over VPC will be supported.

VPC-peer gateway and F5

Recently I was involved in a migration project from a pair of Catalyst 6500 towards a pair Nexus7000. Part of it was moving the SVI’s from the Catalyst to the Nexus. As HSRP was used on each SVI to be moved it was decided to use the following logic

  1. Move HSRP Active role to Cat6500_1
  2. Disable SVI on Cat6500_2
  3. Enable SVI on N7K_2
  4. Move HSRP Active role to N7K_2
  5. Disable SVI on Cat6500_1
  6. Enable SVI on N7K_1

Things went wrong after step 4. The F5 BigIP lost its network connectivity. After some troubleshooting it appeared that the F5 uses the source-mac of the incoming packets as the destination-mac of the reply packet. F5 calls this feature auto last hop.

I hear some of you screaming that I should have used VPC peer-gateway to solve that. Well that was already enabled and was cause of the problem.

The F5 was connected behind the Cat6500 and the Cat6500 was connected with a VPC to the new N7K pair.

As return traffic from the F5 was being hashed by the 6500 over the port-channel members towards either one of the N7K there was a 50% change it would be switched to N7K_1. Due to the vpc peer-gateway feature it responds to traffic destined for the MAC address of its VPC peer. The problem was that N7K_1 doesn’t have an SVI to route that traffic which resulted in connectivity issues.
Below is some sample output showing this behaviour. We have standard HSRP configuration on vlan100. HSRP group 1 maps to MAC 0000.0c9f.f001

N7K_2#  sh hsrp interface vlan 100 brief
                     P indicates configured to preempt.
                     |
Interface   Grp Prio P State    Active addr      Standby addr     Group addr
Vlan100     1   100    Active   local            10.100.25.2      10.100.25.1     (conf)
N7K_1# sh hsrp interface vlan 100 brief
                     P indicates configured to preempt.
                     |
Interface   Grp Prio P State    Active addr      Standby addr     Group addr
Vlan100     1   100    Standby  10.100.25.3      local            10.100.25.1     (conf)

The MAC of the SVI of both Nexus devices.

N7K_2# sh int vlan 100
Vlan100 is up, line protocol is up
  Hardware is EtherSVI, address is  8478.ac57.1142
  Internet Address is 10.100.25.3/24
 ...

N7K_1# sh int vlan 100
Vlan100 is up, line protocol is up
  Hardware is EtherSVI, address is  8478.ac0a.edc2
  Internet Address is 10.100.25.2/24
 ...

the MAC address tables of both devices show they both respond to the HSRP MAC and the MAC of each others SVI. This is indicated by the G flag.

N7K_1# sh mac address-table vlan 100
Legend:
        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
        age - seconds since last seen,+ - primary entry using vPC Peer-Link,
        (T) - True, (F) - False
   VLAN     MAC Address      Type      age     Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
G 100      0000.0c9f.f001    static       -       F    F  vPC Peer-Link(R)
G 100      8478.ac0a.edc2    static       -       F    F  sup-eth1(R)
G 100      8478.ac57.1142    static       -       F    F  vPC Peer-Link(R)
.
.
N7K_2# sh mac address-table vlan 100
Legend:
        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
        age - seconds since last seen,+ - primary entry using vPC Peer-Link,
        (T) - True, (F) - False
   VLAN     MAC Address      Type      age     Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
G 100      0000.0c9f.f001    static       -       F    F  sup-eth1(R)
G 100      8478.ac0a.edc2    static       -       F    F  vPC Peer-Link(R)
G 100      8478.ac57.1142    static       -       F    F  sup-eth1(R)
N7K_2

This is all expected behaviour. Now we shut the the SVI and look at the MAC tables of N7K_1

N7K_1# conf t
Enter configuration commands, one per line.  End with CNTL/Z.
N7K_1(config)# int vlan 100
N7K_1(config-if)# shut
N7K_1(config-if)# end
N7K_1# sh mac address-table vlan 100
Legend:
        * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
        age - seconds since last seen,+ - primary entry using vPC Peer-Link,
        (T) - True, (F) - False
   VLAN     MAC Address      Type      age     Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
* 100      0000.0c9f.f001    static       -       F    F  vPC Peer-Link
G 100      8478.ac57.1142    static       -       F    F  vPC Peer-Link(R)
.
.

As you can see the HSRP MAC is no longer handed locally but the MAC of the SVI of the VPC peer is.

I can’t imagine why Cisco decided to handle traffic for the peer SVI mac address if the SVI is down. If the SVI is down it no longer responds to the HSRP mac address of the VPC peer. Therefore clients which do an arp for the default gateway and send return traffic to the HSRP mac address work fine.

In short shutting down an SVI in one of the nodes in a VPC cluster with vpc peer-gateway enabled can have unpredictable results

VPC failure due to single link failure

I have been working with Cisco VPC for quite some time and rarely encounter issues, failover when one of the VPC peers fail is very fast so overall I am quite fond of VPC. However this is case is a nice one. I have read about this possible failure once before but stumbled upon it myself today. I built the simple topology below and a VPC domain configured with all the default settings.

The ports between the access device and the Nexus 3000 are 1 gbit copper interfaces. Al ports became up/up but for some reason the VPC didn’t come up. The physical ports on both Nexus devices were configured as below so nothing special there.

interface ethernet 1/48
channel-group 100 mode active
!
interface port-channel100
switchport mode trunk
vpc 100

Only after checking the vpc consistenancy parameters the fault was shown

N3K_2# sh vpc consistency-parameters vpc 100

Legend:
Type 1 : vPC will be suspended in case of mismatch

Name                        Type  Local Value            Peer Value
-------------               ----  ---------------------- -----------------------
Shut Lan                    1     No                     No
STP Port Type               1     Default                Default
STP Port Guard              1     None                   None
STP MST Simulate PVST       1     Default                Default
lag-id                      1     -                      [(c8, 0-23-4-ee-be-14,
8064, 0, 0), (8000,
0-26-98-1a-b5-c1, 63,
0, 0)]
mode                        1     -                      active
Speed                       1     -                      100 Mb/s
Duplex                      1     -                      full
Port Mode                   1     -                      trunk
Native Vlan                 1     -                      1
MTU                         1     -                      1500
Admin port mode             1     -
vPC card type               1     Empty                  Empty
Allowed VLANs               -     1-4094                 1-4094
Local suspended VLANs       -     -                      -
N3K_2#

Apparently the cable between N3K_1 and the access device was faulty and only capable of running 100Mbit in stead of 1 Gbit. As the speed is part of the consistency check the entire VPC failed. So a single link failure causes a complete failure of the VPC. Also the output for Local Value ware confusing at the time of the command the local port was up and running.
Although it is a rare condition to have such a cable failure it could have very nasty effects.