Aviatrix alerting, easy with Microsoft Teams
Welcome to my first post since I joined Aviatrix.
What is Aviatrix for the first time reader?
In very simple words:
- Cisco/Juniper = the foundation for On-Prem solutions and traditional Datacenters.
- Aviatrix steps in when it comes to the world of Multicloud.
Connectivity, Visibility, Security in the datapath, Cloud born and IaaS powered (Terraform).
Spoiler Alert - Security on top
Log4J Detect & Block with Aviatrix inside your Cloud Environment
Coming back on topic…
Any system you deploy comes with its own alerting capabilities.
Most alerting capabilities offer first the functionality of sending emails to a noc or operations list where the Engineer On Duty usually monitors the events and takes action.
Simple, right?
Not when you’re growing and you have multiple different departments and solutions running in your company…each one sending you emails for each event that takes place.
A “normal” day in the life of an engineer can easily turn into this (yes, that is my very own mailbox):
I’ve been through this, especially when coming back after a bank holiday and it has always been painful to figure out what is still DOWN.
It was also challenging not to skip some alerts while rushing through emails and then be called later in the day
to be asked why connectivity between some application components still does not work => Headache moment…
What if there was a different way of doing it ?
What about the Chatops model ?
MS Teams, Slack, Webex and the list goes on.
A pop-up you can never easily ignore :)
Scroll down to see how you can both personalise your Aviatrix alerts as well as have them delivered on an MS Teams chat channel.
First problem most people like me usually encounter is: What alarms are important to know about ?
The basics will always involve environment factors: CPU, Memory, Hard Disk Usage…
What about pure Cloud Related Metrics ?
When running Aviatrix Gateways in the Cloud it is important to know when you are nearing CSP (AWS, Azure, GCP, etc) limitations.
Each Cloud publishes (more or less) some values BUT sometimes leaves you to discover on your own when the environment needs scaling.
Not really the dream of any Operations Engineer, is it ?
For tackling this Aviatrix has a few metrics available to help you out:
- Bandwidth Ingress Limit Exceeded Rate
- Bandwidth Egress Limit Exceeded Rate
- PPS Limit Exceeded Rate
- Conntrack Limit Exceeded Rate
- Rate of packets dropped while Receiving
- Rate of packets dropped during Transmission
Let’s then start configuring things and get a nice pop-up message to show on MS Teams when such an Event occurs.
Setup a Channel in MS Teams
You can also use an existing one.
Add a Channel to your new Team
Add a Webhook Integration
MS Teams will receive Alerts/Events from Aviatrix via HTTP POST requests (callbacks).
For this Aviatrix will require an URL to send these messages to.
This one is generated as part of the MS Teams setup by way of an existing MS Plugin (Incoming Webhook).
The URL gets generated.
Here you will configure Aviatrix CoPilot to send alerts.
Go to the Aviatrix CoPilot Dashboard
Navigate to Settings, then Notifications
Configure Notification Content & Destination = Teams URL
Content
Alarm Template message.
There are 2 types of variable classes: alerts and events.
Each one has different attributes (alert.metric, alert.unit, etc).
These get replaced by values depending on alarm type (see preview below under Alarm Destination).
The “type” : “MessageCard” comes from Microsofts Official documentation of what the accepted Webhook formats are.
MS Documentation for more info
{
"@type": "MessageCard",
"@context": "http: //schema.org/extensions",
"themeColor": "0584 ED",
"summary": "The Alarm that keeps you awake at night",
"sections": [{
"activityTitle": " ** Wake UP and FIX the ISSUE ** ",
"activitySubtitle": "Alert \{{alert.name}} triggered for \{{alert.metric}} at threshold \{{alert.threshold}}\{{alert.unit}} on \{{event.newlyAffectedHosts}}",
"activityImage": "https://wallpaperaccess.com/full/522616.jpg",
"markdown": true
}]
}
Alarm Destination
Here you will use the URL you generated before in MS Teams through the Webhooks Plugin Integration.••
Configure which Notifications (Alerts) to send to the MSTeams destination (Webhook)
CoPilot -> Notifications
Result in MS Teams
Extras
What is the Webhook sending in practice?
Configure the URL of a machine you have access to as Notification Receiver (a plain, simple Linux box which is Internet facing).
On that one run a simple TCPDUMP on port 80.
# nc -l -p 80
POST / HTTP/1.1
Content-Type: application/json
host: please.no-ip.biz
accept: application/json
content-length: 518
Connection: close
{"alert":{"closed":false,"metric":"CPU Utilization","name":"High CPU Usage","status":"OPEN","threshold":80,"unit":"%"},"event":{"receiveSeparateAlert":false,"exceededOrDropped":"Exceeded","matchingHosts":["spoke1","spoke1-hagw"],"newlyAffectedHosts":["spoke1"],"recoveredHosts":["spoke2"],"message":"Alert Updated","timestamp":"2021-11-25T07:18:12.230Z"},"webhook":{"name":"fdfsd","secret":"secret","tags":[],"url":"http://please.no-ip.biz"},"custom_message":"Received CPU Utilization at \"2021-11-25T07:18:12
Webhook to Syslog
We send the Webhook to a Python Script.
The Script will convert it into an SNMP Trap and into a Syslog message for relaying it further.
Azure Function for alternate Network Path in case of complete failure
Logic Flow:
- Spoke Gateways (both if HA, single if non-HA) go down because of a major outage / corner case scenario.
- Notification is sent to an Azure Function.
- Azure Function calls the Aviatrix Controller API.
- Aviatrix Controller triggers attaching Spoke VPC with failed gateways natively to the Aviatrix Transit VPC.
- Connectivity is restored.
When the Spoke Gateway comes back up, the change is undone and the same notification logic is followed.
What can you use this for:
- scaling up a gateway instance when you see it reached its maximum capacity
- self healing in case of a disaster (corner case like mentioned above where all gateways are down in a Spoke)
- any type of scenario requiring an automated response to an event in your infrastructure
Next article will cover this topic in detail
Extra - ThreatGuard / ThreatIQ - secure your workloads
Spoiler Alert [BotNets, Log4J, Data Exfiltration, Malware]
Blocking Malicious Traffic for Cloud Workloads