vSphere Performance monitoring with Telegraf, InfluxDB and Grafana 7 - Intro

Aug 6, 2020 (Last modified: Aug 6, 2020)vsphere grafana influx telegraf ·

Overview

Intro

There's been a while since I blogged about vSphere performance monitoring and it's been three years since I started this journey so I thought I'd revisit the topic in this new series.

In my previous posts on this topic I have mainly used my own Powershell scripts for pulling data from vCenter. I have also used the vSphere plugin for Telegraf a bit, and this is what we will use in this new "blog series in blog series". We will also use the new Grafana 7 which was released in May and as the database we will continue with InfluxDB (version 1.x).

When I was planning this new blog series I was planning for a short three-four part series, but it will end up being quite bigger than that. Most of the posts should give valid pointers on their own so feel free to jump in where you'd like. You can find all parts here, that page will get updated as posts are published

Components

The "bill-of-materials" for this and following posts will be:

A CentOS 7.x VM running:
vCenter Server (I have 7.0 installed, but 6.5 and 6.7 should work just as well)

Polling data with scripts vs Telegraf

I've previously stated that I'd rather use PS scripts for pulling data as opposed to the Telegraf agent, and I still feel that this is valid, but as with everything else, it depends... First let me start by mentioning that when I started my vSphere performance monitoring project the vSphere plugin for Telegraf didn't exist. Had it existed I'd probably used it from the get go.

The agent uses the vSphere SDK and offers the ability to include/exclude specific metrics as well as differentiate polling intervals on metrics.

The scripts I've used also uses the vSphere SDK, but through PowerCLI. It pulls the same data, they also pull specific metrics and you control the collection interval. I'm no longer actively using these scripts, but they are available for everyone at GitHub.

In many cases (if not most) and in many environments the Telegraf agent will suit you just fine. Both the agent and its plugins are developed and maintained by InfluxData and they accept community contributions on Github.

Telegraf agent

Without having measured I am pretty sure the Telegraf agent is much more effective than the PS scripts. As they should be considering who's behind the two respectively. One of the big issues with using my scripts in the environment I had at the time was the scaling.

I had over 4000 VMs to pull and had to set up a lot of jobs pulling different parts of the environment. I haven't used Telegraf in an environment with those numbers yet so I haven't got any data to compare, but again I suspect that Telegraf will be more effective. How they compare in manageability when it comes to such scaling I'm not sure as I guess you might have to have multiple Telegraf instances running as with the scripts.

So far we've discussed Pro's for Telegraf. Let's look at some of the advantages of the scripts.

Scripts / DIY

I have a few points that could make you consider going down the script route. First, by pulling the data "yourself" you have more control over it.

You can name the measurements and metadata as you'd like, maybe you pull data from other environments and need to combine them in the same database and table/measurement?
You can do pre calculations before writing, e.g. CPU Ready which we often discuss as a percentage value but pull as milliseconds.
It's more likely that you'll pull only the metrics you'll use. You can include/exclude stuff in Telegraf also, but I suspect most people pull everything by default
You can make use of tagging to support your own logic. Maybe you are a service provider and want to tag VMs with a customer id and not rely on naming conventions? Or you want to build in a connection between a SAN directly to a VM so you can easily correlate
As of time of this writing the Telegraf agent doesn't pull data for the vCenter appliance specific metrics, nor the vSAN specific metrics. For this I've created a different set of scripts. If you wan't to have the same method pulling both vSphere metrics and VCSA metrics than the script method is your option

And, not to forget, it's kind of fun getting your hands dirty with doing the logic yourself and you will get closer to the metrics.

In the end you will have to consider what's most important for you. In this new blog series we will make use of the Telegraf agent. If you want to check out the scripts I propose that you start here

Other resources

As always there are other great minds that have done similar stuff which also could be worth checking out, like this from my fellow vExpert Jorge de la Cruz

Next part

In the next part we will take a closer look at Telegraf and the configuration options

Thanks for reading, and if you have any questions or comments please feel free to contact me

This page was modified on August 6, 2020: Fixed title