vSphere Performance data - Part 3 - Get-stat
This is the Part 3 in my series on vSphere performance data.
Part 1 discussed the project, Part 2 was about checking the methods of retrieving data and ended with me realizing I would use Get-Stat against all (4000) VMs to retrieve data. Part 2 was posted over a month ago as I have been busy preparing for the VCP 6.5 DCV exam (which I passed btw) as well as upgrading/migrating our vCenter servers, but I have actually been able to do a lot of work on this project as well so there will be some updates in the next couple of days.
Previously I had done some benchmarks on retrieving data from VMs using PowerCLI and the Get-Stat cmdlet. I would land on roughly 1 second per VM to retrieve and process the metrics I wanted. As I discussed in part 2 that would result in 4000 seconds to retrieve the data I needed, and with my goal to retrieve all 20 sec metrics within 5 minutes I would need to have around 14 scripts running simultaneously to achieve this.
There's (at least) a couple of things with this that worries me. First is the management and operation of 14 scripts running every 5 minutes, secondly the potential extra load this will put on vCenter and the environment.
Anyways, I started out building a script..
There are lots of resources on using Get-Stat, one that does a great job on explaining the basics is this one from LucD, even though it's from 2009 it's still valid. Another one that talks about a similar project to mine is this one from orchestration.io
I already had some thoughts on the need for parallellization but decided to build a script that didn't care about that at first.
(To begin exploring Get-Stat and other stat cmdlets you should read the blogpost from LucD referred above)
First I started out with checking the different stats that is available for a VM:
PS C:\> Get-StatType -Entity VM2012
cpu.usage.average
cpu.usagemhz.average
cpu.ready.summation
mem.usage.average
mem.swapinRate.average
mem.swapoutRate.average
mem.vmmemctl.average
mem.consumed.average
mem.overhead.average
disk.usage.average
disk.maxTotalLatency.latest
disk.numberReadAveraged.average
disk.numberWriteAveraged.average
net.usage.average
sys.uptime.latest
virtualDisk.numberReadAveraged.average
virtualDisk.numberWriteAveraged.average
virtualDisk.totalReadLatency.average
virtualDisk.totalWriteLatency.average
datastore.numberReadAveraged.average
datastore.numberWriteAveraged.average
datastore.totalReadLatency.average
datastore.totalWriteLatency.average
disk.used.latest
disk.provisioned.latest
disk.unshared.latest
As I've described in the previous posts in this series we already have some performance dashboards so we had a fairly clear understanding on which metrics we would want to use:
- CPU usage/utilization
- CPU Ready & Latency
- MEM usage/utilization
- Network throughput
- Disk throughput (kBps & IOPS)
- Storage latency
The list above is missing several of these. It turns out that you need to add the -RealTime switch to get access to those missing (and a lot more):
PS C:\> Get-StatType -Entity VM2012 -Realtime | sort
cpu.costop.summation
cpu.demand.average
cpu.demandEntitlementRatio.latest
cpu.entitlement.latest
cpu.idle.summation
cpu.latency.average
cpu.maxlimited.summation
cpu.overlap.summation
cpu.readiness.average
cpu.ready.summation
cpu.run.summation
cpu.swapwait.summation
cpu.system.summation
cpu.usage.average
cpu.usagemhz.average
cpu.used.summation
cpu.wait.summation
datastore.maxTotalLatency.latest
datastore.numberReadAveraged.average
datastore.numberWriteAveraged.average
datastore.read.average
datastore.totalReadLatency.average
datastore.totalWriteLatency.average
datastore.write.average
disk.busResets.summation
disk.commands.summation
disk.commandsAborted.summation
disk.commandsAveraged.average
disk.maxTotalLatency.latest
disk.numberRead.summation
disk.numberReadAveraged.average
disk.numberWrite.summation
disk.numberWriteAveraged.average
disk.read.average
disk.usage.average
disk.write.average
mem.active.average
mem.activewrite.average
mem.compressed.average
mem.compressionRate.average
mem.consumed.average
mem.decompressionRate.average
mem.entitlement.average
mem.granted.average
mem.latency.average
mem.llSwapInRate.average
mem.llSwapOutRate.average
mem.llSwapUsed.average
mem.overhead.average
mem.overheadMax.average
mem.overheadTouched.average
mem.shared.average
mem.swapin.average
mem.swapinRate.average
mem.swapout.average
mem.swapoutRate.average
mem.swapped.average
mem.swaptarget.average
mem.usage.average
mem.vmmemctl.average
mem.vmmemctltarget.average
mem.zero.average
mem.zipped.latest
mem.zipSaved.latest
net.broadcastRx.summation
net.broadcastTx.summation
net.bytesRx.average
net.bytesTx.average
net.droppedRx.summation
net.droppedTx.summation
net.multicastRx.summation
net.multicastTx.summation
net.packetsRx.summation
net.packetsTx.summation
net.pnicBytesRx.average
net.pnicBytesTx.average
net.received.average
net.transmitted.average
net.usage.average
power.energy.summation
power.power.average
rescpu.actav1.latest
rescpu.actav15.latest
rescpu.actav5.latest
rescpu.actpk1.latest
rescpu.actpk15.latest
rescpu.actpk5.latest
rescpu.maxLimited1.latest
rescpu.maxLimited15.latest
rescpu.maxLimited5.latest
rescpu.runav1.latest
rescpu.runav15.latest
rescpu.runav5.latest
rescpu.runpk1.latest
rescpu.runpk15.latest
rescpu.runpk5.latest
rescpu.sampleCount.latest
rescpu.samplePeriod.latest
sys.heartbeat.latest
sys.osUptime.latest
sys.uptime.latest
virtualDisk.largeSeeks.latest
virtualDisk.mediumSeeks.latest
virtualDisk.numberReadAveraged.average
virtualDisk.numberWriteAveraged.average
virtualDisk.read.average
virtualDisk.readIOSize.latest
virtualDisk.readLatencyUS.latest
virtualDisk.readLoadMetric.latest
virtualDisk.readOIO.latest
virtualDisk.smallSeeks.latest
virtualDisk.totalReadLatency.average
virtualDisk.totalWriteLatency.average
virtualDisk.write.average
virtualDisk.writeIOSize.latest
virtualDisk.writeLatencyUS.latest
virtualDisk.writeLoadMetric.latest
virtualDisk.writeOIO.latest
So, with access to all of these we mapped the desired list of metrics to the corresponding Get-StatType name. With that retrieving metrics is as easy as:
PS C:\> $metrics = "cpu.ready.summation","cpu.latency.average","cpu.usagemhz.average","cpu.usage.average","mem.active.av
erage","mem.usage.average","net.received.average","net.transmitted.average","disk.maxtotallatency.latest","disk.read.ave
rage","disk.write.average","disk.numberReadAveraged.average","disk.numberWriteAveraged.average"
PS C:\> $stats = Get-Stat -Entity VM2012 -Realtime -Stat $metrics
This will retrieve a lot of stats! The cmdlet will retrieve stats for the given metrics at 20 second intervals for the last hour!
PS C:\> $stats.count
7020
I would only need the last 5 minutes as I would run the script on that interval so I can make use of the -MaxSamples parameter and give it the value 15 (3 metrics per minute x 5) and with that I have a lot less stats to work with
PS C:\> $stats2 = Get-Stat -Entity VM2012 -Realtime -Stat $metrics -MaxSamples 15
PS C:\> $stats2.count
585
One thing to be aware of is that many of the stats will be per instance. Looking at one of the cpu metrics you'll find several instances with the same timestamp. This will correspond to the number of vCPUs this VM has and one metric which is the aggregation identified by the one without a value in "Instance":
PS C:\> get-stat -Entity VM2012 -Realtime -Stat cpu.usagemhz.average -MaxSamples 1
MetricId Timestamp Value Unit Instance
-------- --------- ----- ---- --------
cpu.usagemhz.average 12.07.2017 17.24.20 39 MHz
cpu.usagemhz.average 12.07.2017 17.24.20 4 MHz 1
cpu.usagemhz.average 12.07.2017 17.24.20 23 MHz 0
cpu.usagemhz.average 12.07.2017 17.24.20 3 MHz 2
cpu.usagemhz.average 12.07.2017 17.24.20 3 MHz 3
Please note that you need to examine all of the different metrics you retrieve to understand if you can use the metric without a "Instance" value or not to get the aggregation for the VM. You will also need to understand what the metric actually is and how to read it. For instance CPU Ready might need to be calculated to a percentage value and the IOPS counters (disk.number...average) might need to be grouped if you want the total.
After exploring each of the metrics and trying to understand how to read them I looked at how to build a script for traversing VMs. A quick sudo-coded version would be:
- Define metrics and how much to pull
- Connect to vCenter
- Get VMs
- Traverse and pull stats for VMs
- Process and build an output object per timestamp for each VM
- Output to file or post to an API
No 5 above will depend heavily on what I need to do in No 6 so I decided not to build the entire script before I had looked more at InfluxDB and how I would push the data to the database. This will be the focus for the next part of this series