This is the Part 3 in my series on vSphere performance data.
Part 1 discussed the project, Part 2 was about checking the methods of retrieving data and ended with me realizing I would use Get-Stat against all (4000) VMs to retrieve data. Part 2 was posted over a month ago as I have been busy preparing for the VCP 6.5 DCV exam (which I passed btw) as well as upgrading/migrating our vCenter servers, but I have actually been able to do a lot of work on this project as well so there will be some updates in the next couple of days.
Previously I had done some benchmarks on retrieving data from VMs using PowerCLI and the Get-Stat cmdlet. I would land on roughly 1 second per VM to retrieve and process the metrics I wanted. As I discussed in part 2 that would result in 4000 seconds to retrieve the data I needed, and with my goal to retrieve all 20 sec metrics within 5 minutes I would need to have around 14 scripts running simultaneously to achieve this.
There’s (at least) a couple of things with this that worries me. First is the management and operation of 14 scripts running every 5 minutes, secondly the potential extra load this will put on vCenter and the environment.
Anyways, I started out building a script..
There are lots of resources on using Get-Stat, one that does a great job on explaining the basics is this one from LucD, even though it’s from 2009 it’s still valid. Another one that talks about a similar project to mine is this one from orchestration.io
I already had some thoughts on the need for parallellization but decided to build a script that didn’t care about that at first.
(To begin exploring Get-Stat and other stat cmdlets you should read the blogpost from LucD referred above)
First I started out with checking the different stats that is available for a VM:
PS C:\> Get-StatType -Entity VM2012 cpu.usage.average cpu.usagemhz.average cpu.ready.summation mem.usage.average mem.swapinRate.average mem.swapoutRate.average mem.vmmemctl.average mem.consumed.average mem.overhead.average disk.usage.average disk.maxTotalLatency.latest disk.numberReadAveraged.average disk.numberWriteAveraged.average net.usage.average sys.uptime.latest virtualDisk.numberReadAveraged.average virtualDisk.numberWriteAveraged.average virtualDisk.totalReadLatency.average virtualDisk.totalWriteLatency.average datastore.numberReadAveraged.average datastore.numberWriteAveraged.average datastore.totalReadLatency.average datastore.totalWriteLatency.average disk.used.latest disk.provisioned.latest disk.unshared.latest
As I’ve described in the previous posts in this series we already have some performance dashboards so we had a fairly clear understanding on which metrics we would want to use:
- CPU usage/utilization
- CPU Ready & Latency
- MEM usage/utilization
- Network throughput
- Disk throughput (kBps & IOPS)
- Storage latency
The list above is missing several of these. It turns out that you need to add the -RealTime switch to get access to those missing (and a lot more):
PS C:\> Get-StatType -Entity VM2012 -Realtime | sort cpu.costop.summation cpu.demand.average cpu.demandEntitlementRatio.latest cpu.entitlement.latest cpu.idle.summation cpu.latency.average cpu.maxlimited.summation cpu.overlap.summation cpu.readiness.average cpu.ready.summation cpu.run.summation cpu.swapwait.summation cpu.system.summation cpu.usage.average cpu.usagemhz.average cpu.used.summation cpu.wait.summation datastore.maxTotalLatency.latest datastore.numberReadAveraged.average datastore.numberWriteAveraged.average datastore.read.average datastore.totalReadLatency.average datastore.totalWriteLatency.average datastore.write.average disk.busResets.summation disk.commands.summation disk.commandsAborted.summation disk.commandsAveraged.average disk.maxTotalLatency.latest disk.numberRead.summation disk.numberReadAveraged.average disk.numberWrite.summation disk.numberWriteAveraged.average disk.read.average disk.usage.average disk.write.average mem.active.average mem.activewrite.average mem.compressed.average mem.compressionRate.average mem.consumed.average mem.decompressionRate.average mem.entitlement.average mem.granted.average mem.latency.average mem.llSwapInRate.average mem.llSwapOutRate.average mem.llSwapUsed.average mem.overhead.average mem.overheadMax.average mem.overheadTouched.average mem.shared.average mem.swapin.average mem.swapinRate.average mem.swapout.average mem.swapoutRate.average mem.swapped.average mem.swaptarget.average mem.usage.average mem.vmmemctl.average mem.vmmemctltarget.average mem.zero.average mem.zipped.latest mem.zipSaved.latest net.broadcastRx.summation net.broadcastTx.summation net.bytesRx.average net.bytesTx.average net.droppedRx.summation net.droppedTx.summation net.multicastRx.summation net.multicastTx.summation net.packetsRx.summation net.packetsTx.summation net.pnicBytesRx.average net.pnicBytesTx.average net.received.average net.transmitted.average net.usage.average power.energy.summation power.power.average rescpu.actav1.latest rescpu.actav15.latest rescpu.actav5.latest rescpu.actpk1.latest rescpu.actpk15.latest rescpu.actpk5.latest rescpu.maxLimited1.latest rescpu.maxLimited15.latest rescpu.maxLimited5.latest rescpu.runav1.latest rescpu.runav15.latest rescpu.runav5.latest rescpu.runpk1.latest rescpu.runpk15.latest rescpu.runpk5.latest rescpu.sampleCount.latest rescpu.samplePeriod.latest sys.heartbeat.latest sys.osUptime.latest sys.uptime.latest virtualDisk.largeSeeks.latest virtualDisk.mediumSeeks.latest virtualDisk.numberReadAveraged.average virtualDisk.numberWriteAveraged.average virtualDisk.read.average virtualDisk.readIOSize.latest virtualDisk.readLatencyUS.latest virtualDisk.readLoadMetric.latest virtualDisk.readOIO.latest virtualDisk.smallSeeks.latest virtualDisk.totalReadLatency.average virtualDisk.totalWriteLatency.average virtualDisk.write.average virtualDisk.writeIOSize.latest virtualDisk.writeLatencyUS.latest virtualDisk.writeLoadMetric.latest virtualDisk.writeOIO.latest
So, with access to all of these we mapped the desired list of metrics to the corresponding Get-StatType name. With that retrieving metrics is as easy as:
PS C:\> $metrics = "cpu.ready.summation","cpu.latency.average","cpu.usagemhz.average","cpu.usage.average","mem.active.av erage","mem.usage.average","net.received.average","net.transmitted.average","disk.maxtotallatency.latest","disk.read.ave rage","disk.write.average","disk.numberReadAveraged.average","disk.numberWriteAveraged.average" PS C:\> $stats = Get-Stat -Entity VM2012 -Realtime -Stat $metrics
This will retrieve a lot of stats! The cmdlet will retrieve stats for the given metrics at 20 second intervals for the last hour!
PS C:\> $stats.count 7020
I would only need the last 5 minutes as I would run the script on that interval so I can make use of the -MaxSamples parameter and give it the value 15 (3 metrics per minute x 5) and with that I have a lot less stats to work with
PS C:\> $stats2 = Get-Stat -Entity VM2012 -Realtime -Stat $metrics -MaxSamples 15 PS C:\> $stats2.count 585
One thing to be aware of is that many of the stats will be per instance. Looking at one of the cpu metrics you’ll find several instances with the same timestamp. This will correspond to the number of vCPUs this VM has and one metric which is the aggregation identified by the one without a value in «Instance»:
PS C:\> get-stat -Entity VM2012 -Realtime -Stat cpu.usagemhz.average -MaxSamples 1 MetricId Timestamp Value Unit Instance -------- --------- ----- ---- -------- cpu.usagemhz.average 12.07.2017 17.24.20 39 MHz cpu.usagemhz.average 12.07.2017 17.24.20 4 MHz 1 cpu.usagemhz.average 12.07.2017 17.24.20 23 MHz 0 cpu.usagemhz.average 12.07.2017 17.24.20 3 MHz 2 cpu.usagemhz.average 12.07.2017 17.24.20 3 MHz 3
Please note that you need to examine all of the different metrics you retrieve to understand if you can use the metric without a «Instance» value or not to get the aggregation for the VM. You will also need to understand what the metric actually is and how to read it. For instance CPU Ready might need to be calculated to a percentage value and the IOPS counters (disk.number…average) might need to be grouped if you want the total.
After exploring each of the metrics and trying to understand how to read them I looked at how to build a script for traversing VMs. A quick sudo-coded version would be:
- Define metrics and how much to pull
- Connect to vCenter
- Get VMs
- Traverse and pull stats for VMs
- Process and build an output object per timestamp for each VM
- Output to file or post to an API
No 5 above will depend heavily on what I need to do in No 6 so I decided not to build the entire script before I had looked more at InfluxDB and how I would push the data to the database. This will be the focus for the next part of this series