OneView crash - Error in log file content
This weekend our primary OneView appliance crashed.
This particular OneView appliance handles 10 blade chassis and over 120 blade servers
As OneView handles only the management side of the hardware nothing in production was affected by this crash.
TLDR; There is a bug in version 3.10.04 which doesn't delete expired sessions. This is fixed in version 3.10.07
A few troubleshooting steps was taken initially.
- First we restarted the appliance, it took a while but it stopped when loading it's resource managers and threw the same error
- We also gave it some more CPU's and more RAM to see if it was a resource issue, after powering on the VM it eventually threw the same error
Unfortunately we are not doing backups of the appliance from OneView. We have only been doing VM backups through a 3rd party backup solution. We should of course have done the OV backups, but when OV was configured initially (the appliance were installed on version 1.20) there was no way to do scheduled backups from OV it self and even though we have upgrade to newer versions which has Scheduled remote backups it hasn't been configured.
With not much more to try we submitted a support case to HPE to get some help.
While we waited (and waited) for support to analyze we did try a restore of the appliance from the day before the issue was reported.
The restored VM failed on the same point and gave the same error screen.
That and the fact that the error mentions issues with "log file content" lead us to think that this issue has been present for some time but first came to sight when the appliance rebooted, and more important a hope in that Support could fix it (A "log file" issue should be fixable).
The last resort would be a complete reinstall, bring down the hardware and forcefully add it to a new appliance.
Luckily when our support case was elevated to Level 2 one of the engineers immediately recognized the issue and was pretty confident on how to fix it.
It turns out that in the version we were running, 3.10.04, there is a bug which can fill up the database with "Expired sessions". An HPE Advisory is published about it. This advisory mentions 20 000 "expired sessions" as a threshold before things starts to stop working. As our situation shows things won't neccessarily be noticed until you try to update or reboot the appliance.
To solve the issue the L2 support engineer logged in to the appliance CLI (only accessible with some timed credentials provided by HPE) and tried to run a clean up script. The first attempts were unsuccessful because the appliance crashed before the script was run. On the third attempt we succeeded before it crashed and the cleanup script reset the expired session.
The output of the script shows that we had over 57000 records!
After the cleanup script was run the appliance could start up and it started refreshing the hardware objects. After letting it do its thing for about an hour all seemed to be fine. Only 4 blade servers needed to have its server profile to be reapplied. Pretty minor considering the worst case..
After the refresh I tried to do a backup of the appliance. This didn't work so the L2 engineer logged on again and cleaned some old files through the CLI. After this the backup completed.
The release notes for HPE OneView 3.10.07 states that this bug is fixed so after a successful backup and a VM snapshot I upgraded the appliance to this version. Hopefully we'll never encounter this again!
P.S! Scheduled OneView backups are now configured to a remote location :-)