I've been chewing on this post by Duncan at Yellow Bricks for the past month and a half. It covers some complicated issues that one has to deal with in a enterprise size environment with many assumptions on what gets you into this mess in the first place. The best thing to do is downscale and upscale as needed based on good performance monitoring and bottleneck research. Thankfully I've managed to make good relationships with most teams where I work that this has become the standard operating procedure though sometimes we just can't. At the end of the day the issue boils down to the simple goal:

"As the VMware environment administrator, how can I make better use of what I have available to me?"

For my environment I run into a variety of political reasons going from..

  • "I am going to need that extra 2 CPUs someday in the future so I can't give them up now."
  • "The vendor docs say I really do need 8 CPUs and 128G of RAM for my 3 users even though 126G is unused."
  • "Someone on your team said I really do need that 8G of RAM so I won't give it up"
  • "Oh come on.. what's another 2G of RAM"
  • "I gave up my budget for a physical to do this as a virtual even though I'm still spending less in the grand scheme. Gimme more resources."

to the begging

  • "Pleaseeee. I think it'll help my issues. It might even make me look better to my co-workers."

I have two distinct use cases that really showcase that this kind of capability can be a hard item to use.

Case #1: The poorly written VBscript

Back in the early Windows 3.1 days when VB was a novel concept, some developers made this ground breaking app that would pull data from a remote system, massage the data a bit and put it into a centralized Btrieve database. Well this script that they wrote goes to sleep for a minute after the remote system's queue it checks is empty. This script sleep function checks the clock to see if a minute has passed. It constantly checks the clock which consumes 100% of the CPU all the time. This wasn't much of an issue when each one of these systems was on its own old PC system. We virtualized them since 16 XP workstations in the datacenter is a management headache. Now that's 16 high power, multiple generation newer cores being used 100% all day long for no good reason.

We, VMware Admins, have discovered that on the old PCs these systems would easily take 5-10 mins to work through their work queues. On the newest hardware we have with these as VMs, it takes under 15 seconds to do the same work. So for 60 seconds it is doing nothing except checking the hardware clock.

Solution #1: CPU limits good

We implemented a CPU limiting resource pool for these VBscript VMs. They are still running mega fast in comparison to where they were a year ago. Now they are using no more than 8 cores worth at any given time. A big improvement until the app developers decide if they are going to replace all that code with sleep 60 or recode the entire app.

Case #2: vCenter SQL Server Memory Limits

Due to a feature in vCenter 4.0U1 and ESX 3.5 Hosts, when I increased the RAM on my vCenter dedicated SQL Server from 4G to 8G, a Memory limit was set of 4G. When I would go onto the SQL instance, SQL Server.exe would only be using about 3600 Megs yet all 8G was consumed/used. This screamed to me an issue with the OS instance. After close to 10 days of head beating and not understanding why my brand new vCenter 4.0U1 system was running so poorly, a co-worker with a fresh set of eyes noticed this setting on the SQL Server instance.

Solution #2: Memory limits bad

This is obvious. We disabled the limit and the SQL Server performance went through the roof instantly. We simply couldn't tell easily that the driver was using 4G of RAM as it wasn't a process. Nobody noticed the ballooning happening.

At the end of the day there's pros and cons to having this level of capabilities. This is why I like ESX and the general approach of VMware. Give you everything we can in terms of options, configurations and rope to hang yourself and two of your friends. We will attempt to automate this and hide this as much as we can. The Vendor will never know all the situations we, people in the field, are going to run into so let's give us all the options they can. Use that rope with caution.