LAS4001: VMworld Labs Automation & Workflow Architecture

Disclaimer: I scribbled notes as fast as I could during the presentation and may have mistyped/miswrote some information. If you have solid information to make this more accurate please let me know.

At VMworld, they offer some amazing Hands on Labs. This year there is 27 labs that could be taken. The goal after last year's success is to hit over 250,000 VMs deployed and destroyed in 5 days of Hands on Labs.

vPod

This is built on an architecture called vPod. In this architecture they have the vLayer0 which is the physical layer where the hardware lays with the initial ESXi. Inside of that is vLayer1 which is the first layer of VMs. This is where the ESXi instances of the labs exist along with the VSAs (Virtual Storage Appliances) and generally the vCenter Instance. Since you have this layer in there you can create SRM labs with unique datacenters for the labs. Under that is the vLayer2 that creates the VMs that actually are seen in most of the Labs as Testing Lab VMs.

Each physical cluster this year is 22-24 physical Hosts. Using vSphere 5, DRS, and vCD 1.5. Previously the Labs environment used Lab Manager. Often in this environment is the newest of the newest beta level and is a fantastic test ground for load testing the space.

Automation

The automation they created is called LabCloud Automation. This is a VM that uses Apache, Django, Adobe Message Framework, with a Postgres backend database. It talks to the vCenter & vCloud Director via the various APIs, nothing special or funky just because they work at VMware. Along with that all the API calls go through a set of "Bridge" python code. These bridges are connected in a 1:1 fashion for each vCD instance. The bridges help throttle communications going to the vCD as there are limits to how much the vCD API layer can process/handle at a given time. There is one vCD per datacenter as the environment has been configured.

Every year there has been some sort of issues. This is to be fully expected considering the scale and nature of what is going on. Some history:

2009

This environment had 100% of the labs onsite. All the compute, storage, everything. Putting a serious datacenter on the show floor introduces its own issues as these convention centers are not configured for that much power draw typically. So 1 power path for each device at best. To make things easier for setup, the racks were built offsite and then shipped directly to the convention center with everything still in it. That worked well enough until one of the racks had a forklift arm driven right through the middle.

2010

Datacenter was split between onsite and in a cloud. The thin client connections via PCoIP worked great and had even more power issues since the size of the labs grew again.

2011

In the running of this year's Labs everything is offsite at 3 different datacenters around the world. Nothing is onsite. They have lost UCS blades, a storage controller, crashed some of the switches, completely crashed vCenter (which was a first). They also managed to toast the View environment. They have found that they lose 15-20% of the thinclients a year. floor. 2011 everything is offsite, no power issues this year. Originally the timers for the Labs started off only being 1 hour long. Found most labs were taking longer than that so increased the timer to 1.5 hours. This dug into the possible time to deliver labs to everyone.

People and Labs

Human dynamics are huge in events like this. First day 200 people came in, sat down and all picked lab 1. In the environment just like in vCD/View they stage up a certain number of labs so there's always some ready to go. They simply didn't have enough of Lab 1 spun up ahead of time so many folks sitting there had to wait 10 mins for these labs to get up and running. Not ideal.

Another challenge is tracking who is where and what that station's condition is. Is it available? Is it getting reimaged? Is the person done and the seat freed up? Is the seat working right or not? So they developed a person seatmap. This tracks every single person and what lab they are assigned to. Used Adobe Flex for the interface. Issues were found around keeping up to date consistently with the back-end database as the conference went on since the people scanning the badges were doing them in parallel. More improvements coming next year.

Troubleshooting and Monitoring

Each of the 8 vCDs generate over 200 MB of logs every 10 mins. Log rotation doesn't cut it to figure out what's going on. Other tools were needed to attempt to alert on issues before they got too big. vCenter Operations was extremely useful in diagnosis for some of the issues seen during the conference.

Predicting Demand

One of the tools written to stay ahead of the demand tracked how many vApps have been deployed to each vCD and what the lab status is. This tool has a minimum and maximum # of pre-deployed labs that it would make available. By monitoring this the lab team was able to adjust to some of the human elements such as "Paul Maritz mentioned Horizon, so we should make sure that we have more of these labs ready to go when the General Session ends." At night getting the environment pre-populated takes about an hour to deploy 600-700 vApps when there is no load.Obviously doing that much load on top of people using the system to take the labs, it can take a bit longer. One of the keys is trying to be ready for the first thing in the morning.

Some of the behind the scenes. They do some Load Balancing across the environment. This is tough and needs to take into account how big the labs are along with the complication that the hardware clusters are not all equal. Some hardware has 96G of RAM per ESXi Host.. some has 24G. So one of the big questions is how to spread big labs across clusters and vCD etc. Some reporting happens to capture how the labs are deployed. This is a very tough item to automate today. It has gotten better since now DRS is enabled through use of vCloud Director. That introduces other issues though since it is a 3 layer deep virtualized space any changes occur over the network, not directly to storage often. Performance labs were especially challenging. "Well I want to cause the storage to go IOP crazy in the troubleshooting lab." "Uh. no. They are in a shared environment and could cause the actual physical environment to implode." Some SIOC and resource limits helped make these labs work well enough this year.

Stats

There is a ton of available stats. One of the things that we in IT don't do enough of is taking these stats and showing them to Tons of stats. Getting the information out to say "Look here. This is cool."

In the First year there was

4,000 labs completed
9 lab options
40,000 VMs.

They quickly realized this is awesome. The HUD up on the screen is huge in terms of selling the awesomeness going on. It really made things happen and LABs got more popular. Last year the HUD was based on car analogy. You know your doing something right when Paul Maritz sits down and asks about the HUD and you mean we are doing that so far?

This year the original plan was to have a Jet fighter HUD that showed people's names as they selected a lab and when they finished to give it a more human touch (along with missiles and explosions). That ended up not getting displayed due to some bugs found at the last minute.

Instead an Aquarium was used that showed each vApp as a fish (larger sized vApps meant a larger lead fish, smaller meant smaller fish). Behind that lead fish was a school of fish that would follow it around. Each fish in that school represented one of the VMs in the vApp/Lab.

Wrap Up

Performing the VMworld Labs and building a space to do this is the ultimate load/qa test. Numerous times it has been found that the engineers/product managers never expected or thought that they'd do something like that or that way. Significant amounts of feedback goes back to the products being used. The VMworld Labs are the single largest deploy of vCD in any single effort.

It is a fantastic capability to be offered at VMworld and shows just what is possible with the product suite.

Another post covering the LAS4001 labs here: http://www.thinkmeta.net/index.php/2011/09/01/las4001-lab-architecture-session-cloud-and-virtualization-architecture/

Updated: 3 Sept 2011@6:31pm CST - Adding Aquarium HUD HoL screen shot and link to another LAS4001 lab blog post.

Its just another layerhow deep does the abstraction go?

Related Posts

Ansible tagging - why? 28 Dec 2017

fsck hfs+ filesystem on ubuntu 25 Sep 2015

Windows 10 GA VMware Template 17 Sep 2015