LAS4001: VMworld Labs Automation & Workflow Architecture

VMware | Sep 2, 2011

Disclaimer: I scribbled notes as fast as I could during the presentation and may have mistyped/miswrote some information. If you have solid information to make this more accurate please let me know.

At VMworld, they offer some amazing Hands on Labs. This year there is 27 labs that could be taken. The goal after last year's success is to hit over 250,000 VMs deployed and destroyed in 5 days of Hands on Labs.

vPod

This is built on an architecture called vPod. In this architecture they have the vLayer0 which is the physical layer where the hardware lays with the initial ESXi. Inside of that is vLayer1 which is the first layer of VMs. This is where the ESXi instances of the labs exist along with the VSAs (Virtual Storage Appliances) and generally the vCenter Instance. Since you have this layer in there you can create SRM labs with unique datacenters for the labs. Under that is the vLayer2 that creates the VMs that actually are seen in most of the Labs as Testing Lab VMs.

Each physical cluster this year is 22-24 physical Hosts. Using vSphere 5, DRS, and vCD 1.5. Previously the Labs environment used Lab Manager. Often in this environment is the newest of the newest beta level and is a fantastic test ground for load testing the space.

Automation

The automation they created is called LabCloud Automation. This is a VM that uses Apache, Django, Adobe Message Framework, with a Postgres backend database. It talks to the vCenter & vCloud Director via the various APIs, nothing special or funky just because they work at VMware. Along with that all the API calls go through a set of "Bridge" python code. These bridges are connected in a 1:1 fashion for each vCD instance. The bridges help throttle communications going to the vCD as there are limits to how much the vCD API layer can process/handle at a given time. There is one vCD per datacenter as the environment has been configured.

Every year there has been some sort of issues. This is to be fully expected considering the scale and nature of what is going on. Some history:

2009

This environment had 100% of the labs onsite. All the compute, storage, everything. Putting a serious datacenter on the show floor introduces its own issues as these convention centers are not configured for that much power draw typically. So 1 power path for each device at best. To make things easier for setup, the racks were built offsite and then shipped directly to the convention center with everything still in it. That worked well enough until one of the racks had a forklift arm driven right through the middle.

2010

Datacenter was split between onsite and in a cloud. The thin client connections via PCoIP worked great and had even more power issues since the size of the labs grew again.

2011

In the running of this year's Labs everything is offsite at 3 different datacenters around the world. Nothing is onsite. They have lost UCS blades, a storage controller, crashed some of the switches, completely crashed vCenter (which was a first). They also managed to toast the View environment. They have found that they lose 15-20% of the thinclients a year. floor. 2011 everything is offsite, no power issues this year. Originally the timers for the Labs started off only being 1 hour long. Found most labs were taking longer than that so increased the timer to 1.5 hours. This dug into the possible time to deliver labs to everyone.

People and Labs

Human dynamics are huge in events like this. First day 200 people came in, sat down and all picked lab 1. In the environment just like in vCD/View they stage up a certain number of labs so there's always some ready to go. They simply didn't have enough of Lab 1 spun up ahead of time so many folks sitting there had to wait 10 mins for these labs to get up and running. Not ideal.

Another challenge is tracking who is where and what that station's condition is. Is it available? Is it getting reimaged? Is the person done and the seat freed up? Is the seat working right or not? So they developed a person seatmap. This tracks every single person and what lab they are assigned to. Used Adobe Flex for the interface. Issues were found around keeping up to date consistently with the back-end database as the conference went on since the people scanning the badges were doing them in parallel. More improvements coming next year.

Troubleshooting and Monitoring

Each of the 8 vCDs generate over 200 MB of logs every 10 mins. Log rotation doesn't cut it to figure out what's going on. Other tools were needed to attempt to alert on issues before they got too big. vCenter Operations was extremely useful in diagnosis for some of the issues seen during the conference.

Predicting Demand

One of the tools written to stay ahead of the demand tracked how many vApps have been deployed to each vCD and what the lab status is. This tool has a minimum and maximum # of pre-deployed labs that it would make available. By monitoring this the lab team was able to adjust to some of the human elements such as "Paul Maritz mentioned Horizon, so we should make sure that we have more of these labs ready to go when the General Session ends." At night getting the environment pre-populated takes about an hour to deploy 600-700 vApps when there is no load.Obviously doing that much load on top of people using the system to take the labs, it can take a bit longer. One of the keys is trying to be ready for the first thing in the morning.

Some of the behind the scenes. They do some Load Balancing across the environment. This is tough and needs to take into account how big the labs are along with the complication that the hardware clusters are not all equal. Some hardware has 96G of RAM per ESXi Host.. some has 24G. So one of the big questions is how to spread big labs across clusters and vCD etc. Some reporting happens to capture how the labs are deployed. This is a very tough item to automate today. It has gotten better since now DRS is enabled through use of vCloud Director. That introduces other issues though since it is a 3 layer deep virtualized space any changes occur over the network, not directly to storage often. Performance labs were especially challenging. "Well I want to cause the storage to go IOP crazy in the troubleshooting lab." "Uh. no. They are in a shared environment and could cause the actual physical environment to implode." Some SIOC and resource limits helped make these labs work well enough this year.

Stats

There is a ton of available stats. One of the things that we in IT don't do enough of is taking these stats and showing them to Tons of stats. Getting the information out to say "Look here. This is cool."

In the First year there was

4,000 labs completed
9 lab options
40,000 VMs.

They quickly realized this is awesome. The HUD up on the screen is huge in terms of selling the awesomeness going on. It really made things happen and LABs got more popular. Last year the HUD was based on car analogy. You know your doing something right when Paul Maritz sits down and asks about the HUD and you mean we are doing that so far?

This year the original plan was to have a Jet fighter HUD that showed people's names as they selected a lab and when they finished to give it a more human touch (along with missiles and explosions). That ended up not getting displayed due to some bugs found at the last minute.

Instead an Aquarium was used that showed each vApp as a fish (larger sized vApps meant a larger lead fish, smaller meant smaller fish). Behind that lead fish was a school of fish that would follow it around. Each fish in that school represented one of the VMs in the vApp/Lab.

Wrap Up

Performing the VMworld Labs and building a space to do this is the ultimate load/qa test. Numerous times it has been found that the engineers/product managers never expected or thought that they'd do something like that or that way. Significant amounts of feedback goes back to the products being used. The VMworld Labs are the single largest deploy of vCD in any single effort.

It is a fantastic capability to be offered at VMworld and shows just what is possible with the product suite.

Another post covering the LAS4001 labs here: http://www.thinkmeta.net/index.php/2011/09/01/las4001-lab-architecture-session-cloud-and-virtualization-architecture/

Updated: 3 Sept 2011@6:31pm CST - Adding Aquarium HUD HoL screen shot and link to another LAS4001 lab blog post.

CIM1644: Lab Manager to vCloud for SAP

VMware | Aug 31, 2011

We know that Lab Manager is End of Life. It is just a matter of time. vCloud Director has most of the features and functionality to start migrating over.

Some of the nice new benefits are:

Everything in vCloud Director is full vSphere standard approach. No more missing data in vCenter like Lab Manager.
VMs/vApps can be exported out as OVF and sent somewhere to start working immediately in a different environment.
No more SSMOVE needed.. just use Storage VMotion.
No more lab manager tools. Straight up VMware Tools and using APIs instead.

Some issues and limitations on Lab Manager of no more than 8 nodes per cluster. There are some limits to vCloud Director. Be sure to take those limits into account.

Overall this session really covered much of the very high level view.

BCA1360 - Global Enterprise virtualizing Exchange 2010

VMware | Aug 31, 2011

Exchange 2010 can be virtualized and this session covers how they did it.

Some of the design points that need to be covered are:

DAS vs SAN
Scale up or Scale out

The choices made here are arbitrary and dependent on how you manage your datacenter and what you like/don't like.

Their layout is:

4 datacenters, 2 DCs in US & 2 in Europe

If they have a DC failure, can run around 25% reduced capacity
3 Hosts per datacenter

2 Hosts are active, 1 failover
SAN backend with 1TB 7k rpm SATA disks

How did they do it?

Virtuals are manually balanced across the hosts per role
DRS set to level 1 - don't VMotion naturally
No Reservations
Dedicated Farm versus using the general farm

Exchange, all roles, all support systems etc

The Exchange 2010 Role layout is defined per OS instance, minimal sharing here.

CAS Role

4 G, 2 vCPUs
VMDK based

Hub Role

4G , 4 vCPUs
VMDK based

MBX Role

2000 mailboxes per server
6 vCPU
36G of RAM
3 NIC (MAPI, Backup & Replication)
VMDK for OS & Pagefile
RDM for Log & DB disks
For the 1TB LUN sizes use the 8MB block size format

SAN configuration

EMC Clarion CX4, 1TB 7200rpm SAT disk
RAID 6
Datastores in 8MB
Presented as 500GB and 1TB
OS, Pagefiles, & Misc Storage are VMDK
Logfile & Databases are RDM

LoadGen Physical versus Virtual

They ran some testing with VMware Assistance and the performance numbers were significantly under where Microsoft states are required. In most cases significantly under.

Lessons Learned:

Backups and disk contention as things grew did start to become an issue as load was added. Symptoms would be dropped connections. Moved the backups to the passive copies instead. This addressed much of the concerns.

When doing the migrations, take breaks in between each batch of migrations to iron out any issues. Found problems like pocket of users had unique issues and needed to have time to iron out the gotchas.

Database sizes introduce issues around backup, replication etc. Make sure you can manage them for the demands for your environment.

Some interesting discussions is that Hyper-Threading is not supported for production. It complicates performance discussions by Microsoft. VMware can do either so be sure to follow the Microsoft standards at the VM level.

Memory is a big question. Basically set

Storage.. the main points are make sure you have appropriate IOP capability behind the scenes. The other is if setting up VMDK files, should eagerZeroThick the VMDKs. If you check the box for enabling FT during creatio, this will eagerZeroThick is automatically. Otherwise this should be done when the machine is powered off and run VMKFSTOOLS from the command line.

16 months later...

Success doing VMotions and DAG failovers
Backups are running lights out
Will add more hosts to expand the environment
Pain Points:

Service Desk new process adoptions
Integration with legacy tools in house.

After all is said and done this has done quite a bit for the company.

Datacenter savings
TCO is down and has been passed on to the business
much greater flexibility
Scale out or Scale up very quickly
Lower Administrative overhead so far
More options for disaster recovery and scenarios

Exchange 2010 is possible.

VMworld 2011 - General Session #2 with Steve Herrod

VMware | Aug 31, 2011

Live Blog and notes taken during the session

Steve Herrod, VMware CTO, presenting for the General Session 2.

One of the great changes is the shift from servers and technology to End Devices and Services. Universal access to the environment is critical part of this. As such these have high expectations compared to past environments in computing. DUH! (Devices, Universal Access, High Expectations).

Directions that need to address the DUH is IT needs to simplify, manage and connect the end folks to the services they care about. Desktop services will be need to be contained and compartmentalized. This includes the full Desktop (View) or just the applications (Thinapp). Now if we can separate the Desktop, the Applications and the Data apart from each other, powerful policies can do quite a bit. Once this is done User/Application/Data policies can take place.

View is the direct solution to handle the Desktop Service. Well known and announced View 5.0.

To handle applications ThinApp is the direction being followed. There is a new service called ThinApp Factory as part of the application store. This tool will help auto create ThinApp packages. Fully automated creations of these packages following some recipes.

Project Octopus deals with the data service layer. This is built around providing a dropbox equivalent services in an enterprise. http://www.vmwareoctopus.com

Horizon has a Mobile piece that allows IT to securely gain access to the application space. Along with that is the Mobile Virtual Machine space. As part of pushing the access/application, can have a controlled/contained environment on the phone.

One of the great things coming is a feature called AppBlast. It can present the application, not a full desktop to devices like an iThing. This works on HTML5, not a custom unique protocol to deliver it. So to edit an Excel presentation on an iPad, just connect to the internal IT and work with Excel.

vSphere 5 is the basis for all of this. The HyperVisor is still not a commodity as many of these powerful features come from the fact that the base is solid and functional.

vSphere 5 can now handle 32 vCPUs, 1,000GB RAM, 1 Million per Host. Melvin the Monster VM has come.

Some of the other great features are around performance guarantees. Storage IO assurance is good. By creating Storage Pools, can organize by application, purpose, performance and other validations. Once a pool is created, automation on placement can take place. Along with that can perform DRS with Storage now using Storage VMotion. This can be done automatically. Setup the policies and forget about it as the infrastructure deals with it.

vSphere 5 allows the IT admin to now set Storage I/O Control to assure performance. Dealing with the noisy neighbor is a significant issue to control as the central system in the hypervisor. Along with that network I/O control has arrived also.

As the future comes to pass, one of the largest challenges is the IP problem. Right now networking in the TCP/IP stack is built with Location = Identifier. If a machine wants to move around, the location and Identifier are tied together. For example moving a VM from an internal Cloud to an External Cloud the VM needs to be given a new IP. This can be a very disruptive activity.

VMware has a solution now called VXLAN which is Virtual eXtensible LAN. This technology has been worked on with Intel, Emulex, Cisco, Arista, Broadcom. The spec has been sent to IETF. This is a big step on the virtualization solution. Once this tie has been broken cloud mobility becomes feasible and deliverable.

Management is the end game after a solid reportable infrastructure is available. From there the approach is

Monitor
Correlate
Remediate

At the end the focus is that End User Computing is making a major shift soon. This shift is coming with massive improvements in the client experience into something that the clients actually want. All this is possible by simplifying the layers under neath so time can be spent on this new functionality. To do that management of these spaces needs to be come more automated.

VMworld 2011 - Keynote with Paul Maritz

VMware | Aug 30, 2011

Live blog taken during the keynote.

The keynote started off with a beautiful Tron-3D stylized video talking about computing, virtualization, automation and ultimately the cloud. Your cloud. My cloud.

Hands on Labs started off as a private cloud onsite in 2008. It went to a hybrid cloud solution using some offsite datacenters in 2009. This year the labs are 100% public cloud in 3 datacenters in both US & Europe. The goal is to handle 25,000 labs with over 250,000 VMs created and destroyed in less than 5 days. A truly scalable solution.

Another big point is the VMUG is over 60,000 strong across the entire world. Thanks to all leaders for the great work they do.

Paul Maritz's started off with some amazing information. 2009 was the first milestone where there is more virtual machines than physical machines. This year the milestone broken is that more than 50% of workloads are now running on virtuals.

Some amazing achivements with some number playing:

1 VM every 6 seconds
20 million VMs
A VMotion every 5.5 seconds
800,000 vSphere Admins
68,000 VMware Certified Professionals

So what really is the cloud today? Is this something new or something recycled?

A quick history lesson.

MainFrame
Applications started as glorified bookkeeping for finance.

Client Server
Relational Databases Started. Whole new set of applications
New programing languages. Primarily PC users. Java, HTML, IP
ERP, CRM, Non-Real-Time analytics possible.

When the applications change, then the IT industry really changes.

Cloud
In less than 3 years, the primary device connecting to the networks will be non-Windows OS instances. iPad, Droid, Etc.
New Data Fabrics, new frameworks, HTML5,
Real-time, high scale analytics and commerce - The Facebook generation

A fundamental shift is IT needs to shift from supporting the client server era & the mainframe era to renewing the applications into cloud frameworks and capabilities. Spending in Client Server must become easier and cheaper. In the meantime IT will bridge from existing modes of end-user access.

Server Virtualization is successful since we can add these functionality improvements without changing the apps.

vSphere 5.0 needs to just work. To accomplish this VMware QA is extensively done:

>1 million engineering hours
>2 million QA hours
200 new features
2,000 partner certifications

By delivering this heavily QAed solution, less effort must be taken to keep the infrastructure up and running which in turn means less cost to the businesses. They can spend the money on new features, new frameworks and redeveloped applications instead of back end support costs.

In the vCloud, Service Providers are creating powerful vertical clouds for their industries. New York Stock Exchange is a good example. High end solutions that are gaining large interest by many clients. On the smaller end, Small Business Solutions are coming as "datacenter in a box" with the Storage Virtual Appliance now. Another vertical cloud is the SMB space and one solution there is the VMware GO. SaaS solution to provide functionality for the small SMBs.

New programs are written by young programmers, not older ones (>35). They are developing new frameworks and tools to make things easier. As such tools like Spring in vFabric 5 with things like tc server, GemFire & SQLFire, VMware aims to offer a full solution of ways to modernize applications along with creating a data fabrics. The newly announced tool is VMware Data Director to automatically provision and manage databases. The other third is the PaaS suite to deliver the new application. Cloud Foundry covers this space. It supports all sorts of frameworks and tools.

Moving up the stack for support of end user computing/Existing PC usage. View 5.0 officially announced. Horizon is the other piece in this layer. Windows offered some great functionality to the end clients. Delivery of the applications is a huge piece of that functionality. Horizon offers a new way to deliver applications. How do you deliver this to a iPhone, Droid, iPad etc?

Paul believes that we are entering a post document world.

Older Newer

Its just another layerhow deep does the abstraction go?

LAS4001: VMworld Labs Automation & Workflow Architecture

CIM1644: Lab Manager to vCloud for SAP

BCA1360 - Global Enterprise virtualizing Exchange 2010

VMworld 2011 - General Session #2 with Steve Herrod

VMworld 2011 - Keynote with Paul Maritz