VM Image Sprawl in Real Life

A while back, Geva Perry and I were chatting about the issue of virtual machine image sprawl (Google Search), which is really little more than an extension of not-so-new traditional physical server sprawl problem. It’s hard to get really hard data on how bad the vm sprawl problem is since most images exist behind firewalls or other walled gardens. However, there is one good place to get solid data and that’s Amazon’s own public image repository.

Real Data
So, around the time we were chatting I started collecting some information on the number of public Amazon Machine Image (AMIs). This isn’t a perfect sampling, but should be pretty good for most purposes.

Here’s what the data looks like as of today:

Amazon Machine Image (AMI) Count

As you can see there is quite a steady stream of new images added to EC2. The data flatlines during the holiday break, but I’m sure we’ll see it pick back up starting next week.

Main takeaway is the 12% month-over-month growth of AMIs from 11/24 to 12/24. That’s pretty amazing when you consider this is the public images only.

Amazon Private Images
Private images might range somewhere between 10 and 100x public images. I can tell you that CloudScale had approximately 90 private AMIs created over it’s lifetime (~1 year). CloudScale would probably be considered more of a ‘power user’ though, so it’s more realistic to assume something closer to 10 total private AMIs per average AWS user.

Hard to say how that relates to the images in the public repository. I can’t even really hazard a guess, but we’re probably talking on the order of 20,000 AMIs minimum and quite possibly towards the higher end of the spectrum.

Conclusion
Anyway you measure it, while virtualization ’solved’ the datacenter efficiency and server usage problem, it also uncorked the bottle in terms of sprawl. At least with physical servers you were constrained by hardware, power, and cooling. Now the only constraint is disk space, which is basically no constraint.

This problem is only going to get dramatically worse as we move forward. You’re going to see a lot of folks trying to solve the sprawl problem in a variety of ways. I’m 100% certain, for example, that Amazon’s November pre-announcement of an AWS GUI in ‘09 is a result of their need to put a rich UI around their public images. We’ll see ratings, social networking, and possibly versioning baked in when it releases.

Post to Twitter

  • I can't say RightScale is contributing much to those numbers. The vast majority of our customers have ditched the AMI model altogether. Opting to configure their servers with ServerTemplates instead.

    Whereas launching a server from an AMI is analogous to burning a CD from an ISO, a ServerTemplate is analogous to burning a CD from a playlist and essentially just the definition of the composition and configuration of your cloud servers. It's a lot easier to manipulate your components with the granularity of a script oriented playlist then it is by constantly bundling and "sprawling" images that represent "mostly tweaks". Instead, you can save revisions of your Template and the underlying scripts themselves. Overall this model allows for quicker tweaking and repurposing of cloud servers and simplifies lifecycle management of cloud servers.

    Also, it's portable! You don't have to reconfigure various instance sizes if you want to scale vertically, just replace the base AMI that contains the OS. You can also migrate templates from one cloud to another without having to reconfigure the bulk of the server deployment process. Of course we still have to play by each cloud's own rules, network level configurations, image types, etc, but you avoid the work of reconfiguring your servers for a new infrastructure provider. This feature is currently live and available in our free Developer Edition for EC2 US to EC2 EU migrations there and back, which are essentially entirely separate clouds.
  • VM proliferation certainly can be a mess as @MattPovey suggests. However, it seems to me that one of the benefits of virtualization is that you can fine tune a specific instance to meet a particular need at low marginal cost ... if you have appropriate management tools.

    I guess a key point is to have deployment processes that have to document purpose and expected life cycle of an instance when it is deployed --- something we never really did in hardware.

    So, if we do with virtual what we did with physical hardware we face sprawl. But would you agree that with new processes and better management tools that does not have to be the case?
  • Andrew,

    That was 90 total AMIs created over the life of the CloudScale project, most of which are still there mothballed and could probably be deleted. I'll save that for a weekend project in the near future.

    The vast majority of those images were revisions of the baseline gold master that CloudScale used. There were revisions for 32-bit and 64-bit of CentOS, RHEL4, RHEL5, and Ubuntu. So, basically, this was 2 baselines x 4 OS versions x roughly 10 different attempts and revisions. Mostly tweaks to make sure that Puppet and the CloudScale node agent software did the right thing at boot.

    Obviously, I couldn't agree with you more about the role that Puppet plays in managing this problem. Using virtual machine images as a configuration management system is inherently flawed and completely unscalable.
  • Randy,

    Did you have 90 active images or that was over the life of the Cloudscale and some were retired? If you had 90 active, why did you need so many? I know you used Puppet, to build some systems, which makes me even more curious.

    Image sprawl is a royal pain in the you-know-what and the fact that you get more and more of them is only part of the problem. Unless, you have a lot of discipline and take really good notes, you have no idea what is on ami XXXXXXX vs. YYYYYYY after they have been around for a while and even then it is nearly impossible to easily compare them. (that's a not so new problem that also exists on physical servers as well)

    As long as we are going to use virtualization, there will be some images, but there are tools and strategies to minimize the sprawl while providing other benefits as well. For example, Puppet can help minimize the number of necessary images and provide meaning to the services you are running. I blogged about it a bit here:
    http://stochasticresonance.wordpress.com/2008/09/01/semantics-matter-or-i-finally-get-it/

    Regards,
    Andrew
blog comments powered by Disqus

Twitter links powered by Tweet This v1.7.1, a WordPress plugin for Twitter.