Infrastructure-as-a-Service Builder’s Guide v1.0

Just in time for the New Year, we’re releasing a short 12 page whitepaper on building Infrastructure-as-a-Service (IaaS) clouds.  This whitepaper is targeted at folks building public or private clouds who want to understand our general take on clouds, cloud computing, and Infrastructure-as-a-Service.  In particular, we highlight some of the important areas to think about when you are planning and designing your infrastructure cloud.

Of course, we welcome comments and feedback.  They will be incorporated into future revisions.  The paper itself does go into some technical depth in a few areas, but we can provide quite a bit more color in our workshops.

For your reading pleasure, I present our first big technical whitepaper:

Thanks!

The Cloudscaling Team

Ps. We realize the definition of ‘workload’ or ‘cloud workload’ is not as crisp as it could be and request your feedback and thinking on better nomenclature or definitions.  Credit will be given as appropriate.


Post to Twitter

  • Good read. Just to confirm some of the definitions you guys use :

    - a "pod" is basically any arbitrary grouping of VM hosts. Maybe it's based on physical infrastructure boundaries (the VLAN example), maybe it's based on end customer identity (all VMs of customer A have to go on that pod), maybe it's based on workload type (all Apaches have to go there).

    - an "availability zone" is a collection of one or more pods, where you have protection against individual VM host crashes (H/A), but not against "disasters" in the sense of traditional DR. In case of such a disaster you better have a replica in another zone.

    Are these definitions in-line with your thinking ? If so - would you agree then that in most case an availability zone will map onto one physical data center ?
  • Yes. That's correct. Mostly. A pod isn't an arbitrary grouping though. It's a grouping based on scale, which is related to architecture decisions made in designing the pod. Google's pods for their infrastructure are 10,000 servers, because they rely on all of the servers in a given pod being on the same switch. (they build their own custom switches for this purpose). It's both a design decision and a scaling constraint.

    VMware pods will almost certainly be designed around Virtual Center, which has a stated limit of 256 ESX hosts, but most folks I've talked to say realistically it's 50. I've also heard inklings that if you use DRS this number is much, much, smaller. So if you decide that DRS is a requirement for a VMware-based IaaS offering, then your pod size might be only 30 ESX hosts (or less).

    Another scaling constraint (business, not technical this time) is capex. You might have a design that allows for 1,000+ nodes, but design a pod at a smaller size initially due to the realities of how much you can build out at once.

    I would say that when well designed an availability zone == a datacenter, but I'm not certain that is always the case. It's fairly likely that over time folks will have more than one availability zone within a single datacenter, assuming each avail zone is isolated in power, network, and cooling.

    The primary idea here is that availability zone is cribbed directly from Amazon's usage: facility infrastructure is guaranteed to be redundant, but not the facility itself. For a redundant facility you would need to be A) in a different building and B) have that additional building far enough away to be unaffected by acts of god. That range varies, but my personal number is 250 miles.
  • Randy -- this is a nice primer. Question: At what size (measured in VMs, hosts, apps, or whatever metric you like) are today's CCS's likely to sweat and what are the factors that cause a CCS to hit scalability limits? Put another way, what 'resource' in the cloud management infrastructure (be it technical, people, or process) is likely to be the bottleneck as you grow your IaaS cloud?
  • John, apologies for the delay. It depends on the CCS. It looks like the low end for a pod is about 50 physical servers running ~30 VMs each and the high end is probably more like 500 physical servers running ~30 VMs each. A CCS could conceivably manage a fairly large number of pods without too much trouble. I expect in the thousands. Any CCS that is designed in a loosely-coupled fashion should be able to be horizontally scaled using regular techniques. At the end most of them are simple batch processing systems.

    I don't know that there is a single resource constraints in scaling an IaaS cloud. The biggest issue is more one of scaling factor. As the margins get thinner, the ability to manage 10, 100, or 1,000 servers per operator will be crucial, but also reach a point of diminishing returns. The cost of a single operator spread across 10 vs. 100 servers is big, but between 1,000 and 10,000 servers is pretty marginal.

    Or to put it slightly differently, IaaS providers need to optimize their cost structures and that will be the primary source of any 'bottleneck' in that it will directly impact scalability. But over optimization is dangerous. At some point those resources are better spent on sales & marketing.
blog comments powered by Disqus

Twitter links powered by Tweet This v1.7.1, a WordPress plugin for Twitter.