Blog entries

Infrastructure-as-a-Service Builder’s Guide v1.0

Just in time for the New Year, we’re releasing a short 12 page whitepaper on building Infrastructure-as-a-Service (IaaS) clouds.  This whitepaper is targeted at folks building public or private clouds who want to understand our general take on clouds, cloud computing, and Infrastructure-as-a-Service.  In particular, we highlight some of the important areas to think about when you are planning and designing your infrastructure cloud.

Of course, we welcome comments and feedback.  They will be incorporated into future revisions.  The paper itself does go into some technical depth in a few areas, but we can provide quite a bit more color in our workshops.

For your reading pleasure, I present our first big technical whitepaper:

Thanks!

The Cloudscaling Team

Ps. We realize the definition of ‘workload’ or ‘cloud workload’ is not as crisp as it could be and request your feedback and thinking on better nomenclature or definitions.  Credit will be given as appropriate.


Post to Twitter

Virtual Server vs. Real Server Disk Drive Speed

It’s important to understand the potential differences between virtual server disk drives and physical disk drives, so I wanted to post a very brief blog on the topic.  For this article I’ve chosen to compare the performance of an iSCSI SAN on Gigabit Ethernet to a single SATA disk drive.  The reason for this is two-fold: first, it more starkly highlights the relative performance differences between purchasing say a single dedicated server in a hosting environment with a single disk or a virtual machine hosted in a cloud environment.  Secondly, when you are looking at internal private clouds or a lot of the newer cloud offerings, they are commonly built using an iSCSI SAN backend.

To be clear, the top three U.S. clouds do not use iSCSI SANs: Amazon’s EC2, Rackspace Cloud, and GoGrid, all use local RAID subsystems.  This is common knowledge.  Of the early cloud pioneers, as far as I’m aware, mostly the U.K.-based clouds such as ElasticHosts and FlexiScale use iSCSI SANs.  The latest set of new cloud entrants, such as Savvis, Terremark, and Hosting.com all use either iSCSI or Fiber Channel-based SANs.  This is also commonly known.

Your Mileage May Vary on these performance numbers.  I’m not trying to highlight any ‘right’ way to build a cloud here.  I’m simply trying to show what the difference in performance is between a single SATA disk and a VM disk drive backed by an iSCSI SAN over a single Gigabit Ethernet.

This is not a robust performance and benchmarking analysis.  It’s a simple “run the numbers and compare” blog posting.  These are by no means authoritative performance numbers and that’s not their purpose either.  Their purpose is to highlight how performance differs between a single spindle and many in a RAID configuration, even when that RAID is available via a SAN over Gigabit Ethernet.

Please avoid overly critiquing the testing technique here.  It’s not meant to be robust, so nitpicking it serves no purpose.

Setup & Methodology
This is a very simple test in the Cloudscaling hosting & cloud lab environment.  Both servers running the test are on latest Ubuntu Jaunty Jackalope release.  One is a physical server with a single SATA disk and the other is a VMware vSphere VM backed by an iSCSI LUN.  The iSCSI LUN is provided by a ZFS-based SAN product called NexentaStor from Nexenta Systems.  This is an OpenSolaris derivative and a very cost effective alternative to say a NetApp or EqualLogic system.

The iSCSI SAN hardware is a simple Sun x2200 M2 with a Sun J4200 JBOD and 6 15K RPM SAS drives.

The bonnie++ command line was as simple as possible:


bonnie++ -n 512


Note that the simplicity of the bonnie testing method may have caused some weird skewing of numbers.  See below for more.

Basic Numbers
Here is a basic high-level chart showing the numbers.

Figure 1. High level of SATA vs. VM disk

Figure 1. High level of SATA vs. VM disk

The first thing you will notice, of course, is the two big spikes for sequential and random file reads.  These numbers are artificially inflated as clearly 325,000 IOPS for sequential and 460,000 IOPS for random reads are ridiculous.  This is likely due to caching either in the OS or the controller on the physical box.  bonnie++ is supposed to account for this, but for some reason, in this instance it did not.  So it might be a little easier to evaluate the relative performance on a logarithmic scale:

Figure 2. Logarithmic Scale for High Level Results

Figure 2. Logarithmic scale for test Results

Much better.  What is easier to notice here is that the VM generally performs better on both standard measures of disk speed: raw throughput and disk operations (I/O per second or IOPS) with the obvious exception of the two aberrant data points.

Removing those two data points will give us an even clearer picture:

Figure 3. Normalized test results

Figure 3. Normalized test results

Great.  Now this is very clear.  As you can see, the first half of the chart shows raw throughput (Kbytes/second).  When reading blocks from the VM disk we’re nearly saturating the gigabit ethernet link which should top out at 125Mbps theoretical, and we’re hitting 107MBps on average over 10 runs, so this is quite acceptable.  The SATA disk, in comparison gets just over 60MBps, which is about right, even though the SATA spec and controller are capable of more.  Sustained block reads from SATA disks will typically be 60-80MBps in the real world.

Much more interesting is the number of IOPS.  Many real world disk workloads, like a database spend the majority of their time doing large amounts of their ’seeking’ from one position of the disk to another, meaning lots of random file access.  They will bottleneck on waiting for the disk ‘head’ to move from one position to another on a disk drive and read new data.  It’s hard to tell the difference above because the SATA disk is so slow it barely registers on the chart.

If we change to a logarithmic scale again the data becomes much easier to read:

Figure 4. Normalized logarithmic scale test data

Figure 4. Normalized logarithmic scale test data

Now you can see that doing random seeks (i.e. moving the head of the disk drive from one location to a new one to read a piece of data) are starkly different.  A single SATA disk gets about 185 IOPS while a set of 6 SAS disks in the SAN is right around 10,000 IOPS.  This is a huge performance difference.  There are several reasons for this.  One, a typical SATA disk has an average latency of 8.5ms and a 15K SAS disk has only 3ms.  Also, with 6 disks in a RAID configuration, I have 6x more disk heads to read with.

It’s still a bit hard to see with this chart, but for most of the rest of the IOPS tests above, the SAN solution is roughly 3x the performance of the single disk.  For example, Sequential File deletion is 2,573 (SAN) vs. 840 (SATA).

Rather than going through the entire set of results, I recommend you download my simple spreadsheet.

Note that for Amazon, Rackspace, or GoGrid, local VM disk results will likely look very similar to the iSCSI SAN results for IOPS and sequential read/write (first half of chart) will be much higher.

Amazon’s Elastic Block Storage (EBS) would have similar performance characteristics to the iSCSI SAN above and hence you can see why it can be acceptable for running a database.

Summary
My point here is very simple.  I want to highlight the difference between purchasing a dedicated server with a single (or small number of) SATA disks vs. going with a cloud solution that uses a shared iSCSI SAN or local RAID on a single physical node.  Purchasing your  own dedicated server solution with a RAID can be extremely costly compared to a similar cloud solution.

More importantly, for those workloads that require random I/O and file access, like database applications, RAID is clearly a winner.  That’s why using a shared RAID (via an iSCSI SAN or a local RAID) on a physical node for your cloud VM can be a clear advantage of the cloud today.

Post to Twitter

More on Amazon’s SAS70 Type II

Amazon hasn’t been forthcoming since my last post on their control and control objectives, which is disappointing, but expected.  I still believe that transparency here is more important than security through obscurity.  Hiding the controls and control objectives doesn’t provide much in the way of particular security benefits, although I’m certain some will argue that it does.  Consider however, that while the SAS70 controls would tell what is being audited, that doesn’t necessarily translate to all of the controls in place.

Regardless, a bit more light has been shed on Amazon’s controls and measures in their recent security webinar.  You can access it here.

At a high level, CJ Moses, who presents the webinar talks to the core areas they covered in the control objectives, which are:

  1. Security organization
  2. Amazon employee lifecycle
  3. Logical security
  4. Physical security
  5. Environmental safeguards
  6. Change management
  7. Data integrity, availability, and redundancy
  8. Incident handling

This looks pretty reasonable at a high level.  Of course, it would be nice to see the actual controls and objectives, but at least they are covering the appropriate areas of security.  I do notice that there isn’t much around perimeter or related security.  I’m guessing they are trying to gloss over the AWS distributed firewall.  It would be nice if someone besides Amazon was vetting the way this was built.  They appear to consider it a piece of core intellectual property despite the fact it would be trivial to reproduce.  I’m not exactly certain why.

Post to Twitter

Why is Amazon’s SAS70 Audit Bogus?

At first glance it seems like Amazon’s recent announcement of a successful SAS70 audit is grounds for celebration[1]. Certainly it has met with fanfare on Twitter and blogs.

Unfortunately, a SAS70 audit isn’t what most people think it is. Worse yet, Amazon’s reluctance to provide details of the audit provides a false sense of security with no tangible benefits.

Let me explain.

Understanding the SAS70 Audit
The SAS70 is a methodology for performing an audit, not the audit rules themselves. The SAS70 can prove whatever you decide it needs to prove. From taking the garbage out to turning the lights on.

From Wikipedia:


SAS 70 defines the professional standards used by a service auditor
to assess the internal controls of a service organization and issue a service auditor’s report.


Here’s how it works.

For a SAS70, you must specify a series of “controls” and “control objectives”. Like it sounds, you are asserting that a given ‘control’ meets a goal or objective. An example of a control might be the ‘new user creation process’ or a ‘firewall’. An example of a control objective might be the following[2]:


The new user creation process MUST guarantee that a user’s password
is at least 8 characters long and composed of a mix of at least one uppercase,
one lowercase, and one numerical character.


Once all of the control objectives are in place an outside auditor, like Deloitte & Touche, comes in and verifies that you are compliant with the stated control objectives over a period of time. If it is a Type 1 audit the period is 3 days. If it is a Type 2 the period is 6 months.

Now here’s the rub: Who decides what the control objectives are? An outside agency? A regulatory body?

None of the above. The company being audited decides and can make the control objectives anything they like. Here’s a SAS70 FAQ response on the topic right from the SAS70.com website.

Again, the SAS70 is just an auditing framework. Why then do so many think it’s useful?

Background on the SAS70 Audit
The SAS70 comes out of the financial industry and is a relatively generic framework for that reason. The financial industry has tons of different regulatory requirements that vary from state to state and country to country. Moreover, within the financial industry these kinds of audits are undertaken all of the time, the parties involved know what they are testing for, and how to negotiate it.

For example, a large bank might outsource work to a secondary institution and have a desire to see that institution provide proof they are following certain guidelines or regulations. A good example is the Bank Secrecy Act. The large bank in this case knows what the BSA requires and how to evaluate the secondary institution’s SAS70. This knowledge allows them to assess secondary institution’s level of compliance with the BSA. At the same time, the secondary institution is familiar with what its large partners will require and sets up its annual Type 2 to cover the ‘usual suspects’ of controls and control objectives.

So how did we get here?

Hosting Companies and the SAS70
In recent years as financial institutions began to outsource they required that various hosting (and other) businesses perform the audit as well. Unlike their usual partners it hasn’t been clear what hosters need to be compliant with. Because of this most folks have simply done these SAS70s as simple Type 1s that are one-offs. This allowed the hosters to keep their costs down while allowing the bank to outsource and the hosters to generate revenue.

Here’s the problem: Cloud computing is ushering in whole new ways of delivering IT services.

It demands greater transparency than ever, especially when it comes to security. If the average person doesn’t understand the SAS70 and if you don’t provide your control objectives so that others can vet the objectives you sold then you are creating a false sense of security.

You could have one control objective that simply says: “we must keep the power in the data center on” and successfully pass by fulfilling that over 3 days or 6 months.

The Need For A Cloud Security Standard
There are a couple of security and IT standards that can be used as the basis for a good SAS70 audit. For example there is CoBIT and the ISO27002 (formerly ISO17799). There are probably others I’m unfamiliar with. Unfortunately, most of these standards really focus on the Enterprise and not on a multi-tenant public cloud or hosting companies, who have some issues specific to their particular business models and architectures.

So, even if Amazon used one of these, it’s still not good enough for them to keep their controls and control objectives hidden from public view. How are we to be certain that they are sufficient? [3]

Summary
Until there is a security standard for running a cloud then SAS70 audits with unpublished controls and control objectives like the recent AMZN announcement are simply smoke and mirrors. They provide little or no real assurance to the average consumer of the AWS public cloud and serve only to provide a false sense of security.

UPDATE: @wpauley says he has a copy of the AWS controls, but I haven’t seen them yet. When I get a copy I will post them.
UPDATE2: Apparently @wpauley was a special case. AWS is keeping the controls under wraps. If you have a copy send them to me anonymously and I will get them posted.


[1] Even the recent refresh of the Amazon Security Whitepaper (PDF) does not include details on the controls or control objectives
[2] Been a while since I was involved in a SAS70 and there is a specific language they use that I’ve forgotten. Did not find any examples on the net. Appreciate clarifications in comments below if you have them.
[3] I think this raises a broader question, which is should any public cloud ever be allowed to keep their SAS70 controls and control objectives hidden? There is a very nominal argument for security through obscurity, but the reality is that many people will have to see them anyway, so why not shed some light?

Post to Twitter

On Second Thought…How Big Is AWS Really?

We are trying something new at Cloudscaling and inviting a few of the more interesting cloud computing bloggers to provide some alternative viewpoints.  We thought we would start with Andy Schroepfer. His additional analysis of AWS revenue is thoughtful and worth consideration.  –Randy Bias, Founder and Cloud Strategist, Cloudscaling

Guest Author Andy Schroepfer is VP of Strategy at Rackspace. You can follow Andy’s content on www.NoMoreServers.com and via Twitter @shrepfur.

Determining the exact revenue size of Amazon’s Web Services (AWS) unit is akin to finding the exact server that an Amazon customer’s code is running on in the cloud. In both cases, only Amazon knows the specifics whereas the rest of us are left to guesstimate. I have previously praised a few blog researchers for their good efforts and wanted to contribute to the discussion with an analysis of a different type. My approach, using just the financial detail in Amazon’s SEC filings, suggests AWS might be smaller than the much more detailed assessments grounded in usage data. To that end, I suggest that all of AWS is below $200 million annually, and perhaps closer to $150 million.

Read more…

Post to Twitter

Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.