Virtual Server vs. Real Server Disk Drive Speed

It’s important to understand the potential differences between virtual server disk drives and physical disk drives, so I wanted to post a very brief blog on the topic.  For this article I’ve chosen to compare the performance of an iSCSI SAN on Gigabit Ethernet to a single SATA disk drive.  The reason for this is two-fold: first, it more starkly highlights the relative performance differences between purchasing say a single dedicated server in a hosting environment with a single disk or a virtual machine hosted in a cloud environment.  Secondly, when you are looking at internal private clouds or a lot of the newer cloud offerings, they are commonly built using an iSCSI SAN backend.

To be clear, the top three U.S. clouds do not use iSCSI SANs: Amazon’s EC2, Rackspace Cloud, and GoGrid, all use local RAID subsystems.  This is common knowledge.  Of the early cloud pioneers, as far as I’m aware, mostly the U.K.-based clouds such as ElasticHosts and FlexiScale use iSCSI SANs.  The latest set of new cloud entrants, such as Savvis, Terremark, and Hosting.com all use either iSCSI or Fiber Channel-based SANs.  This is also commonly known.

Your Mileage May Vary on these performance numbers.  I’m not trying to highlight any ‘right’ way to build a cloud here.  I’m simply trying to show what the difference in performance is between a single SATA disk and a VM disk drive backed by an iSCSI SAN over a single Gigabit Ethernet.

This is not a robust performance and benchmarking analysis.  It’s a simple “run the numbers and compare” blog posting.  These are by no means authoritative performance numbers and that’s not their purpose either.  Their purpose is to highlight how performance differs between a single spindle and many in a RAID configuration, even when that RAID is available via a SAN over Gigabit Ethernet.

Please avoid overly critiquing the testing technique here.  It’s not meant to be robust, so nitpicking it serves no purpose.

Setup & Methodology
This is a very simple test in the Cloudscaling hosting & cloud lab environment.  Both servers running the test are on latest Ubuntu Jaunty Jackalope release.  One is a physical server with a single SATA disk and the other is a VMware vSphere VM backed by an iSCSI LUN.  The iSCSI LUN is provided by a ZFS-based SAN product called NexentaStor from Nexenta Systems.  This is an OpenSolaris derivative and a very cost effective alternative to say a NetApp or EqualLogic system.

The iSCSI SAN hardware is a simple Sun x2200 M2 with a Sun J4200 JBOD and 6 15K RPM SAS drives.

The bonnie++ command line was as simple as possible:


bonnie++ -n 512


Note that the simplicity of the bonnie testing method may have caused some weird skewing of numbers.  See below for more.

Basic Numbers
Here is a basic high-level chart showing the numbers.

Figure 1. High level of SATA vs. VM disk

Figure 1. High level of SATA vs. VM disk

The first thing you will notice, of course, is the two big spikes for sequential and random file reads.  These numbers are artificially inflated as clearly 325,000 IOPS for sequential and 460,000 IOPS for random reads are ridiculous.  This is likely due to caching either in the OS or the controller on the physical box.  bonnie++ is supposed to account for this, but for some reason, in this instance it did not.  So it might be a little easier to evaluate the relative performance on a logarithmic scale:

Figure 2. Logarithmic Scale for High Level Results

Figure 2. Logarithmic scale for test Results

Much better.  What is easier to notice here is that the VM generally performs better on both standard measures of disk speed: raw throughput and disk operations (I/O per second or IOPS) with the obvious exception of the two aberrant data points.

Removing those two data points will give us an even clearer picture:

Figure 3. Normalized test results

Figure 3. Normalized test results

Great.  Now this is very clear.  As you can see, the first half of the chart shows raw throughput (Kbytes/second).  When reading blocks from the VM disk we’re nearly saturating the gigabit ethernet link which should top out at 125Mbps theoretical, and we’re hitting 107MBps on average over 10 runs, so this is quite acceptable.  The SATA disk, in comparison gets just over 60MBps, which is about right, even though the SATA spec and controller are capable of more.  Sustained block reads from SATA disks will typically be 60-80MBps in the real world.

Much more interesting is the number of IOPS.  Many real world disk workloads, like a database spend the majority of their time doing large amounts of their ’seeking’ from one position of the disk to another, meaning lots of random file access.  They will bottleneck on waiting for the disk ‘head’ to move from one position to another on a disk drive and read new data.  It’s hard to tell the difference above because the SATA disk is so slow it barely registers on the chart.

If we change to a logarithmic scale again the data becomes much easier to read:

Figure 4. Normalized logarithmic scale test data

Figure 4. Normalized logarithmic scale test data

Now you can see that doing random seeks (i.e. moving the head of the disk drive from one location to a new one to read a piece of data) are starkly different.  A single SATA disk gets about 185 IOPS while a set of 6 SAS disks in the SAN is right around 10,000 IOPS.  This is a huge performance difference.  There are several reasons for this.  One, a typical SATA disk has an average latency of 8.5ms and a 15K SAS disk has only 3ms.  Also, with 6 disks in a RAID configuration, I have 6x more disk heads to read with.

It’s still a bit hard to see with this chart, but for most of the rest of the IOPS tests above, the SAN solution is roughly 3x the performance of the single disk.  For example, Sequential File deletion is 2,573 (SAN) vs. 840 (SATA).

Rather than going through the entire set of results, I recommend you download my simple spreadsheet.

Note that for Amazon, Rackspace, or GoGrid, local VM disk results will likely look very similar to the iSCSI SAN results for IOPS and sequential read/write (first half of chart) will be much higher.

Amazon’s Elastic Block Storage (EBS) would have similar performance characteristics to the iSCSI SAN above and hence you can see why it can be acceptable for running a database.

Summary
My point here is very simple.  I want to highlight the difference between purchasing a dedicated server with a single (or small number of) SATA disks vs. going with a cloud solution that uses a shared iSCSI SAN or local RAID on a single physical node.  Purchasing your  own dedicated server solution with a RAID can be extremely costly compared to a similar cloud solution.

More importantly, for those workloads that require random I/O and file access, like database applications, RAID is clearly a winner.  That’s why using a shared RAID (via an iSCSI SAN or a local RAID) on a physical node for your cloud VM can be a clear advantage of the cloud today.

Post to Twitter

  • Dear Randy,

    Without wishing to get into an argument, both Chris Webb and Richard Davies are personal friends and colleagues from university. They definitely use local storage as I described.

    In terms of other points:

    - we use 2.5" spindles for the reasons you outline either in 8x 1U units or 24x in 2U. We use RAID 6 so actually the storage in use is reduced.
    - both ElasticHosts and ourselves don't oversubscribe x4 the CPU which I would describe as totally excessive. ElasticHosts clearly state on their website a maximum over-subscription of x2. We manage things on a server load basis directly which works as roughly the same end result.
    - we don't see average VM sizes of 1GB RAM, usually nearly 4-5GB so the sort of VM to spindle ratio you outline is not what we are experiencing
    - neither ElasticHosts nor ourselves use Nehalem or Westmere cores but AMD cores of which the latest generation, cost for cost, outperform their Intel equivalents for virtualisation uses under our tests

    I appreciate your analysis but as outlined above, your figures don't correlate with what we are experiencing. I'll outline below our typical figures for a 1U (most of our cloud) unit:
    CPU: 26GHz (12 cores)
    RAM: 64GB
    Storage: 3TB (nett in RAID 6)
    Drives: x8 2.5inch

    On a box like this we typically see 12 to a maximum of 20 VMs. I completely agree with you that disks are the main bottleneck which is why we don't use heavy over-subscription as you describe. We have dynamic assignment algorithms that take into account disk size, existing server load and other factors to optimise performance and distribution of load.

    Finally in terms of 10GigE, the issue here is not bandwidth but latency. Ethernet just has too much latency to be really snappy when using a SAN. This is the second bottleneck after disk performance. 10GigE won't solve this problem. Actually if you want to make a really low latency network you are better using Infiniband which has super low latency compared with Ethernet. This is why it is very popular with huge grid super-computing deployments.

    Appreciate all your feedback!

    Best wishes,

    Robert

    --
    Robert Jenkins
    Co-founder
    CloudSigma
    http://www.cloudsigma.com/
    http://www.twitter.com
  • As a correction, ElasticHosts, in common with our own platform use local persistent storage which offers some of the best performance available in the cloud. Generally SAN setups are more expensive and create large single points of failures as demonstrated by some of the major cloud outages (caused by SAN issues).

    There are significant developments happening that will make local storage more similar to SAN set-ups in terms of removing exposure to the failure of any physical host. In reality, multi-node customers already benefit from complete physical separation and redundancy. You can read more about our approach to Infrastructre-as-a-Service at http://www.cloudsigma.com .
  • @CloudSigma, I believe you are incorrect unless ElasticHosts has changed recently. I spoke directly with Richard Davies about this, admittedly a while ago (18 months now) and they use a combination of local storage and iSCSI SAN. As I understand it, the storage is local on an LVM LUN unless there is a a need to migrate the VM and then that LUN is shared to another box in the cluster using iSCSI. It sounded to me like this winds up with quite a bit of cross-mounting of LUNs.

    At Cloudscaling, we have experience with all of the major storage deployment models, local storage, remote storage, iSCSI, FC-SAN, distributed filesystems (GlusterFS, OCFS, GFS, Lustre), etc. In fact, if you read the recent CloudHarmony disk benchmarking post you'll notice a couple of things:

    http://blog.cloudharmony.com/2010/06/disk-io-benchmarking-in-cloud.html

    First, at GoGrid, where I was previously, we have the best scores for local storage. This is because the GoGrid team spent a lot of time performance tuning the storage as well as understanding oversubscription rates. In other words, I understand the benefits of local storage very well.

    Second, you'll notice that several clouds with local storage have benchmarks on par with SAN based providers. Again, this is mostly because of oversubscription rates. (See my post on subscription modeling).

    So, the principal problem with your push back is that you are asserting there is one right way to build storage for a cloud and that is provably false. Clearly there are pluses and minuses with each approach. The Cloudscaling team has deployed into production using all of these techniques and we understand them quite well.

    We all like local storage, but there are some significant tradeoffs in terms of manageability, lack of live migration, etc. Certainly, some distributed filesystems and similar technologies look very promising, but none are currently production worthy.

    OTOH, one huge development in iSCSI's favor is that 10GE is about to get dirt cheap. At $250/port TCO (or less), running SANs (FCoE or iSCSI) over 10GE is very attractive. It also simplifies cabling, allows for better utilization rates, and allows much higher density of VMs per compute node.

    For example, in a modern cloud today CPU utilization rates are 20-40% (see my presentations while at GoGrid and James Hamilton of Amazon/Microsoft's presentations). This generally implies that cores could be further oversubscribed by providers. Local storage actually causes a problem here because you can only put 12 3.5" spindles in a 2U rackmount. Or possibly 24 2.5" spindles. At an oversubscription rate of 4:1, 12 spindles supports only 48 VMs, which is close to the current average. Most clouds are deploying Nehalem or Westmere class Intel cores with 48GB in a box. With VMs averaging around 1GB RAM in size, that's pretty much the max on 12 spindles. Many folks are already moving to 72GB in a box and soon 144GB will be standard. If you use an iSCSI SAN you could ride the increase in RAM per box and increase VM densities over the next few years.

    What this all means is that local storage could be a scaling limitation until SSDs hit the right price point and it's not clear when that will happen. It will most likely be a few years until we hit that crossover point.

    Cloudscaling agrees that local storage with SAN characteristics is ideal, but we aren't there yet and clearly running bleeding edge technology in the storage layer of a cloud isn't a great idea.

    BTW, I did take a look at the website and your product looks very good. I'm guessing you are leveraging the new AES instructions in the Westmere cores to get reasonable storage encryption speeds. This is something we have also been talking to clients about.

    Just as an FYI, Yahoo! does use 'SAN' if you are including 'remote storage' (e.g. NAS) in that description, which you seem to in your argument. In fact, Yahoo! has one of the largest NetApp deployments in the world.

    Thanks for your post.

  • mfarney
    Indeed, a dedicated server can be costly and one should have a serious database and a site that produces some income before getting such a server. Most web masters prefer the shared hosting. It is cheaper and quite reliable because of the big competition on hosting services. Every company that offers such a thing wants to be better and faster than their competitors. Thus, nowadays, shared hosting is better.
    Mathew Farney | UK VPS hosting
  • I can confirm from our internal experience here at Cloud Central that you're numbers are right on the money. Our storage system employs iSCSI over multiple gig-e links backed by ZFS running on Sun storage hardware. We employ SSD's for read cache (L2ARC) and write logs (ZIL) as part of our hybrid storage pool, this allows us to leverage the benefits of SSD's (high IOPS / $) and hard disks (high capacity / %), whilst minimizing energy usage. The IOPS numbers we have seen from our storage sytem are very impressive (several thousand), whilst sequential reads & writes are in the order of 70-80MB/second, which is more than enough for 99% of DB and web workloads.

    The thing that people should keep in mind is that most workloads are dominated by random IO patterns, whilst very few worksloads require very high sequential IO performance (with the exception of workloads such as video editing, which I'd suggest are best done in house for the time being). Therefore generally speaking, the IOPS of your storage system is more important than the raw sequential read / write performance for real workloads.

    One additional interesting thing worth noting is the performance of iSCSI over gig-e vs Fiber Channel. Most enterprise users would turn their nose up at iSCSI, but is the performance gain offered by Fiber Channel really worth the additional cost? In my experience, the answer is no.

    Regards,
    Kris
  • I'd just add that the same misunderstandings exist with respect to network I/O as well as storage. In my tests, some of the providers with the best storage I/O had the worst network I/O and vice versa. Also, these things tend to vary a great deal according to instance type, and providers are generally not forthcoming about how I/O resources are apportioned. For example, I found that one major provider was applying fairly draconian network throttles to the smaller instance types, but good luck finding anything about that in their public documentation. On the storage side the problem is the opposite; there's nothing in the storage stack that's anything like the traffic-shaping functionality in the network stack, so if your "neighbor" decides to bang the hell out of the local storage you *will* be affected. SLAs mean nothing if the technical capabilities necessary to enforce them are absent.
  • You're right, it's true about network I/O, but in my experience very few apps are network bound, whereas most web apps (at least the DB portion) having a scaling constraint on the disk. For example, some of the largest folks on the Internet push 40Gbps out of a single datacenter. That's a lot of bandwidth, but that's across thousands of servers, so the average network utilization inside a DC is low per server. The bottlenecks tend to be the network on storage systems, not the individual servers themselves.

    There is some traffic shaping functionality and QoS in some of the SAN providers, but it's minimal at best. 3par and Compellant claim some, but as far as I can tell, only Pillar probably has something that works.

    Regardless, none of them were designed to do this 'right'. We need VM backing stores designed for these kinds of environments with proper QoS and the ability to tune by the frontend VM. For bonus points, the QoS, traffic-shaping, etc. would follow a VM when it 'migrates' as well.

    Anyway, that's a very long discussion. I think network I/O is important, but I haven't seen it be as impactful as disk except in those environments where the oversubscription rates are ludicrous, like container-based VPS offerings (e.g. OpenVZ/Virtuozzo on a box at 100:1 or more).
  • Is it local vs. remote (what you just said), or physical vs. virtual (the title)? I'd suggest that, whichever it is, the other should be held constant. It might also be more interesting to show results for like numbers of spindles in each case, since it's not quite a surprise that six spindles will outperform one except in very special circumstances. I realize that you didn't want to get too bogged down in methodological details, but statements like "sustained block reads from SATA disks will typically be 60-80MBps in the real world" lose their relevance if the methodological divergence is too great.

    Similarly, instead of saying that results on AWS etc. will "likely" look similar, why not actually test there? I've actually done that, it cost mere pocket change, and it revealed very significant differences not only between providers but among the same provider's options as well. I think it would be very interesting if we could collaborate on more fully characterizing the differences between the actual options that users have in this area.
  • The primary reason I haven't published any results using AWS is that the EULA can be read in such a way that posting said results might be in violation. So I'm playing it safe there.

    I (we) are happy to collaborate on this. I think it's generally mis-understood and underrepresented in the blogosphere. People don't really understand the differences in the storage architectures for the different clouds, yet that is a major area of concern.

    I knew one startup that had designed their architecture heavily around a message bus. I did the math for them and they needed about 10,000 IOPS on a single message hub to keep up with their volume. They did not seem to realize that might be a challenge.
  • Please please *please* don't use Mbps for mega*bytes* per second. If you're talking about network speeds, use megabits with a lowercase b; if you're talking about storage use megabytes with a capital B. I've seen people led astray by exactly this error enough for ten careers already, and it tends to mark the person making it as a novice in one area or the other.

    Other points...

    Using NexentaStor, or any software-based target, won't really give you a true feel for the throughput capability of an initiator, and neither will 512-byte writes or comparing one disk vs. several. Bonnie(++) is universally recognized as a poor and outdated benchmark, including by your friends at Sun. The issue with caching is that guest VMs can't generally reach out and force the host to flush its cache at the "right" points. This is getting better, but in the meantime some providers have resorted to configuring their hypervisors so that *all* guest I/O is done synchronously. That also leads to invalid results/comparisons, apparently including some of those above.
  • Thanks for your input. I've updated the article to be clear that this is not a robust benchmarking test and to fix my typos on MBps.

    If you would like to provide a better methodology, I will be happy to revisit this in the New Year based on your methodology.

    In the meantime, the primary purpose of the article is served, which is to show folks that there is a significant difference in the characteristics of performance between a single local disk and remote RAID even over GigE.
blog comments powered by Disqus

Twitter links powered by Tweet This v1.7.1, a WordPress plugin for Twitter.