Saturday, December 14, 2013

High-Performance Virtualization: SR-IOV and InfiniBand

About a week ago I posted a brief introduction to SR-IOV and a few early results from comparing the MPI performance of Amazon's new SR-IOV-enabled C3 instances to C3 instances using their older virtualized interconnect.  Although there were no real shockers in what I found (EC2's performance is still pretty lousy for MPI), I subsequently got a few experimental performance tips from the team at Amazon who developed EC2's SR-IOV which should bring EC2 performance closer to the metal.  As eager as I am to test out these tweaks and do some more realistic application benchmarking, apparently Instagram is hogging up all the C3 instances right now.

SR-IOV performance over 10gig is a horse that is rapidly being beaten to death though, and in the arena of supercomputing, it's not even that relevant.  Rather, InfiniBand dominates the high-end market, so it only makes sense to evaluate the performance of MPI and parallel applications on virtualized clusters with an InfiniBand interconnect.  To this end, I'll share some of the performance numbers that my colleagues and I obtained last spring in testing virtualizing QDR InfiniBand with SR-IOV.

For those disinclined to read this whole post, I also presented this data at SC'13 in Denver and Mellanox was kind enough to post a video of my talk online.

Finally, unlike most of my other posts, I actually got to do this work and play with SR-IOV'ed InfiniBand on the clock as a part of the performance evaluation that preceded SDSC's Comet award.  As such, I did not collect all this data myself, and I cannot take the credit for designing these experiments.  All of the involved parties are credited in the official presentation of the following data, and I should say that this blog post has not been vetted or approved by anyone involved in the data collection.

This post is a little long, so please feel free to skip to the relevant point of interest:

The Testing Platforms

The idea of this experiment was to determine whether or not it would be worthwhile for us to build our own high-performance machine with high-performance virtualization given all the buzz in cloud computing.  As such, we decided to compare the performance of three compute platforms:
  1. the best-available commercial offering of virtualized compute (Amazon EC2),
  2. a virtualized computing platform based on InfiniBand virtualized with SR-IOV, and
  3. a non-virtualized, bare-metal compute platform based on InfiniBand as a frame of reference
At the time we did this study, Amazon's cc2 instances were the top-of-the-line, so we spun up an 8-node EC2 cluster with
  • cc2.8xlarge instances running 
    • Amazon Linux 2013.03 (based on Enterprise Linux 6) 
    • HVM virtualization
  • dual Intel Xeon E5-2670 Sandy Bridge-EP processors (8 cores each, 2.6 GHz)
  • 60.5 DDR3 DRAM
  • Amazon's 10GbE interconnect within a common placement group
as our Option #1 (best-available commercial service).  We tested Options #2 and #3 on the same physical hardware, which was an 8-node HP s6500 chassis populated with
  • HP SL230 Gen 8 nodes running
    • Rocks 6.1 (based on Enterprise Linux 6)
    • kvm virtualization
  • dual Intel Xeon E5-2660 Sandy Bridge-EP processors (8 cores each, 2.2 GHz)
  • 64 GB DDR3 DRAM
  • QDR4X InfiniBand based on Mellanox ConnectX-3 (MT27500) HCAs
All three test cases had pretty similar hardware in terms of processor families and RAM.  The big differences were that Amazon used 10gig while our test cluster used QDR4X InfiniBand, and the CPUs in our test machine were clocked about 15% slower than those available in Amazon's cc2 instances.

MPI Latency

As I discussed in my previous post, the big performance gain from SR-IOV arises as a result of the fact that DMA (or RDMA, in the case of InfiniBand) can bypass the hypervisor and the host CPU altogether and write directly to the memory address space of a virtual machine (VM).  This should represent a significant drop in the overall latency of message passing, and in the 10gig case, it did.  How about with InfiniBand?

Increase in MPI latency due to SR-IOV in QDR InfiniBand

The latency difference, as you can see (or perhaps cannot see!) is extremely small here--significantly less pronounced than the 10gig comparison I posted last week.  For tiny messages (message size M < 128 bytes), the extra latency caused by virtualization is on the order of 30%.  Once message sizes get bigger though, the extra latency overhead of virtualizing InfiniBand with SR-IOV is less than 10% for all messages larger than 128 bytes.  Just as with SR-IOV and 10gig, the overhead in latency diminishes to nearly zero once the fabric performance moves into the bandwidth-limited regime.

An Interesting Diversion: PCIe Passthrough

An additional datapoint is using regular old PCIe passthrough, which as you may remember from my previous post, uses the same underlying virtualization-aware IOMMU hardware to pass DMA through the hypervisor.  I presented PCIe passthrough data at SC'13 which raised a few eyebrows, so for the sake of completeness, I'll also mention it here:

Difference in MPI latency for different virtualization strategies in QDR InfiniBand

In our tests, PCIe passthrough actually performed significantly worse than SR-IOV, and this is almost definitely a driver issue.  I reason this to be the case because:
  1. PCIe and SR-IOV use the same IOMMU hardware (Intel VT-d) under the hood; the only difference is that the VM interacts with the HCA's physical function in PCIe passthrough mode and one of the HCA's virtual functions in SR-IOV mode.
  2. Thus, the VM in PCIe passthrough mode is interacting with the physical function via the native hardware driver for the HCA.  In contrast, the VM in SR-IOV mode interacts with the virtual function via a paravirtualized driver that is aware that the virtual function does not have the same physical capabilities as a fully fledged physical function.
  3. The esteemed D.K. Panda and his colleagues reported similar performance degradation for small (M<128) messages when using SR-IOV, and this latency was caused by data being inlined in the work queues on the HCA.  They got significant performance increases by not inlining these small messages when interacting with the physical function's driver; the virtual function's driver automatically disables data inlining.  This explains the big disparity for M < 128 bytes.
We didn't spend a lot of time optimizing the PCIe passthrough performance since SR-IOV was our principal concern, so I never actually tested this hypothesis.

Comparing to a Commercial Cloud

Finally, how does the latency of virtualized InfiniBand compare to our best-available virtualized compute platform (Option #1)?

Difference in MPI latency for virtualized InfiniBand and Amazon EC2's cc2 instances

It should be no surprise that Amazon's 10gig interconnect, which uses software virtualization in addition to TCP and ethernet, is awful compared to QDR InfiniBand virtualized with SR-IOV.  For small messages, Amazon's latency is 50x greater than SR-IOV'ed InfiniBand, and the latency still remains roughly 10x greater for large messages due to the difference in total available bandwidth.

Although I'd argue that this comparison is valid in that it compares a technology we will be using (SR-IOV'ed InfiniBand) with the best-available commercial alternative (Amazon EC2), it really is an apples-to-oranges comparison in that both the bandwidth and latency characteristics of RDMA and TCP/IP are vastly different even when you remove all the overheads incurred by virtualization.  Both TCP/IP and Ethernet are lossy, which can incur massive drops in throughput as the network experiences congestion.

Overall Comparison of Latency Performance

To provide an overall summary of interconnect latencies and the effects of virtualization, here's one final plot that ties together both my previous post's latency data and the data from the experiment I presented at SC:


Comparing all latency data at once, a few features should be obvious:
  1. 10gig is always worse than QDR4X InfiniBand no matter what.  Even virtualized InfiniBand has almost 10x less latency than non-virtualized 10gig.
  2. SR-IOV does incur some small increase in latency in both interconnects, but the performance loss is on the order of a few percent (not orders of magnitude)
In addition, a few caveats should be made about this comparison of data:
  1. Some of the measurements were not optimized and do not reflect the best possible performance!
    1. The SR-IOV 10gig data did not have the tunings that Amazon's EC2 High Performance Network Virtualization team sent me after my previous post.
    2. The PCIe passthrough data did not have the data-inlining tunings that D.K. Panda's group presented at CCGrid 2013.
  2. I did not try InfiniBand with software virtualization, nor did I try 10gig with PCIe passthrough, so there is no way to compare InfiniBand to 10gig in those cases.
  3. It is difficult to compare the 10gig data from Amazon to the native 10gig performance, since the native 10gig cluster's networking topology was presumably quite different from what Amazon's cluster compute instances use.

MPI Bandwidth

The whole point of SR-IOV is to reduce the latency of DMA, so the bandwidth performance of virtualized InfiniBand should be no different from the bandwidth available to non-virtualized HCAs.

Decrease in MPI bandwidth due to SR-IOV in QDR InfiniBand

Indeed, SR-IOV incurs less than 2% loss of bandwidth across the entire range of message sizes.  The above diagram only shows tiny messages because, quite frankly, the bandwidth curves overlap for larger messages where the bandwidth (as opposed to latency) limits the overall throughput of message passing.  For the large messages, InfiniBand virtualized with SR-IOV is capable of delivering over 95% of the theoretical line speed.  Bandwidth loss due to SR-IOV, at least in these point-to-point tests, is a non-issue in virtualized InfiniBand.

PCIe Passthrough

I presented the data for MPI latency when using InfiniBand virtualized with PCIe passthrough, so I might as well present the bandwidth data as well:

Difference in MPI bandwidth for different virtualization strategies in QDR InfiniBand

Passthrough's bandwidth looks relatively terrible, but be reminded that the big disparities in performance are only present for small messages.  Also recall that the throughput of passing small messages is limited by the latency of the interconnect, not the bandwidth, so these wacky bandwidth measurements are just another manifestation of the inconsistencies in latency I showed in the previous section.  There isn't much to be taken away from this that hasn't already been said, but it's a good case study in why understanding your drivers, whether they be paravirtualized or true hardware drivers, can often be critical to performance.

Comparing to the Commercial Cloud

It should be obvious that 10gig ethernet will have significantly less bandwidth than QDR4X InfiniBand's 40Gb/s links, and since SR-IOV comes with almost no loss of bandwidth, SR-IOV'ed InfiniBand should blow Amazon out of the water:

Difference in MPI bandwidth for virtualized InfiniBand and Amazon EC2's cc2 instances

As labeled, SR-IOV'ed InfiniBand peaks out at 3.8 GB/s, while Amazon's 10gig saturates at around 400 MB/s.  What's striking about this is the fraction of line speed each interconnect is able to achieve: SR-IOV'ed InfiniBand can demonstrate over 95% of the theoretical max, while Amazon's 10gig is delivering only 400 MB/s, or less than 35% of line speed.  This translates to performance differences of between 9x (large messages) and 25x (small messages) when message passing over virtualized InfiniBand instead of the virtualized 10gig that exists in cloud-based compute services.

Overall Comparison of Bandwidth Performance

Pulling in data from a few other tests (Amazon's SR-IOV and non-virtualized 10gig), here is a bigger picture of the bandwidth performance of all the platforms I've evaluated:


The conclusions here are really the same as the conclusions derived from the latency performance in the previous section:  10gig is still always worse than QDR4X InfiniBand no matter what.  This is a trivial finding since a QDR4X link should be able to push data 4x faster than a 10gig link, but here's the data to back it up.

The same caveats I mentioned in the previous MPI latency section apply here as well.  In addition, the kind folks of Amazon's EC2 High Performance Network Virtualization team say they have gotten much better SR-IOV bandwidth than I demonstrated and suggested that there may have been something wrong with the instances I got.  However, my bandwidth data was measured between all of the node-pairs within my four-node C3 cluster, and all measurements were consistent with each other so there must've been something wrong with all four nodes.  Unless someone can bankroll another set of C3 tests for me (my previous blog post cost me $20) though, further testing will have to wait.

Application Benchmarks

Microbenchmarks are all well and good, but they ultimately only test isolated aspects of the overall system performance and don't always follow the same performance trends of users' applications which actually do the science that makes supercomputers useful.

The devil in evaluating new technology is that it's often difficult to get the people who understand the scientific applications to get in the same room as the people who understand the new technology to really explore the full range of interplay between application algorithms and compute hardware.  With that being said, there are a few supercomputing organizations that do this and do it extremely well; for example, NERSC has a long track record of doing serious workload and performance analysis, and the HPC Advisory Council maintains an extremely nice list of application benchmarks complete with profiles.

The two application benchmarks we ran to evaluate the performance of InfiniBand virtualized with SR-IOV were derived from the HPC Advisory Council's repository and represent opposite ends of the spectrum in terms of application communication profiles.  Both were built with Intel's compilers and had the full Sandy Bridge ISA exposed to the VMs, meaning the playing field was quite level as far as all of the tested platforms being able to execute 256-byte vector operations via AVX.  We also opted to use OpenMPI across the board for similar reasons; even though MVAPICH2 is my preferred MPI stack when using InfiniBand, OpenMPI has the benefit of supporting both InfiniBand and 10gig via the openib and tcp btls.

WRF - Weather Modeling

WRF is a widely used weather modeling application that is run in both research and operational forecasting.  Its communication patterns are marked by an overwhelming amount of nearest-neighbor communications (see slide 12 in the HPCAC profile of WRF) whereby one MPI rank communicates with only a few other MPI ranks, and those communicating partners never change throughout the course of the simulation.  Due to the way MPI ranks typically map to cores in most modern clusters, this means that much of the communication is actually done between MPI ranks that reside on the same physical host node, and the communication that does have to traverse the interconnect does so in highly predictable ways.  Ultimately, this means that congestion becomes less of an issue since communication can be reliably patterned in a very orderly way.

The size of the messages that WRF passes between these neighboring ranks are neither tiny nor massive either (4KK < M < 32KB), and this means they benefit from both low latency and high bandwidth.  In terms of how gnarly some applications can get, WRF represents a very "nice" problem in that its communications demands mesh very well with the strengths of InfiniBand.

Application runtime for WRF 3-hour 12km CONUS benchmark

When the WRF 12km CONUS benchmark is run with six nodes (96 cores) over QDR4X InfiniBand virtualized with SR-IOV, the overall performance loss is only about 15%.  Although this is not as nice as the 10% overhead in MPI latency and roughly zero overhead in MPI bandwidth I showed in the previous sections, I would argue that a 15% increase in walltime is not a show stopper if it means a user can develop an entire software ecosystem around using WRF within a VM and simply deploy it as a scalable compute appliance on SDSC's upcoming Comet machine.

Surprisingly, Amazon's performance hit wasn't too bad either--a 44% increase in overall runtime is equivalent to what one might expect by turning off compiler optimizations or upgrading by a few processor generations.  However, I should point out that both the Native IB and SR-IOV benchmarks were running on SDSC's test cluster with 2.2 GHz E5-2660 processors, while the Amazon instances were using 2.6 GHz E5-2670 processors.  Thus, interconnect performance aside, the Amazon performance should have been measurably better than the SR-IOV test cluster by virtue of the fact that the Amazon test was using faster processors.

Nonetheless, this application test illustrates sort of the "best-case" application performance when running a parallel application over a virtualized interconnect, and this best case is pretty good.

Quantum ESPRESSO - Density Functional Theory

If WRF represents the "best-case" application performance, Quantum ESPRESSO represents something close to the worst-case scenario.  This isn't to suggest that Quantum ESPRESSO is a bad code; rather, the mathematical problem it solves is inherently difficult to parallelize efficiently.  The application itself performs density functional theory (DFT) calculations for condensed matter problems (this is one of the rare cases where my background in materials science actually becomes relevant!) which are typically marked by repeatedly doing two terrible mathematical operations:
  1. matrix diagonalization
  2. 3D Fourier transforms
Matrix diagonalization is done with the conjugate gradient algorithm which, unlike WRF, typically has very irregular communication patterns that put more pressure on the interconnect than WRF's nearest-neighbor communications.   In fact, conjugate gradient algorithms are sufficiently taxing on all aspects of a parallel machine that the famed Jack Dongarra, one of the founders of TOP500 and authors of the LINPACK benchmark, has proposed a conjugate gradient-based benchmark (HPCG) as a modern alternative to LINPACK for assessing the overall capability of supercomputers.

3D Fourier transforms are similarly hard on the interconnect because, like conjugate gradient calculations, they are rich in things like global collective operations as a result of having to perform multiple matrix transpose operations where every MPI rank needs to send and receive all of its data to every other MPI rank.  The very nature of these global collectives means they cause interconnect congestion, and efficient 3D FFTs saturate (and are limited by) the bisection bandwidth of the entire fabric connecting all of the compute nodes.

For those who may not be interested in the gory details underlying the parallel algorithms that are required to perform the DFT calculations that Quantum ESPRESSO uses, let it suffice to say that this code stresses the interconnect and is dominated by collectives and irregular communication (see slide 17 in the HPCAC profile of Quantum ESPRESSO).  Here's how that translates into performance over virtualized interconnects:

Application runtime for Quantum ESPRESSO DEISA AUSURF 112 benchmark

Virtualizing the interconnect with SR-IOV contributes a substantially greater degree of overhead to the overall performance when calculating the DEISA AUSURF 112 benchmark with Quantum ESPRESSO.  Runtime increases by 28% when run within a three-VM (48-core) subset of the virtualized cluster despite the problem being largely bandwidth-limited, and this may be the effect of a large number of small messages (M < 16 bytes) that the HPCAC profile reports.  It may also be related to the precipitous performance drop that SR-IOV appears to induce in collective operations, as reported by D.K. Panda's group.  However, the conditions of Panda's experiments were quite different from ours in that we maintained only one VM per physical host.

While a 28% increase in runtime due to SR-IOV is not great, this benchmark performs horrendously on Amazon EC2 and posts a slowdown of over 500%.  This is the result of Quantum ESPRESSO striking at the Achilles heel of Amazon EC2: ESPRESSO needs bisection bandwidth and generates congestion, and Amazon lacks deliverable bandwidth and relies on TCP/IP over Ethernet underneath, which really suffers under heavy network congestion.

The Wrap-Up

The application performance of clusters which utilize SR-IOV to virtualize InfiniBand is quite respectable.  While it'd be disingenuous to say that SR-IOV will usher in a new era of virtualized high-performance computing, I genuinely think it makes the idea of running tightly coupled parallel applications within virtual machines a palatable option.  Overall,
  • MPI latency overhead introduced by SR-IOV is small (< 10%) for all but the smallest of messages in virtualized InfiniBand.  Even then, small message passing performance might be improved by tuning the virtual function drivers.
  • MPI bandwidth overhead introduced by SR-IOV is miniscule (< 2%) across the board, and SR-IOV'ed InfiniBand is able to achieve > 95% line speed from within a VM.
  • Application performance overhead caused by SR-IOV will typically be on the order of 10% to 50% for small-ish problems (three to eight nodes).  The exact performance loss due to the virtual environment depends on the communication patterns of the underlying problem, and applications that stress the interconnect will experience greater overhead from SR-IOV.
If nothing else, the data here shows that one cannot simply "go to Amazon" for all virtualized high-performance computing needs.  Using SR-IOV with InfiniBand delivers raw performance that is orders-of-magnitude better in both latency and bandwidth between VMs, and application performance can be expected to be, at the very minimum, 20% better over virtualized InfiniBand.

SR-IOV's overhead is small but finite.  SDSC has chosen to go with it for the deployment of Comet because of Comet's need to serve an extremely diverse range of user communities, web-based gateways, and other projects that require sophisticated software stacks that would be too difficult to maintain on bare metal.  In that context, losing 15% of peak performance is a fair trade if it makes a project or compute service tractable.  Thus, I think the bottom line is that SR-IOV represents a major step forward in the pursuit of high-performance virtualized high-performance computing.  It's not the end of the line, but it does open a lot of new doors.

As far as I know, there are a few other sites experimenting with virtualizing InfiniBand using the host IOMMU.  If you are really interested in this emerging technology, here are some links of interest:
Also please feel free to contact me directly.

4 comments:

  1. Can you share the performance tips that you got from the Amazon team?

    ReplyDelete
    Replies
    1. Sure thing, with the caveat that these are unofficial and should be treated as untested and unsupported!

      1. Disable adaptive throttling for the ixgbevf driver ("options ixgbevf InterruptThrottleRate=0,0,0,0,0,0,0" in /etc/modprobe.d/ixgbevf.conf)

      2. Reduce the ring buffer sizes (e.g., ethtool -g eth0 rx 128 tx 128)

      3. Be sure to bind your communicating processes to the socket closest to the IO device.

      These will improve the results of these benchmarks, but may or may not help actual application performance. For example, #3 definitely isn't going to help you if you are running 16 MPI processes per instance, since 8 will invariably have to communicate over QPI before going to the ixgbe device.

      Delete
  2. Hi Glenn, thank you for sharing this information. It is extremely useful for those of us using Amazon's AWS services to do analytics and parallel tasks. Great work, and please, keep posting! THANK YOU! This entire blog is a wealth of information!

    ReplyDelete
    Replies
    1. Thank you for the kind words. I'm always glad to hear when someone can get something useful out of what I post.

      Delete