Monday, September 26, 2016

Passing QEMU command line options through libvirt

This one comes from Stefan's blog with some duct tape and bailing wire courtesy of Laine.  I've talked previously about using wrapper scripts to launch QEMU, which typically use sed to insert options that libvirt doesn't know about.  This is by far better than defining full vfio-pci devices using <qemu:arg> options, which many guides suggest, but it hides the devices from libvirt and causes all sorts of problems with device permissions and locked memory, etc.  But, there's a nice compromise as Stefan shows in his last example at the link above.  Say we only want to add x-vga=on to one of our hostdev entries in the libvirt domain XML.  We can do something like this:

<qemu:commandline>
  <qemu:arg value='-set'/>
  <qemu:arg value='device.hostdev0.x-vga=on'/>
</qemu:commandline>

The effect is that we add the option x-vga=on to the hostdev0 device, which is defined via a normal <hostdev> section in the XML and gets all the device permission and locked memory love from libvirt.  So which device is hostdev0?  Well, things get a little mushy there.  libvirt invents the names based on the order of the hostdev entries in the XML, so you can simply count (starting from zero) to pick the entry for the additional option.  It's a bit cleaner overall than needing to manage a wrapper script separately from the VM.  Also, don't forget that to use <qemu:commandline> you need to first enable the QEMU namespace in the XML by updating the first line in the domain XML to:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

Otherwise libvirt will promptly discard the extra options when you save the domain.

"Intel-IOMMU: enabled": It doesn't mean what you think it means

A quick post just because I keep seeing this in practically every how-to guide I come across.  The instructions grep dmesg for "IOMMU" and come up with either "Intel-IOMMU: enabled" or "DMAR: IOMMU enabled". Clearly that means it's enabled, right? Wrong. That line comes from a __setup() function that parses the options for "intel_iommu=". Nothing has been done at that point, not even a check to see if VT-d hardware is present. Pass intel_iommu=on as a boot option to an AMD system and you'll see this line. Yes, this is clearly not a very intuitive message. So for the record, the mouthful that you should be looking for is this line:

DMAR: Intel(R) Virtualization Technology for Directed I/O

or on older kernels the prefix is different:

PCI-DMA: Intel(R) Virtualization Technology for Directed I/O

When you see this, you're pretty much past all the failure points of initializing VT-d. FWIW, the "DMAR" flavors of the above appeared in v4.2, so on a more recent kernel, that's your better option.

Friday, July 15, 2016

Intel Graphics assignment

Hey folks, it feels like it's time to mention that assignment of Intel graphics devices (IGD) is currently available in qemu.git and will be part of the upcoming QEMU 2.7 release.  There's already pretty thorough documentation of the modes available in the source tree, please give it a read.  There are two modes described there, "legacy" and "Universal Passthrough" (UPT), each have their pros and cons.  Which ones are available to you depends on your hardware.  UPT mode is only available for Broadwell and newer processors while legacy mode is available all the way back through SandyBridge.  If you have a processor older than SandyBridge, stop now, this is not going to work for you.  If you don't know what any of these strange names mean, head to Wikipedia and Ark to figure it out.

The high level overview is that "legacy" mode is much like our GeForce support, the IGD is meant to be the primary and exclusive graphics in the VM.  Additionally the IGD address in the VM must be at PCI 00:02.0, only Seabios is currently supported, only the 440FX chipset model is supported (no Q35), the IGD device must be the primary host graphics device, and the host needs to be running kernel v4.6 or newer.  Clearly assigning the host primary graphics is a bit of an about-face for our GPU assignment strategy, but we depend on running the IGD video ROM, which depends on VGA and imposes most of the above requirements as well (oh add CONFIG_VFIO_PCI_VGA to the requirements list).  I have yet to see an IGD ROM with UEFI support, which is why OVMF is not yet supported, but seems possible to support with a CSM and some additional code in OVMF.

Legacy mode should work with both Linux and Windows guests (and hopefully others if you're so inclined).  The i915 driver does suffer from the typical video driver problem that sometimes the whole system explodes (not literally) when unbinding or re-binding the IGD to the driver.  Personally I avoid this by blacklisting the i915 driver.  Of course as some have found out trying to do this with discrete GPUs, there are plenty of other drivers ready to jump on the device to keep the console working.  The primary ones I've seen are vesafb and efifb, which one is used on your system depends on your host firmware settings, legacy BIOS vs UEFI respectively.  To disable these, simply add video=vesafb:off or video=efifb:off to the kernel command line (not sure which to use?  try both, video=vesafb:off,efifb:off).  The first thing you'll notice when you boot an IGD system with i915 blacklisted and the more basic framebuffer drivers disabled is that you don't get anything on the graphics head after grub.  Plan for this.  I use a serial console, but perhaps you're more comfortable running blind and willing to hope the system boots and you can ssh into it remotely.

If you've followed along with this procedure, you should be able to simply create a <hostdev> entry in your libvirt XML, which ought to look something like this:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </hostdev>

Again, assigning the IGD device (which is always 00:02.0) to address 00:02.0 in the VM is required.  Delete the <video> and <graphics> sections and everything should just magically work.  Caveat emptor, my newest CPU is Broadwell, I've been told this works with Skylake, but IGD is hardly standardized and each new implementation seems to tweak things just a bit.

Some of you are probably also curious why this doesn't work on Q35, which leads into the discussion of UPT  mode; IGD clearly is not a discrete GPU, but "integrated" not only means that the GPU is embedded in the system, in this case it means that the GPU is kind of smeared across the system.  This is why IGD assignment hasn't "just worked" and why you need a host kernel with support for exposing certain regions through vfio and a BIOS that's aware of IGD, and it needs to be at a specific address, etc, etc, etc.  One of those requirements is that the video ROM actually also cares about a few properties of the device at PCI address 00:1f.0, the ISA/LPC bridge.  Q35 includes its own bridge at that location and we cannot simply modify the IDs of that bridge for compatibility reasons.  Therefore that bridge being an implicit part of Q35 means that IGD assignment doesn't work on Q35.  This also means that PCI address 00:1f.0 is not available for use in a 440FX machine.

Ok, so UPT.  Intel has known for a while that the sprawl of IGD has made it difficult to deal with for device assignment.  To combat this, both software and hardware changes have been made that help to consolidate IGD to be more assignment-friendly.  Great news, right?  Well sort of.  First off, in UPT mode the IGD is meant to be a secondary graphics device in the VM, there's no VGA mode support (oh, BTW, x-vga=on is automatically added by QEMU in legacy mode).  In fact, um, there's no output support of any kind by default in UPT mode.  How's this useful you ask, well between the emulated graphics and IGD you can setup mirroring so you actually have a remote-capable, hardware accelerated graphics VM.  Plus, if you add the option x-igd-opregion=on to the vfio-pci device, you can get output to a physical display, but there again you're going to need the host running kernel v4.6 or newer and the upcoming QEMU 2.7 support, while no-output UPT has probably actually worked for quite a while.  UPT mode has no requirements for the IGD PCI address, but note that most VM firmare, SeaBIOS or OVMF, will define the primary graphics as the one having the lowest PCI address.  Usually not a problem, but some of you create some crazy configs.  You'll also still need to do all the blacklisting and video disabling above, or just risk binding and unbinding i915 from the host, gambling each time whether it'll explode.

So UPT sounds great except why is this opregion thing optional?  Well, it turns out that if you want to do that cool mirroring thing I mention above and a physical output is enabled with the opregion, you actually need to have a monitor attached to the device or else your apps don't get any hardware acceleration love.  Whereas if IGD doesn't know about any outputs, it's happy to apply hardware acceleration regardless of what's physically connected.  Sucks, but readers here should already know how to create wrapper scrips to add this extra option if they want it (similar to x-vga=on).  I don't think Intel really wants to support this hacky hybrid mode either, thus the experimental x- option prefix tag.

Oh, one more gotcha for UPT mode, Intel seems to expect otherwise, but I've had zero success trying to run Linux guests with UPT.  Just go ahead and assume this is for your Windows guests only at this point.

What else... laptop displays should work, I believe switching outputs even works, but working on laptops is rather inconvenient since you're unlikely to have a serial console available.  Also note that while you can use input-linux to attach a laptop keyboard and mouse (not trackpad IME), I don't know how to make the hotkeys work, so that's a bummer.  Some IGD devices will generate DMAR error spew on the host when assigned, particularly the first time per host boot.  Don't be too alarmed by this, especially if it stops before the display is initialized.  This seems to be caused by resetting the IGD in an IOMMU context where it can't access its page tables setup by the BIOS/host.  Unless you have an ongoing spew of these, they can probably be ignored.  If you have something older than SandyBridge that you wish you could use this with and continued reading even after told to stop, sorry, there was a hardware change at SandyBridge and I don't have anything older to test with and don't really want to support additional code for such outdated hardware.  Besides, those are pretty old and you need an excuse for an upgrade anyway.

With this support I've switched my desktop system so that the host actually runs from a USB stick and the previous bare-metal Fedora install is virtualized with IGD, running alongside my existing GeForce VM.  Give it a try and good luck.

Sunday, January 3, 2016

Comments on the 7 Gamers, 1 CPU video

In case you've seen this video:


And you're thinking to yourself that the R9 Nano they used looks like a great choice for your own GPU assignment build, think again.  This GPU is known to have reset issues, so while it's impressive to see a build with this degree of consolidation, you should be suspicious about the problems Linus alludes to and the limited functionality of the overall system shown in the video.  For instance, does rebooting a VM require a host reboot, or perhaps a manual soft eject of the GPU from the VM?  We see this a lot with newer AMD cards, well beyond the partial workarounds we have for Bonaire and Hawaii based GPUs.

Personally I would have preferred to see an NVIDIA based solution, but due to the scale of this build, and unique slot and power restrictions, the compatibility with GPU assignment was mostly an afterthought.  NVIDIA is not without issues, but for the time being we understand those issues and have workarounds for them, and even a path for supported configurations with Quadro cards.

On the plus side, yes, this is using KVM and VFIO and it's an impressive example of what this technology can do.  However, when you're spec'ing your own build, do your own research and don't rely on videos like this to choose your components.  My 2 cents...

Friday, October 23, 2015

Intel processors with ACS support

If you've been keeping up with this blog then you understand a bit about IOMMU groups and device isolation.  In my howto series I describe the limitations of the Xeon E3 processor that I use in my example system and recommend Xeon E5 or higher processors to provide the best case device isolation for those looking to build a system.  Well, thanks to the vfio-users mailing list, it has come to my attention that there are in fact Core i7 processors with PCIe Access Control Services (ACS) support on the processor root ports.

Intel lists these processors as High End Desktop Processors, they include Socket 2011-v3 Haswell E processors, Socket 2011 Ivy Bridge E processors, and Socket 2011 Sandy Bridge E processors.  The linked datasheets for each family clearly lists ACS register capabilities.  Current listings for these processors include:

Haswell-E (LGA2011-v3)
i7-5960X (8-core, 3/3.5GHz)
i7-5930K (6-core, 3.2/3.8GHz)
i7-5820K (6-core, 3.3/3.6GHz)

Ivy Bridge-E (LGA2011)
i7-4960X (6-core, 3.6/4GHz)
i7-4930K (6-core, 3.4/3.6GHz)
i7-4820K (4-core, 3.7/3.9GHz)

Sandy Bridge-E (LGA2011)
i7-3960X (6-core, 3.3/3.9GHz)
i7-3970X (6-core, 3.5/4GHz)
i7-3930K (6-core, 3.2/3.8GHz)
i7-3820 (4-core, 3.6/3.8GHz)

These also appear to be the only Intel Core processors compatible with Socket 2011 and 2011-v3 found on X79 and X99 motherboards, so basing your platform around these chipsets will hopefully lead to success.  My recommendation is based only on published specs, not first hand experience though, so your mileage may vary.

Unfortunately there are not yet any Skylake based "High End Desktop Processors" and from what we've seen on the mailing list, Skylake does not implement ACS on processor root ports, nor do we have quirks to enable isolation on the Z170 PCH root ports (which include a read-only ACS capability, effectively confirming lack of isolation), and integrated I/O devices on the motherboard exhibit poor grouping with system management components (aside from the onboard I219 LOM, which we do have quirked).  This makes the currently available Skylake platforms a really bad choice for doing device assignment.

Based on this new data, I'll revise my recommendation for Intel platforms to include Xeon E5 and higher processors or Core i7 High End Desktop Processors (as listed by Intel).  Of course there are combinations where regular Core i5, i7 and Xeon E3 processors will work well, we simply need to be aware of their limitations and factor that into our system design.

EDIT (Oct 30 04:18 UTC 2015): A more subtle feature also found in these E series processors is support for IOMMU super pages.  The datasheets for the E5 Xeons and these High End Desktop Processors indicate support for 2MB and 1GB IOMMU pages while the standard Core i5 and i7 only support 4KB pages.  This means less space wasted for the IOMMU page tables, more efficient table walks by the hardware, and less thrashing of the I/O TLB under I/O load resulting in I/O stalls.  Will you notice it?  Maybe.  VFIO will take advantage of IOMMU super pages any time we find a sufficiently sized range of contiguous pages.  To help insure this happens, make use of hugepages in the VM.