Linux as a UEFI bootloader and kexecing windows


** DRAFT DRAFT DRAFT **

Abstract

Description: As strange as it first seems, chainloading Windows from Linux might be the more secure way to boot the system. From within a minimal PXE booted runtime, the Linux shell scripts can perform a TPM-rooted remote attestation with the normal tpm2 tools, receive the BitLocker keys from the attestation server using the safeboot scripts and safely pass them to Microsoft’s bootloader in a UEFI ramdisk via a form of kexec. This specialized Linux kernel and initrd also makes an ideal OS install and recovery environment since it can use the vendor-provided UEFI device drivers to talk to the hardware, allowing a generic kernel to work on most devices without customization.

What's the problem?

We have a problem.

We have servers, like datacenters full of racks of bare metal machines with local disks, and some of these servers run Windows.

Since this is a Linux-centric conference, you might think that running Windows is the problem... except that is not actually a problem -- there are many reasons that users need to run Windows, so we need to support them.

Data is the problem

One of the big problems that we have is what to do with the disks and the systems when they are being decomissioned or sent out for service or if there are nosy administrators. This isn't a hypothetical problem -- many disks end up on eBay with clear-text data and sensitive company information -- and we don't want to destroy working hardware if there are other options.

Encryption alone is not the solution

You might be thinking, "this sounds like a job for encryption!".

And you would be right -- LUKS for Linux and Bitlocker for Windows allow us to encrypt almost the entire disks with good-enough encryption. However, now we've turned our problem into a key management problem…

Bitlocker, Microsoft's full disk encryption system, has a few different ways to retrieve the secret when the machine boots. One simple method is that for each boot someone toggles in the password on the front panel (or remote KVM equivilant), except that this requires manual intervention on each boot and we really don’t want to have to type a decent password in each time the server reboots. We also don't want to store the key on the machine, since then it is almost as good as having a cleartext disk.

TPM's are good, but...

Some of you in the audience might be thinking, "why don't you put the key in the TPM?"

And that is one of the main uses for the Trusted Platform Module -- to store secrets and only unseal them if the machine is in a good state. We're big fans of the TPM and use it in several applications as part of the safeboot.dev project.

The problem is that the TPM doesn't know much about the external state of the machine, only the firmware in the boot path, and there have been demonstration of sniffing the disk encryption key from the bus connecting the CPU to the TPM during an unattended boot. This basically means that if the TPM unseals the secrets automatically, then it might as well be in the clear if the server and disks leave together, either by accident or maliciously.

So what we really need is a way to deliver the encryption key to the server over the network each time it reboots, as long as it is only on our own network. Microsoft does have network unlock for Bitlocker, although it has many caveats related to needing DHCP extensions and doesn't really just replaces inputing the Bitlocker PIN.

Additionally, we really want to ensure that we're only providing the key to machines that are in a good state. This means that the firmware is configured in the state the we expect, that all of the things like SecureBoot keys are correctly configured, etc, so that a local admin can't configure a machine insecurely and then use it to exfiltrate secrets from the system.

Remote Attestation to the rescue

Yes, this does sound like a job for TPM rooted remote attestation!

The safeboot-attest system does this for Linux systems with LUKS encryption. Using the a small signed Linux kernel and initrd, we're able to PXE boot a server, generate a signed TPM quote as to the state and identity of the system. The booting system then sends this quote and eventlog to a remote attestation server that can validate the state and ensure that the event log is consistent with our desired configuration, and reply with a TPM sealed secret that is only valid for this single boot. An attacker in the middle can't unseal the secret since they don't have access to the TPM, and a local admin can't configure the system into an insecure mode or boot an insecure loader kernel, since that would not be accepted by the remote attestation server.

Once the secrets are received and unsealed to decrypt the disk, the booting system extends a PCR to prevent further use of the sealed secret, adds the LUKS disk encryption key into the real operating system's initrd, and then calls kexec to boot into the encrypted image on the disk.

What about legacy systems?

As I mentioned, this works wonderfully on Linux systems since kexec can pass secrets and control to a new Linux kernel. But what about legacy systems?

kexec NTOSKRNL.EXE

In an ideal world we would be able to do the same sort of operation: the UEFI firmware in the flash would fetch the Linux boot kernel via PXE, which would perform a remote attestation and then kexec("ntoskrnl.exe").

quibble is an attempt to reverse engineer this handoff, although it is somewhat fragile. The problem here is that the NT Kernel's entry point is undocumented and uses proprietary APIs to hand control from the Windows bootmgfw.efi boot manager to the kernel, so quibble has to depend on attempting to recreate these structures. It also only works on some Windows versions, which makes it hard to support in production.

kexec BOOTMGFW.EFI

Another approach would be to have Linux be able to kexec "bootmgwfw.efi" the Windows boot manager instead. This is a normal EFI executable, so the API's used for the handoff are well documented (if extensive), and they don't change much between Windows versions, so it should be easier to support in production across a heterogenous fleet of systems.

The first issue is a simple matter of programming: kexec only supports bzImage and MultiBoot kernels, so it would have to be extended to support PE32 EFI executables. The bigger issue is that once Linux has called the UEFI gBS->ExitBootServices() function early in the kernel startup process, the devices and protocols that the Windows boot manager expects to use to load various DLLs and the next stage boot loaders from the disk are no longer available. Additionally, ExitBootServices also remaps the UEFI runtime services, and this may only be called once, so Linux would have to know where Windows wanted to remap them.

kexec UefiPayloadPkg

What if instead, you might ask, Linux provides an emulation of the various boot services device drivers?

This is a really good idea and implemented by Chris Koch and Ofir Weisse. Their success with the UefiPayloadPkg inspired this project -- it is a special build of the open source EDK2 UEFI reference implementation that is designed to be executed by a system that has already completed silicon and platform initialization, which allows it to focus only on the bootloader logic with platform independent drivers. Essentially the UefiPayloadPkg "firmware" only needs to provide a layer of UEFI callbacks to appear that it is booting the system.

This works really well in Qemu and on hardware that has device drivers in the open source EDK2 tree, or if you are building for a system where you have source for the device drivers already (such as the custom hardware used by large hyperscalers like google). Unfortunately trying it on more commodity hardware fails: it does not have the drivers for Dell's PERC controllers or the Matrox video emulated by the ASPEED BMC, so the Windows boot loader is unable to boot on those servers.

Additionally, it only provides a shim of System Management Mode, rather than the actual SMM interface. This means that hardware interfaces provided by vendor SMM modules is not available to the UefiPayloadPkg. On real hardware this includes things like the UEFI NVRAM.

chainloading Windows

Instead of trying to emulate UEFI and trying to find drivers for all of the various devices, what if we instead returned to UEFI from Linux and allowed the real vendor UEFI firmware to then load the next boot option?

This gives us access to:

All of the OEM provided drivers (including Option ROMs)
Real NVRAM
Real SMM

This also comes with a challenge: the Linux boot kernel must not disturb any of the UEFI state or be able to restore all of the state prior to returning from the Linux kernel.

Since I'm giving this talk, it must work, right? So here's a demo of PXE booting a system in Qemu, retrieving secrets via remote attestion, and then chainloading into Windows:

There's lots that happens in that 1 minute video, so let's break down the steps and components that are being used.

PXE

The first thing to notice is that we're booting the system with PXE using the machine's UEFI PXE code. The image that is delivered is a "signed unified EFI image" that contains a PE32 executable, the Linux kernel, the initrd and the command line for the kernel. In order to boot efficiently, we're trying to keep this reproducibly built minimal environment to less than 16 MiB, including the tpm2-tools, LUKS, LVM, etc.

This unified image is also signed with sbsign-tool using the private key that matches the public key stored in the system's PK, KEK or one of the db entries. Since we're using an attested boot, this doesn't provide much more security, although it does prevent some classes of false attestation failures.

EFI wrapper

The safeboot-loader EFI wrapper that we've written is similar to the systemd boot wrapper, which uses named PE32 sections for the Linux kernel, initrd and commandline, although we're doing some extra work to ensure that the Linux kernel can return to the UEFI environment.

The first thing the wrapper does on startup is locates the named sections (linux, initrd, cmdline). It doesn't do any signature checks on them or additional measurements since the entire image is signed at once, and the entire image has already been measured by the UEFI PXE firmware.

Next the loader allocates a 1 GiB aligned region of memory with the UEFI AllocatePages() function. Since this is the region that the Linux kernel will use, and the loader adds a memmap=exactmap argument to the kernel command line to tell the kernel that it is not allowed to touch any memory outside of that region. This ensures that when Linux "returns" to UEFI, all of the UEFI data structures are intact and unmodified.

You might notice that there is an additional region at the very bottom of memory that is passed into the exactmap. This is for the SMP trampoline, which we're not using since the boot kernel runs single CPU, although there is code in the kernel that will panic if it is unable to reserve some amount of memory down there.

The loader also stores the CR3 page table register, the GDT, LDT, and IDT global, local and interrupt descriptor tables, as well as some other x86-specific cruft into a region in the very bottom of physical memory. These will be used both by UEFI device drivers while Linux is running, as well as to return to UEFI later.

Finally the loader hooks the gBS->ExitBootServices() function pointer so that the first call to exit boot services will be ignored. This reduces the number of patches we have to make to the Linux kernel, although it could also be done by having our boot kernel simply not make the one-time call to exit boot services.

There are likely more dusty corners of the x86 that we need to be sure to preserve. The current set works on our real hardware, although not a wide range has been tested.

UEFI Protocols

Since Linux can't disturb any of the UEFI state, it also can't directly interact with any devices since that would potentially confuse the UEFI device driver when it tried to use the device later. Instead, the boot kernel has special device drivers that act like a shim or impedence matching layer between the vendor provided UEFI drivers and the Linux runtime.

The UEFI Forum has an extensive specification for the various UEFI device drivers, which they call "Protocols", and most vendor firmwares have fairly comprehensive support for the devices that are included in their systems. This also allows us to have a much smaller boot kernel since it does not need to carry a range of device drivers; the firmware already has code to talk to the random RAID controller or whatever, so Linux doesn't need to know about it.

There is some irony here, since a few years ago I gave a talk about LinuxBoot and replacing UEFI with Linux drivers, and now I'm giving a talk about using those same drivers. Ideally we would have an actual open firmware like LinuxBoot and coreboot, although until then we're having to support commodity systems with UEFI so that what we're doing here...

So, which of the "Protocols" do we support?

TCG2 Protocol and eventlog

The most important protocol for a remote attestation client is to be able to talk to the TPM and receive signed quotes that we can deliver to the remote attestation server. We also need support for the event log to be able to show the attestation server what measurements have been done to arrive at the PCR values in the quote. tpm.c

The EFI_TCG2_PROTOCOL has a few fully featured interfaces for reading PCRs, extending them, etc, although we use the raw SubmitCommand() method that allows the kernel module to take the commands sent to /dev/tpm0 and directly send them to the TPM hardware. Our module does not inspect the commands at all, so it does not need to be aware of what is being sent and this means that it supports everything supported by tpm2-tools (or which ever TPM library you want to use). There is a caveat that the length of the reply from the TPM is emebedded in the message, so it does have to peak into it to find the return value.

The event log is slightly more complicated due to some weird decisisions by the TCG and UEFI. Once ExitBootServices is called, the UEFI provided TPM drivers are no longer available and their memory is freed, so the event log is no longer accessible. As a result, the operating system's TPM driver has to store a copy of the event log before ExitBootServices is called, which means that it can't be a late-loaded kernel module. To make it even worse, ExitBootServices measures things into the eventlog, so the copy that is made prior to the call is not correct. This has caused problems for Bitlocker in the past and not all vendors get it right.

However, since our boot kernel does not exit boot services, the event log is still "live" and we can even add new items to it as new measurements are extended into the PCRs. We've patched the bin_log_seqops.seqops.start() method in our Linux TPM object to always call the UEFI EFI_TCG2_PROTOCOL's GetEventLog() method so that it is always reading the current version.

Simple Network Protocol

The second most important one for a remote attestation client is the NIC to be able to contact the remote attestation server... UEFI's SIMPLE_NETWORK_PROTOCOL has a full TCP stack underneath, although our efinet.c uses the raw UEFI protocol's "Transmit()" and "Receive()" calls to directly put Linux struct sk_buff objects on the wire and read them from the wire.

One caveat, at the time of this talk, is that we do not have a proper callback working, so instead we spin up a kernel thread that polls the UEFI Protocol every so often. This limits the total bandwidth of our boot kernel and is an area that needs improvement. Another limitation is that failures are not really handled; packets are just dropped if they are not sendable for some reason.

Block IO Protocol

During normal attestations the boot kernel might not need to access the disk, although it may be called upon to install an OS image or otherwise inspect what is already installed. For that to work, we need the EFI_BLOCK_IO_PROTOCOL, which is used by blockio.c to create a /dev/uefi* block device for every UEFI supported storage device. This includes both normal disks, as well as USB flash drives and vendor supported drivers for things like PERC RAID controllers.

The version of the UEFI protocol that we're using the synchronous: Read() and Write() calls return when they are done. This is not very high performance, although again it is not a significant concern in the pre-boot environment. There is a fancier non-blocking version of the protocol and future work might use that once we have support for UEFI callbacks.

Since removable block devices can appear after initiailization, our module also registers a UEFI event for when new devices register support for the EFI_BLOCK_IO_PROTOCOL GUID. This is checked in the same periodic timer as the NIC callback.

Ramdisk Protocol

Usually a successful remote attestation will deliver a secret to the machine, such as a disk encryption key. Since the boot kernel doesn't use the disk directly, it needs a way to pass this to the real OS. For a kexec of a Linux kernel, this can be added to the next kernel's initrd. For a Windows server, the Bitlocker .BEK protector file can be passed in a UEFI ramdisk provided by ramdisk.c.

A GPT partitioned disk image can be sent into /sys/efi/firmware/ramdisk and that data will be copied into a UEFI allocated region, and then the UEFI_RAMDISK_REGISTER_RAMDISK method will be called on the region to create a new UEFI BLOCK_IO_PROTOCOL device. Since the boot kernel has access to UEFI block devices, the vfat partition of the image can be mounted and the provisioned secrets (or whatever) written into it.

DXE Module Loader

Unfortunately, no vendors seems to ship with the EFI_RAMDISK_PROTOCOL device drivers. Luckily the source code for the RamDiskDxe module is available in the UEFI edk2 reference implementation, so it can be loaded by the boot kernel with the EFI_IMAGE_START boot services function.

This is sort of a like a insmod for UEFI and allows new functionality to be added during the boot process. It also measures the new modules into the appropriate PCR and creates TPM event log entries, so the attestation will reveal that these modules have been loaded.

(This could be used to also invoke the Window's bootloader, although there would need to be some additional work done to shutdown Linux during the process. Future research is needed...)

Memory management hack

Some of you in the audience might have been thinking "doesn't UEFI run in a linear address map and with its own page tables?" And yes, yes it does. efiwrapper.c is the messiest hack in the current safeboot-loader tree and one that I would really like to find a better solution.

The memory that the wrapper allocated for Linux is at least 1 GiB and is gigabyte aligned, which means that we can replace all of the entries in the Linux kernel thread's top-level page table with the same values from the UEFI page table that we stored back in the wrapper setup code. This recreates the linear mapping that UEFI used, allowing all of the UEFI code to run unchanged.

Like I said, this is an ugly hack and there is probably a better way to do it. Linux's EFI nvram interface uses a work queue with its own page table and stuff, maybe that is a better approach, although it does introduce threading and ordering issues.

safeboot-attest

The safeboot-attest scripts are thin wrappers around the tpm2-tools programs to create an ephemeral attestation key and a signed quote, which then use curl to do a multipart HTTP POST to a remote attestation server. The server replies with an ephemeral transport key that is encrypted with the TPM's endorsement key as well as an asset that is encrypted with the transport key. The tpm2-activatecredential command validates that the attestation key is valid for the endorsement key, and then allows the script to unseal the asset.

Bitlocker key in ramdisk

In this demo the remote attestation server provided asset contains the Microsoft Bitlocker .BEK protector file, which is installed into a GPT partitioned and FAT formatted UEFI ramdisk image. This image is then passed to the UEFI RamdiskDxe module, which we loaded with the gBS->StartImage() call.

chainload via `kexec_load`

Finally the real disk's EFI System Partition is mounted with the block IO protocol interface and the chainload tool is used to pass control to Microsoft's bootmgfw.efi boot loader.

chainload builds a table with the physical address of the EFI executable, a small trampoline and the low memory, then uses the kexec_load() system call to shutdown the Linux kernel and jump into the trampoline. The trampoline restores the UEFI context that was saved so long ago during the boot process and then invokes gBS->LoadImage() and gBS->StartImage() on the executable, passing in the UEFI DevicePath for the boot device.

Hello, Windows 10

And if all goes well, bootmgfw.efi will end up loading its additional libraries, finding the BitLocker BEK protector in the ramdisk, decrypting the BitLocker volume and passing control to the Windows kernel.

Linux as a UEFI bootloader and kexecing windows

Abstract