Linux as a UEFI bootloader and kexecing windows
|** *DRAFT* *DRAFT* *DRAFT* **|
Description: As strange as it first seems, chainloading Windows from Linux might be the more secure way to boot the system. From within a minimal PXE booted runtime, the Linux shell scripts can perform a TPM-rooted remote attestation with the normal tpm2 tools, receive the BitLocker keys from the attestation server using the safeboot scripts and safely pass them to Microsoft’s bootloader in a UEFI ramdisk via a form of kexec. This specialized Linux kernel and initrd also makes an ideal OS install and recovery environment since it can use the vendor-provided UEFI device drivers to talk to the hardware, allowing a generic kernel to work on most devices without customization.
What's the problem?
We have a problem.
We have servers, like datacenters full of racks of bare metal machines with local disks, and some of these servers run Windows.
Since this is a Linux-centric conference, you might think that running Windows is the problem... except that is not actually a problem -- there are many reasons that users need to run Windows, so we need to support them.
Data is the problem
One of the big problems that we have is what to do with the disks and the systems when they are being decomissioned or sent out for service or if there are nosy administrators. This isn't a hypothetical problem -- many disks end up on eBay with clear-text data and sensitive company information -- and we don't want to destroy working hardware if there are other options.
Encryption alone is not the solution
You might be thinking, "this sounds like a job for encryption!".
And you would be right -- LUKS for Linux and Bitlocker for Windows allow us to encrypt almost the entire disks with good-enough encryption. However, now we've turned our problem into a key management problem…
Bitlocker, Microsoft's full disk encryption system, has a few different ways to retrieve the secret when the machine boots. One simple method is that for each boot someone toggles in the password on the front panel (or remote KVM equivilant), except that this requires manual intervention on each boot and we really don’t want to have to type a decent password in each time the server reboots. We also don't want to store the key on the machine, since then it is almost as good as having a cleartext disk.
TPM's are good, but...
Some of you in the audience might be thinking, "why don't you put the key in the TPM?"
And that is one of the main uses for the Trusted Platform Module -- to store secrets and only unseal them if the machine is in a good state. We're big fans of the TPM and use it in several applications as part of the safeboot.dev project.
The problem is that the TPM doesn't know much about the external state of the machine, only the firmware in the boot path, and there have been demonstration of sniffing the disk encryption key from the bus connecting the CPU to the TPM during an unattended boot. This basically means that if the TPM unseals the secrets automatically, then it might as well be in the clear if the server and disks leave together, either by accident or maliciously.
So what we really need is a way to deliver the encryption key to the server over the network each time it reboots, as long as it is only on our own network. Microsoft does have network unlock for Bitlocker, although it has many caveats related to needing DHCP extensions and doesn't really just replaces inputing the Bitlocker PIN.
Additionally, we really want to ensure that we're only providing the key to machines that are in a good state. This means that the firmware is configured in the state the we expect, that all of the things like SecureBoot keys are correctly configured, etc, so that a local admin can't configure a machine insecurely and then use it to exfiltrate secrets from the system.
Remote Attestation to the rescue
Yes, this does sound like a job for TPM rooted remote attestation!
The safeboot-attest system does this for Linux systems with LUKS encryption. Using the a small signed Linux kernel and initrd, we're able to PXE boot a server, generate a signed TPM quote as to the state and identity of the system. The booting system then sends this quote and eventlog to a remote attestation server that can validate the state and ensure that the event log is consistent with our desired configuration, and reply with a TPM sealed secret that is only valid for this single boot. An attacker in the middle can't unseal the secret since they don't have access to the TPM, and a local admin can't configure the system into an insecure mode or boot an insecure loader kernel, since that would not be accepted by the remote attestation server.
Once the secrets are received and unsealed to decrypt the disk,
the booting system extends a PCR to prevent further use of the sealed secret,
adds the LUKS disk encryption key into the real operating system's initrd,
and then calls
kexec to boot into the encrypted image on the disk.
What about legacy systems?
As I mentioned, this works wonderfully on Linux systems since
can pass secrets and control to a new Linux kernel. But what about
In an ideal world we would be able to do the same sort of operation:
the UEFI firmware in the flash would fetch the Linux boot kernel via PXE,
which would perform a remote attestation and then
is an attempt to reverse engineer this handoff, although it is somewhat
The problem here is that the NT Kernel's entry point is undocumented
and uses proprietary APIs to hand control from the Windows
boot manager to the kernel, so quibble has to depend on attempting to
recreate these structures. It also only works on some Windows versions,
which makes it hard to support in production.
Another approach would be to have Linux be able to
the Windows boot manager instead. This is a normal EFI executable,
so the API's used for the handoff are well documented (if extensive),
and they don't change much between Windows versions, so it should be
easier to support in production across a heterogenous fleet of systems.
The first issue is a simple matter of programming:
supports bzImage and MultiBoot kernels, so it would have to be extended
to support PE32 EFI executables. The bigger issue is that once Linux has
called the UEFI
gBS->ExitBootServices() function early in the kernel
startup process, the devices and protocols that the Windows boot manager
expects to use to load various DLLs and the next stage boot loaders
from the disk are no longer available. Additionally,
also remaps the UEFI runtime services, and this may only be called once,
so Linux would have to know where Windows wanted to remap them.
What if instead, you might ask, Linux provides an emulation of the various boot services device drivers?
This is a really good idea and implemented by Chris Koch and Ofir Weisse. Their success with the UefiPayloadPkg inspired this project -- it is a special build of the open source EDK2 UEFI reference implementation that is designed to be executed by a system that has already completed silicon and platform initialization, which allows it to focus only on the bootloader logic with platform independent drivers. Essentially the UefiPayloadPkg "firmware" only needs to provide a layer of UEFI callbacks to appear that it is booting the system.
This works really well in Qemu and on hardware that has device drivers in the open source EDK2 tree, or if you are building for a system where you have source for the device drivers already (such as the custom hardware used by large hyperscalers like google). Unfortunately trying it on more commodity hardware fails: it does not have the drivers for Dell's PERC controllers or the Matrox video emulated by the ASPEED BMC, so the Windows boot loader is unable to boot on those servers.
Additionally, it only provides a shim of System Management Mode, rather than the actual SMM interface. This means that hardware interfaces provided by vendor SMM modules is not available to the UefiPayloadPkg. On real hardware this includes things like the UEFI NVRAM.
Instead of trying to emulate UEFI and trying to find drivers for all of the various devices, what if we instead returned to UEFI from Linux and allowed the real vendor UEFI firmware to then load the next boot option?
This gives us access to:
- All of the OEM provided drivers (including Option ROMs)
- Real NVRAM
- Real SMM
This also comes with a challenge: the Linux boot kernel must not disturb any of the UEFI state or be able to restore all of the state prior to returning from the Linux kernel.
Since I'm giving this talk, it must work, right? So here's a demo of PXE booting a system in Qemu, retrieving secrets via remote attestion, and then chainloading into Windows:
There's lots that happens in that 1 minute video, so let's break down the steps and components that are being used.
The first thing to notice is that we're booting the system with PXE
using the machine's UEFI PXE code. The image that is delivered is a
"signed unified EFI image" that contains a PE32 executable, the Linux
kernel, the initrd and the command line for the kernel. In order to
boot efficiently, we're trying to keep this reproducibly built
minimal environment to less than 16 MiB, including the
LUKS, LVM, etc.
This unified image is also signed with
sbsign-tool using the private
key that matches the public key stored in the system's
or one of the
db entries. Since we're using an attested boot, this
doesn't provide much more security, although it does prevent some classes
of false attestation failures.
safeboot-loader EFI wrapper that we've written is similar to the
systemd boot wrapper,
which uses named PE32 sections for the Linux kernel, initrd and commandline,
although we're doing some extra work to ensure that the Linux kernel
can return to the UEFI environment.
The first thing the wrapper does on startup is locates the named sections
cmdline). It doesn't do any signature checks on
them or additional measurements since the entire image is signed at once,
and the entire image has already been measured by the UEFI PXE firmware.
Next the loader allocates a 1 GiB aligned region of memory with the UEFI
AllocatePages() function. Since this is the region that the
Linux kernel will use, and the loader adds a
memmap=exactmap argument to the
kernel command line to tell the kernel that it is not allowed to touch any
memory outside of that region. This ensures that when Linux "returns"
to UEFI, all of the UEFI data structures are intact and unmodified.
You might notice that there is an additional region at the very bottom of
memory that is passed into the
exactmap. This is for the SMP trampoline,
which we're not using since the boot kernel runs single CPU, although
there is code in the kernel that will panic if it is unable to reserve
some amount of memory down there.
The loader also stores the
CR3 page table register, the
IDT global, local and interrupt descriptor tables, as well as some other
x86-specific cruft into a region in the very bottom of physical memory.
These will be used both by UEFI device drivers while Linux is running,
as well as to return to UEFI later.
Finally the loader hooks the
gBS->ExitBootServices() function pointer so that
the first call to exit boot services will be ignored. This reduces the
number of patches we have to make to the Linux kernel, although it could
also be done by having our boot kernel simply not make the one-time call
to exit boot services.
There are likely more dusty corners of the x86 that we need to be sure to preserve. The current set works on our real hardware, although not a wide range has been tested.
Since Linux can't disturb any of the UEFI state, it also can't directly interact with any devices since that would potentially confuse the UEFI device driver when it tried to use the device later. Instead, the boot kernel has special device drivers that act like a shim or impedence matching layer between the vendor provided UEFI drivers and the Linux runtime.
The UEFI Forum has an extensive specification
for the various UEFI device drivers, which they call "
and most vendor firmwares have fairly comprehensive support for the devices
that are included in their systems. This also allows us to have a much
smaller boot kernel since it does not need to carry a range of device
drivers; the firmware already has code to talk to the random RAID controller
or whatever, so Linux doesn't need to know about it.
There is some irony here, since a few years ago I gave a talk about LinuxBoot and replacing UEFI with Linux drivers, and now I'm giving a talk about using those same drivers. Ideally we would have an actual open firmware like LinuxBoot and coreboot, although until then we're having to support commodity systems with UEFI so that what we're doing here...
So, which of the "Protocols" do we support?
TCG2 Protocol and eventlog
The most important protocol for a remote attestation client is to be able to talk to the TPM and receive signed quotes that we can deliver to the remote attestation server. We also need support for the event log to be able to show the attestation server what measurements have been done to arrive at the PCR values in the quote. tpm.c
EFI_TCG2_PROTOCOL has a few fully featured interfaces for
reading PCRs, extending them, etc, although we use the raw
method that allows the kernel module to take the commands sent to
and directly send them to the TPM hardware. Our module does not inspect
the commands at all, so it does not need to be aware of what is being
sent and this means that it supports everything supported by
which ever TPM library you want to use). There is a caveat that the
length of the reply from the TPM is emebedded in the message, so it does
have to peak into it to find the return value.
The event log is slightly more complicated due to some weird decisisions
by the TCG and UEFI. Once
ExitBootServices is called, the UEFI provided
TPM drivers are no longer available and their memory is freed,
so the event log is no longer accessible. As a result, the operating system's
TPM driver has to store a copy of the event log before
is called, which means that it can't be a late-loaded kernel module.
To make it even worse,
ExitBootServices measures things into the eventlog,
so the copy that is made prior to the call is not correct. This has
caused problems for Bitlocker in the past and not all vendors get
However, since our boot kernel does not exit boot services, the
event log is still "live" and we can even add new items to it as new
measurements are extended into the PCRs. We've patched the
bin_log_seqops.seqops.start() method in our Linux TPM object to
always call the UEFI
so that it is always reading the current version.
Simple Network Protocol
The second most important one for a remote attestation client
is the NIC to be able to contact the remote attestation server...
SIMPLE_NETWORK_PROTOCOL has a full TCP stack
underneath, although our efinet.c uses the raw UEFI protocol's
Transmit()" and "
Receive()" calls to directly put Linux
struct sk_buff objects on the wire and read them from the wire.
One caveat, at the time of this talk, is that we do not have a proper callback working, so instead we spin up a kernel thread that polls the UEFI Protocol every so often. This limits the total bandwidth of our boot kernel and is an area that needs improvement. Another limitation is that failures are not really handled; packets are just dropped if they are not sendable for some reason.
Block IO Protocol
During normal attestations the boot kernel might not need to access
the disk, although it may be called upon to install an OS image or
otherwise inspect what is already installed. For that to work, we
EFI_BLOCK_IO_PROTOCOL, which is used by blockio.c
to create a
/dev/uefi* block device for every UEFI supported storage
device. This includes both normal disks, as well as USB flash drives
and vendor supported drivers for things like PERC RAID controllers.
The version of the UEFI protocol that we're using the synchronous:
Write() calls return when they are done. This is not
very high performance, although again it is not a significant concern
in the pre-boot environment. There is a fancier non-blocking version
of the protocol and future work might use that once we have support
for UEFI callbacks.
Since removable block devices can appear after initiailization,
our module also registers a UEFI event for when new devices register
support for the
EFI_BLOCK_IO_PROTOCOL GUID. This is checked in
the same periodic timer as the NIC callback.
Usually a successful remote attestation will deliver a secret to the
machine, such as a disk encryption key. Since the boot kernel doesn't
use the disk directly, it needs a way to pass this to the real OS.
kexec of a Linux kernel, this can be added to the next kernel's
initrd. For a Windows server, the Bitlocker
.BEK protector file
can be passed in a UEFI ramdisk provided by ramdisk.c.
A GPT partitioned disk image can be sent into
and that data will be copied into a UEFI allocated region, and then
UEFI_RAMDISK_REGISTER_RAMDISK method will be called on the region
to create a new UEFI
BLOCK_IO_PROTOCOL device. Since the boot kernel
has access to UEFI block devices, the vfat partition of the image can
be mounted and the provisioned secrets (or whatever) written into it.
DXE Module Loader
Unfortunately, no vendors seems to ship with the
device drivers. Luckily the source code for the RamDiskDxe
module is available in the UEFI edk2 reference implementation, so it can
be loaded by the boot kernel with the
EFI_IMAGE_START boot services
This is sort of a like a
insmod for UEFI and allows new functionality
to be added during the boot process. It also measures the new modules
into the appropriate PCR and creates TPM event log entries, so the
attestation will reveal that these modules have been loaded.
(This could be used to also invoke the Window's bootloader, although there would need to be some additional work done to shutdown Linux during the process. Future research is needed...)
Memory management hack
Some of you in the audience might have been thinking "doesn't UEFI run in a linear address map and with its own page tables?" And yes, yes it does. efiwrapper.c is the messiest hack in the current safeboot-loader tree and one that I would really like to find a better solution.
The memory that the wrapper allocated for Linux is at least 1 GiB and is gigabyte aligned, which means that we can replace all of the entries in the Linux kernel thread's top-level page table with the same values from the UEFI page table that we stored back in the wrapper setup code. This recreates the linear mapping that UEFI used, allowing all of the UEFI code to run unchanged.
Like I said, this is an ugly hack and there is probably a better way to do it. Linux's EFI nvram interface uses a work queue with its own page table and stuff, maybe that is a better approach, although it does introduce threading and ordering issues.
The safeboot-attest scripts
are thin wrappers around the
tpm2-tools programs to create an ephemeral
attestation key and a signed quote, which then use
curl to do a multipart
POST to a remote attestation server. The server replies with an
ephemeral transport key that is encrypted with the TPM's endorsement key
as well as an asset that is encrypted with the transport key.
tpm2-activatecredential command validates that the attestation key
is valid for the endorsement key, and then allows the script to unseal the
Bitlocker key in ramdisk
In this demo the remote attestation server provided asset contains
the Microsoft Bitlocker
.BEK protector file, which is installed into
a GPT partitioned and FAT formatted UEFI ramdisk image. This image is
then passed to the UEFI
RamdiskDxe module, which we loaded with the
Finally the real disk's EFI System Partition is mounted with the
block IO protocol interface and the
chainload tool is used to
pass control to Microsoft's
bootmgfw.efi boot loader.
chainload builds a table with the physical address of the EFI executable,
a small trampoline and the low memory, then uses the
system call to shutdown the Linux kernel and jump into the trampoline.
The trampoline restores the UEFI context that was saved so long ago during
the boot process and then invokes
on the executable, passing in the UEFI DevicePath for the boot device.
Hello, Windows 10
And if all goes well,
bootmgfw.efi will end up loading its additional
libraries, finding the BitLocker
BEK protector in the ramdisk,
decrypting the BitLocker volume and passing control to the Windows kernel.