Competition between IT hardware manufacturers is fierce. Decimal point differences in performance specs translate into millions of dollars won or lost with every chip release. Manufacturers are very creative at finding ways to gain an edge over their competition, and sometimes the creativity works against them. This appears to be the case with Intel’s CPUs, and in the worst case, it affects anyone who relies on Intel chips for virtualization — most companies, and cloud providers like Microsoft Azure, Amazon EC2, Google Compute Engine. It is up to operating system manufacturers to fix the problem and the fix will hurt performance.
Details of the security vulnerability are under embargo from Intel in an attempt to give developers time to come up with a fix so much of the reporting on the bug is extrapolated from online discussions and by dissecting the Linux patches that were quickly rolled out in December.
It is suspected that the flaw is in the way an Intel CPU manages memory between “kernel mode” and “user mode.” Think of all the programs running on a computer at the same time. For security and stability reasons we want to be sure that one program doesn’t negatively impact another program. For example, if your browser crashes you don’t want it to take down the entire computer by crashing the OS.
In a virtualized cloud environment, you don’t want someone else’s program to be able to see the details of what you are running in your portion of the cloud. To accomplish this isolation, individual programs are run in their own “user space.” However, these programs are still sharing hardware like network connections and hard drives so there is another layer required. Kernel mode coordinates requests for shared hardware and still maintain isolation between the various user mode programs. When microseconds can impact your performance metrics, the “cost” of loading kernel mode to execute the request, then unloading kernel mode, and returning to user mode is “expensive.” As described in The Register article, Intel attempted a shortcut “To make the transition from user mode to kernel mode and back to user mode as fast and efficient as possible, the kernel is present in all processes’ virtual memory address spaces, although it is invisible to these programs. When the kernel is needed, the program makes a system call, the processor switches to kernel mode and enters the kernel. When it is done, the CPU is told to switch back to user mode, and re-enter the process.”
Although memory for each user process is well isolated, it is believed that the Intel flaw allows for these user processes to exploit kernel memory space to violate the intended isolation.
Many operating systems utilize a security control called Kernel Address Space Layout Randomization (KASLR) which is supposed to address risks of a user process gaining access to kernel memory space (Daniel López Azaña has a good summary of ASLR, KASLR and KARL here.) However, in October 2017 the Linux core kernel developers released the KAISER patch series which hinted at the current Intel CPU issue, detailed in the LWN article, “KAISER: hiding the kernel from user space.” Then in December, a number of Linux distributions released kernel updates which included Kernel Page-Table Isolation (PTI) significantly restricting memory space available to running processes. On December 26, 2017, Intel’s competitor AMD sent this email to the Linux kernel mailing list:
"AMD processors are not subject to the types of attacks that the kernel page table isolation feature protects against. The AMD microarchitecture does not allow memory references, including speculative references, that access higher privileged data when running in a lesser privileged mode when that access would result in a page fault."
All of this activity seems to point squarely at a problem in the way that Intel CPUs isolate, or fail to isolate, kernel memory from user processes. But while under the embargo it is all educated guessing.
Major Linux distributions have released kernel updates to address the issue and Microsoft is expected to release corresponding patches in January’s patch bundle. There are rumors that Microsoft Azure and Amazon Web Services customers have been notified directly of impending maintenance outages this month which might be associated with patches for this Intel bug. Since the kernel mode shortcut was intended to improve CPU performance, you should expect that the fix will negatively impact current performance. We will have to wait for the Intel information embargo to be lifted, and for the Linux and Windows patches to be applied to truly understand the risks and performance impacts.