Intel TDX protects guest memory from VMM access. Any memory that is
required for communication with the VMM must be explicitly shared.
It is a two-step process: the guest sets the shared bit in the page
table entry and notifies VMM about the change. The notification happens
using MapGPA hypercall.
Conversion back to private memory requires clearing the shared bit,
notifying VMM with MapGPA hypercall following with accepting the memory
with AcceptPage hypercall.
Provide a TDX version of x86_platform.guest.* callbacks. It makes
__set_memory_enc_pgtable() work right in TDX guest.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-27-kirill.shutemov@linux.intel.com
There are a few MSRs and control register bits that the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent security
guarantees. Fortunately, TDX ensures that these are all in the correct
state before the kernel loads, which means the kernel does not need to
modify them.
The conditions to avoid are:
* Any writes to the EFER MSR
* Clearing CR4.MCE
This theoretically makes the guest boot more fragile. If, for instance,
EFER was set up incorrectly and a WRMSR was performed, it will trigger
early exception panic or a triple fault, if it's before early
exceptions are set up. However, this is likely to trip up the guest
BIOS long before control reaches the kernel. In any case, these kinds
of problems are unlikely to occur in production environments, and
developers have good debug tools to fix them quickly.
Change the common boot code to work on TDX and non-TDX systems.
This should have no functional effect on non-TDX systems.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-24-kirill.shutemov@linux.intel.com
Secondary CPU startup is currently performed with something called
the "INIT/SIPI protocol". This protocol requires assistance from
VMMs to boot guests. As should be a familiar story by now, that
support can not be provded to TDX guests because TDX VMMs are
not trusted by guests.
To remedy this situation a new[1] "Multiprocessor Wakeup Structure"
has been added to to an existing ACPI table (MADT). This structure
provides the physical address of a "mailbox". A write to the mailbox
then steers the secondary CPU to the boot code.
Add ACPI MADT wake structure parsing support and wake support. Use
this support to wake CPUs whenever it is present instead of INIT/SIPI.
While this structure can theoretically be used on 32-bit kernels,
there are no 32-bit TDX guest kernels. It has not been tested and
can not practically *be* tested on 32-bit. Make it 64-bit only.
1. Details about the new structure can be found in ACPI v6.4, in the
"Multiprocessor Wakeup Structure" section.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-22-kirill.shutemov@linux.intel.com
Historically, x86 platforms have booted secondary processors (APs)
using INIT followed by the start up IPI (SIPI) messages. In regular
VMs, this boot sequence is supported by the VMM emulation. But such a
wakeup model is fatal for secure VMs like TDX in which VMM is an
untrusted entity. To address this issue, a new wakeup model was added
in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
the APs. More details about this wakeup model can be found in ACPI
specification v6.4, the section titled "Multiprocessor Wakeup Structure".
Since the existing trampoline code requires processors to boot in real
mode with 16-bit addressing, it will not work for this wakeup model
(because it boots the AP in 64-bit mode). To handle it, extend the
trampoline code to support 64-bit mode firmware handoff. Also, extend
IDT and GDT pointers to support 64-bit mode hand off.
There is no TDX-specific detection for this new boot method. The kernel
will rely on it as the sole boot method whenever the new ACPI structure
is present.
The ACPI table parser for the MADT multiprocessor wake up structure and
the wakeup method that uses this structure will be added by the following
patch in this series.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-21-kirill.shutemov@linux.intel.com
TDX guests cannot do port I/O directly. The TDX module triggers a #VE
exception to let the guest kernel emulate port I/O by converting them
into TDCALLs to call the host.
But before IDT handlers are set up, port I/O cannot be emulated using
normal kernel #VE handlers. To support the #VE-based emulation during
this boot window, add a minimal early #VE handler support in early
exception handlers. This is similar to what AMD SEV does. This is
mainly to support earlyprintk's serial driver, as well as potentially
the VGA driver.
The early handler only supports I/O-related #VE exceptions. Unhandled or
failed exceptions will be handled via early_fixup_exceptions() (like
normal exception failures). At runtime I/O-related #VE exceptions (along
with other types) handled by virt_exception_kernel().
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-19-kirill.shutemov@linux.intel.com
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses
Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.
Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.
TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.
Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
TDX brings a new exception -- Virtualization Exception (#VE). Handling
of #VE structurally very similar to handling #GP.
Extract two helpers from exc_general_protection() that can be reused for
handling #VE.
No functional changes.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-7-kirill.shutemov@linux.intel.com
Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.
In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.
A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A leaf of TDCALL used to communicate
with the VMM is called TDVMCALL.
Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).
__tdx_module_call() - Used to communicate with the TDX module (via
TDCALL instruction).
__tdx_hypercall() - Used by the guest to request services from
the VMM (via TDVMCALL leaf of TDCALL).
Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.
The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file. The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.
Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.
For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".
Based on previous patch by Sean Christopherson.
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-4-kirill.shutemov@linux.intel.com
Secure Arbitration Mode (SEAM) is an extension of VMX architecture. It
defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
operation (SEAM VMX non-root) which are both isolated from the legacy
VMX operation where the host kernel runs.
A CPU-attested software module (called 'TDX module') runs in SEAM VMX
root to manage and protect VMs running in SEAM VMX non-root. SEAM VMX
root is also used to host another CPU-attested software module (called
'P-SEAMLDR') to load and update the TDX module.
Host kernel transits to either P-SEAMLDR or TDX module via the new
SEAMCALL instruction, which is essentially a VMExit from VMX root mode
to SEAM VMX root mode. SEAMCALLs are leaf functions defined by
P-SEAMLDR and TDX module around the new SEAMCALL instruction.
A guest kernel can also communicate with TDX module via TDCALL
instruction.
TDCALLs and SEAMCALLs use an ABI different from the x86-64 system-v ABI.
RAX is used to carry both the SEAMCALL leaf function number (input) and
the completion status (output). Additional GPRs (RCX, RDX, R8-R11) may
be further used as both input and output operands in individual leaf.
TDCALL and SEAMCALL share the same ABI and require the largely same
code to pass down arguments and retrieve results.
Define an assembly macro that can be used to implement C wrapper for
both TDCALL and SEAMCALL.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-3-kirill.shutemov@linux.intel.com
In preparation of extending cc_platform_has() API to support TDX guest,
use CPUID instruction to detect support for TDX guests in the early
boot code (via tdx_early_init()). Since copy_bootdata() is the first
user of cc_platform_has() API, detect the TDX guest status before it.
Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
bit in a valid TDX guest platform.
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-2-kirill.shutemov@linux.intel.com
Show value of gap end in the kernel log which equates to number of physical
address bits used by system.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220406195149.228164-4-steve.wahl@hpe.com
The UV5 platform synchronizes the TSCs among all chassis, and will not
proceed to OS boot without achieving synchronization. Previous UV
platforms provided a register indicating successful synchronization.
This is no longer available on UV5. On this platform TSC_ADJUST
should not be reset by the kernel.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220406195149.228164-3-steve.wahl@hpe.com
Version 2 of the GHCB specification provides a Non Automatic Exit (NAE)
event type that can be used by the SEV-SNP guest to communicate with the
PSP without risk from a malicious hypervisor who wishes to read, alter,
drop or replay the messages sent.
SNP_LAUNCH_UPDATE can insert two special pages into the guest’s memory:
the secrets page and the CPUID page. The PSP firmware populates the
contents of the secrets page. The secrets page contains encryption keys
used by the guest to interact with the firmware. Because the secrets
page is encrypted with the guest’s memory encryption key, the hypervisor
cannot read the keys. See SEV-SNP firmware spec for further details on
the secrets page format.
Create a platform device that the SEV-SNP guest driver can bind to get
the platform resources such as encryption key and message id to use to
communicate with the PSP. The SEV-SNP guest driver provides a userspace
interface to get the attestation report, key derivation, extended
attestation report etc.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-43-brijesh.singh@amd.com
Version 2 of GHCB specification provides SNP_GUEST_REQUEST and
SNP_EXT_GUEST_REQUEST NAE that can be used by the SNP guest to
communicate with the PSP.
While at it, add a snp_issue_guest_request() helper that will be used by
driver or other subsystem to issue the request to PSP.
See SEV-SNP firmware and GHCB spec for more details.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-42-brijesh.singh@amd.com
For debugging purposes it is very useful to have a way to see the full
contents of the SNP CPUID table provided to a guest. Add an sev=debug
kernel command-line option to do so.
Also introduce some infrastructure so that additional options can be
specified via sev=option1[,option2] over time in a consistent manner.
[ bp: Massage, simplify string parsing. ]
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-41-brijesh.singh@amd.com
SEV-SNP guests will be provided the location of special 'secrets' and
'CPUID' pages via the Confidential Computing blob. This blob is
provided to the run-time kernel either through a boot_params field that
was initialized by the boot/compressed kernel, or via a setup_data
structure as defined by the Linux Boot Protocol.
Locate the Confidential Computing blob from these sources and, if found,
use the provided CPUID page/table address to create a copy that the
run-time kernel will use when servicing CPUID instructions via a #VC
handler.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-40-brijesh.singh@amd.com
Initial/preliminary detection of SEV-SNP is done via the Confidential
Computing blob. Check for it prior to the normal SEV/SME feature
initialization, and add some sanity checks to confirm it agrees with
SEV-SNP CPUID/MSR bits.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-39-brijesh.singh@amd.com
CPUID instructions generate a #VC exception for SEV-ES/SEV-SNP guests,
for which early handlers are currently set up to handle. In the case
of SEV-SNP, guests can use a configurable location in guest memory
that has been pre-populated with a firmware-validated CPUID table to
look up the relevant CPUID values rather than requesting them from
hypervisor via a VMGEXIT. Add the various hooks in the #VC handlers to
allow CPUID instructions to be handled via the table. The code to
actually configure/enable the table will be added in a subsequent
commit.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-33-brijesh.singh@amd.com
This code will also be used later for SEV-SNP-validated CPUID code in
some cases, so move it to a common helper.
While here, also add a check to terminate in cases where the CPUID
function/subfunction is indexed and the subfunction is non-zero, since
the GHCB MSR protocol does not support non-zero subfunctions.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-32-brijesh.singh@amd.com
Due to
103a4908ad ("x86/head/64: Disable stack protection for head$(BITS).o")
kernel/head{32,64}.c are compiled with -fno-stack-protector to allow
a call to set_bringup_idt_handler(), which would otherwise have stack
protection enabled with CONFIG_STACKPROTECTOR_STRONG.
While sufficient for that case, there may still be issues with calls to
any external functions that were compiled with stack protection enabled
that in-turn make stack-protected calls, or if the exception handlers
set up by set_bringup_idt_handler() make calls to stack-protected
functions.
Subsequent patches for SEV-SNP CPUID validation support will introduce
both such cases. Attempting to disable stack protection for everything
in scope to address that is prohibitive since much of the code, like the
SEV-ES #VC handler, is shared code that remains in use after boot and
could benefit from having stack protection enabled. Attempting to inline
calls is brittle and can quickly balloon out to library/helper code
where that's not really an option.
Instead, re-enable stack protection for head32.c/head64.c, and make the
appropriate changes to ensure the segment used for the stack canary is
initialized in advance of any stack-protected C calls.
For head64.c:
- The BSP will enter from startup_64() and call into C code
(startup_64_setup_env()) shortly after setting up the stack, which
may result in calls to stack-protected code. Set up %gs early to allow
for this safely.
- APs will enter from secondary_startup_64*(), and %gs will be set up
soon after. There is one call to C code prior to %gs being setup
(__startup_secondary_64()), but it is only to fetch 'sme_me_mask'
global, so just load 'sme_me_mask' directly instead, and remove the
now-unused __startup_secondary_64() function.
For head32.c:
- BSPs/APs will set %fs to __BOOT_DS prior to any C calls. In recent
kernels, the compiler is configured to access the stack canary at
%fs:__stack_chk_guard [1], which overlaps with the initial per-cpu
'__stack_chk_guard' variable in the initial/"master" .data..percpu
area. This is sufficient to allow access to the canary for use
during initial startup, so no changes are needed there.
[1] 3fb0fdb3bb ("x86/stackprotector/32: Make the canary into a regular percpu variable")
[ bp: Massage commit message. ]
Suggested-by: Joerg Roedel <jroedel@suse.de> #for 64-bit %gs set up
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-24-brijesh.singh@amd.com
To provide a more secure way to start APs under SEV-SNP, use the SEV-SNP
AP Creation NAE event. This allows for guest control over the AP register
state rather than trusting the hypervisor with the SEV-ES Jump Table
address.
During native_smp_prepare_cpus(), invoke an SEV-SNP function that, if
SEV-SNP is active, will set/override apic->wakeup_secondary_cpu. This
will allow the SEV-SNP AP Creation NAE event method to be used to boot
the APs. As a result of installing the override when SEV-SNP is active,
this method of starting the APs becomes the required method. The override
function will fail to start the AP if the hypervisor does not have
support for AP creation.
[ bp: Work in forgotten review comments. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-23-brijesh.singh@amd.com
Add the needed functionality to change pages state from shared
to private and vice-versa using the Page State Change VMGEXIT as
documented in the GHCB spec.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-22-brijesh.singh@amd.com
probe_roms() accesses the memory range (0xc0000 - 0x10000) to probe
various ROMs. The memory range is not part of the E820 system RAM range.
The memory range is mapped as private (i.e encrypted) in the page table.
When SEV-SNP is active, all the private memory must be validated before
accessing. The ROM range was not part of E820 map, so the guest BIOS
did not validate it. An access to invalidated memory will cause a
exception yet, so validate the ROM memory regions before it is accessed.
[ bp: Massage commit message. ]
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-21-brijesh.singh@amd.com
The encryption attribute for the .bss..decrypted section is cleared in the
initial page table build. This is because the section contains the data
that need to be shared between the guest and the hypervisor.
When SEV-SNP is active, just clearing the encryption attribute in the
page table is not enough. The page state needs to be updated in the RMP
table.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-20-brijesh.singh@amd.com
early_set_memory_{encrypted,decrypted}() are used for changing the page
state from decrypted (shared) to encrypted (private) and vice versa.
When SEV-SNP is active, the page state transition needs to go through
additional steps.
If the page is transitioned from shared to private, then perform the
following after the encryption attribute is set in the page table:
1. Issue the page state change VMGEXIT to add the page as a private
in the RMP table.
2. Validate the page after its successfully added in the RMP table.
To maintain the security guarantees, if the page is transitioned from
private to shared, then perform the following before clearing the
encryption attribute from the page table.
1. Invalidate the page.
2. Issue the page state change VMGEXIT to make the page shared in the
RMP table.
early_set_memory_{encrypted,decrypted}() can be called before the GHCB
is setup so use the SNP page state MSR protocol VMGEXIT defined in the
GHCB specification to request the page state change in the RMP table.
While at it, add a helper snp_prep_memory() which will be used in
probe_roms(), in a later patch.
[ bp: Massage commit message. ]
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-19-brijesh.singh@amd.com
The SEV-SNP guest is required by the GHCB spec to register the GHCB's
Guest Physical Address (GPA). This is because the hypervisor may prefer
that a guest uses a consistent and/or specific GPA for the GHCB associated
with a vCPU. For more information, see the GHCB specification section
"GHCB GPA Registration".
[ bp: Cleanup comments. ]
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-18-brijesh.singh@amd.com
The SEV-SNP guest is required by the GHCB spec to register the GHCB's
Guest Physical Address (GPA). This is because the hypervisor may prefer
that a guest use a consistent and/or specific GPA for the GHCB associated
with a vCPU. For more information, see the GHCB specification section
"GHCB GPA Registration".
If hypervisor can not work with the guest provided GPA then terminate the
guest boot.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-17-brijesh.singh@amd.com
Version 2 of the GHCB specification added the advertisement of features
that are supported by the hypervisor. If the hypervisor supports SEV-SNP
then it must set the SEV-SNP features bit to indicate that the base
functionality is supported.
Check that feature bit while establishing the GHCB; if failed, terminate
the guest.
Version 2 of the GHCB specification adds several new Non-Automatic Exits
(NAEs), most of them are optional except the hypervisor feature. Now
that the hypervisor feature NAE is implemented, bump the GHCB maximum
supported protocol version.
While at it, move the GHCB protocol negotiation check from the #VC
exception handler to sev_enable() so that all feature detection happens
before the first #VC exception.
While at it, document why the GHCB page cannot be setup from
load_stage2_idt().
[ bp: Massage commit message. ]
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-13-brijesh.singh@amd.com
The SEV-ES guest calls sev_es_negotiate_protocol() to negotiate the GHCB
protocol version before establishing the GHCB. Cache the negotiated GHCB
version so that it can be used later.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-12-brijesh.singh@amd.com
The GHCB specification defines the reason code for reason set 0. The
reason codes defined in the set 0 do not cover all possible causes for a
guest to request termination.
The reason sets 1 to 255 are reserved for the vendor-specific codes.
Reserve the reason set 1 for the Linux guest. Define the error codes for
reason set 1 so that one can have meaningful termination reasons and thus
better guest failure diagnosis.
While at it, change sev_es_terminate() to accept a reason set parameter.
[ bp: Massage commit message. ]
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-11-brijesh.singh@amd.com
sme_enable() handles feature detection for both SEV and SME. Future
patches will also use it for SEV-SNP feature detection/setup, which
will need to be done immediately after the first #VC handler is set up.
Move it now in preparation.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-9-brijesh.singh@amd.com
When SRBDS is mitigated by TSX OFF, update_srbds_msr() will still read
and write to MSR_IA32_MCU_OPT_CTRL even when that MSR is not supported
due to not having loaded the appropriate microcode.
Check for X86_FEATURE_SRBDS_CTRL which is set only when the respective
microcode which adds MSR_IA32_MCU_OPT_CTRL is loaded.
Based on a patch by Thadeu Lima de Souza Cascardo <cascardo@canonical.com>.
[ bp: Massage commit message. ]
Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220401074517.1848264-1-ricardo.canuelo@collabora.com
In mce_threshold_create_device(), if threshold_create_bank() fails, the
previously allocated threshold banks array @bp will be leaked because
the call to mce_threshold_remove_device() will not free it.
This happens because mce_threshold_remove_device() fetches the pointer
through the threshold_banks per-CPU variable but bp is written there
only after the bank creation is successful, and not before, when
threshold_create_bank() fails.
Add a helper which unwinds all the bank creation work previously done
and pass into it the previously allocated threshold banks array for
freeing.
[ bp: Massage. ]
Fixes: 6458de97fc ("x86/mce/amd: Straighten CPU hotplug path")
Co-developed-by: Alviro Iskandar Setiawan <alviro.iskandar@gnuweeb.org>
Signed-off-by: Alviro Iskandar Setiawan <alviro.iskandar@gnuweeb.org>
Co-developed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220329104705.65256-3-ammarfaizi2@gnuweeb.org
amd_cache_northbridges() is exported by amd_nb.c and is called by
amd64-agp.c and amd64_edac.c modules at module_init() time so that NB
descriptors are properly cached before those drivers can use them.
However, the init_amd_nbs() initcall already does call
amd_cache_northbridges() unconditionally and thus makes sure the NB
descriptors are enumerated.
That initcall is a fs_initcall type which is on the 5th group (starting
from 0) of initcalls that gets run in increasing numerical order by the
init code.
The module_init() call is turned into an __initcall() in the MODULE=n
case and those are device-level initcalls, i.e., group 6.
Therefore, the northbridges caching is already finished by the time
module initialization starts and thus the correct initialization order
is retained.
Unexport amd_cache_northbridges(), update dependent modules to
call amd_nb_num() instead. While at it, simplify the checks in
amd_cache_northbridges().
[ bp: Heavily massage and *actually* explain why the change is ok. ]
Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220324122729.221765-1-nchatrad@amd.com
Paolo reported that the instruction sequence that is used to replace:
call __static_call_return0
namely:
66 66 48 31 c0 data16 data16 xor %rax,%rax
decodes to something else on i386, namely:
66 66 48 data16 dec %ax
31 c0 xor %eax,%eax
Which is a nonsensical sequence that happens to have the same outcome.
*However* an important distinction is that it consists of 2
instructions which is a problem when the thing needs to be overwriten
with a regular call instruction again.
As such, replace the instruction with something that decodes the same
on both i386 and x86_64.
Fixes: 3f2a8fc4b1 ("static_call/x86: Add __static_call_return0()")
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220318204419.GT8939@worktop.programming.kicks-ass.net
Add local variables to reduce repetition later. No functional change
intended.
Link: https://lore.kernel.org/r/20220304035110.988712-2-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It doesn't make any sense to disable non-executable mappings -
security-wise or else.
So rip out that switch and move the remaining code into setup.c and
delete setup_nx.c
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127115626.14179-6-bp@alien8.de
There should be no need to disable SMEP anymore.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127115626.14179-5-bp@alien8.de
Those were added as part of the SMAP enablement but SMAP is currently
an integral part of kernel proper and there's no need to disable it
anymore.
Rip out that functionality. Leave --uaccess default on for objtool as
this is what objtool should do by default anyway.
If still needed - clearcpuid=smap.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127115626.14179-4-bp@alien8.de
That chicken bit was added by
4f88651125 ("[PATCH] i386: allow disabling X86_FEATURE_SEP at boot")
but measuring int80 vsyscall performance on 32-bit doesn't matter
anymore.
If still needed, one can boot with
clearcpuid=sep
to disable that feature for testing.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127115626.14179-3-bp@alien8.de
Having to give the X86_FEATURE array indices in order to disable a
feature bit for testing is not really user-friendly. So accept the
feature bit names too.
Some feature bits don't have names so there the array indices are still
accepted, of course.
Clearing CPUID flags is not something which should be done in production
so taint the kernel too.
An exemplary cmdline would then be something like:
clearcpuid=de,440,smca,succory,bmi1,3dnow
("succory" is wrong on purpose). And it says:
[ ... ] Clearing CPUID bits: de 13:24 smca (unknown: succory) bmi1 3dnow
[ Fix CONFIG_X86_FEATURE_NAMES=n build error as reported by the 0day
robot: https://lore.kernel.org/r/202203292206.ICsY2RKX-lkp@intel.com ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127115626.14179-2-bp@alien8.de
- Make the prctl() for enabling dynamic XSTATE components correct so it
adds the newly requested feature to the permission bitmap instead of
overwriting it. Add a selftest which validates that.
- Unroll string MMIO for encrypted SEV guests as the hypervisor cannot
emulate it.
- Handle supervisor states correctly in the FPU/XSTATE code so it takes
the feature set of the fpstate buffer into account. The feature sets
can differ between host and guest buffers. Guest buffers do not contain
supervisor states. So far this was not an issue, but with enabling
PASID it needs to be handled in the buffer offset calculation and in
the permission bitmaps.
- Avoid a gazillion of repeated CPUID invocations in by caching the values
early in the FPU/XSTATE code.
- Enable CONFIG_WERROR for X86.
- Make the X86 defconfigs more useful by adapting them to Y2022 reality.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJJWwwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoT3mEACA9xkNjECn/MHN3B0X5wTPhVyw9+TJ
OdfpqL7C9pbAU1s2mwf3TyicrCOqx8nlnOYB/mXgfRGnbZqmUeGQFpZFM587dm/I
r/BtouAzSASjnaW7SijT3gnRTqMPVNTcLOTUEVjnTa7zatw+t4rH1uxE9dLqEq9B
cKMtsBOJyTTbj4ie3ngkUS2PQngNNHLJ4oQGZW4wCA5snLuwF1LlgcZJy8Zkrlpo
D58h/ZV6K2/tI7INWLINlqGnxaL2B/Ld4zXsFH+t05XGh+JOiq8ueLi5tdfEPG9f
/pzuGia0Cv6WBv+jOHLCBe2kfgvBx+Y8Goi0tqL0hwKCGjpZlQkhRccrjbVSAPhW
2SfxOD1pulTwI1J75csYXjTc/heJvAv/ZpZSz3wldM3fyiwnmgfWKlMYqG6Xb9+T
2OHwEUJHJQnon/f25+yb9dWI7HYMw2fEIqu3CgbRyOviObcB9MM1uKVErkCYAUWY
W7Q8ShjNPrUguCPbw4YFPIwaazuhRbR8t2kRvfBOyTYwh3jo6U3eRL72Cov84uik
hnFtUdiusWtvV59ngZelREmd3iVKif2hxx7EoGDY/VV2Ru4C2X/xgJemKJeKSR/f
gm6pp8wbPSC4TBJOfP6IwYtoZKyu03miIeupPPUDxx0hLbx5j2e6EgVM5NVAeJFF
fu4MEkGvStZc+w==
=GK27
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of x86 fixes and updates:
- Make the prctl() for enabling dynamic XSTATE components correct so
it adds the newly requested feature to the permission bitmap
instead of overwriting it. Add a selftest which validates that.
- Unroll string MMIO for encrypted SEV guests as the hypervisor
cannot emulate it.
- Handle supervisor states correctly in the FPU/XSTATE code so it
takes the feature set of the fpstate buffer into account. The
feature sets can differ between host and guest buffers. Guest
buffers do not contain supervisor states. So far this was not an
issue, but with enabling PASID it needs to be handled in the buffer
offset calculation and in the permission bitmaps.
- Avoid a gazillion of repeated CPUID invocations in by caching the
values early in the FPU/XSTATE code.
- Enable CONFIG_WERROR in x86 defconfig.
- Make the X86 defconfigs more useful by adapting them to Y2022
reality"
* tag 'x86-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu/xstate: Consolidate size calculations
x86/fpu/xstate: Handle supervisor states in XSTATE permissions
x86/fpu/xsave: Handle compacted offsets correctly with supervisor states
x86/fpu: Cache xfeature flags from CPUID
x86/fpu/xsave: Initialize offset/size cache early
x86/fpu: Remove unused supervisor only offsets
x86/fpu: Remove redundant XCOMP_BV initialization
x86/sev: Unroll string mmio with CC_ATTR_GUEST_UNROLL_STRING_IO
x86/config: Make the x86 defconfigs a bit more usable
x86/defconfig: Enable WERROR
selftests/x86/amx: Update the ARCH_REQ_XCOMP_PERM test
x86/fpu/xstate: Fix the ARCH_REQ_XCOMP_PERM implementation
* Documentation improvements
* Prevent module exit until all VMs are freed
* PMU Virtualization fixes
* Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences
* Other miscellaneous bugfixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmJIGV8UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroO5FQgAhls4+Nu+NqId/yvvyNxr3vXq0dHI
hLlHtvzgGzZisZ7y2bNeyIpJVBDT5LCbrptPD/5eTvchVswDh0+kCVC0Uni5ugGT
tLT/Pv9Oq9e0X7aGdHRyuHIivIFDC20zIZO2DV48Lrj/+r6DafB2Fghq2XQLlBxN
p8KislvuqAAos543BPC1+Lk3dhOLuZ8qcFD8wGRlcCwjNwYaitrQ16rO04cLfUur
OwIks1I6TdI2JpLBhm6oWYVG/YnRsoo4bQE8cjdQ6yNSbwWtRpV33q7X6onw8x8K
BEeESoTnMqfaxIF/6mPl6bnDblVHFp6Xhld/vJcgeWQTdajFtuFE/K4sCA==
=xnQ6
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- Only do MSR filtering for MSRs accessed by rdmsr/wrmsr
- Documentation improvements
- Prevent module exit until all VMs are freed
- PMU Virtualization fixes
- Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences
- Other miscellaneous bugfixes
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
KVM: x86: fix sending PV IPI
KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
KVM: x86: Remove redundant vm_entry_controls_clearbit() call
KVM: x86: cleanup enter_rmode()
KVM: x86: SVM: fix tsc scaling when the host doesn't support it
kvm: x86: SVM: remove unused defines
KVM: x86: SVM: move tsc ratio definitions to svm.h
KVM: x86: SVM: fix avic spec based definitions again
KVM: MIPS: remove reference to trap&emulate virtualization
KVM: x86: document limitations of MSR filtering
KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr
KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set
KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
KVM: x86: Trace all APICv inhibit changes and capture overall status
KVM: x86: Add wrappers for setting/clearing APICv inhibits
KVM: x86: Make APICv inhibit reasons an enum and cleanup naming
KVM: X86: Handle implicit supervisor access with SMAP
KVM: X86: Rename variable smap to not_smap in permission_fault()
...
If guest kernel is configured with nopvspin, or CONFIG_PARAVIRT_SPINLOCK
is disabled, or guest find its has dedicated pCPUs from realtime hint
feature, the pvspinlock will be disabled, and vCPU preemption check
is disabled too.
Hoever, KVM still can emulating HLT for vCPU for both cases. Checking if a vCPU
is preempted or not can still boost performance in IPI-heavy scenarios such as
unixbench file copy and pipe-based context switching tests: Here the vCPU is
running with a dedicated pCPU, so the guest kernel has nopvspin but is
emulating HLT for the vCPU:
Testcase Base with patch
System Benchmarks Index Values INDEX INDEX
Dhrystone 2 using register variables 3278.4 3277.7
Double-Precision Whetstone 822.8 825.8
Execl Throughput 1296.5 941.1
File Copy 1024 bufsize 2000 maxblocks 2124.2 2142.7
File Copy 256 bufsize 500 maxblocks 1335.9 1353.6
File Copy 4096 bufsize 8000 maxblocks 4256.3 4760.3
Pipe Throughput 1050.1 1054.0
Pipe-based Context Switching 243.3 352.0
Process Creation 820.1 814.4
Shell Scripts (1 concurrent) 2169.0 2086.0
Shell Scripts (8 concurrent) 7710.3 7576.3
System Call Overhead 672.4 673.9
======== =======
System Benchmarks Index Score 1467.2 1483.0
Move the setting of pv_ops.lock.vcpu_is_preempted to kvm_guest_init, so
that it does not depend on pvspinlock.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <1646815610-43315-1-git-send-email-lirongqing@baidu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If apic_id is less than min, and (max - apic_id) is greater than
KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
the new apic_id does not fit the bitmask. In this case __send_ipi_mask
should send the IPI.
This is mostly theoretical, but it can happen if the apic_ids on three
iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.
Fixes: aaffcfd1e8 ("KVM: X86: Implement PV IPIs in linux guest")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Features:
- kprobes: rethook: x86: replace kretprobe trampoline with rethook
Current release - regressions:
- sfc: avoid null-deref on systems without NUMA awareness
in the new queue sizing code
Current release - new code bugs:
- vxlan: do not feed vxlan_vnifilter_dump_dev with non-vxlan devices
- eth: lan966x: fix null-deref on PHY pointer in timestamp ioctl
when interface is down
Previous releases - always broken:
- openvswitch: correct neighbor discovery target mask field
in the flow dump
- wireguard: ignore v6 endpoints when ipv6 is disabled and fix a leak
- rxrpc: fix call timer start racing with call destruction
- rxrpc: fix null-deref when security type is rxrpc_no_security
- can: fix UAF bugs around echo skbs in multiple drivers
Misc:
- docs: move netdev-FAQ to the "process" section of the documentation
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmJF3S0ACgkQMUZtbf5S
IruIvA/+NZx+c+fBBrbjOh63avRL7kYIqIDREf+v6lh4ZXmbrp22xalcjIdxgWeK
vAiYfYmzZblWAGkilcvPG3blCBc+9b+YE+pPJXFe60Huv3eYpjKfgTKwQOg/lIeM
8MfPP7eBwcJ/ltSTRtySRl9LYgyVcouP9rAVJavFVYrvuDYunwhfChswVfGCYon8
42O4nRwrtkTE1MjHD8HS3YxvwGlo+iIyhsxgG/gWx8F2xeIG22H6adzjDXcCQph8
air/awrJ4enYkVMRokGNfNppK9Z3vjJDX5xha3CREpvXNPe0F24cAE/L8XqyH7+r
/bXP5y9VC9mmEO7x4Le3VmDhOJGbCOtR89gTlevftDRdSIrbNHffZhbPW48tR7o8
NJFlhiSJb4HEMN0q7BmxnWaKlbZUlvLEXLuU5ytZE/G7i+nETULlunfZrCD4eNYH
gBGYhiob2I/XotJA9QzG/RDyaFwDaC/VARsyv37PSeBAl/yrEGAeP7DsKkKX/ayg
LM9ItveqHXK30J0xr3QJA8s49EkIYejjYR3l0hQ9esf9QvGK99dE/fo44Apf3C3A
Lz6XpnRc5Xd7tZ9Aopwb3FqOH6WR9Hq9Qlbk0qifsL/2sRbatpuZbbDK6L3CR3Ir
WFNcOoNbbqv85kCKFXFjj0jdpoNa9Yej8XFkMkVSkM3sHImYmYQ=
=5Bvy
-----END PGP SIGNATURE-----
Merge tag 'net-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull more networking updates from Jakub Kicinski:
"Networking fixes and rethook patches.
Features:
- kprobes: rethook: x86: replace kretprobe trampoline with rethook
Current release - regressions:
- sfc: avoid null-deref on systems without NUMA awareness in the new
queue sizing code
Current release - new code bugs:
- vxlan: do not feed vxlan_vnifilter_dump_dev with non-vxlan devices
- eth: lan966x: fix null-deref on PHY pointer in timestamp ioctl when
interface is down
Previous releases - always broken:
- openvswitch: correct neighbor discovery target mask field in the
flow dump
- wireguard: ignore v6 endpoints when ipv6 is disabled and fix a leak
- rxrpc: fix call timer start racing with call destruction
- rxrpc: fix null-deref when security type is rxrpc_no_security
- can: fix UAF bugs around echo skbs in multiple drivers
Misc:
- docs: move netdev-FAQ to the 'process' section of the
documentation"
* tag 'net-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
vxlan: do not feed vxlan_vnifilter_dump_dev with non vxlan devices
openvswitch: Add recirc_id to recirc warning
rxrpc: fix some null-ptr-deref bugs in server_key.c
rxrpc: Fix call timer start racing with call destruction
net: hns3: fix software vlan talbe of vlan 0 inconsistent with hardware
net: hns3: fix the concurrency between functions reading debugfs
docs: netdev: move the netdev-FAQ to the process pages
docs: netdev: broaden the new vs old code formatting guidelines
docs: netdev: call out the merge window in tag checking
docs: netdev: add missing back ticks
docs: netdev: make the testing requirement more stringent
docs: netdev: add a question about re-posting frequency
docs: netdev: rephrase the 'should I update patchwork' question
docs: netdev: rephrase the 'Under review' question
docs: netdev: shorten the name and mention msgid for patch status
docs: netdev: note that RFC postings are allowed any time
docs: netdev: turn the net-next closed into a Warning
docs: netdev: move the patch marking section up
docs: netdev: minor reword
docs: netdev: replace references to old archives
...
Use the offset calculation to do the size calculation which avoids yet
another series of CPUID instructions for each invocation.
[ Fix the FP/SSE only case which missed to take the xstate
header into account, as
Reported-by: kernel test robot <oliver.sang@intel.com> ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/87o81pgbp2.ffs@tglx