mirror of
https://gitee.com/bianbu-linux/linux-6.6
synced 2025-06-29 23:43:21 -04:00
In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators in use. At a minimum there is libc malloc and the stack, and in many cases there are libc malloc, the stack, direct syscalls to mmap anonymous memory, and multiple VM heaps (one for small objects, one for big objects, etc.). Each of these layers usually has its own tools to inspect its usage; malloc by compiling a debug version, the VM through heap inspection tools, and for direct syscalls there is usually no way to track them. On Android we heavily use a set of tools that use an extended version of the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped in userspace and slice their usage by process, shared (COW) vs. unique mappings, backing, etc. This can account for real physical memory usage even in cases like fork without exec (which Android uses heavily to share as many private COW pages as possible between processes), Kernel SamePage Merging, and clean zero pages. It produces a measurement of the pages that only exist in that process (USS, for unique), and a measurement of the physical memory usage of that process with the cost of shared pages being evenly split between processes that share them (PSS). If all anonymous memory is indistinguishable then figuring out the real physical memory usage (PSS) of each heap requires either a pagemap walking tool that can understand the heap debugging of every layer, or for every layer's heap debugging tools to implement the pagemap walking logic, in which case it is hard to get a consistent view of memory across the whole system. Tracking the information in userspace leads to all sorts of problems. It either needs to be stored inside the process, which means every process has to have an API to export its current heap information upon request, or it has to be stored externally in a filesystem that somebody needs to clean up on crashes. It needs to be readable while the process is still running, so it has to have some sort of synchronization with every layer of userspace. Efficiently tracking the ranges requires reimplementing something like the kernel vma trees, and linking to it from every layer of userspace. It requires more memory, more syscalls, more runtime cost, and more complexity to separately track regions that the kernel is already tracking. This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a userspace-provided name for anonymous vmas. The names of named anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>]. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name) Setting the name to NULL clears it. The name length limit is 80 bytes including NUL-terminator and is checked to contain only printable ascii characters (including space), except '[',']','\','$' and '`'. Ascii strings are being used to have a descriptive identifiers for vmas, which can be understood by the users reading /proc/pid/maps or /proc/pid/smaps. Names can be standardized for a given system and they can include some variable parts such as the name of the allocator or a library, tid of the thread using it, etc. The name is stored in a pointer in the shared union in vm_area_struct that points to a null terminated string. Anonymous vmas with the same name (equivalent strings) and are otherwise mergeable will be merged. The name pointers are not shared between vmas even if they contain the same name. The name pointer is stored in a union with fields that are only used on file-backed mappings, so it does not increase memory usage. CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this feature. It keeps the feature disabled by default to prevent any additional memory overhead and to avoid confusing procfs parsers on systems which are not ready to support named anonymous vmas. The patch is based on the original patch developed by Colin Cross, more specifically on its latest version [1] posted upstream by Sumit Semwal. It used a userspace pointer to store vma names. In that design, name pointers could be shared between vmas. However during the last upstreaming attempt, Kees Cook raised concerns [2] about this approach and suggested to copy the name into kernel memory space, perform validity checks [3] and store as a string referenced from vm_area_struct. One big concern is about fork() performance which would need to strdup anonymous vma names. Dave Hansen suggested experimenting with worst-case scenario of forking a process with 64k vmas having longest possible names [4]. I ran this experiment on an ARM64 Android device and recorded a worst-case regression of almost 40% when forking such a process. This regression is addressed in the followup patch which replaces the pointer to a name with a refcounted structure that allows sharing the name pointer between vmas of the same name. Instead of duplicating the string during fork() or when splitting a vma it increments the refcount. [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/ [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/ [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/ [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/ Changes for prctl(2) manual page (in the options section): PR_SET_VMA Sets an attribute specified in arg2 for virtual memory areas starting from the address specified in arg3 and spanning the size specified in arg4. arg5 specifies the value of the attribute to be set. Note that assigning an attribute to a virtual memory area might prevent it from being merged with adjacent virtual memory areas due to the difference in that attribute's value. Currently, arg2 must be one of: PR_SET_VMA_ANON_NAME Set a name for anonymous virtual memory areas. arg5 should be a pointer to a null-terminated string containing the name. The name length including null byte cannot exceed 80 bytes. If arg5 is NULL, the name of the appropriate anonymous virtual memory areas will be reset. The name can contain only printable ascii characters (including space), except '[',']','\','$' and '`'. This feature is available only if the kernel is built with the CONFIG_ANON_VMA_NAME option enabled. [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table] Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy, added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the work here was done by Colin Cross, therefore, with his permission, keeping him as the author] Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com Signed-off-by: Colin Cross <ccross@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Glauber <jan.glauber@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Landley <rob@landley.net> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Shaohua Li <shli@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
278 lines
9.4 KiB
C
278 lines
9.4 KiB
C
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
|
|
#ifndef _LINUX_PRCTL_H
|
|
#define _LINUX_PRCTL_H
|
|
|
|
#include <linux/types.h>
|
|
|
|
/* Values to pass as first argument to prctl() */
|
|
|
|
#define PR_SET_PDEATHSIG 1 /* Second arg is a signal */
|
|
#define PR_GET_PDEATHSIG 2 /* Second arg is a ptr to return the signal */
|
|
|
|
/* Get/set current->mm->dumpable */
|
|
#define PR_GET_DUMPABLE 3
|
|
#define PR_SET_DUMPABLE 4
|
|
|
|
/* Get/set unaligned access control bits (if meaningful) */
|
|
#define PR_GET_UNALIGN 5
|
|
#define PR_SET_UNALIGN 6
|
|
# define PR_UNALIGN_NOPRINT 1 /* silently fix up unaligned user accesses */
|
|
# define PR_UNALIGN_SIGBUS 2 /* generate SIGBUS on unaligned user access */
|
|
|
|
/* Get/set whether or not to drop capabilities on setuid() away from
|
|
* uid 0 (as per security/commoncap.c) */
|
|
#define PR_GET_KEEPCAPS 7
|
|
#define PR_SET_KEEPCAPS 8
|
|
|
|
/* Get/set floating-point emulation control bits (if meaningful) */
|
|
#define PR_GET_FPEMU 9
|
|
#define PR_SET_FPEMU 10
|
|
# define PR_FPEMU_NOPRINT 1 /* silently emulate fp operations accesses */
|
|
# define PR_FPEMU_SIGFPE 2 /* don't emulate fp operations, send SIGFPE instead */
|
|
|
|
/* Get/set floating-point exception mode (if meaningful) */
|
|
#define PR_GET_FPEXC 11
|
|
#define PR_SET_FPEXC 12
|
|
# define PR_FP_EXC_SW_ENABLE 0x80 /* Use FPEXC for FP exception enables */
|
|
# define PR_FP_EXC_DIV 0x010000 /* floating point divide by zero */
|
|
# define PR_FP_EXC_OVF 0x020000 /* floating point overflow */
|
|
# define PR_FP_EXC_UND 0x040000 /* floating point underflow */
|
|
# define PR_FP_EXC_RES 0x080000 /* floating point inexact result */
|
|
# define PR_FP_EXC_INV 0x100000 /* floating point invalid operation */
|
|
# define PR_FP_EXC_DISABLED 0 /* FP exceptions disabled */
|
|
# define PR_FP_EXC_NONRECOV 1 /* async non-recoverable exc. mode */
|
|
# define PR_FP_EXC_ASYNC 2 /* async recoverable exception mode */
|
|
# define PR_FP_EXC_PRECISE 3 /* precise exception mode */
|
|
|
|
/* Get/set whether we use statistical process timing or accurate timestamp
|
|
* based process timing */
|
|
#define PR_GET_TIMING 13
|
|
#define PR_SET_TIMING 14
|
|
# define PR_TIMING_STATISTICAL 0 /* Normal, traditional,
|
|
statistical process timing */
|
|
# define PR_TIMING_TIMESTAMP 1 /* Accurate timestamp based
|
|
process timing */
|
|
|
|
#define PR_SET_NAME 15 /* Set process name */
|
|
#define PR_GET_NAME 16 /* Get process name */
|
|
|
|
/* Get/set process endian */
|
|
#define PR_GET_ENDIAN 19
|
|
#define PR_SET_ENDIAN 20
|
|
# define PR_ENDIAN_BIG 0
|
|
# define PR_ENDIAN_LITTLE 1 /* True little endian mode */
|
|
# define PR_ENDIAN_PPC_LITTLE 2 /* "PowerPC" pseudo little endian */
|
|
|
|
/* Get/set process seccomp mode */
|
|
#define PR_GET_SECCOMP 21
|
|
#define PR_SET_SECCOMP 22
|
|
|
|
/* Get/set the capability bounding set (as per security/commoncap.c) */
|
|
#define PR_CAPBSET_READ 23
|
|
#define PR_CAPBSET_DROP 24
|
|
|
|
/* Get/set the process' ability to use the timestamp counter instruction */
|
|
#define PR_GET_TSC 25
|
|
#define PR_SET_TSC 26
|
|
# define PR_TSC_ENABLE 1 /* allow the use of the timestamp counter */
|
|
# define PR_TSC_SIGSEGV 2 /* throw a SIGSEGV instead of reading the TSC */
|
|
|
|
/* Get/set securebits (as per security/commoncap.c) */
|
|
#define PR_GET_SECUREBITS 27
|
|
#define PR_SET_SECUREBITS 28
|
|
|
|
/*
|
|
* Get/set the timerslack as used by poll/select/nanosleep
|
|
* A value of 0 means "use default"
|
|
*/
|
|
#define PR_SET_TIMERSLACK 29
|
|
#define PR_GET_TIMERSLACK 30
|
|
|
|
#define PR_TASK_PERF_EVENTS_DISABLE 31
|
|
#define PR_TASK_PERF_EVENTS_ENABLE 32
|
|
|
|
/*
|
|
* Set early/late kill mode for hwpoison memory corruption.
|
|
* This influences when the process gets killed on a memory corruption.
|
|
*/
|
|
#define PR_MCE_KILL 33
|
|
# define PR_MCE_KILL_CLEAR 0
|
|
# define PR_MCE_KILL_SET 1
|
|
|
|
# define PR_MCE_KILL_LATE 0
|
|
# define PR_MCE_KILL_EARLY 1
|
|
# define PR_MCE_KILL_DEFAULT 2
|
|
|
|
#define PR_MCE_KILL_GET 34
|
|
|
|
/*
|
|
* Tune up process memory map specifics.
|
|
*/
|
|
#define PR_SET_MM 35
|
|
# define PR_SET_MM_START_CODE 1
|
|
# define PR_SET_MM_END_CODE 2
|
|
# define PR_SET_MM_START_DATA 3
|
|
# define PR_SET_MM_END_DATA 4
|
|
# define PR_SET_MM_START_STACK 5
|
|
# define PR_SET_MM_START_BRK 6
|
|
# define PR_SET_MM_BRK 7
|
|
# define PR_SET_MM_ARG_START 8
|
|
# define PR_SET_MM_ARG_END 9
|
|
# define PR_SET_MM_ENV_START 10
|
|
# define PR_SET_MM_ENV_END 11
|
|
# define PR_SET_MM_AUXV 12
|
|
# define PR_SET_MM_EXE_FILE 13
|
|
# define PR_SET_MM_MAP 14
|
|
# define PR_SET_MM_MAP_SIZE 15
|
|
|
|
/*
|
|
* This structure provides new memory descriptor
|
|
* map which mostly modifies /proc/pid/stat[m]
|
|
* output for a task. This mostly done in a
|
|
* sake of checkpoint/restore functionality.
|
|
*/
|
|
struct prctl_mm_map {
|
|
__u64 start_code; /* code section bounds */
|
|
__u64 end_code;
|
|
__u64 start_data; /* data section bounds */
|
|
__u64 end_data;
|
|
__u64 start_brk; /* heap for brk() syscall */
|
|
__u64 brk;
|
|
__u64 start_stack; /* stack starts at */
|
|
__u64 arg_start; /* command line arguments bounds */
|
|
__u64 arg_end;
|
|
__u64 env_start; /* environment variables bounds */
|
|
__u64 env_end;
|
|
__u64 *auxv; /* auxiliary vector */
|
|
__u32 auxv_size; /* vector size */
|
|
__u32 exe_fd; /* /proc/$pid/exe link file */
|
|
};
|
|
|
|
/*
|
|
* Set specific pid that is allowed to ptrace the current task.
|
|
* A value of 0 mean "no process".
|
|
*/
|
|
#define PR_SET_PTRACER 0x59616d61
|
|
# define PR_SET_PTRACER_ANY ((unsigned long)-1)
|
|
|
|
#define PR_SET_CHILD_SUBREAPER 36
|
|
#define PR_GET_CHILD_SUBREAPER 37
|
|
|
|
/*
|
|
* If no_new_privs is set, then operations that grant new privileges (i.e.
|
|
* execve) will either fail or not grant them. This affects suid/sgid,
|
|
* file capabilities, and LSMs.
|
|
*
|
|
* Operations that merely manipulate or drop existing privileges (setresuid,
|
|
* capset, etc.) will still work. Drop those privileges if you want them gone.
|
|
*
|
|
* Changing LSM security domain is considered a new privilege. So, for example,
|
|
* asking selinux for a specific new context (e.g. with runcon) will result
|
|
* in execve returning -EPERM.
|
|
*
|
|
* See Documentation/userspace-api/no_new_privs.rst for more details.
|
|
*/
|
|
#define PR_SET_NO_NEW_PRIVS 38
|
|
#define PR_GET_NO_NEW_PRIVS 39
|
|
|
|
#define PR_GET_TID_ADDRESS 40
|
|
|
|
#define PR_SET_THP_DISABLE 41
|
|
#define PR_GET_THP_DISABLE 42
|
|
|
|
/*
|
|
* No longer implemented, but left here to ensure the numbers stay reserved:
|
|
*/
|
|
#define PR_MPX_ENABLE_MANAGEMENT 43
|
|
#define PR_MPX_DISABLE_MANAGEMENT 44
|
|
|
|
#define PR_SET_FP_MODE 45
|
|
#define PR_GET_FP_MODE 46
|
|
# define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */
|
|
# define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */
|
|
|
|
/* Control the ambient capability set */
|
|
#define PR_CAP_AMBIENT 47
|
|
# define PR_CAP_AMBIENT_IS_SET 1
|
|
# define PR_CAP_AMBIENT_RAISE 2
|
|
# define PR_CAP_AMBIENT_LOWER 3
|
|
# define PR_CAP_AMBIENT_CLEAR_ALL 4
|
|
|
|
/* arm64 Scalable Vector Extension controls */
|
|
/* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */
|
|
#define PR_SVE_SET_VL 50 /* set task vector length */
|
|
# define PR_SVE_SET_VL_ONEXEC (1 << 18) /* defer effect until exec */
|
|
#define PR_SVE_GET_VL 51 /* get task vector length */
|
|
/* Bits common to PR_SVE_SET_VL and PR_SVE_GET_VL */
|
|
# define PR_SVE_VL_LEN_MASK 0xffff
|
|
# define PR_SVE_VL_INHERIT (1 << 17) /* inherit across exec */
|
|
|
|
/* Per task speculation control */
|
|
#define PR_GET_SPECULATION_CTRL 52
|
|
#define PR_SET_SPECULATION_CTRL 53
|
|
/* Speculation control variants */
|
|
# define PR_SPEC_STORE_BYPASS 0
|
|
# define PR_SPEC_INDIRECT_BRANCH 1
|
|
# define PR_SPEC_L1D_FLUSH 2
|
|
/* Return and control values for PR_SET/GET_SPECULATION_CTRL */
|
|
# define PR_SPEC_NOT_AFFECTED 0
|
|
# define PR_SPEC_PRCTL (1UL << 0)
|
|
# define PR_SPEC_ENABLE (1UL << 1)
|
|
# define PR_SPEC_DISABLE (1UL << 2)
|
|
# define PR_SPEC_FORCE_DISABLE (1UL << 3)
|
|
# define PR_SPEC_DISABLE_NOEXEC (1UL << 4)
|
|
|
|
/* Reset arm64 pointer authentication keys */
|
|
#define PR_PAC_RESET_KEYS 54
|
|
# define PR_PAC_APIAKEY (1UL << 0)
|
|
# define PR_PAC_APIBKEY (1UL << 1)
|
|
# define PR_PAC_APDAKEY (1UL << 2)
|
|
# define PR_PAC_APDBKEY (1UL << 3)
|
|
# define PR_PAC_APGAKEY (1UL << 4)
|
|
|
|
/* Tagged user address controls for arm64 */
|
|
#define PR_SET_TAGGED_ADDR_CTRL 55
|
|
#define PR_GET_TAGGED_ADDR_CTRL 56
|
|
# define PR_TAGGED_ADDR_ENABLE (1UL << 0)
|
|
/* MTE tag check fault modes */
|
|
# define PR_MTE_TCF_NONE 0UL
|
|
# define PR_MTE_TCF_SYNC (1UL << 1)
|
|
# define PR_MTE_TCF_ASYNC (1UL << 2)
|
|
# define PR_MTE_TCF_MASK (PR_MTE_TCF_SYNC | PR_MTE_TCF_ASYNC)
|
|
/* MTE tag inclusion mask */
|
|
# define PR_MTE_TAG_SHIFT 3
|
|
# define PR_MTE_TAG_MASK (0xffffUL << PR_MTE_TAG_SHIFT)
|
|
/* Unused; kept only for source compatibility */
|
|
# define PR_MTE_TCF_SHIFT 1
|
|
|
|
/* Control reclaim behavior when allocating memory */
|
|
#define PR_SET_IO_FLUSHER 57
|
|
#define PR_GET_IO_FLUSHER 58
|
|
|
|
/* Dispatch syscalls to a userspace handler */
|
|
#define PR_SET_SYSCALL_USER_DISPATCH 59
|
|
# define PR_SYS_DISPATCH_OFF 0
|
|
# define PR_SYS_DISPATCH_ON 1
|
|
/* The control values for the user space selector when dispatch is enabled */
|
|
# define SYSCALL_DISPATCH_FILTER_ALLOW 0
|
|
# define SYSCALL_DISPATCH_FILTER_BLOCK 1
|
|
|
|
/* Set/get enabled arm64 pointer authentication keys */
|
|
#define PR_PAC_SET_ENABLED_KEYS 60
|
|
#define PR_PAC_GET_ENABLED_KEYS 61
|
|
|
|
/* Request the scheduler to share a core */
|
|
#define PR_SCHED_CORE 62
|
|
# define PR_SCHED_CORE_GET 0
|
|
# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */
|
|
# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */
|
|
# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
|
|
# define PR_SCHED_CORE_MAX 4
|
|
# define PR_SCHED_CORE_SCOPE_THREAD 0
|
|
# define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1
|
|
# define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2
|
|
|
|
#define PR_SET_VMA 0x53564d41
|
|
# define PR_SET_VMA_ANON_NAME 0
|
|
|
|
#endif /* _LINUX_PRCTL_H */
|