linux/kernel
Nhat Pham 8cba9576df hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory.  This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers.  The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc.  fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max.  Similar procedure is done to other memory limits
(memory.low for e.g).  However, this is rather cumbersome and buggy. 
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller).  Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.

Some caveats to consider:
  * This feature is only available on cgroup v2.
  * There is no hugetlb pool management involved in the memory
    controller. As stated above, hugetlb folios are only charged towards
    the memory controller when it is used. Host overcommit management
    has to consider it when configuring hard limits.
  * Failure to charge towards the memcg results in SIGBUS. This could
    happen even if the hugetlb pool still has pages (but the cgroup
    limit is hit and reclaim attempt fails).
  * When this feature is enabled, hugetlb pages contribute to memory
    reclaim protection. low, min limits tuning must take into account
    hugetlb memory.
  * Hugetlb pages utilized while this option is not selected will not
    be tracked by the memory controller (even if cgroup v2 is remounted
    later on).

Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18 14:34:17 -07:00
..
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 2023-09-16 11:16:00 +01:00
cgroup hugetlb: memcg: account hugetlb-backed memory in memory controller 2023-10-18 14:34:17 -07:00
configs Kbuild updates for v6.6 2023-09-05 11:01:47 -07:00
debug printk changes for 6.6 2023-09-04 13:20:19 -07:00
dma swiotlb: fix the check whether a device has used software IO TLB 2023-09-27 11:19:15 +02:00
entry entry: Remove empty addr_limit_user_check() 2023-08-23 10:32:39 +02:00
events mm/gup: adapt get_user_page_vma_remote() to never return NULL 2023-10-18 14:34:15 -07:00
futex mm/mm_init.c: remove obsolete macro HASH_SMALL 2023-08-18 10:12:07 -07:00
gcov gcov: shut up missing prototype warnings for internal stubs 2023-08-18 10:18:58 -07:00
irq Boring updates for the interrupt subsystem: 2023-08-28 14:33:11 -07:00
kcsan mm: delete checks for xor_unlock_is_negative_byte() 2023-10-18 14:34:17 -07:00
livepatch livepatch: Make 'klp_stack_entries' static 2023-06-05 13:56:52 +02:00
locking - An extensive rework of kexec and crash Kconfig from Eric DeVolder 2023-08-29 14:53:51 -07:00
module module/decompress: use vmalloc() for zstd decompression workspace 2023-08-29 09:39:08 -07:00
power PM: hibernate: Fix the exclusive get block device in test_resume mode 2023-09-12 11:45:15 +02:00
printk Revert "printk: export symbols for debug modules" 2023-09-07 14:19:42 +02:00
rcu rcu: dynamically allocate the rcu-kfree shrinker 2023-10-04 10:32:24 -07:00
sched sched/numa, mm: make numa migrate functions to take a folio 2023-10-16 15:44:38 -07:00
time Fix false positive "softirq work is pending" messages on -rt 2023-09-02 09:01:48 -07:00
trace tracing/user_events: Align set_bit() address for all archs 2023-09-30 16:25:41 -04:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.kexec crash: hotplug support for kexec_load() 2023-08-24 16:25:14 -07:00
Kconfig.locks
Kconfig.preempt
Makefile v6.5-rc1-modules-next 2023-06-28 15:51:08 -07:00
acct.c audit/stable-6.6 PR 20230829 2023-08-30 08:17:35 -07:00
async.c
audit.c audit: move trailing statements to next line 2023-08-15 18:16:14 -04:00
audit.h audit: correct audit_filter_inodes() definition 2023-07-21 12:17:25 -04:00
audit_fsnotify.c audit: fix potential double free on error path from fsnotify_add_inode_mark 2022-08-22 18:50:06 -04:00
audit_tree.c
audit_watch.c audit_init_parent(): constify path 2022-09-01 17:39:30 -04:00
auditfilter.c audit: move trailing statements to next line 2023-08-15 18:16:14 -04:00
auditsc.c Including fixes from netfilter and bpf. 2023-09-07 18:33:07 -07:00
backtracetest.c
bounds.c mm: multi-gen LRU: minimal implementation 2022-09-26 19:46:09 -07:00
capability.c lsm: constify the 'target' parameter in security_capget() 2023-08-08 16:48:47 -04:00
cfi.c cfi: Switch to -fsanitize=kcfi 2022-09-26 10:13:13 -07:00
compat.c sched_getaffinity: don't assume 'cpumask_size()' is fully initialized 2023-03-14 19:32:38 -07:00
configs.c
context_tracking.c locking/atomic: treewide: use raw_atomic*_<op>() 2023-06-05 09:57:20 +02:00
cpu.c cpu/hotplug: Prevent self deadlock on CPU hot-unplug 2023-08-30 12:24:22 +02:00
cpu_pm.c cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}() 2023-01-13 11:48:15 +01:00
crash_core.c Crash: add lock to serialize crash hotplug handling 2023-09-29 17:20:48 -07:00
crash_dump.c
cred.c cred: convert printks to pr_<level> 2023-08-18 10:18:49 -07:00
delayacct.c delayacct: track delays from IRQ/SOFTIRQ 2023-04-18 16:39:34 -07:00
dma.c
exec_domain.c
exit.c mm: remove remnants of SPLIT_RSS_COUNTING 2023-10-04 10:32:20 -07:00
extable.c
fail_function.c kernel/fail_function: fix memory leak with using debugfs_lookup() 2023-02-08 13:36:22 +01:00
fork.c mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl 2023-10-06 14:44:11 -07:00
freezer.c freezer,sched: Rewrite core freezer logic 2022-09-07 21:53:50 +02:00
gen_kheaders.sh Revert "kheaders: substituting --sort in archive creation" 2023-05-28 16:20:21 +09:00
groups.c
hung_task.c kernel/hung_task.c: set some hung_task.c variables storage-class-specifier to static 2023-04-08 13:45:37 -07:00
iomem.c kernel/iomem.c: remove __weak ioremap_cache helper 2023-08-21 13:37:28 -07:00
irq_work.c trace: Add trace_ipi_send_cpu() 2023-03-24 11:01:29 +01:00
jump_label.c jump_label: Prevent key->enabled int overflow 2022-12-01 15:53:05 -08:00
kallsyms.c kallsyms: Change func signature for cleanup_symbol_name() 2023-08-25 15:00:36 -07:00
kallsyms_internal.h kallsyms: Reduce the memory occupied by kallsyms_seqs_of_names[] 2022-11-12 18:47:36 -08:00
kallsyms_selftest.c Modules changes for v6.6-rc1 2023-08-29 17:32:32 -07:00
kallsyms_selftest.h kallsyms: Add self-test facility 2022-11-15 00:42:02 -08:00
kcmp.c
kcov.c kcov: add prototypes for helper functions 2023-06-09 17:44:17 -07:00
kexec.c crash: hotplug support for kexec_load() 2023-08-24 16:25:14 -07:00
kexec_core.c crash: add generic infrastructure for crash hotplug support 2023-08-24 16:25:13 -07:00
kexec_elf.c
kexec_file.c integrity-v6.6 2023-08-30 09:16:56 -07:00
kexec_internal.h panic, kexec: make __crash_kexec() NMI safe 2022-09-11 21:55:06 -07:00
kheaders.c kheaders: Use array declaration instead of char 2023-03-24 20:10:59 -07:00
kprobes.c kernel: kprobes: Use struct_size() 2023-08-23 09:38:17 +09:00
ksyms_common.c kallsyms: make kallsyms_show_value() as generic function 2023-06-08 12:27:20 -07:00
ksysfs.c crash: hotplug support for kexec_load() 2023-08-24 16:25:14 -07:00
kthread.c mm: remove remnants of SPLIT_RSS_COUNTING 2023-10-04 10:32:20 -07:00
latencytop.c latencytop: use the last element of latency_record of system 2022-09-11 21:55:12 -07:00
module_signature.c
notifier.c notifiers: add tracepoints to the notifiers infrastructure 2023-04-08 13:45:38 -07:00
nsproxy.c nsproxy: Convert nsproxy.count to refcount_t 2023-08-21 11:29:12 -07:00
padata.c padata: use alignment when calculating the number of worker threads 2023-03-14 17:06:44 +08:00
panic.c panic: Reenable preemption in WARN slowpath 2023-09-15 11:28:08 +02:00
params.c kernel: params: Remove unnecessary ‘0’ values from err 2023-07-10 12:47:01 -07:00
pid.c pidfd: prevent a kernel-doc warning 2023-09-19 13:21:33 -07:00
pid_namespace.c memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
pid_sysctl.h memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
profile.c kernel/profile.c: simplify duplicated code in profile_setup() 2022-09-11 21:55:12 -07:00
ptrace.c mm: make __access_remote_vm() static 2023-10-18 14:34:15 -07:00
range.c
reboot.c kernel/reboot: Add SYS_OFF_MODE_RESTART_PREPARE mode 2022-10-04 15:59:36 +02:00
regset.c
relay.c kernel: relay: remove unnecessary NULL values from relay_open_buf 2023-08-18 10:18:55 -07:00
resource.c dax/kmem: Fix leak of memory-hotplug resources 2023-02-17 14:58:01 -08:00
resource_kunit.c
rseq.c rseq: Extend struct rseq with per-memory-map concurrency ID 2022-12-27 12:52:12 +01:00
scftorture.c scftorture: Pause testing after memory-allocation failure 2023-07-14 15:02:57 -07:00
scs.c scs: add support for dynamic shadow call stacks 2022-11-09 18:06:35 +00:00
seccomp.c seccomp: Add missing kerndoc notations 2023-08-17 12:32:15 -07:00
signal.c signal: print comm and exe name on fatal signals 2023-08-18 10:18:50 -07:00
smp.c smp: Reduce NMI traffic from CSD waiters to CSD destination 2023-07-10 14:19:04 -07:00
smpboot.c cpu/hotplug: Remove unused state functions 2023-05-15 13:45:00 +02:00
smpboot.h
softirq.c sched/core: introduce sched_core_idle_cpu() 2023-07-13 15:21:50 +02:00
stackleak.c stackleak: allow to specify arch specific stackleak poison function 2023-04-20 11:36:35 +02:00
stacktrace.c
static_call.c
static_call_inline.c static_call: Add call depth tracking support 2022-10-17 16:41:16 +02:00
stop_machine.c
sys.c mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl 2023-10-06 14:44:11 -07:00
sys_ni.c x86/shstk: Introduce map_shadow_stack syscall 2023-08-02 15:01:51 -07:00
sysctl-test.c kernel/sysctl-test: use SYSCTL_{ZERO/ONE_HUNDRED} instead of i_{zero/one_hundred} 2022-09-08 16:56:45 -07:00
sysctl.c v6.5-rc1-sysctl-next 2023-06-28 16:05:21 -07:00
task_work.c task_work: add kerneldoc annotation for 'data' argument 2023-09-19 13:21:32 -07:00
taskstats.c genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
torture.c torture: Stop right-shifting torture_random() return values 2023-08-14 15:01:08 -07:00
tracepoint.c tracepoint: Allow livepatch module add trace event 2023-02-18 14:34:36 -05:00
tsacct.c
ucount.c sysctl: Add size to register_sysctl 2023-08-15 15:26:17 -07:00
uid16.c
uid16.h
umh.c sysctl: fix unused proc_cap_handler() function warning 2023-06-29 15:19:43 -07:00
up.c
user-return-notifier.c
user.c kernel/user: Allow user_struct::locked_vm to be usable for iommufd 2022-11-30 20:16:49 -04:00
user_namespace.c userns: fix a struct's kernel-doc notation 2023-02-02 22:50:04 -08:00
usermode_driver.c
utsname.c
utsname_sysctl.c utsname: simplify one-level sysctl registration for uts_kern_table 2023-04-13 11:49:35 -07:00
vhost_task.c vhost: Fix worker hangs due to missed wake up calls 2023-06-08 15:43:09 -04:00
watch_queue.c watch_queue: prevent dangling pipe pointer 2023-06-06 10:47:04 +02:00
watchdog.c watchdog/hardlockup: avoid large stack frames in watchdog_hardlockup_check() 2023-08-18 10:19:00 -07:00
watchdog_buddy.c watchdog/hardlockup: move SMP barriers from common code to buddy code 2023-06-19 16:25:28 -07:00
watchdog_perf.c watchdog/perf: add a weak function for an arch to detect if perf can use NMIs 2023-06-09 17:44:21 -07:00
workqueue.c workqueue: Fix missed pwq_release_worker creation in wq_cpu_intensive_thresh_init() 2023-09-18 08:50:31 -10:00
workqueue_internal.h workqueue: Drop the special locking rule for worker->flags and worker_pool->flags 2023-08-07 15:57:22 -10:00