a panic of watchdog hard lockup
这个周一早上来工作被告知上周有 4 台物理 crash 了,需要诊断修复,
表象:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 #0 [ffff88103fbc59f0] machine_kexec at ffffffff81059beb #1 [ffff88103fbc5a50] __crash_kexec at ffffffff81105822 #2 [ffff88103fbc5b20] panic at ffffffff81680541 #3 [ffff88103fbc5ba0] nmi_panic at ffffffff81085abf #4 [ffff88103fbc5bb0] watchdog_overflow_callback at ffffffff8112f879 #5 [ffff88103fbc5bc8] __perf_event_overflow at ffffffff81174d2e #6 [ffff88103fbc5c00] perf_event_overflow at ffffffff81175974 #7 [ffff88103fbc5c10] intel_pmu_handle_irq at ffffffff81009d88 #8 [ffff88103fbc5e38] perf_event_nmi_handler at ffffffff8168ed6b #9 [ffff88103fbc5e58] nmi_handle at ffffffff816901b7 #10 [ffff88103fbc5eb0] do_nmi at ffffffff816903c3 #11 [ffff88103fbc5ef0] end_repeat_nmi at ffffffff8168f5d3 [exception RIP: update_curr+15] RIP: ffffffff810ce3cf RSP: ffff88103fbc3db8 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffff88092b2ed200 RCX: 0000000000000001 RDX: 0000000000000001 RSI: ffff88092b2ed200 RDI: ffff880f6afb8600 RBP: ffff88103fbc3dd0 R8: ffff88103d2b7500 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880f6afb8600 R13: 0000000000000001 R14: 0000000000000003 R15: ffff8813bf7f5548 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #12 [ffff88103fbc3db8] update_curr at ffffffff810ce3cf #13 [ffff88103fbc3dd8] enqueue_entity at ffffffff810d042d #14 [ffff88103fbc3e20] unthrottle_cfs_rq at ffffffff810d16f4 #15 [ffff88103fbc3e58] distribute_cfs_runtime at ffffffff810d1932 #16 [ffff88103fbc3ea0] sched_cfs_period_timer at ffffffff810d1acf #17 [ffff88103fbc3ed8] __hrtimer_run_queues at ffffffff810b4d72 #18 [ffff88103fbc3f30] hrtimer_interrupt at ffffffff810b5310 #19 [ffff88103fbc3f80] local_apic_timer_interrupt at ffffffff81051037 #20 [ffff88103fbc3f98] smp_apic_timer_interrupt at ffffffff81699f0f #21 [ffff88103fbc3fb0] apic_timer_interrupt at ffffffff8169845d --- <IRQ stack> --- #22 [ffff8801699a3de8] apic_timer_interrupt at ffffffff8169845d [exception RIP: native_safe_halt+6] RIP: ffffffff81060fe6 RSP: ffff8801699a3e98 RFLAGS: 00000286 RAX: 00000000ffffffed RBX: ffff88103fbcd080 RCX: 0100000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046 RBP: ffff8801699a3e98 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 00099b9bb0645f00 R13: ffff88103fbcfde0 R14: f21bf8c4662d3c34 R15: 0000000000000082 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #23 [ffff8801699a3ea0] default_idle at ffffffff810347ff #24 [ffff8801699a3ec0] arch_cpu_idle at ffffffff81035146 #25 [ffff8801699a3ed0] cpu_startup_entry at ffffffff810e82f5 #26 [ffff8801699a3f28] start_secondary at ffffffff8104f0da
影响范围:
确认的范围有 Linux 3.10.0-514.26.2.el7
解决方案:
等待上游合并 patch c06f04c70489b9deea3212af8375e2f0c2f0b184
原因^patch :
distribute_cfs_runtime() intentionally only hands out enough runtime to bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take the runtime they need only once they actually get to run. However, if they get to run sufficiently quickly, the period timer is still in distribute_cfs_runtime() and no runtime is available, causing them to throttle. Then distribute has to handle them again, and this can go on until distribute has handed out all of the runtime 1ns at a time, which takes far too long.
诊断过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 crash> bt PID: 0 TASK: ffff880169986dd0 CPU: 7 COMMAND: "swapper/7" #0 [ffff88103fbc59f0] machine_kexec at ffffffff81059beb #1 [ffff88103fbc5a50] __crash_kexec at ffffffff81105822 #2 [ffff88103fbc5b20] panic at ffffffff81680541 #3 [ffff88103fbc5ba0] nmi_panic at ffffffff81085abf #4 [ffff88103fbc5bb0] watchdog_overflow_callback at ffffffff8112f879 #5 [ffff88103fbc5bc8] __perf_event_overflow at ffffffff81174d2e #6 [ffff88103fbc5c00] perf_event_overflow at ffffffff81175974 #7 [ffff88103fbc5c10] intel_pmu_handle_irq at ffffffff81009d88 #8 [ffff88103fbc5e38] perf_event_nmi_handler at ffffffff8168ed6b #9 [ffff88103fbc5e58] nmi_handle at ffffffff816901b7 #10 [ffff88103fbc5eb0] do_nmi at ffffffff816903c3 #11 [ffff88103fbc5ef0] end_repeat_nmi at ffffffff8168f5d3 [exception RIP: update_curr+15] RIP: ffffffff810ce3cf RSP: ffff88103fbc3db8 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffff88092b2ed200 RCX: 0000000000000001 RDX: 0000000000000001 RSI: ffff88092b2ed200 RDI: ffff880f6afb8600 RBP: ffff88103fbc3dd0 R8: ffff88103d2b7500 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880f6afb8600 R13: 0000000000000001 R14: 0000000000000003 R15: ffff8813bf7f5548 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #12 [ffff88103fbc3db8] update_curr at ffffffff810ce3cf #13 [ffff88103fbc3dd8] enqueue_entity at ffffffff810d042d #14 [ffff88103fbc3e20] unthrottle_cfs_rq at ffffffff810d16f4 #15 [ffff88103fbc3e58] distribute_cfs_runtime at ffffffff810d1932 #16 [ffff88103fbc3ea0] sched_cfs_period_timer at ffffffff810d1acf #17 [ffff88103fbc3ed8] __hrtimer_run_queues at ffffffff810b4d72 #18 [ffff88103fbc3f30] hrtimer_interrupt at ffffffff810b5310 #19 [ffff88103fbc3f80] local_apic_timer_interrupt at ffffffff81051037 #20 [ffff88103fbc3f98] smp_apic_timer_interrupt at ffffffff81699f0f #21 [ffff88103fbc3fb0] apic_timer_interrupt at ffffffff8169845d --- <IRQ stack> --- #22 [ffff8801699a3de8] apic_timer_interrupt at ffffffff8169845d [exception RIP: native_safe_halt+6] RIP: ffffffff81060fe6 RSP: ffff8801699a3e98 RFLAGS: 00000286 RAX: 00000000ffffffed RBX: ffff88103fbcd080 RCX: 0100000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046 RBP: ffff8801699a3e98 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 00099b9bb0645f00 R13: ffff88103fbcfde0 R14: f21bf8c4662d3c34 R15: 0000000000000082 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #23 [ffff8801699a3ea0] default_idle at ffffffff810347ff #24 [ffff8801699a3ec0] arch_cpu_idle at ffffffff81035146 #25 [ffff8801699a3ed0] cpu_startup_entry at ffffffff810e82f5 #26 [ffff8801699a3f28] start_secondary at ffffffff8104f0da
看 bt 的长相,推断系统死亡之前应该没啥事情,然后开始处理一个时钟中断,正在处理时钟中断的过程中发生一个 NMI 异常,然后进入异常处理,然后在异常处理里 panic 了。在异常栈中选择多看了两眼的函数是 watchdog_overflow_callback,是因为后面的打印信息和 panic 操作都在这个函数里面进行的,关于 watchdog_overflow_callback 这个函数描述^watchdog :
This function is invoked from Non-Maskable Interrupt (NMI) context. If a CPU is busy, this function executes periodically and it checks whether watchdog_timer_fn has incremented the CPU-specific counter during the past interval. If the counter has not been incremented, watchdog_overflow_callback assumes that the CPU is ‘locked up’ in a section of kernel code where interrupts are disabled, and a panic occurs unless ‘panic on hard lockup’ is explicitly disabled via the nmi_watchdog=nopanic parameter on the kernel command line.
分析 watchdog_overflow_callback 函数的实现,其中 is_hardlockup 后面对应的代码块中具体是对比如下的两个值:
1 2 3 4 crash> px hrtimer_interrupts_saved:7 per_cpu(hrtimer_interrupts_saved, 7) = $14 = 0xa50fb crash> px hrtimer_interrupts:7 per_cpu(hrtimer_interrupts, 7) = $15 = 0xa50fb
其中 hrtimer_interrupts 是 pre-cpu 变量,它会被 call_timer_fn 更新 (根据外部信息),也就是说 watchdog 发现 cpu #7 locked up 在一块内核代码里了。回头注意看一下 bt 注意其中两次 RFLAGS 的变化,在中断栈之前 00000286 是即 IF 标志位置位,进入异常栈过后发现 00000002 即 IF 未置位,一个典型中断被 NMI 抢占的现象,这现象可能是 ISR 占用了太久了触发了 watchdog。
IRQ stack 如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 [exception RIP: update_curr+15] RIP: ffffffff810ce3cf RSP: ffff88103fbc3db8 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffff88092b2ed200 RCX: 0000000000000001 RDX: 0000000000000001 RSI: ffff88092b2ed200 RDI: ffff880f6afb8600 RBP: ffff88103fbc3dd0 R8: ffff88103d2b7500 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880f6afb8600 R13: 0000000000000001 R14: 0000000000000003 R15: ffff8813bf7f5548 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #12 [ffff88103fbc3db8] update_curr at ffffffff810ce3cf #13 [ffff88103fbc3dd8] enqueue_entity at ffffffff810d042d #14 [ffff88103fbc3e20] unthrottle_cfs_rq at ffffffff810d16f4 #15 [ffff88103fbc3e58] distribute_cfs_runtime at ffffffff810d1932 #16 [ffff88103fbc3ea0] sched_cfs_period_timer at ffffffff810d1acf #17 [ffff88103fbc3ed8] __hrtimer_run_queues at ffffffff810b4d72 #18 [ffff88103fbc3f30] hrtimer_interrupt at ffffffff810b5310 #19 [ffff88103fbc3f80] local_apic_timer_interrupt at ffffffff81051037 #20 [ffff88103fbc3f98] smp_apic_timer_interrupt at ffffffff81699f0f #21 [ffff88103fbc3fb0] apic_timer_interrupt at ffffffff8169845d --- <IRQ stack> ---
已知 apic_timer_interrupt 会禁用中断,而且在 hrtimer_interrupt 函数注释也重复说明了这一点,就在一定程度上强化了之前猜想 ISR(apic_timer_interrupt)触发 NMI watchdog 的想法。
1 2 3 4 5 6 7 8 9 10 11 crash> dis -s hrtimer_interrupt|head -n 10 FILE: kernel/hrtimer.c LINE: 1292 1287 /* 1288 * High resolution timer interrupt 1289 * Called with interrupts disabled 1290 */ 1291 void hrtimer_interrupt(struct clock_event_device *dev) * 1292 { 1293 struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
通过 backtrace 梳理函数的调用关系 update_curr<-enqueue_entity<-unthrottle_cfs_rq<-distribute_cfs_runtime<-sched_cfs_period_timer 猜测这个可能与 CFS Bandwidth Control [^cfs_bandwidth] 特性有点关系 (通过 dis -s unthrottle_cfs_rq 确认),下面就是利基于 x86_64 上 caller 与 callee 约定以及函数原型看参数,去参数的值对比。
第一组:distribute_cfs_runtime 函数 原型:
1 2 3 crash> dis -s distribute_cfs_runtime 3423 static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, 3424 u64 remaining, u64 expires)
caller:
1 #16 [ffff88103fbc3ea0] sched_cfs_period_timer at ffffffff810d1acf
1 2 3 4 5 6 crash> dis 0xffffffff810d1acf -r |tail ... 0xffffffff810d1ac1 <sched_cfs_period_timer+193>: mov %r13,%rsi 0xffffffff810d1ac4 <sched_cfs_period_timer+196>: mov %r15,%rdx 0xffffffff810d1ac7 <sched_cfs_period_timer+199>: mov %r12,%rdi 0xffffffff810d1aca <sched_cfs_period_timer+202>: callq 0xffffffff810d1840 <distribute_cfs_runtime>
callee:
1 2 3 4 5 6 7 8 9 10 11 12 crash> dis distribute_cfs_runtime|head -n 20 0xffffffff810d1840 <distribute_cfs_runtime>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffff810d1845 <distribute_cfs_runtime+5>: push %rbp 0xffffffff810d1846 <distribute_cfs_runtime+6>: mov %rsp,%rbp 0xffffffff810d1849 <distribute_cfs_runtime+9>: push %r15 // 3rd 0xffffffff810d184b <distribute_cfs_runtime+11>: push %r14 0xffffffff810d184d <distribute_cfs_runtime+13>: mov %rdx,%r14 0xffffffff810d1850 <distribute_cfs_runtime+16>: push %r13 // 2nd 0xffffffff810d1852 <distribute_cfs_runtime+18>: push %r12 // 1st 0xffffffff810d1854 <distribute_cfs_runtime+20>: mov %rsi,%r12 0xffffffff810d1857 <distribute_cfs_runtime+23>: push %rbx 0xffffffff810d1858 <distribute_cfs_runtime+24>: sub $0x10,%rsp
1 2 3 4 5 6 7 8 crash> bt -f ... #15 [ffff88103fbc3e58] distribute_cfs_runtime at ffffffff810d1932 ffff88103fbc3e60: ffff88103d2b7500 f21bf8c4662d3c34 ffff88103fbc3e70: ffff8813bf7f5580 ffff8813bf7f5548 ffff88103fbc3e80: 0000000002625a00 ffff8813bf7f5640 ffff88103fbc3e90: 00099ba6210dc705 ffff88103fbc3ed0 ffff88103fbc3ea0: ffffffff810d1acf
根据栈顺序: cfs_bandwidth *cfs_b 是 ffff8813bf7f5548, u64 remaining 是 0000000002625a00, u64 expires 是 00099ba6210dc705。
第二组数 unthrottle_cfs_rq 函数
原型:unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
caller:
1 2 3 4 crash> dis -r ffffffff810d1932 | tail ... 0xffffffff810d192a <distribute_cfs_runtime+234>: mov %r15,%rdi 0xffffffff810d192d <distribute_cfs_runtime+237>: callq 0xffffffff810d1610 <unthrottle_cfs_rq>
callee:
1 2 3 4 5 crash> dis -r ffffffff810d16f4 | head -n 20 0xffffffff810d1610 <unthrottle_cfs_rq>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffff810d1615 <unthrottle_cfs_rq+5>: push %rbp 0xffffffff810d1616 <unthrottle_cfs_rq+6>: mov %rsp,%rbp 0xffffffff810d1619 <unthrottle_cfs_rq+9>: push %r15
1 2 3 4 5 #14 [ffff88103fbc3e20] unthrottle_cfs_rq at ffffffff810d16f4 ffff88103fbc3e28: ffff88103fe56c40 00000000008cc0b3 ffff88103fbc3e38: ffff8813bf7f5640 00099ba6210dc705 ffff88103fbc3e48: ffff88103d2b7400 ffff88103fbc3e98 ffff88103fbc3e58: ffffffff810d1932
则 struct cfs_rq *cfs_rq 是 ffff88103d2b7400
查看结构体字段的值:
1 2 3 crash> cfs_rq.runtime_remaining,runtime_expires ffff88103d2b7400 runtime_remaining = 1 runtime_expires = 2704412611823365
获取 cfs_rq.throttled_list 链表的地址:
1 2 3 4 crash> cfs_rq.throttled_list ffff88103d2b7400 -ox struct cfs_rq { [ffff88103d2b7500] struct list_head throttled_list; }
确认 rq 的剩余运行时间总量:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 crash> list -H ffff88103d2b7500 -o cfs_rq.throttled_list -s cfs_rq.runtime_remaining | grep -c runtime_remaining 22 crash> list -H ffff88103d2b7500 -o cfs_rq.throttled_list -s cfs_rq.throttled,runtime_remaining ffff88103d2b6200 throttled = 1 runtime_remaining = -2535 ffff88103d2b6800 throttled = 1 runtime_remaining = -2337 ffff88103d2b7e00 throttled = 1 runtime_remaining = -2706 ffff880f30f33e00 throttled = 1 runtime_remaining = -2441 ffff88103d2b5600 throttled = 1 runtime_remaining = -2356 ffff88103d2b6a00 throttled = 1 runtime_remaining = -2365 ffff88103d2b7a00 throttled = 1 runtime_remaining = -2260 ffff88103d2b5400 throttled = 1 runtime_remaining = -2404 ffff88103d2b6c00 throttled = 1 runtime_remaining = -2421 ffff88103d2b7600 throttled = 1 runtime_remaining = -2429 ffff88103d2b4200 throttled = 1 runtime_remaining = -2357 ffff88103d2b7800 throttled = 1 runtime_remaining = -2359 ffff88103d2b6000 throttled = 1 runtime_remaining = -2416 ffff88103d2b7200 throttled = 1 runtime_remaining = -2353 ffff88103d2b6e00 throttled = 1 runtime_remaining = -2263 ffff8813be33a800 throttled = 1 runtime_remaining = -3394 ffff880f30f33c00 throttled = 1 runtime_remaining = -2599 ffff880f30f31000 throttled = 1 runtime_remaining = -2546 ffff8813be33ae00 throttled = 1 runtime_remaining = -3157 ffff88103d2b4c00 throttled = 1 runtime_remaining = -2337 ffff88103d2b6600 throttled = 1 runtime_remaining = -2284 ffff8813bf7f5540 throttled = 1767994478 runtime_remaining = 0
看到一组异常的值明显和上面的 cfs_rq 的不一样。
1 2 3 crash> struct cfs_rq.throttled,runtime_remaining ffff8813bf7f5540 throttled = 1767994478 runtime_remaining = 0
感谢 wuzhouhui 的指正 ffff8813bf7f5540 是一个 cfs_bandwidth 结构体,所以值看上去非常诡异。
累加 remaining 的值计算一共多少时间。
1 2 crash> pd 2535+2337+2706+2441+2356+2365+2260+2404+2421+2429+2357+2359+2416+2353+2263+3394+2599+2546+3157+2337+2284+0 $2 = 52319
如法炮制,来取得 static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining, u64 expires) 中 remaining 参数经过处理过后的值:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 /usr/src/debug/kernel-3.10.0-514.26.2.el7/linux-3.10.0-514.26.2.el7.x86_64/kernel/sched/fair.c: 3438 0xffffffff810d190e <distribute_cfs_runtime+206>: sub %rcx,%rdx 0xffffffff810d1911 <distribute_cfs_runtime+209>: cmp %rdx,%r12 0xffffffff810d1914 <distribute_cfs_runtime+212>: cmovbe %r12,%rdx /usr/src/debug/kernel-3.10.0-514.26.2.el7/linux-3.10.0-514.26.2.el7.x86_64/kernel/sched/fair.c: 3441 0xffffffff810d1918 <distribute_cfs_runtime+216>: sub %rdx,%r12 /usr/src/debug/kernel-3.10.0-514.26.2.el7/linux-3.10.0-514.26.2.el7.x86_64/kernel/sched/fair.c: 3443 0xffffffff810d191b <distribute_cfs_runtime+219>: add %rcx,%rdx /usr/src/debug/kernel-3.10.0-514.26.2.el7/linux-3.10.0-514.26.2.el7.x86_64/kernel/sched/fair.c: 3447 0xffffffff810d191e <distribute_cfs_runtime+222>: test %rdx,%rdx /usr/src/debug/kernel-3.10.0-514.26.2.el7/linux-3.10.0-514.26.2.el7.x86_64/kernel/sched/fair.c: 3443 0xffffffff810d1921 <distribute_cfs_runtime+225>: mov %rdx,0xd8(%r15) /usr/src/debug/kernel-3.10.0-514.26.2.el7/linux-3.10.0-514.26.2.el7.x86_64/kernel/sched/fair.c: 3447 0xffffffff810d1928 <distribute_cfs_runtime+232>: jle 0xffffffff810d18bf <distribute_cfs_runtime+127> /usr/src/debug/kernel-3.10.0-514.26.2.el7/linux-3.10.0-514.26.2.el7.x86_64/kernel/sched/fair.c: 3448 0xffffffff810d192a <distribute_cfs_runtime+234>: mov %r15,%rdi 0xffffffff810d192d <distribute_cfs_runtime+237>: callq 0xffffffff810d1610 <unthrottle_cfs_rq>
1 2 3 4 5 6 7 8 9 10 11 12 crash> l 3438 3433 3434 raw_spin_lock(&rq->lock); 3435 if (!cfs_rq_throttled(cfs_rq)) 3436 goto next; 3437 3438 runtime = -cfs_rq->runtime_remaining + 1; 3439 if (runtime > remaining) 3440 runtime = remaining; 3441 remaining -= runtime; 3442 3443 cfs_rq->runtime_remaining += runtime;
可以看到 remaining 的值放在 r12 里面,且下面的汇编指令都没有修改 r12,就调用了 unthrottle_cfs_rq。
1 2 3 4 5 6 7 8 crash> dis unthrottle_cfs_rq 0xffffffff810d1610 <unthrottle_cfs_rq>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffff810d1615 <unthrottle_cfs_rq+5>: push %rbp 0xffffffff810d1616 <unthrottle_cfs_rq+6>: mov %rsp,%rbp 0xffffffff810d1619 <unthrottle_cfs_rq+9>: push %r15 0xffffffff810d161b <unthrottle_cfs_rq+11>: push %r14 0xffffffff810d161d <unthrottle_cfs_rq+13>: push %r13 0xffffffff810d161f <unthrottle_cfs_rq+15>: push %r12
1 2 3 4 5 6 7 8 #14 [ffff88103fbc3e20] unthrottle_cfs_rq at ffffffff810d16f4 ffff88103fbc3e28: ffff88103fe56c40 00000000008cc0b3 // r12 ffff88103fbc3e38: ffff8813bf7f5640 00099ba6210dc705 ffff88103fbc3e48: ffff88103d2b7400 ffff88103fbc3e98 // r15 | rbp ffff88103fbc3e58: ffffffff810d1932 crash> pd 0x00000000008cc0b3 $3 = 9224371
在 unthrottle_cfs_rq 中取得 distribute_cfs_runtime 的 remaining 的值为 9224371。
1 2 3 4 5 6 7 8 crash> cfs_bandwidth.throttled_cfs_rq ffff8813bf7f5548 throttled_cfs_rq = { next = 0xffff88103d2b6300, prev = 0xffff88103d2b6700 } crash> list 0xffff88103d2b6300 | wc -l 22
之前计算在地址为 ffff88103d2b7400 的 cfs_rq 中的 runtime_remaining 的值为 52319。distribute_cfs_runtime 中的本地变量的 remaining 为 9224371 远大于 cfs_rq 中的 remaining 为 52319,这就是使得 kernel 在下面 while 的循环中出不来:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 crash> l do_sched_cfs_period_timer 3471 ... 3513 3514 /* 3515 * This check is repeated as we are holding onto the new bandwidth 3516 * while we unthrottle. This can potentially race with an unthrottled 3517 * group trying to acquire new bandwidth from the global pool. 3518 */ 3519 while (throttled && runtime > 0) { 3520 raw_spin_unlock(&cfs_b->lock); 3521 /* we can't nest cfs_b->lock while distributing bandwidth */ 3522 runtime = distribute_cfs_runtime(cfs_b, runtime, 3523 runtime_expires); 3524 raw_spin_lock(&cfs_b->lock); 3525 3526 throttled = !list_empty(&cfs_b->throttled_cfs_rq); 3527 } 3528 3529 /* return (any) remaining runtime */ 3530 cfs_b->runtime = runtime; 3531 /* 3532 * While we are ensured activity in the period following an 3533 * unthrottle, this also covers the case in which the new bandwidth is 3534 * insufficient to cover the existing bandwidth deficit. (Forcing the 3535 * timer to remain active while there are any throttled entities.) 3536 */ 3537 cfs_b->idle = 0; 3538 3539 return 0; 3540 3541 out_deactivate: 3542 cfs_b->timer_active = 0; 3543 return 1; 3544 }
referece:
[^cfs_bandwidth]: CFS Bandwidth Control