之前的 CFS 以每个 CPU 上的运行队列 (per cpu runqueue) 为单位计算 load,但使用了 group scheduler 之后,每个 cgroup 都有一个自己的 per CPU 运行队列,而如果内核需要了解每个 cgroup 对系统整体 load 的贡献情况,原来的方案就不能满足需要了,而且基于 per CPU runqueue 的方法也有结果波动过大的缺点。
/* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and * then put the task into the rbtree: */ staticvoid enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { structcfs_rq *cfs_rq; structsched_entity *se = &p->se;
/* * If in_iowait is set, the code below may not trigger any cpufreq * utilization updates, so do it here explicitly with the IOWAIT flag * passed. */ if (p->in_iowait) cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT); ...
/* Update task and its cfs_rq load average */ staticinlinevoidupdate_load_avg(struct sched_entity *se, int flags) { structcfs_rq *cfs_rq = cfs_rq_of(se); u64 now = cfs_rq_clock_task(cfs_rq); structrq *rq = rq_of(cfs_rq); int cpu = cpu_of(rq); int decayed;
/* * Track task load average for carrying it to new CPU after migrated, and * track group sched_entity load average for task_h_load calc in migration */ if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) __update_load_avg_se(now, cpu, cfs_rq, se); // here ... }
/* Give new sched_entity start runnable values to heavy its load in infant time */ voidinit_entity_runnable_average(struct sched_entity *se) { structsched_avg *sa = &se->avg;
sa->last_update_time = 0; /* * sched_avg's period_contrib should be strictly less then 1024, so * we give it 1023 to make sure it is almost a period (1024us), and * will definitely be update (after enqueue). */ sa->period_contrib = 1023; /* * Tasks are intialized with full load to be seen as heavy tasks until * they get a chance to stabilize to their real load level. * Group entities are intialized with zero load to reflect the fact that * nothing has been attached to the task group yet. */ if (entity_is_task(se)) sa->load_avg = scale_load_down(se->load.weight); sa->load_sum = sa->load_avg * LOAD_AVG_MAX; /* * At this point, util_avg won't be used in select_task_rq_fair anyway */ sa->util_avg = 0; sa->util_sum = 0; /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */ }
0x02 core path: ___update_load_avg
这是 per entity load track 重要的入口函数,在 4.12 之前实现与之前不一样,优化方法是在 patch[^Optimize]中引入的,但是思路一致。开发者用几何级数来表示历史上 se 贡献的 runnable average,这样做首先需要把 runable 的全部时间切分成近似 1024 us (约 1ms) 的时间片,并标记时间片为 N-ms 之前为 $$p_N$$, 比如 $$p_0$$ 就表示当前的时间片,示意图如下:
static __always_inline int ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, unsignedlong weight, int running, struct cfs_rq *cfs_rq) { u64 delta;
delta = now - sa->last_update_time; // 计算新采样周期的值 /* * This should only happen when time goes backwards, which it * unfortunately does during sched clock init when we swap over to TSC. */ if ((s64)delta < 0) { sa->last_update_time = now; return0; }
/* * Use 1024ns as the unit of measurement since it's a reasonable * approximation of 1us and fast to compute. */ delta >>= 10; // 把周期值由 ns 转化为 us if (!delta) return0;
/* * running is a subset of runnable (weight) so running can't be set if * runnable is clear. But there are some corner cases where the current * se has been already dequeued but cfs_rq->curr still points to it. * This means that weight will be 0 but not running for a sched_entity * but also for a cfs_rq if the latter becomes idle. As an example, * this happens during idle_balance() which calls * update_blocked_averages() */ if (!weight) running = 0;
/* * Now we know we crossed measurement unit boundaries. The *_avg * accrues by two steps: * * Step 1: accumulate *_sum since last_update_time. If we haven't * crossed period boundaries, finish. */ if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq)) // look here !!! 计算累计出来的 load return0; /* * Step 2: update *_avg. */ sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); if (cfs_rq) { cfs_rq->runnable_load_avg = div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); } sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);
return1; }
下面这个函数 accumulate_sum 非常核心,它通过累加三个部分的总和来计算 se 的负载影响;d1 是离当前时间最远(不完整的)period 的剩余部分,d2 是完整 period 的而 d3 是(不完整的)当前 period 的剩余部分。d1,d2,d3 的示意图如下:
/* after bounds checking we can collapse to 32-bit */ local_n = n;
/* * As y^PERIOD = 1/2, we can combine * y^n = 1/2^(n/PERIOD) * y^(n%PERIOD) * With a look-up table which covers y^n (n<PERIOD) * * To achieve constant time decay_load. */ if (unlikely(local_n >= LOAD_AVG_PERIOD)) { val >>= local_n / LOAD_AVG_PERIOD; local_n %= LOAD_AVG_PERIOD; }
val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32); return val; }