组调度的支持,对实现多租户系统的管理是十分方便的,在一台机器上,可以方便对多用户进行 CPU 均分.然后,这还不足够,组调度只能保证用户间的公平,但若管理员想控制一个用户使用的最大 CPU 资源,则需要带宽控制. 针对 CFS 组调度,引入了此功能,该功能可以让管理员控制在一段时间内一个组可以使用 CPU 的最长时间.
0x00 CFS bandwidth control design
先看看大佬是如何设计[^roadmap]这个系统的 (关注企业应用),分为 bandwidth control 和 CFS bandwidth control 看设计。
首先是 bandwidth control 设计要先考虑两个主要面:
The actual amount of CPU time available to a group is highly variable as it is dependent on the presence and execution patterns of other groups, a machine can the not be predictably partitioned without intimately understanding the behaviors of all co-scheduled applications.
The maximum amount of CPU time available to a group is not predictable. While this is closely related to the first point, the distinction is worth noting as this directly affects capacity planning.
we have now opted for global specifcation of both enforcement interval (cpu.cfs_ period_us) and allowable bandwidth (cpu.cfs_ quota_us). By specifying this, the group as a whole will be limited to cpu.cfs_quota_us units of CPU time within the period of cpu.cfs_period_us.
Of note is that these limits are hierarchical, unlike SCHED_RT we do not currently perform feasibility evaluaion regarding the defined limits. If a child has a more permissive bandwidth allowance than its parent, it will be indirectly throttled when the parent’s quota is exhausted.
Additionally, there is the global control: /proc/sys/ kernel/sched_cfs_bandwidth_slice_us
大佬们在论文里讨论了两个方案,在 cfs bandwidth v4 版本后引入了 Hybrid global pool 实现。
如果仅实现 local pool 设计下,在大型的 SMP 系统中,计算剩余时间和存储剩余时间是一个多对多的关系,而锁的竞争导致开销大,而如果仅 tracking quota globally 依然是不能解决前面所述的问题,唯一的好处就是当 quota 没有用完,消耗的时间计算比较有效率,是因为本地 cpu 变量修改的是无锁。因此大佬们选择了一个混合方案以此来改善性能。
To each task_group a new cfs_bandwidth structure has been added. This tracks (globally) the allocated and consumed quota within a period.
However, consumption does not occur against this pool directly; as in the local pool approach above there is a local, per cfs_rq, store of granted and consumed quota. This quota is acquired from the global pool in a (user configurable) batch size. When there is no quota available to re-provision a running cfs_rq, it is locally throttled until the next quota refresh.
Bandwidth refresh is a periodic operation that occurs once per quota period within which all throttled run-queues are unthrottled and the global bandwidth pool is replenished.
0x01 实现分析
1: 核心数据结构:
根据大佬们论文里面的设计讨论,global cpu runtime pool 的实现就是 cfs_bandwidth 结构体,其作为 task_group 的最后一个字段。quota 是限于的每 period 中的,根据我的理解,正常 task group 被调度的情况下$$period >= quota >= runtime$$,但是时间的消耗不上立刻直接反应在 global pool 中,而是在每 cfs_rq 中 local pool 中记录已经获取和消化的配额,这个配额从 global pool 中以预配置的大小获取。当没有配额来填充 cfs 中的 local pool 时候,这个 task_group 会被限制到下一次 quota 的重新分配。
staticvoid __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, unsignedlong delta_exec) { /* dock delta_exec before expiring quota (as it could span periods) */ cfs_rq->runtime_remaining -= delta_exec; // 剩余时间减去进程已经运行的时间。 expire_cfs_rq_runtime(cfs_rq); // 检查 local pool 的时间片到期时间,如果没有到期就把到期时间再往后续一口
if (likely(cfs_rq->runtime_remaining > 0)) // 分支预测,也就是说代码暗示我们 local pool 剩余时间还有。 return;
/* * if we're unable to extend our runtime we resched so that the active * hierarchy can be throttled */ // 如果 local pool 没有剩余时间,就从 global pool 不能借时间。借不到的话就设置 curr 重新调度。 if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) resched_task(rq_of(cfs_rq)->curr); }
assign_cfs_rq_runtime() 可以看到expires = cfs_b->runtime_expires;后cfs_rq->runtime_expires = expires; 和 cfs_rq->runtime_remaining += amount;这两波操作就能理解 local pool 重 global pool 借时间的细节了。
/* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
raw_spin_lock(&cfs_b->lock); if (cfs_b->quota == RUNTIME_INF) amount = min_amount; else { /* * If the bandwidth pool has become inactive, then at least one * period must have elapsed since the last consumption. * Refresh the global state and ensure bandwidth timer becomes * active. */ if (!cfs_b->timer_active) { __refill_cfs_bandwidth_runtime(cfs_b); // 重新填满时间 __start_cfs_bandwidth(cfs_b); }
cfs_rq->runtime_remaining += amount; /* * we may have advanced our local expiration to account for allowed * spread between our sched_clock and the one on which runtime was * issued. */ if ((s64)(expires - cfs_rq->runtime_expires) > 0) cfs_rq->runtime_expires = expires;
/* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and * then put the task into the rbtree: */ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; // 因为需要支持组调度,而组调度下面 se 是有层次结构的,所以遍历所有调度实体。 // 如果没有组调度是没有必要获取层次信息。 for_each_sched_entity(se) { if (se->on_rq) // 如果 se 已经在就绪队列上 break; cfs_rq = cfs_rq_of(se); // 获取当前 se 所在的 cfs_rq enqueue_entity(cfs_rq, se, flags); // enqueue_entity 完成 se 的真正插入操作
/* * When a group wakes up we want to make sure that its quota is not already * expired/exceeded, otherwise it may be allowed to steal additional ticks of * runtime as update_curr() throttling can not not trigger until it's on-rq. */ static void check_enqueue_throttle(struct cfs_rq *cfs_rq) { if (!cfs_bandwidth_used()) // 如果限流没有开 return;
/* an active group must be handled by the update_curr()->put() path */ if (!cfs_rq->runtime_enabled || cfs_rq->curr) // 如果 runtime 没有启用 return;
/* ensure the group is not already throttled */ if (cfs_rq_throttled(cfs_rq)) // 已经限流操作了 return;