理解 Linux LOAD AVERAGE 的误区

uptime 和 top 等命令都可以看到 load average 指标, 从左至右三个数字分别表示 1 分钟, 5 分钟, 15 分钟的 load average:

$ uptime

10:16:25 up 3 days, 19:23, 2 users, load average: 0.00, 0.01, 0.05

$ uptime

10:16:25 up 3 days, 19:23, 2 users, load average: 0.00, 0.01, 0.05

Load average 的概念源自 UNIX 系统, 虽然各家的公式不尽相同, 但都是用于衡量正在使用 CPU 的进程数量和正在等待 CPU 的进程数量, 一句话就是 runnable processes 的数量. 所以 load average 可以作为 CPU 瓶颈的参考指标, 如果大于 CPU 的数量, 说明 CPU 可能不够用了.

但是, Linux 上不是这样的!

Linux 上的 load average 除了包括正在使用 CPU 的进程数量和正在等待 CPU 的进程数量之外, 还包括 uninterruptible sleep 的进程数量. 通常等待 IO 设备, 等待网络的时候, 进程会处于 uninterruptible sleep 状态. Linux 设计者的逻辑是, uninterruptible sleep 应该都是很短暂的, 很快就会恢复运行, 所以被等同于 runnable. 然而 uninterruptible sleep 即使再短暂也是 sleep, 何况现实世界中 uninterruptible sleep 未必很短暂, 大量的, 或长时间的 uninterruptible sleep 通常意味着 IO 设备遇到了瓶颈. 众所周知, sleep 状态的进程是不需要 CPU 的, 即使所有的 CPU 都空闲, 正在 sleep 的进程也是运行不了的, 所以 sleep 进程的数量绝对不适合用作衡量 CPU 负载的指标, Linux 把 uninterruptible sleep 进程算进 load average 的做法直接颠覆了 load average 的本来意义. 所以在 Linux 系统上, load average 这个指标基本失去了作用, 因为你不知道它代表什么意思, 当看到 load average 很高的时候, 你不知道是 runnable 进程太多还是 uninterruptible sleep 进程太多, 也就无法判断是 CPU 不够用还是 IO 设备有瓶颈.

参考资料: https://en.wikipedia.org/wiki/Load_(computing )

"Most UNIX systems count only processes in the running (on CPU) or runnable (waiting for CPU) states. However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system."

源代码:

RHEL6
kernel/sched.c:
static void calc_load_account_active(struct rq *this_rq)
{
long nr_active, delta;
nr_active = this_rq->nr_running;
    nr_active += (long) this_rq->nr_uninterruptible;
    if (nr_active != this_rq->calc_load_active) {
            delta = nr_active - this_rq->calc_load_active;
            this_rq->calc_load_active = nr_active;
            atomic_long_add(delta, &calc_load_tasks);
    }
}
===============
static void calc_load_account_active(struct rq *this_rq)
{
long nr_active, delta;
nr_active = this_rq->nr_running;
    nr_active += (long) this_rq->nr_uninterruptible;
    if (nr_active != this_rq->calc_load_active) {
            delta = nr_active - this_rq->calc_load_active;
            this_rq->calc_load_active = nr_active;
            atomic_long_add(delta, &calc_load_tasks);
    }
}
RHEL7
kernel/sched/core.c:
static long calc_load_fold_active(struct rq *this_rq)
{
long nr_active, delta = 0;
nr_active = this_rq->nr_running;
    nr_active += (long) this_rq->nr_uninterruptible;
    if (nr_active != this_rq->calc_load_active) {
            delta = nr_active - this_rq->calc_load_active;
            this_rq->calc_load_active = nr_active;
    }
    return delta;
}
RHEL7
kernel/sched/core.c:
static long calc_load_fold_active(struct rq *this_rq)
{
long nr_active, delta = 0;
nr_active = this_rq->nr_running;
    nr_active += (long) this_rq->nr_uninterruptible;
    if (nr_active != this_rq->calc_load_active) {
            delta = nr_active - this_rq->calc_load_active;
            this_rq->calc_load_active = nr_active;
    }
    return delta;
}
RHEL7
kernel/sched/core.c:
/*
Global load-average calculations
We take a distributed and async approach to calculating the global load-avg
in order to minimize overhead.
The global load average is an exponentially decaying average of nr_running +
nr_uninterruptible.
Once every LOAD_FREQ:
nr_active = 0;
for_each_possible_cpu(CPU)
nr_active += cpu_of(CPU)->nr_running + cpu_of(CPU)->nr_uninterruptible;
avenrun[n] = avenrun[0] exp_n + nr_active (1 - exp_n)
Due to a number of reasons the above turns in the mess below:
for_each_possible_cpu() is prohibitively expensive on machines with
serious number of cpus, therefore we need to take a distributed approach
to calculating nr_active.
\Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
= \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
So assuming nr_active := 0 when we start out -- true per definition, we
can simply take per-CPU deltas and fold those into a global accumulate
to obtain the same result. See calc_load_fold_active().
Furthermore, in order to avoid synchronizing all per-CPU delta folding
across the machine, we assume 10 ticks is sufficient time for every
CPU to have completed this task.
This places an upper-bound on the IRQ-off latency of the machine. Then
again, being late doesn't loose the delta, just wrecks the sample.
cpu_rq()->nr_uninterruptible isn't accurately tracked per-CPU because
this would add another cross-CPU cacheline miss and atomic operation
to the wakeup path. Instead we increment on whatever CPU the task ran
when it went into uninterruptible state and decrement on whatever CPU
did the wakeup. This means that only the sum of nr_uninterruptible over
all cpus yields the correct result.
This covers the NO_HZ=n code, for extra head-aches, see the comment below.
*/
RHEL7
kernel/sched/core.c:
/*
Global load-average calculations
We take a distributed and async approach to calculating the global load-avg
in order to minimize overhead.
The global load average is an exponentially decaying average of nr_running +
nr_uninterruptible.
Once every LOAD_FREQ:
nr_active = 0;
for_each_possible_cpu(CPU)
nr_active += cpu_of(CPU)->nr_running + cpu_of(CPU)->nr_uninterruptible;
avenrun[n] = avenrun[0] exp_n + nr_active (1 - exp_n)
Due to a number of reasons the above turns in the mess below:
for_each_possible_cpu() is prohibitively expensive on machines with
serious number of cpus, therefore we need to take a distributed approach
to calculating nr_active.
\Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
= \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
So assuming nr_active := 0 when we start out -- true per definition, we
can simply take per-CPU deltas and fold those into a global accumulate
to obtain the same result. See calc_load_fold_active().
Furthermore, in order to avoid synchronizing all per-CPU delta folding
across the machine, we assume 10 ticks is sufficient time for every
CPU to have completed this task.
This places an upper-bound on the IRQ-off latency of the machine. Then
again, being late doesn't loose the delta, just wrecks the sample.
cpu_rq()->nr_uninterruptible isn't accurately tracked per-CPU because
this would add another cross-CPU cacheline miss and atomic operation
to the wakeup path. Instead we increment on whatever CPU the task ran
when it went into uninterruptible state and decrement on whatever CPU
did the wakeup. This means that only the sum of nr_uninterruptible over
all cpus yields the correct result.
This covers the NO_HZ=n code, for extra head-aches, see the comment below.
*/

来源: http://www.bubuko.com/infodetail-3498886.html

与本文相关文章

暂无,快来抢沙发吧！