当前位置：

首页
/
IT
/
程序
/
Android
/
Android 系统层 Watchdog 机制源码分析

Android 系统层 Watchdog 机制源码分析

一: 为什么需要看门狗?

Watchdog, 初次见到这个词语是在大学的单片机书上, 谈到了看门狗定时器. 在很早以前那个单片机刚发展的时候, 单片机容易受到外界工作影响, 导致自己的程序跑飞, 因此有了看门狗的保护机制, 即: 需要每多少时间内都去喂狗, 如果不喂狗, 看门狗将触发重启. 大体原理是, 在系统运行以后启动了看门狗的计数器, 看门狗就开始自动计数, 如果到了一定的时间还不去清看门狗, 那么看门狗计数器就会溢出从而引起看门狗中断, 造成系统复位

而手机, 其实是一个超强超强的单片机, 其运行速度比单片机快 N 倍, 存储空间比单片机大 N 倍, 里面运行了若干个线程, 各种软硬件协同工作, 不怕一万, 就怕万一, 万一我们的系统死锁了, 万一我们的手机也受到很大的干扰程序跑飞了. 都可能发生 jj 思密达的事情, 因此, 我们也需要看门狗机制.

二: Android 系统层看门狗

看门狗有硬件看门狗和软件看门狗之分, 硬件就是单片机那种的定时器电路, 软件, 则是我们自己实现一个类似机制的看门狗. Android 系统为了保证系统的稳定性, 也设计了这么一个看门狗, 其为了保证各种系统服务能够正常工作, 要监控很多的服务, 并且在核心服务异常时要进行重启, 还要保存现场

接下来我们就看看 Android 系统的 Watchdog 是怎么设计的

注: 本文以 Android6.0 代码讲解

Android 系统的 Watchdog 源码路径在此:

frameworks/base/services/core/java/com/android/server/Watchdog.java

Watchdog 的初始化位于 SystemServer.

/frameworks/base/services/java/com/android/server/SystemServer.java

在 SystemServer 中会对 Watchdog 进行初始化

Slog.i(TAG, "Init Watchdog");
            final Watchdog watchdog = Watchdog.getInstance();
            watchdog.init(context, mActivityManagerService);

此时 Watchdog 会走如下初始化方法, 先是构造方法, 再是 init 方法:

private Watchdog() {
        super("watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.
        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        // Initialize monitor for Binder threads.
        addMonitor(new BinderThreadMonitor());
    }
    public void init(Context context, ActivityManagerService activity) {
        mResolver = context.getContentResolver();
        mActivity = activity;
        // 注册重启广播
        context.registerReceiver(new RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
    }

但是我们看了源码会知道, Watchdog 这个类继承于 Thread, 所以还会需要一个启动的地方, 就是下面这行代码, 这是在 ActivityManagerService 的 SystemReady 接口中干的

Watchdog.getInstance().start();
TAG: HandlerChecker

上面的代码中有个比较重要的类, HandlerChecker, 这是 Watchdog 用来检测主线程, io 线程, 显示线程, UI 线程的机制, 代码也不长, 直接贴出来吧其原理就是通过各个 Handler 的 looper 的 MessageQueue 来判断该线程是否卡住了当然, 该线程是运行在 SystemServer 进程中的线程

public final class HandlerChecker implements Runnable {
        private final Handler mHandler;
        private final String mName;
        private final long mWaitMax;
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
        private boolean mCompleted;
        private Monitor mCurrentMonitor;
        private long mStartTime;
        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }
        public void addMonitor(Monitor monitor) {
            mMonitors.add(monitor);
        }
        // 记录当前的开始时间
        public void scheduleCheckLocked() {
            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
                // If the target looper has recently been polling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread.  Note that we
                // only do this if mCheckReboot is false and we have no
                // monitors, since those would need to be executed at this point.
                mCompleted = true;
                return;
            }
            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }
            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            mHandler.postAtFrontOfQueue(this);
        }
        public boolean isOverdueLocked() {
            return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
        }
        // 获取完成时间标识
        public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }
        public Thread getThread() {
            return mHandler.getLooper().getThread();
        }
        public String getName() {
            return mName;
        }
        public String describeBlockedStateLocked() {
            if (mCurrentMonitor == null) {
                return "Blocked in handler on" + mName + "(" + getThread().getName() + ")";
            } else {
                return "Blocked in monitor" + mCurrentMonitor.getClass().getName()
                        + "on" + mName + "(" + getThread().getName() + ")";
            }
        }
        @Override
        public void run() {
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }
            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }
    }

通过上面的代码, 我们可以看到一个核心的方法是

mHandler.getLooper().getQueue().isPolling()

这个方法的实现在 MessageQueue 中, 我将代码贴出来, 我们可以看到上面的注释写到: 返回当前的 looper 线程是否在 polling 工作来做, 这个是个很好的用于检测 loop 是否存活的方法我们从 HandlerChecker 源码可以看到, 如果 looper 这个返回 true, 将会直接返回

/**
     * Returns whether this looper's thread is currently polling for more work to do.
     * This is a good signal that the loop is still alive rather than being stuck
     * handling a callback.  Note that this method is intrinsically racy, since the
     * state of the loop can change before you get the result back.
     *
     * <p>This method is safe to call from any thread.
     *
     * @return True if the looper is currently polling for events.
     * @hide
     */
    public boolean isPolling() {
        synchronized (this) {
            return isPollingLocked();
        }
    }

若没有返回 true, 表明 looper 当前正在工作, 会 post 一下自己, 同时将 mComplete 置为 false, 标明已经发出一个消息正在等待处理如果当前的 looper 没有阻塞, 那很快, 将会调用到自己的 run 方法

自己的 run 方法干了什么呢干的是

TAG: HandlerChecker 源码里面的 166 行

, 里面对自己的 Monitors 遍历并进行 monitor(注: 此处的 monitor 下面会讲到), 若有 monitor 发生了阻塞, 那么 mComplete 会一直是 false

那么在系统检测调用这个获取完成状态时, 就会进入 else 里面, 进行了时间的计算, 并返回相应的时间状态码

// 获取完成时间标识
        public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }

好了, 到这我们已经知道是怎么判断线程是否卡住的了

MessageQueue.isPolling
Monitor.monitor
TAG:Monitor
public interface Monitor {
        void monitor();
    }

Monitor 是一个接口, 实现这个接口的类有好几个比如: 如下我搜出来的结果

看, 有这么多的类实现了该接口, 而且我们都不用去猜, 就可以知道, 他们一定会注册到这个 Watchdog 中注册到哪的呢, 下面代码可以看到

mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
    public void addMonitor(Monitor monitor) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
            }
            mMonitorChecker.addMonitor(monitor);
        }
    }

所以各个实现这个接口的类, 只需要调一下, 上述接口就行了我们看一下

ActivityManagerService

类的调法路径在此, 点击可以进入

/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
2381        Watchdog.getInstance().addMonitor(this);
19655    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
19656    public void monitor() {
19657        synchronized (this) { }
19658    }

可以看到, 我们的 AMS 实现了该接口, 并在 2381 行, 将自己注册进 Watchdog. 同时其 monitor 方法只是同步一下自己, 确保自己没有死锁

干的事情虽然不多, 但这足够了足够让外部通过这个方法得到 AMS 是否死了

好了, 现在我们知道是如何判断其他服务是否死锁了, 那么看 Watchdog 的 run 方法是怎么完成这一套机制的吧

TAG: Watchdog.run

run 方法就是死循环, 不断的去遍历所有 HandlerChecker, 并调其监控方法, 等待三十秒, 评估状态具体见下面的注释:

341    @Override
342    public void run() {
343        boolean waitedHalf = false;
344        while (true) {
345            final ArrayList<HandlerChecker> blockedCheckers;
346            final String subject;
347            final boolean allowRestart;
348            int debuggerWasConnected = 0;
349            synchronized (this) {
350                long timeout = CHECK_INTERVAL;
351                // Make sure we (re)spin the checkers that have become idle within
352                // this wait-and-check interval
                   // 在这里, 我们遍历所有 HandlerChecker, 并调其监控方法, 记录开始时间
353                for (int i=0; i<mHandlerCheckers.size(); i++) {
354                    HandlerChecker hc = mHandlerCheckers.get(i);
355                    hc.scheduleCheckLocked();
356                }
357
358                if (debuggerWasConnected > 0) {
359                    debuggerWasConnected--;
360                }
361
362                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
363                // wait while asleep. If the device is asleep then the thing that we are waiting
364                // to timeout on is asleep as well and won't have a chance to run, causing a false
365                // positive on when to kill things.
366                long start = SystemClock.uptimeMillis();
                   // 等待 30 秒, 使用 uptimeMills 是为了不把手机睡眠时间算进入, 手机睡眠时系统服务同样睡眠
367                while (timeout > 0) {
368                    if (Debug.isDebuggerConnected()) {
369                        debuggerWasConnected = 2;
370                    }
371                    try {
372                        wait(timeout);
373                    } catch (InterruptedException e) {
374                        Log.wtf(TAG, e);
375                    }
376                    if (Debug.isDebuggerConnected()) {
377                        debuggerWasConnected = 2;
378                    }
379                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
380                }
381                // 评估 Checker 的状态, 里面会遍历所有的 HandlerChecker, 并获取最大的返回值
382                final int waitState = evaluateCheckerCompletionLocked();
                   // 最大的返回值有四种情况, 分别是: COMPLETED 对应消息已处理完毕线程无阻塞
383                if (waitState == COMPLETED) {
384                    // The monitors have returned; reset
385                    waitedHalf = false;
386                    continue;
                   // WAITING 对应消息处理花费 0～29 秒, 继续运行
387                } else if (waitState == WAITING) {
388                    // still waiting but within their configured intervals; back off and recheck
389                    continue;
                   // WAITED_HALF 对应消息处理花费 30～59 秒, 线程可能已经被阻塞, 需要保存当前 AMS 堆栈状态
390                } else if (waitState == WAITED_HALF) {
391                    if (!waitedHalf) {
392                        // We've waited half the deadlock-detection interval.  Pull a stack
393                        // trace and wait another half.
394                        ArrayList<Integer> pids = new ArrayList<Integer>();
395                        pids.add(Process.myPid());
396                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
397                                NATIVE_STACKS_OF_INTEREST);
398                        waitedHalf = true;
399                    }
400                    continue;
401                }
402                //OVERDUE 对应消息处理已经花费超过 60, 能够走到这里, 说明已经发生了超时 60 秒了那么下面接下来全是应对超时的情况
403                // something is overdue!
404                blockedCheckers = getBlockedCheckersLocked();
405                subject = describeCheckersLocked(blockedCheckers);
406                allowRestart = mAllowRestart;
407            }
408
409            // If we got here, that means that the system is most likely hung.
410            // First collect stack traces from all threads of the system process.
411            // Then kill this process so that the system will restart.
412            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
413

....... 各种记录的保存

// Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS:" + subject);
                for (int i=0; i<blockedCheckers.size(); i++) {
                    Slog.w(TAG, blockedCheckers.get(i).getName() + "stack trace:");
                    StackTraceElement[] stackTrace
                            = blockedCheckers.get(i).getThread().getStackTrace();
                    for (StackTraceElement element: stackTrace) {
                        Slog.w(TAG, "at" + element);
                    }
                }
                Slog.w(TAG, "*** GOODBYE!");
                Process.killProcess(Process.myPid());
                System.exit(10);
            }
 
            waitedHalf = false;
        }
    }

上述可以看到, 如果走到 412 行处便是重启系统前的准备了

会进行以下事情:

写 Eventlog

以追加的方式, 输出 system_server 和 3 个 native 进程的栈信息

输出 kernel 栈信息

dump 所有阻塞线程

输出 dropbox 信息

判断有没有 debuger, 没有的话, 重启系统了, 并输出 log: *** WATCHDOG KILLING SYSTEM PROCESS:

三: 总结:

以上便是 Android 系统层 Watchdog 的原理了设计的比较好若由我来设计, 我还真想不到使用 Monitor 那个锁机制来判断

接下来总结以下:

Watchdog 是一个线程, 用来监听系统各项服务是否正常运行, 没有发生死锁

HandlerChecker 用来检查 Handler 以及 monitor

monitor 通过锁来判断是否死锁

超时 30 秒会输出 log, 超时 60 秒会重启 (debug 情况除外)

来源: http://www.jianshu.com/p/5c18c4e8c826

与本文相关文章

暂无,快来抢沙发吧！