一: 为什么需要看门狗?
Watchdog, 初次见到这个词语是在大学的单片机书上, 谈到了看门狗定时器. 在很早以前那个单片机刚发展的时候, 单片机容易受到外界工作影响, 导致自己的程序跑飞, 因此有了看门狗的保护机制, 即: 需要每多少时间内都去喂狗, 如果不喂狗, 看门狗将触发重启. 大体原理是, 在系统运行以后启动了看门狗的计数器, 看门狗就开始自动计数, 如果到了一定的时间还不去清看门狗, 那么看门狗计数器就会溢出从而引起看门狗中断, 造成系统复位
而手机, 其实是一个超强超强的单片机, 其运行速度比单片机快 N 倍, 存储空间比单片机大 N 倍, 里面运行了若干个线程, 各种软硬件协同工作, 不怕一万, 就怕万一, 万一我们的系统死锁了, 万一我们的手机也受到很大的干扰程序跑飞了. 都可能发生 jj 思密达的事情, 因此, 我们也需要看门狗机制.
二: Android 系统层看门狗
看门狗有硬件看门狗和软件看门狗之分, 硬件就是单片机那种的定时器电路, 软件, 则是我们自己实现一个类似机制的看门狗. Android 系统为了保证系统的稳定性, 也设计了这么一个看门狗, 其为了保证各种系统服务能够正常工作, 要监控很多的服务, 并且在核心服务异常时要进行重启, 还要保存现场
接下来我们就看看 Android 系统的 Watchdog 是怎么设计的
注: 本文以 Android6.0 代码讲解
Android 系统的 Watchdog 源码路径在此:
frameworks/base/services/core/java/com/android/server/Watchdog.java
Watchdog 的初始化位于 SystemServer.
/frameworks/base/services/java/com/android/server/SystemServer.java
在 SystemServer 中会对 Watchdog 进行初始化
- Slog.i(TAG, "Init Watchdog");
- final Watchdog watchdog = Watchdog.getInstance();
- watchdog.init(context, mActivityManagerService);
此时 Watchdog 会走如下初始化方法, 先是构造方法, 再是 init 方法:
- private Watchdog() {
- super("watchdog");
- // Initialize handler checkers for each common thread we want to check. Note
- // that we are not currently checking the background thread, since it can
- // potentially hold longer running operations with no guarantees about the timeliness
- // of operations there.
- // The shared foreground thread is the main checker. It is where we
- // will also dispatch monitor checks and do other work.
- mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
- "foreground thread", DEFAULT_TIMEOUT);
- mHandlerCheckers.add(mMonitorChecker);
- // Add checker for main thread. We only do a quick check since there
- // can be UI running on the thread.
- mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
- "main thread", DEFAULT_TIMEOUT));
- // Add checker for shared UI thread.
- mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
- "ui thread", DEFAULT_TIMEOUT));
- // And also check IO thread.
- mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
- "i/o thread", DEFAULT_TIMEOUT));
- // And the display thread.
- mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
- "display thread", DEFAULT_TIMEOUT));
- // Initialize monitor for Binder threads.
- addMonitor(new BinderThreadMonitor());
- }
- public void init(Context context, ActivityManagerService activity) {
- mResolver = context.getContentResolver();
- mActivity = activity;
- // 注册重启广播
- context.registerReceiver(new RebootRequestReceiver(),
- new IntentFilter(Intent.ACTION_REBOOT),
- android.Manifest.permission.REBOOT, null);
- }
但是我们看了源码会知道, Watchdog 这个类继承于 Thread, 所以还会需要一个启动的地方, 就是下面这行代码, 这是在 ActivityManagerService 的 SystemReady 接口中干的
- Watchdog.getInstance().start();
- TAG: HandlerChecker
上面的代码中有个比较重要的类, HandlerChecker, 这是 Watchdog 用来检测主线程, io 线程, 显示线程, UI 线程的机制, 代码也不长, 直接贴出来吧其原理就是通过各个 Handler 的 looper 的 MessageQueue 来判断该线程是否卡住了当然, 该线程是运行在 SystemServer 进程中的线程
- public final class HandlerChecker implements Runnable {
- private final Handler mHandler;
- private final String mName;
- private final long mWaitMax;
- private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
- private boolean mCompleted;
- private Monitor mCurrentMonitor;
- private long mStartTime;
- HandlerChecker(Handler handler, String name, long waitMaxMillis) {
- mHandler = handler;
- mName = name;
- mWaitMax = waitMaxMillis;
- mCompleted = true;
- }
- public void addMonitor(Monitor monitor) {
- mMonitors.add(monitor);
- }
- // 记录当前的开始时间
- public void scheduleCheckLocked() {
- if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
- // If the target looper has recently been polling, then
- // there is no reason to enqueue our checker on it since that
- // is as good as it not being deadlocked. This avoid having
- // to do a context switch to check the thread. Note that we
- // only do this if mCheckReboot is false and we have no
- // monitors, since those would need to be executed at this point.
- mCompleted = true;
- return;
- }
- if (!mCompleted) {
- // we already have a check in flight, so no need
- return;
- }
- mCompleted = false;
- mCurrentMonitor = null;
- mStartTime = SystemClock.uptimeMillis();
- mHandler.postAtFrontOfQueue(this);
- }
- public boolean isOverdueLocked() {
- return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
- }
- // 获取完成时间标识
- public int getCompletionStateLocked() {
- if (mCompleted) {
- return COMPLETED;
- } else {
- long latency = SystemClock.uptimeMillis() - mStartTime;
- if (latency < mWaitMax/2) {
- return WAITING;
- } else if (latency < mWaitMax) {
- return WAITED_HALF;
- }
- }
- return OVERDUE;
- }
- public Thread getThread() {
- return mHandler.getLooper().getThread();
- }
- public String getName() {
- return mName;
- }
- public String describeBlockedStateLocked() {
- if (mCurrentMonitor == null) {
- return "Blocked in handler on" + mName + "(" + getThread().getName() + ")";
- } else {
- return "Blocked in monitor" + mCurrentMonitor.getClass().getName()
- + "on" + mName + "(" + getThread().getName() + ")";
- }
- }
- @Override
- public void run() {
- final int size = mMonitors.size();
- for (int i = 0 ; i < size ; i++) {
- synchronized (Watchdog.this) {
- mCurrentMonitor = mMonitors.get(i);
- }
- mCurrentMonitor.monitor();
- }
- synchronized (Watchdog.this) {
- mCompleted = true;
- mCurrentMonitor = null;
- }
- }
- }
通过上面的代码, 我们可以看到一个核心的方法是
mHandler.getLooper().getQueue().isPolling()
这个方法的实现在 MessageQueue 中, 我将代码贴出来, 我们可以看到上面的注释写到: 返回当前的 looper 线程是否在 polling 工作来做, 这个是个很好的用于检测 loop 是否存活的方法我们从 HandlerChecker 源码可以看到, 如果 looper 这个返回 true, 将会直接返回
- /**
- * Returns whether this looper's thread is currently polling for more work to do.
- * This is a good signal that the loop is still alive rather than being stuck
- * handling a callback. Note that this method is intrinsically racy, since the
- * state of the loop can change before you get the result back.
- *
- * <p>This method is safe to call from any thread.
- *
- * @return True if the looper is currently polling for events.
- * @hide
- */
- public boolean isPolling() {
- synchronized (this) {
- return isPollingLocked();
- }
- }
若没有返回 true, 表明 looper 当前正在工作, 会 post 一下自己, 同时将 mComplete 置为 false, 标明已经发出一个消息正在等待处理如果当前的 looper 没有阻塞, 那很快, 将会调用到自己的 run 方法
自己的 run 方法干了什么呢干的是
TAG: HandlerChecker 源码里面的 166 行
, 里面对自己的 Monitors 遍历并进行 monitor(注: 此处的 monitor 下面会讲到), 若有 monitor 发生了阻塞, 那么 mComplete 会一直是 false
那么在系统检测调用这个获取完成状态时, 就会进入 else 里面, 进行了时间的计算, 并返回相应的时间状态码
- // 获取完成时间标识
- public int getCompletionStateLocked() {
- if (mCompleted) {
- return COMPLETED;
- } else {
- long latency = SystemClock.uptimeMillis() - mStartTime;
- if (latency < mWaitMax/2) {
- return WAITING;
- } else if (latency < mWaitMax) {
- return WAITED_HALF;
- }
- }
- return OVERDUE;
- }
好了, 到这我们已经知道是怎么判断线程是否卡住的了
- MessageQueue.isPolling
- Monitor.monitor
- TAG:Monitor
- public interface Monitor {
- void monitor();
- }
Monitor 是一个接口, 实现这个接口的类有好几个比如: 如下我搜出来的结果
看, 有这么多的类实现了该接口, 而且我们都不用去猜, 就可以知道, 他们一定会注册到这个 Watchdog 中注册到哪的呢, 下面代码可以看到
- mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
- "foreground thread", DEFAULT_TIMEOUT);
- mHandlerCheckers.add(mMonitorChecker);
- public void addMonitor(Monitor monitor) {
- synchronized (this) {
- if (isAlive()) {
- throw new RuntimeException("Monitors can't be added once the Watchdog is running");
- }
- mMonitorChecker.addMonitor(monitor);
- }
- }
所以各个实现这个接口的类, 只需要调一下, 上述接口就行了我们看一下
ActivityManagerService
类的调法路径在此, 点击可以进入
- /frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
- 2381 Watchdog.getInstance().addMonitor(this);
- 19655 /** In this method we try to acquire our lock to make sure that we have not deadlocked */
- 19656 public void monitor() {
- 19657 synchronized (this) { }
- 19658 }
可以看到, 我们的 AMS 实现了该接口, 并在 2381 行, 将自己注册进 Watchdog. 同时其 monitor 方法只是同步一下自己, 确保自己没有死锁
干的事情虽然不多, 但这足够了足够让外部通过这个方法得到 AMS 是否死了
好了, 现在我们知道是如何判断其他服务是否死锁了, 那么看 Watchdog 的 run 方法是怎么完成这一套机制的吧
TAG: Watchdog.run
run 方法就是死循环, 不断的去遍历所有 HandlerChecker, 并调其监控方法, 等待三十秒, 评估状态具体见下面的注释:
- 341 @Override
- 342 public void run() {
- 343 boolean waitedHalf = false;
- 344 while (true) {
- 345 final ArrayList<HandlerChecker> blockedCheckers;
- 346 final String subject;
- 347 final boolean allowRestart;
- 348 int debuggerWasConnected = 0;
- 349 synchronized (this) {
- 350 long timeout = CHECK_INTERVAL;
- 351 // Make sure we (re)spin the checkers that have become idle within
- 352 // this wait-and-check interval
- // 在这里, 我们遍历所有 HandlerChecker, 并调其监控方法, 记录开始时间
- 353 for (int i=0; i<mHandlerCheckers.size(); i++) {
- 354 HandlerChecker hc = mHandlerCheckers.get(i);
- 355 hc.scheduleCheckLocked();
- 356 }
- 357
- 358 if (debuggerWasConnected > 0) {
- 359 debuggerWasConnected--;
- 360 }
- 361
- 362 // NOTE: We use uptimeMillis() here because we do not want to increment the time we
- 363 // wait while asleep. If the device is asleep then the thing that we are waiting
- 364 // to timeout on is asleep as well and won't have a chance to run, causing a false
- 365 // positive on when to kill things.
- 366 long start = SystemClock.uptimeMillis();
- // 等待 30 秒, 使用 uptimeMills 是为了不把手机睡眠时间算进入, 手机睡眠时系统服务同样睡眠
- 367 while (timeout > 0) {
- 368 if (Debug.isDebuggerConnected()) {
- 369 debuggerWasConnected = 2;
- 370 }
- 371 try {
- 372 wait(timeout);
- 373 } catch (InterruptedException e) {
- 374 Log.wtf(TAG, e);
- 375 }
- 376 if (Debug.isDebuggerConnected()) {
- 377 debuggerWasConnected = 2;
- 378 }
- 379 timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
- 380 }
- 381 // 评估 Checker 的状态, 里面会遍历所有的 HandlerChecker, 并获取最大的返回值
- 382 final int waitState = evaluateCheckerCompletionLocked();
- // 最大的返回值有四种情况, 分别是: COMPLETED 对应消息已处理完毕线程无阻塞
- 383 if (waitState == COMPLETED) {
- 384 // The monitors have returned; reset
- 385 waitedHalf = false;
- 386 continue;
- // WAITING 对应消息处理花费 0~29 秒, 继续运行
- 387 } else if (waitState == WAITING) {
- 388 // still waiting but within their configured intervals; back off and recheck
- 389 continue;
- // WAITED_HALF 对应消息处理花费 30~59 秒, 线程可能已经被阻塞, 需要保存当前 AMS 堆栈状态
- 390 } else if (waitState == WAITED_HALF) {
- 391 if (!waitedHalf) {
- 392 // We've waited half the deadlock-detection interval. Pull a stack
- 393 // trace and wait another half.
- 394 ArrayList<Integer> pids = new ArrayList<Integer>();
- 395 pids.add(Process.myPid());
- 396 ActivityManagerService.dumpStackTraces(true, pids, null, null,
- 397 NATIVE_STACKS_OF_INTEREST);
- 398 waitedHalf = true;
- 399 }
- 400 continue;
- 401 }
- 402 //OVERDUE 对应消息处理已经花费超过 60, 能够走到这里, 说明已经发生了超时 60 秒了那么下面接下来全是应对超时的情况
- 403 // something is overdue!
- 404 blockedCheckers = getBlockedCheckersLocked();
- 405 subject = describeCheckersLocked(blockedCheckers);
- 406 allowRestart = mAllowRestart;
- 407 }
- 408
- 409 // If we got here, that means that the system is most likely hung.
- 410 // First collect stack traces from all threads of the system process.
- 411 // Then kill this process so that the system will restart.
- 412 EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
- 413
....... 各种记录的保存
- // Only kill the process if the debugger is not attached.
- if (Debug.isDebuggerConnected()) {
- debuggerWasConnected = 2;
- }
- if (debuggerWasConnected >= 2) {
- Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
- } else if (debuggerWasConnected > 0) {
- Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
- } else if (!allowRestart) {
- Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
- } else {
- Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS:" + subject);
- for (int i=0; i<blockedCheckers.size(); i++) {
- Slog.w(TAG, blockedCheckers.get(i).getName() + "stack trace:");
- StackTraceElement[] stackTrace
- = blockedCheckers.get(i).getThread().getStackTrace();
- for (StackTraceElement element: stackTrace) {
- Slog.w(TAG, "at" + element);
- }
- }
- Slog.w(TAG, "*** GOODBYE!");
- Process.killProcess(Process.myPid());
- System.exit(10);
- }
- waitedHalf = false;
- }
- }
上述可以看到, 如果走到 412 行处便是重启系统前的准备了
会进行以下事情:
写 Eventlog
以追加的方式, 输出 system_server 和 3 个 native 进程的栈信息
输出 kernel 栈信息
dump 所有阻塞线程
输出 dropbox 信息
判断有没有 debuger, 没有的话, 重启系统了, 并输出 log: *** WATCHDOG KILLING SYSTEM PROCESS:
三: 总结:
以上便是 Android 系统层 Watchdog 的原理了设计的比较好若由我来设计, 我还真想不到使用 Monitor 那个锁机制来判断
接下来总结以下:
Watchdog 是一个线程, 用来监听系统各项服务是否正常运行, 没有发生死锁
HandlerChecker 用来检查 Handler 以及 monitor
monitor 通过锁来判断是否死锁
超时 30 秒会输出 log, 超时 60 秒会重启 (debug 情况除外)
来源: http://www.jianshu.com/p/5c18c4e8c826