天天看点

Android S watchdog原理及源码分析一.概述二.WatchDog初始化三.watchdog检测机制四.watchdog处理流程五、总结

说明:本代码分析基于 Android S版本,如有错误,希望在评论区指出,希望共同学习进步。

本代码参照的源码链接:安卓源码

说明:没有标识源码路径皆为watchdog.java文件
frameworks/base/services/java/com/android/server/SystemServer.java
frameworks/base/services/core/java/com/android/server/Watchdog.java
frameworks/native/libs/binder/IPCThreadState.cpp
           

watchdog.java

  • 一.概述
  • 二.WatchDog初始化
    • 2.1 startBootstrapServices
    • 2.2 getInstance//获得实例
    • 2.3 创建watchdog类和watchdog的构造函数,并添加进对应的服务和线程
      • 2.3.1 handlerchecker
      • 2.3.2 addMonitor
      • 2.3.3 BinderThreadMonitor
        • 2.3.3.1 blockUntilThreadAvailable
    • 2.4 init
      • 2.4.1 RebootRequestReceiver
      • 2.4.2 rebootSystem
  • 三.watchdog检测机制
    • 3.1 run()
    • 3.2 scheduleCheckLocked()
    • 3.3 evaluateCheckerCompletionLocked()
    • 3.4 getCompletionStateLocked()
  • 四.watchdog处理流程
    • 4.1 run()
      • 4.1.1 getBlockedCheckersLocked()
      • 4.1.2 describeCheckersLocked()
      • 4.1.1 describeBlockedStateLocked()
    • 4.2 AMS.dumpStackTraces()
    • 4.3 OS.processCpuTracker()
      • 4.3.1 WD.printCurrentState()
    • 4.4 WD.doSysRq()
    • 4.5 dropBox()
    • 4.6 killProcess()
  • 五、总结
    • 5.1 有两种方式加入watchdog监控
    • 5.2 以下情况,即使触发了Watchdog,也不会杀掉system_server进程:
    • 5.3监控Handler线程
    • 5.4 监控同步锁
    • 5.5 输出信息

一.概述

Android系统中,有硬件WatchDog用于定时检测关键硬件是否正常工作,类似地,在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件。WatchDog功能主要是分析系统核心服务和重要线程是否处于Blocked状态。

在framework中,WatchDog主要监测两种服务,

  • 第一类是Monitor Checker
  • 第二类是Looper Checker。

二.WatchDog初始化

2.1 startBootstrapServices

frameworks/base/services/java/com/android/server/SystemServer.java

private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
t.traceBegin("startBootstrapServices");
// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
//为了能够在boot前,出现死锁状况重启,watchdog应尽可能早启动
t.traceBegin("StartWatchdog");
//引导服务中创建watchdog
final Watchdog watchdog = Watchdog.getInstance();//793
watchdog.start();//794
t.traceEnd();

// Complete the watchdog setup with an ActivityManager instance and listen for reboots
// Do this only after the ActivityManagerService is properly started as a system process
//使用ActivityManager实例完成看门狗设置,并监听重启;仅在将ActivityManagerService作为系统进程正确启动之后,才执行此操作。
//注册reboot广播
t.traceBegin("InitWatchdog");
watchdog.init(mSystemContext, mActivityManagerService);//1000
t.traceEnd();
           

总结:

system_server进程启动过程中初始化watchdog,主要有:

  • 创建watchdog对象,改对象本身继承与thread;
  • 调用start()开始工作;
  • 注册reboot广播。

2.2 getInstance//获得实例

frameworks/base/services/core/java/com/android/server/Watchdog.java

public static Watchdog getInstance() {//315

if (sWatchdog == null) {

//单例模式,创建实例对象【见小节2.3】

sWatchdog = new Watchdog();

}

return sWatchdog;

}

2.3 创建watchdog类和watchdog的构造函数,并添加进对应的服务和线程

frameworks/base/services/core/java/com/android/server/Watchdog.java

/** This class calls its monitor every minute. Killing this process if they don't return **/
//Watchdog本质是一个线程
public class Watchdog extends Thread {
/* This handler will be used to post message back onto the main thread */
//该处理程序将用于将消息发布回主线程
//所有的HandlerChecker对象组成的列表,HandlerChecker对象类型见【小节2.3.1】
private final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();//147
。。。
private Watchdog() {//323
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
//为我们要检查的每个公共线程初始化处理程序检查器。 请注意,我们目前不检查后台线程,
//因为该后台线程有可能持有更长的运行时间,而无法保证那里的操作是否及时。
//共享的前台线程是主要检查器。 在这里,我们还将派遣监视器检查并执行其他工作。
//将前台线程加入队列
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
//添加主线程检查器。 我们仅进行快速检查,因为线程上可能正在运行UI。
//将主线程加入队列
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
//将UI线程加入队列
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
//将i/o线程加入队列
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
将display线程加入队列
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// And the animation thread.
//将animation(动画)线程加入队列
mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
"animation thread", DEFAULT_TIMEOUT));
// And the surface animation thread.
//将surface animation(表面动画)加入队列
mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
"surface animation thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
//初始化binder线程监控,主要用于检测binder线程是否达到连接上限16个
//如果大于等于16个 则阻塞线程等待mThreadCountDecrement唤醒
//见【小节2.3.2】
addMonitor(new BinderThreadMonitor());//356见【小节2.3.3】
mInterestingJavaPids.add(Process.myPid());
// See the notes on DEFAULT_TIMEOUT.
//请参阅有关DEFAULT_TIMEOUT的注释。
//assert断言函数,Linux中的宏定义,由于DB初始化为false,所以DB不变。
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}
}
           

总结:Watchdog继承于Thread,创建的线程名为”watchdog”。mHandlerCheckers队列包括前台线程fg、主线程, ui, io, display, animation,surface animation线程的HandlerChecker对象并且初始化binder线程监控。

2.3.1 handlerchecker

frameworks/base/services/core/java/com/android/server/Watchdog.java

/**
* Used for checking status of handle threads and scheduling monitor callbacks.
*/
//用于检查句柄线程的状态并安排监视器回调。
public final class HandlerChecker implements Runnable {//158
private final Handler mHandler;//Handler对象
private final String mName;//线程名
private final long mWaitMax;//最长等待间隔时间
//记录所监控的服务
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
private boolean mCompleted;//检查完成状态,开始检查时先设置为false
private Monitor mCurrentMonitor;//目前正检查的Monitor
private long mStartTime;//检查开始时间
private int mPauseCount;//停止次数,如果大于0,说明此时Handler Checker处于暂停状态

HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
}
           

总结:HandlerChecker对象用于存放当前handler对象,当前线程名以及最长等待间隔时间,

HandlerChecker是Runnable对象(可运行的),所以run()方法是核心的部分。

参考文献【1】:这篇博客详细介绍Java中implements runnable和extends Thread的使用方法,在本代码中使用的就是runnable,

优点有两个,runnable接口可以继承其他类,多个线程共享一个对象;当一组线程需要访问相同资源时,使用Runnable接口。

Java 中的“implements Runnable” 和“extends Thread”

2.3.2 addMonitor

frameworks/base/services/core/java/com/android/server/Watchdog.java

public class Watchdog extends Thread {
    public final class HandlerChecker implements Runnable {
     ...
    void addMonitorLocked(Monitor monitor) {//176
    // We don't want to update mMonitors when the Handler is in the middle of checking
    // all monitors. We will update mMonitors on the next schedule if it is safe
    //当处理程序在检查所有监视器时,我们不想更新mMonitors。 
    //如果安全的话,我们会在下一个时间表上更新mMonitors
    //将上面的BinderThreadMonitor添加到mMonitorQueue队列中   
    mMonitorQueue.add(monitor);
    }//addMonitorLocked_end

    public void addMonitor(Monitor monitor) {//419
    synchronized (this) {
    //此处的mMonitorChecker的数据类型是HandlerChecker
    mMonitorChecker.addMonitorLocked(monitor);
        }
    }//addMonitor_end
 }//HandlerChecker_end
}//Watchdog_end
           

说明:addMonitorLocked用于监控binder线程,将monitor添加到handlerchecker的成员变量mMonitorQueue队列中。在这里是将BinderThreadMonitor对象添加到mMonitorQueue队列中。

addMonitor(): 用于监控实现了Watchdog.Monitor接口的服务,这种超时可能是”android.fg”线程消息处理得慢,也可能是monitor迟迟拿不到锁

2.3.3 BinderThreadMonitor

/** Monitor for checking the availability of binder threads. The monitor will block until
* there is a binder thread available to process in coming IPCs to make sure other processes
* can still communicate with the service.
*/
//监视以检查绑定程序线程的可用性。
//监视程序将阻塞,直到即将到来的IPC中有可用的绑定线程来处理,以确保其他进程仍可以与服务进行通信。
private static final class BinderThreadMonitor implements Watchdog.Monitor {//304
@Override
public void monitor() {
Binder.blockUntilThreadAvailable();//见【小节2.3.3.1】
}
}
说明:blockUntilThreadAvailable最终调用的是IPCThreadState,等待有空闲的binder线程
           

2.3.3.1 blockUntilThreadAvailable

frameworks/native/libs/binder/IPCThreadState.cpp

void IPCThreadState::blockUntilThreadAvailable()
{
pthread_mutex_lock(&mProcess->mThreadCountLock);
mProcess->mWaitingForThreads++;
while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
static_cast<unsigned long>(mProcess->mMaxThreads));
//等待正在执行的binder线程小于进程最大binder线程上限(16个)
pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
}
mProcess->mWaitingForThreads--;
pthread_mutex_unlock(&mProcess->mThreadCountLock);
}
           

说明:通过函数调用分析addMonitor(new BinderThreadMonitor());//356,是将binder线程添加到android.fg线程的HandlerChecker(mMonitorChecker)来检查是否工作正常。

2.4 init

frameworks/base/services/core/java/com/android/server/Watchdog.java

/**
* Registers a {@link BroadcastReceiver} to listen to reboot broadcasts and trigger reboot.
* Should be called during boot after the ActivityManagerService is up and registered
* as a system service so it can handle registration of a {@link BroadcastReceiver}.
*/
//注册一个{@link BroadcastReceiver}来收听重启广播并触发重启。
//在启动ActivityManagerService并将其注册为系统服务后,应在引导期间调用它,
//以便它可以处理{@link BroadcastReceiver}的注册。
public void init(Context context, ActivityManagerService activity) {//370
mActivity = activity;
//注册reboot广播接收者【见小节2.4.1】
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
}
           

2.4.1 RebootRequestReceiver

final class RebootRequestReceiver extends BroadcastReceiver {//289
@Override
public void onReceive(Context c, Intent intent) {
if (intent.getIntExtra("nowait", 0) != 0) {
//【见小节2.4.2】
rebootSystem("Received ACTION_REBOOT broadcast");
return;
}
Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
}
}
           

2.4.2 rebootSystem

/**
* Perform a full reboot of the system.
*/
//执行系统的完全重新引导。
void rebootSystem(String reason) {//484
Slog.i(TAG, "Rebooting system because: " + reason);
IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
try {
//通过PowerManager执行reboot操作
pms.reboot(false, reason, false);
} catch (RemoteException ex) {
}
}
           

说明:最终通过PowerManagerService来完成上层的重启操作,具体的重启流程后续会单独讲述。

三.watchdog检测机制

当调用Watchdog.getInstance().start()时,则进入线程watchdog的run()方法,该方法主要分为两个部分:

前半部分:【小节3.1】用于监测是否触发超时;

后半部分:【小节4.1】当触发超时则输出各种信息。

3.1 run()

@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final List<HandlerChecker> blockedCheckers;
//超时原因用于日志的输出
final String subject;
//是否允许重启 默认true在watchdog.setAllowRestart中会进行重新赋值
final boolean allowRestart;
//调试进程连接数
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;//CHECK_INTERVAL=30s
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
//确保我们在这个等待和检查间隔内旋转空闲的跳转。
//第一步:30S轮训系统中所有的monitor
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
//执行所有的Checker的监控方法, 每个Checker记录当前的mStartTime[见小节3.2]
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
//注意:这里我们使用uptimeMillis是因为我们不想增加睡眠时的等待时间
//如果设备处于睡眠状态,那么我们等待超时的事物也将处于睡眠状态
//并且没有机会运行,从而导致什么时间杀死事物的判断是错误的。
//通俗的讲就是uptimeMillis只会在设备唤醒的时候计算超时,设备休眠的话计算时间会导致错误的重启

//2.等待30s,等待检查结果
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);//触发中断,直接捕获异常,继续等待
// Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
//评估checker状态【见小节3.3】
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
//首次进入等待时间过半的状态
Slog.i(TAG, "WAITED_HALF");
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
//输出system_server和3个native进程的traces【见小节4.2】
ActivityManagerService.dumpStackTraces(pids, null, null,
getInterestingNativePids(), null);
waitedHalf = true;
}
continue;
}
// something is overdue!
//进入到这里,意味着watchdog已超时【见小节4.1】
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
           

该方法的的主要功能:

1.执行所有的checker的监控方法scheduleCheckLocked()

  • 当mMonitor个数为0(除了android.fg线程之外都为0)且处于poll状态,则设置mCompleted = true;
  • 当上次check还没有完成,则直接返回

2.等待30S秒后,在调用evaluateCheckerCompletionLocked来评估checker状态;

3.根据waitState状态来执行不同的操作:

  • 当 COMPLETED或WAITING,则正常运行;
  • 当WAITED_HALF(超过30S)且为首次,则输出system_server和3个native进程的traces;
  • 当overdue,则输出更多信息。

3.2 scheduleCheckLocked()

public final class HandlerChecker implements Runnable {
...
public void scheduleCheckLocked() {//182
if (mCompleted) {
// Safe to update monitors in queue, Handler is not in the middle of work
// 安全更新队列中的monitors,处理程序不在工作中
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())|| (mPauseCount > 0)) {
// Don't schedule until after resume OR
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if we have no monitors since those would need to
// be executed at this point.
//不要安排在恢复之前,或者如果目标循环程序最近正在轮询,则没有理由让我们的检查程序排队,
//因为那样就好,它不会死锁。 这样可以避免必须进行上下文切换来检查线程。 
//请注意,只有在没有监视器的情况下才执行此操作,因为此时需要执行这些监视器。
mCompleted = true;//当目标looper正在轮询状态则返回。
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
//有一个check正在处理中,则无需重复发送
return;
}
mCompleted = false;
mCurrentMonitor = null;
//记录当前时间
mStartTime = SystemClock.uptimeMillis();
//发送消息,插入消息队列最开头,见下方的run()方法
mHandler.postAtFrontOfQueue(this);//208
}
...
@Override
public void run() {//247
// Once we get here, we ensure that mMonitors does not change even if we call
// #addMonitorLocked because we first add the new monitors to mMonitorQueue and
// move them to mMonitors on the next schedule when mCompleted is true, at which
// point we have completed execution of this method.
//到达此处后,我们确保即使调用#addMonitorLocked也不会更改mMonitors,因为当mCompleted为true时,
//我们首先将新的监视器添加到mMonitorQueue并在下一个计划将它们移动到mMonitors,
//此时我们已完成执行方法。
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
//回调具体服务的monitor方法
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
...
}
           

此方法主要功能:

向watchdog的监控线程的looper池的最头部执行该HandlerChecker.run()方法,在此方法中调用monitor(),执行完成后会设置mCompleted = true;那么当handler消息池当前的信息导致迟迟没有机会执行monitor()方法,则会触发watchdog。

其中208行postAtFrontOfQueue(this),此方法输入参数为Runnable对象,根据消息机制,最终会回调HandlerChecker.run()方法,

此方法会循环遍历所有的monitor接口,具体的服务实现该接口的monitor()方法。

可能的问题,如果有消息不断调用postAtFrontOfQueue(this)也可能导致watchdog没有机会执行;或者是每个monitor消耗一些时间,累加起来超过1分钟造成的watchdog,这些都是非常规的watchdog。

3.3 evaluateCheckerCompletionLocked()

private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
//【见小结3.4】
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
           

总结:获取mHandlerCheckers列表中等待状态值最大的state

3.4 getCompletionStateLocked()

public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
//mWaitMax默认为60S
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
           

总结:根据计算时间,返回状态

  • COMPLETED=0;等待完成;
  • WAITING=1;等待时间小于DEFAULT_TIMEOUT的一般,即30S;
  • WAITED_HALF=2;等待时间处于30S~60S之间;
  • OVERDUE=3;等待时间大于或等于60S。

四.watchdog处理流程

4.1 run()

@Override
public void run() {//573
boolean waitedHalf = false;
while (true) {
...
// something is overdue!
//获取被阻塞的checker【见小节4.1.1】
blockedCheckers = getBlockedCheckersLocked();//637
//获取描述信息【见小节4.1.2】
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
//如果程序走到这里,意味着系统很可能已经挂起了 
//首先从系统进程的所有线程中手机堆栈跟踪,然后杀死该进程 
//这样系统才会重启
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
long anrTime = SystemClock.uptimeMillis();
StringBuilder report = new StringBuilder();
report.append(MemoryPressureUtil.currentPsiState());

ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
StringWriter tracesFileException = new StringWriter();
//第二次以追加方式,打印Java线程和native线程的堆栈【见小节4.2】
final File stack = ActivityManagerService.dumpStackTraces(
pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
tracesFileException);

// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
//睡眠5s,确保(stack trace)堆栈能够完全写入文件
SystemClock.sleep(5000);

//输出kernel栈信息【见小节4.3】
processCpuTracker.update();
report.append(processCpuTracker.printCurrentState(anrTime));
report.append(tracesFileException.getBuffer());

// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
// 让kernel dump全部的block线程和cpu信息【见小节4.4】
doSysRq('w');
doSysRq('l');

// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
//输出dropbox信息【见小节4.5】
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
// If a watched thread hangs before init() is called, we don't have a
// valid mActivity. So we can't log the error to dropbox.
//如果监视的线程在调用init()之前挂起,则我们没有有效的mActivity。
// 因此,我们无法将错误记录到保管箱。
if (mActivity != null) {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null, null,
subject, report.toString(), stack, null);
}
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,
subject);
}
};
dropboxThread.start();

try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.//等待dropbox线程工作2s
} catch (InterruptedException ignored) {}

IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
//将阻塞状态报告给activity controller
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system,返回值1表示继续等待,-1表示杀死系统
int res = controller.systemNotResponding(subject);
if (res >= 0) {
//新增:活动控制器要求继续等待
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;//设置ActivityController的某些情况下,可以让发生Watchdog时继续等待
}
} catch (RemoteException e) {
}
}

// Only kill the process if the debugger is not attached.
//当debugger没有attach时,才杀死进程
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
//删除:遍历输出阻塞线程的栈信息
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");

if (!Build.IS_USER && isCrashLoopFound()
&& !WatchdogProperties.should_ignore_fatal_count().orElse(false)) {
breakCrashLoop();//中断crash循环【见小节4.7】
}
//杀死进程system_server【见小节4.6】
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
...
}
           

总结:

watchdog检测到异常的信息收集工作:

  • AMS.dumpStackTraces:输出打印Java线程和native线程的堆栈
  • os.processCpuTracker:打印kernel栈信息
  • doSysRq
  • dropBox

4.1.1 getBlockedCheckersLocked()

private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
//遍历所有的checker
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
//将所有没有完成,且超时的checker加入队列
if (hc.isOverdueLocked()) {
checkers.add(hc);
}
}
return checkers;
}
           

4.1.2 describeCheckersLocked()

private String describeCheckersLocked(List<HandlerChecker> checkers) {//513
StringBuilder builder = new StringBuilder(128);
for (int i=0; i<checkers.size(); i++) {
if (builder.length() > 0) {
builder.append(", ");
}
//输出所有checker信息
builder.append(checkers.get(i).describeBlockedStateLocked());
}
return builder.toString();
}


String describeBlockedStateLocked() {//237
//非前台线程进入该分支
if (mCurrentMonitor == null) {
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
} else {
//前台线程进入该分支
return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
+ " on " + mName + " (" + getThread().getName() + ")";
}
}
           

说明:将所有执行时间超过1分钟的handler线程或者monitor都记录下来.

  • 当输出的信息是Blocked in handler,意味着相应的线程处理当前消息时间超过1分钟;
  • 当输出的信息是Blocked in monitor,意味着相应的线程处理当前消息时间超过1分钟,或者monitor迟迟拿不到锁;

4.1.1 describeBlockedStateLocked()

String describeBlockedStateLocked() { 
if (mCurrentMonitor == null) { 
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")"; 
} else { 
return "Blocked in monitor " + mCurrentMonitor.getClass().getName() + " on " + mName + " (" + getThread().getName() + ")"; 
} 
}
           

说明:describeBlockedStateLocked 获取Blocked状态的描述 在哪个的HandlerChecker中 或者 当前handler执行哪个monitor。

4.2 AMS.dumpStackTraces()

frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java

public static File dumpStackTraces(ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids,
ArrayList<Integer> nativePids, StringWriter logExceptionCreatingFile) {
return dumpStackTraces(firstPids, processCpuTracker, lastPids, nativePids,
logExceptionCreatingFile, null);
}

...

//路径/data/anr/traces.txt
final File tracesDir = new File(ANR_TRACE_DIR);//3997
// Each set of ANR traces is written to a separate file and dumpstate will process
// all such files and add them to a captured bug report if they're recent enough.
//每组ANR跟踪都写入一个单独的文件,并且转储状态将处理所有此类文件,如果它们足够新,
//则将它们添加到捕获的错误报告中。
maybePruneOldTraces(tracesDir);

// NOTE: We should consider creating the file in native code atomically once we've
// gotten rid of the old scheme of dumping and lot of the code that deals with paths
// can be removed.
//注意:一旦摆脱了旧的转储方案并且可以删除许多处理路径的代码,我们应该考虑以本机代码原子方式创建文件。
File tracesFile;
try {
//创建traces文件
tracesFile = createAnrDumpFile(tracesDir);
} catch (IOException e) {
Slog.w(TAG, "Exception creating ANR dump file:", e);
if (logExceptionCreatingFile != null) {
logExceptionCreatingFile.append("----- Exception creating ANR dump file -----\n");
e.printStackTrace(new PrintWriter(logExceptionCreatingFile));
}
return null;
}

//输出traces内容
Pair<Long, Long> offsets = dumpStackTraces(
tracesFile.getAbsolutePath(), firstPids, nativePids, extraPids);
if (firstPidOffsets != null) {
if (offsets == null) {
firstPidOffsets[0] = firstPidOffsets[1] = -1;
} else {
firstPidOffsets[0] = offsets.first; // Start offset to the ANR trace file//开始到ANR跟踪文件的偏移量
firstPidOffsets[1] = offsets.second; // End offset to the ANR trace file//结束到ANR跟踪文件的偏移量
}
}
return tracesFile;
}
           

总结:输出system_server和mediaserver,/sdcard,surfaceflinger这3个native进程的traces信息。

(这个有待进一步分析,确认是否是这些,6.0版本是下面列出这些)

在AMS的新版本中输出的仍然是哪些进程的traces信息?

4.3 OS.processCpuTracker()

frameworks/base/core/java/com/android/internal/os/ProcessCpuTracker.java

//WD.processCpuTracker()
processCpuTracker.update();//662
report.append(processCpuTracker.printCurrentState(anrTime));//【见小节4.3.1】
report.append(tracesFileException.getBuffer());

public class ProcessCpuTracker {//51
private static final String TAG = "ProcessCpuTracker";
private static final boolean DEBUG = false;
private static final boolean localLOGV = DEBUG || false;
...
@UnsupportedAppUsage  //限制framework中的某些定义无法被外部应用访问
//ProcessCpuTracker的构造函数
public ProcessCpuTracker(boolean includeThreads) {//314
mIncludeThreads = includeThreads;
long jiffyHz = Os.sysconf(OsConstants._SC_CLK_TCK);
mJiffyMillis = 1000/jiffyHz;
}
...
}
           

Android-R中的注释UnsupportedAppUsage: https://blog.csdn.net/shanbl_linux_android/article/details/106094124

4.3.1 WD.printCurrentState()

final public String printCurrentState(long now) {
final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
buildWorkingProcs();
StringWriter sw = new StringWriter();
PrintWriter pw = new FastPrintWriter(sw, false, 1024);
pw.print("CPU usage from ");
if (now > mLastSampleTime) {
pw.print(now-mLastSampleTime);
pw.print("ms to ");
pw.print(now-mCurrentSampleTime);
pw.print("ms ago");
} else {
pw.print(mLastSampleTime-now);
pw.print("ms to ");
pw.print(mCurrentSampleTime-now);
pw.print("ms later");
}
pw.print(" (");
pw.print(sdf.format(new Date(mLastSampleWallTime)));
pw.print(" to ");
pw.print(sdf.format(new Date(mCurrentSampleWallTime)));
pw.print(")");
long sampleTime = mCurrentSampleTime - mLastSampleTime;
long sampleRealTime = mCurrentSampleRealTime - mLastSampleRealTime;
long percAwake = sampleRealTime > 0 ? ((sampleTime*100) / sampleRealTime) : 0;
if (percAwake != 100) {
pw.print(" with ");
pw.print(percAwake);
pw.print("% awake");
}
pw.println(":");
final int totalTime = mRelUserTime + mRelSystemTime + mRelIoWaitTime
+ mRelIrqTime + mRelSoftIrqTime + mRelIdleTime;
if (DEBUG) Slog.i(TAG, "totalTime " + totalTime + " over sample time "
+ (mCurrentSampleTime-mLastSampleTime));
int N = mWorkingProcs.size();
for (int i=0; i<N; i++) {
Stats st = mWorkingProcs.get(i);
printProcessCPU(pw, st.added ? " +" : (st.removed ? " -": " "),
st.pid, st.name, (int)st.rel_uptime,
st.rel_utime, st.rel_stime, 0, 0, 0, st.rel_minfaults, st.rel_majfaults);
if (!st.removed && st.workingThreads != null) {
int M = st.workingThreads.size();
for (int j=0; j<M; j++) {
Stats tst = st.workingThreads.get(j);
printProcessCPU(pw,
tst.added ? " +" : (tst.removed ? " -": " "),
tst.pid, tst.name, (int)st.rel_uptime,
tst.rel_utime, tst.rel_stime, 0, 0, 0, 0, 0);
}
}
}
printProcessCPU(pw, "", -1, "TOTAL", totalTime, mRelUserTime, mRelSystemTime,
mRelIoWaitTime, mRelIrqTime, mRelSoftIrqTime, 0, 0);
pw.flush();
return sw.toString();
}
           

4.4 WD.doSysRq()

private void doSysRq(char c) {//736
try {
FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");
sysrq_trigger.write(c);
sysrq_trigger.close();
} catch (IOException e) {
Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
}
}
           

总结:通过向节点/proc/sysrq-trigger写入字符,触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log。

4.5 dropBox()

关于dropbox已在dropBox源码篇详细讲解过,输出文件到/data/system/dropbox。对于触发watchdog时,生成的dropbox文件的tag是system_server_watchdog,内容是traces以及相应的blocked信息。

4.6 killProcess()

Process.killProcess已经在文章理解杀进程的实现原理已详细讲解,通过发送信号9给目标进程来完成杀进程的过程。

当杀死system_server进程,从而导致zygote进程自杀,进而触发init执行重启Zygote进程,这便出现了手机framework重启的现象。

五、总结

watchdog是一个运行在system_server进程引导服务中名为“watchdog”的线程:

  • watchdog运作过程,当阻塞时间超过1分钟则触发watchdog,会杀死system_server,触发上层重启。
  • mHandlerCheckers记录所有HandlerChecker对象的列表,包括foreground, main, ui, i/o, display,animation,surface animation线程的handler;
  • mHandlerCheckers.mMonitors记录所有watchdog目前正在监控Monitor,所有的这些monitors都运行在foreground线程;

5.1 有两种方式加入watchdog监控

  1. addThread():用于监测Handler线程,默认超时时长为60s.这种超时往往是所对应的handler线程消息处理得慢;
  2. addMonitor(): 用于监控实现了Watchdog.Monitor接口的服务.这种超时可能是”android.fg”线程消息处理得慢,也可能是monitor迟迟拿不到锁;

5.2 以下情况,即使触发了Watchdog,也不会杀掉system_server进程:

  • monkey: 设置IActivityController,拦截systemNotResponding事件, 比如monkey.
  • hang: 执行am hang命令,不重启;
  • debugger: 连接debugger的情况, 不重启;

5.3监控Handler线程

Watchdog监控的线程:默认地DEFAULT_TIMEOUT=60s,调试时才为10s方便找出潜在的ANR问题。

序列 线程名 对应handler 说明 timeout
1 main new Handler(Looper.getMainLooper()) 当前主线程 60s
2 android.fg FgThread.getHandler 前台线程 60s
3 android.ui UiThread.getHandler() UI线程 60s
4 android.io IoThread.getHandler() I/O线程 60s
5 android.display DisplayThread.getHandler() display线程 60s
6 android.animation AnimationThread.getHandler() animation线程 60s
7 android.surface animation SurfaceAnimationThread.getHandler() surface animation线程 60s
8 ActivityManagerService AMS.MainHandler AMS线程 60s
9 PowerManagerService PMS.PowerManagerHandler PMS线程 60s
10 PackageManagerService PKMS.PackageHandler PKMS线程 10s
11 PermissionManagerService
12 RollbackManagerServiceImpl

目前watchdog会监控system_server进程中的以上12个线程:

  • 前9个线程的Looper消息处理时间不得超过1分钟;
  • PackageManager线程的处理时间不得超过10分钟;

5.4 监控同步锁

能够被Watchdog监控的系统服务都实现了Watchdog.Monitor接口,并实现其中的monitor()方法。运行在android.fg线程,

系统中实现该接口类(12个)主要有:

  • ActivityManagerService
  • WindowManagerService
  • InputManagerService
  • PowerManagerService
  • NetworkManagementService
  • MountService
  • NativeDaemonConnector
  • BinderThreadMonitor
  • MediaProjectionManagerService
  • MediaRouterService
  • MediaSessionService
  • TvRemoteService

5.5 输出信息

watchdog在check过程中出现阻塞1分钟的情况,则会输出:

  1. AMS.dumpStackTraces:输出system_server和3个native进程的traces
    • 该方法会输出两次,第一次在超时30s的地方;第二次在超时1min;
  2. os.processCpuTracker,输出system_server进程中所有线程的kernel stack;
    • 节点/proc/%d/task获取进程内所有的线程列表
    • 节点/proc/%d/stack获取kernel的栈
  3. doSysRq, 触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log;
    • 节点/proc/sysrq-trigger
  4. dropBox,输出文件到/data/system/dropbox,内容是trace + blocked信息
  5. 杀掉system_server,进而触发zygote进程自杀,从而重启上层framework。