天天看點

Android S watchdog原理及源碼分析一.概述二.WatchDog初始化三.watchdog檢測機制四.watchdog處理流程五、總結

說明:本代碼分析基于 Android S版本,如有錯誤,希望在評論區指出,希望共同學習進步。

本代碼參照的源碼連結:安卓源碼

說明:沒有辨別源碼路徑皆為watchdog.java檔案
frameworks/base/services/java/com/android/server/SystemServer.java
frameworks/base/services/core/java/com/android/server/Watchdog.java
frameworks/native/libs/binder/IPCThreadState.cpp
           

watchdog.java

  • 一.概述
  • 二.WatchDog初始化
    • 2.1 startBootstrapServices
    • 2.2 getInstance//獲得執行個體
    • 2.3 建立watchdog類和watchdog的構造函數,并添加進對應的服務和線程
      • 2.3.1 handlerchecker
      • 2.3.2 addMonitor
      • 2.3.3 BinderThreadMonitor
        • 2.3.3.1 blockUntilThreadAvailable
    • 2.4 init
      • 2.4.1 RebootRequestReceiver
      • 2.4.2 rebootSystem
  • 三.watchdog檢測機制
    • 3.1 run()
    • 3.2 scheduleCheckLocked()
    • 3.3 evaluateCheckerCompletionLocked()
    • 3.4 getCompletionStateLocked()
  • 四.watchdog處理流程
    • 4.1 run()
      • 4.1.1 getBlockedCheckersLocked()
      • 4.1.2 describeCheckersLocked()
      • 4.1.1 describeBlockedStateLocked()
    • 4.2 AMS.dumpStackTraces()
    • 4.3 OS.processCpuTracker()
      • 4.3.1 WD.printCurrentState()
    • 4.4 WD.doSysRq()
    • 4.5 dropBox()
    • 4.6 killProcess()
  • 五、總結
    • 5.1 有兩種方式加入watchdog監控
    • 5.2 以下情況,即使觸發了Watchdog,也不會殺掉system_server程序:
    • 5.3監控Handler線程
    • 5.4 監控同步鎖
    • 5.5 輸出資訊

一.概述

Android系統中,有硬體WatchDog用于定時檢測關鍵硬體是否正常工作,類似地,在framework層有一個軟體WatchDog用于定期檢測關鍵系統服務是否發生死鎖事件。WatchDog功能主要是分析系統核心服務和重要線程是否處于Blocked狀态。

在framework中,WatchDog主要監測兩種服務,

  • 第一類是Monitor Checker
  • 第二類是Looper Checker。

二.WatchDog初始化

2.1 startBootstrapServices

frameworks/base/services/java/com/android/server/SystemServer.java

private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
t.traceBegin("startBootstrapServices");
// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
//為了能夠在boot前,出現死鎖狀況重新開機,watchdog應盡可能早啟動
t.traceBegin("StartWatchdog");
//引導服務中建立watchdog
final Watchdog watchdog = Watchdog.getInstance();//793
watchdog.start();//794
t.traceEnd();

// Complete the watchdog setup with an ActivityManager instance and listen for reboots
// Do this only after the ActivityManagerService is properly started as a system process
//使用ActivityManager執行個體完成看門狗設定,并監聽重新開機;僅在将ActivityManagerService作為系統程序正确啟動之後,才執行此操作。
//注冊reboot廣播
t.traceBegin("InitWatchdog");
watchdog.init(mSystemContext, mActivityManagerService);//1000
t.traceEnd();
           

總結:

system_server程序啟動過程中初始化watchdog,主要有:

  • 建立watchdog對象,改對象本身繼承與thread;
  • 調用start()開始工作;
  • 注冊reboot廣播。

2.2 getInstance//獲得執行個體

frameworks/base/services/core/java/com/android/server/Watchdog.java

public static Watchdog getInstance() {//315

if (sWatchdog == null) {

//單例模式,建立執行個體對象【見小節2.3】

sWatchdog = new Watchdog();

}

return sWatchdog;

}

2.3 建立watchdog類和watchdog的構造函數,并添加進對應的服務和線程

frameworks/base/services/core/java/com/android/server/Watchdog.java

/** This class calls its monitor every minute. Killing this process if they don't return **/
//Watchdog本質是一個線程
public class Watchdog extends Thread {
/* This handler will be used to post message back onto the main thread */
//該處理程式将用于将消息釋出回主線程
//所有的HandlerChecker對象組成的清單,HandlerChecker對象類型見【小節2.3.1】
private final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();//147
。。。
private Watchdog() {//323
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
//為我們要檢查的每個公共線程初始化處理程式檢查器。 請注意,我們目前不檢查背景線程,
//因為該背景線程有可能持有更長的運作時間,而無法保證那裡的操作是否及時。
//共享的前台線程是主要檢查器。 在這裡,我們還将派遣螢幕檢查并執行其他工作。
//将前台線程加入隊列
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
//添加主線程檢查器。 我們僅進行快速檢查,因為線程上可能正在運作UI。
//将主線程加入隊列
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
//将UI線程加入隊列
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
//将i/o線程加入隊列
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
将display線程加入隊列
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// And the animation thread.
//将animation(動畫)線程加入隊列
mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
"animation thread", DEFAULT_TIMEOUT));
// And the surface animation thread.
//将surface animation(表面動畫)加入隊列
mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
"surface animation thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
//初始化binder線程監控,主要用于檢測binder線程是否達到連接配接上限16個
//如果大于等于16個 則阻塞線程等待mThreadCountDecrement喚醒
//見【小節2.3.2】
addMonitor(new BinderThreadMonitor());//356見【小節2.3.3】
mInterestingJavaPids.add(Process.myPid());
// See the notes on DEFAULT_TIMEOUT.
//請參閱有關DEFAULT_TIMEOUT的注釋。
//assert斷言函數,Linux中的宏定義,由于DB初始化為false,是以DB不變。
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}
}
           

總結:Watchdog繼承于Thread,建立的線程名為”watchdog”。mHandlerCheckers隊列包括前台線程fg、主線程, ui, io, display, animation,surface animation線程的HandlerChecker對象并且初始化binder線程監控。

2.3.1 handlerchecker

frameworks/base/services/core/java/com/android/server/Watchdog.java

/**
* Used for checking status of handle threads and scheduling monitor callbacks.
*/
//用于檢查句柄線程的狀态并安排螢幕回調。
public final class HandlerChecker implements Runnable {//158
private final Handler mHandler;//Handler對象
private final String mName;//線程名
private final long mWaitMax;//最長等待間隔時間
//記錄所監控的服務
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
private boolean mCompleted;//檢查完成狀态,開始檢查時先設定為false
private Monitor mCurrentMonitor;//目前正檢查的Monitor
private long mStartTime;//檢查開始時間
private int mPauseCount;//停止次數,如果大于0,說明此時Handler Checker處于暫停狀态

HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
}
           

總結:HandlerChecker對象用于存放目前handler對象,目前線程名以及最長等待間隔時間,

HandlerChecker是Runnable對象(可運作的),是以run()方法是核心的部分。

參考文獻【1】:這篇部落格詳細介紹Java中implements runnable和extends Thread的使用方法,在本代碼中使用的就是runnable,

優點有兩個,runnable接口可以繼承其他類,多個線程共享一個對象;當一組線程需要通路相同資源時,使用Runnable接口。

Java 中的“implements Runnable” 和“extends Thread”

2.3.2 addMonitor

frameworks/base/services/core/java/com/android/server/Watchdog.java

public class Watchdog extends Thread {
    public final class HandlerChecker implements Runnable {
     ...
    void addMonitorLocked(Monitor monitor) {//176
    // We don't want to update mMonitors when the Handler is in the middle of checking
    // all monitors. We will update mMonitors on the next schedule if it is safe
    //當處理程式在檢查所有螢幕時,我們不想更新mMonitors。 
    //如果安全的話,我們會在下一個時間表上更新mMonitors
    //将上面的BinderThreadMonitor添加到mMonitorQueue隊列中   
    mMonitorQueue.add(monitor);
    }//addMonitorLocked_end

    public void addMonitor(Monitor monitor) {//419
    synchronized (this) {
    //此處的mMonitorChecker的資料類型是HandlerChecker
    mMonitorChecker.addMonitorLocked(monitor);
        }
    }//addMonitor_end
 }//HandlerChecker_end
}//Watchdog_end
           

說明:addMonitorLocked用于監控binder線程,将monitor添加到handlerchecker的成員變量mMonitorQueue隊列中。在這裡是将BinderThreadMonitor對象添加到mMonitorQueue隊列中。

addMonitor(): 用于監控實作了Watchdog.Monitor接口的服務,這種逾時可能是”android.fg”線程消息處理得慢,也可能是monitor遲遲拿不到鎖

2.3.3 BinderThreadMonitor

/** Monitor for checking the availability of binder threads. The monitor will block until
* there is a binder thread available to process in coming IPCs to make sure other processes
* can still communicate with the service.
*/
//監視以檢查綁定程式線程的可用性。
//監視程式将阻塞,直到即将到來的IPC中有可用的綁定線程來處理,以確定其他程序仍可以與服務進行通信。
private static final class BinderThreadMonitor implements Watchdog.Monitor {//304
@Override
public void monitor() {
Binder.blockUntilThreadAvailable();//見【小節2.3.3.1】
}
}
說明:blockUntilThreadAvailable最終調用的是IPCThreadState,等待有空閑的binder線程
           

2.3.3.1 blockUntilThreadAvailable

frameworks/native/libs/binder/IPCThreadState.cpp

void IPCThreadState::blockUntilThreadAvailable()
{
pthread_mutex_lock(&mProcess->mThreadCountLock);
mProcess->mWaitingForThreads++;
while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
static_cast<unsigned long>(mProcess->mMaxThreads));
//等待正在執行的binder線程小于程序最大binder線程上限(16個)
pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
}
mProcess->mWaitingForThreads--;
pthread_mutex_unlock(&mProcess->mThreadCountLock);
}
           

說明:通過函數調用分析addMonitor(new BinderThreadMonitor());//356,是将binder線程添加到android.fg線程的HandlerChecker(mMonitorChecker)來檢查是否工作正常。

2.4 init

frameworks/base/services/core/java/com/android/server/Watchdog.java

/**
* Registers a {@link BroadcastReceiver} to listen to reboot broadcasts and trigger reboot.
* Should be called during boot after the ActivityManagerService is up and registered
* as a system service so it can handle registration of a {@link BroadcastReceiver}.
*/
//注冊一個{@link BroadcastReceiver}來收聽重新開機廣播并觸發重新開機。
//在啟動ActivityManagerService并将其注冊為系統服務後,應在引導期間調用它,
//以便它可以處理{@link BroadcastReceiver}的注冊。
public void init(Context context, ActivityManagerService activity) {//370
mActivity = activity;
//注冊reboot廣播接收者【見小節2.4.1】
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
}
           

2.4.1 RebootRequestReceiver

final class RebootRequestReceiver extends BroadcastReceiver {//289
@Override
public void onReceive(Context c, Intent intent) {
if (intent.getIntExtra("nowait", 0) != 0) {
//【見小節2.4.2】
rebootSystem("Received ACTION_REBOOT broadcast");
return;
}
Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
}
}
           

2.4.2 rebootSystem

/**
* Perform a full reboot of the system.
*/
//執行系統的完全重新開機。
void rebootSystem(String reason) {//484
Slog.i(TAG, "Rebooting system because: " + reason);
IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
try {
//通過PowerManager執行reboot操作
pms.reboot(false, reason, false);
} catch (RemoteException ex) {
}
}
           

說明:最終通過PowerManagerService來完成上層的重新開機操作,具體的重新開機流程後續會單獨講述。

三.watchdog檢測機制

當調用Watchdog.getInstance().start()時,則進入線程watchdog的run()方法,該方法主要分為兩個部分:

前半部分:【小節3.1】用于監測是否觸發逾時;

後半部分:【小節4.1】當觸發逾時則輸出各種資訊。

3.1 run()

@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final List<HandlerChecker> blockedCheckers;
//逾時原因用于日志的輸出
final String subject;
//是否允許重新開機 預設true在watchdog.setAllowRestart中會進行重新指派
final boolean allowRestart;
//調試程序連接配接數
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;//CHECK_INTERVAL=30s
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
//確定我們在這個等待和檢查間隔内旋轉空閑的跳轉。
//第一步:30S輪訓系統中所有的monitor
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
//執行所有的Checker的監控方法, 每個Checker記錄目前的mStartTime[見小節3.2]
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
//注意:這裡我們使用uptimeMillis是因為我們不想增加睡眠時的等待時間
//如果裝置處于睡眠狀态,那麼我們等待逾時的事物也将處于睡眠狀态
//并且沒有機會運作,進而導緻什麼時間殺死事物的判斷是錯誤的。
//通俗的講就是uptimeMillis隻會在裝置喚醒的時候計算逾時,裝置休眠的話計算時間會導緻錯誤的重新開機

//2.等待30s,等待檢查結果
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);//觸發中斷,直接捕獲異常,繼續等待
// Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
//評估checker狀态【見小節3.3】
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
//首次進入等待時間過半的狀态
Slog.i(TAG, "WAITED_HALF");
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
//輸出system_server和3個native程序的traces【見小節4.2】
ActivityManagerService.dumpStackTraces(pids, null, null,
getInterestingNativePids(), null);
waitedHalf = true;
}
continue;
}
// something is overdue!
//進入到這裡,意味着watchdog已逾時【見小節4.1】
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
           

該方法的的主要功能:

1.執行所有的checker的監控方法scheduleCheckLocked()

  • 當mMonitor個數為0(除了android.fg線程之外都為0)且處于poll狀态,則設定mCompleted = true;
  • 當上次check還沒有完成,則直接傳回

2.等待30S秒後,在調用evaluateCheckerCompletionLocked來評估checker狀态;

3.根據waitState狀态來執行不同的操作:

  • 當 COMPLETED或WAITING,則正常運作;
  • 當WAITED_HALF(超過30S)且為首次,則輸出system_server和3個native程序的traces;
  • 當overdue,則輸出更多資訊。

3.2 scheduleCheckLocked()

public final class HandlerChecker implements Runnable {
...
public void scheduleCheckLocked() {//182
if (mCompleted) {
// Safe to update monitors in queue, Handler is not in the middle of work
// 安全更新隊列中的monitors,處理程式不在工作中
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())|| (mPauseCount > 0)) {
// Don't schedule until after resume OR
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if we have no monitors since those would need to
// be executed at this point.
//不要安排在恢複之前,或者如果目标循環程式最近正在輪詢,則沒有理由讓我們的檢查程式排隊,
//因為那樣就好,它不會死鎖。 這樣可以避免必須進行上下文切換來檢查線程。 
//請注意,隻有在沒有螢幕的情況下才執行此操作,因為此時需要執行這些螢幕。
mCompleted = true;//當目标looper正在輪詢狀态則傳回。
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
//有一個check正在進行中,則無需重複發送
return;
}
mCompleted = false;
mCurrentMonitor = null;
//記錄目前時間
mStartTime = SystemClock.uptimeMillis();
//發送消息,插入消息隊列最開頭,見下方的run()方法
mHandler.postAtFrontOfQueue(this);//208
}
...
@Override
public void run() {//247
// Once we get here, we ensure that mMonitors does not change even if we call
// #addMonitorLocked because we first add the new monitors to mMonitorQueue and
// move them to mMonitors on the next schedule when mCompleted is true, at which
// point we have completed execution of this method.
//到達此處後,我們確定即使調用#addMonitorLocked也不會更改mMonitors,因為當mCompleted為true時,
//我們首先将新的螢幕添加到mMonitorQueue并在下一個計劃将它們移動到mMonitors,
//此時我們已完成執行方法。
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
//回調具體服務的monitor方法
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
...
}
           

此方法主要功能:

向watchdog的監控線程的looper池的最頭部執行該HandlerChecker.run()方法,在此方法中調用monitor(),執行完成後會設定mCompleted = true;那麼當handler消息池目前的資訊導緻遲遲沒有機會執行monitor()方法,則會觸發watchdog。

其中208行postAtFrontOfQueue(this),此方法輸入參數為Runnable對象,根據消息機制,最終會回調HandlerChecker.run()方法,

此方法會循環周遊所有的monitor接口,具體的服務實作該接口的monitor()方法。

可能的問題,如果有消息不斷調用postAtFrontOfQueue(this)也可能導緻watchdog沒有機會執行;或者是每個monitor消耗一些時間,累加起來超過1分鐘造成的watchdog,這些都是非正常的watchdog。

3.3 evaluateCheckerCompletionLocked()

private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
//【見小結3.4】
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
           

總結:擷取mHandlerCheckers清單中等待狀态值最大的state

3.4 getCompletionStateLocked()

public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
//mWaitMax預設為60S
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
           

總結:根據計算時間,傳回狀态

  • COMPLETED=0;等待完成;
  • WAITING=1;等待時間小于DEFAULT_TIMEOUT的一般,即30S;
  • WAITED_HALF=2;等待時間處于30S~60S之間;
  • OVERDUE=3;等待時間大于或等于60S。

四.watchdog處理流程

4.1 run()

@Override
public void run() {//573
boolean waitedHalf = false;
while (true) {
...
// something is overdue!
//擷取被阻塞的checker【見小節4.1.1】
blockedCheckers = getBlockedCheckersLocked();//637
//擷取描述資訊【見小節4.1.2】
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
//如果程式走到這裡,意味着系統很可能已經挂起了 
//首先從系統程序的所有線程中手機堆棧跟蹤,然後殺死該程序 
//這樣系統才會重新開機
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
long anrTime = SystemClock.uptimeMillis();
StringBuilder report = new StringBuilder();
report.append(MemoryPressureUtil.currentPsiState());

ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
StringWriter tracesFileException = new StringWriter();
//第二次以追加方式,列印Java線程和native線程的堆棧【見小節4.2】
final File stack = ActivityManagerService.dumpStackTraces(
pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
tracesFileException);

// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
//睡眠5s,確定(stack trace)堆棧能夠完全寫入檔案
SystemClock.sleep(5000);

//輸出kernel棧資訊【見小節4.3】
processCpuTracker.update();
report.append(processCpuTracker.printCurrentState(anrTime));
report.append(tracesFileException.getBuffer());

// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
// 讓kernel dump全部的block線程和cpu資訊【見小節4.4】
doSysRq('w');
doSysRq('l');

// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
//輸出dropbox資訊【見小節4.5】
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
// If a watched thread hangs before init() is called, we don't have a
// valid mActivity. So we can't log the error to dropbox.
//如果監視的線程在調用init()之前挂起,則我們沒有有效的mActivity。
// 是以,我們無法将錯誤記錄到保管箱。
if (mActivity != null) {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null, null,
subject, report.toString(), stack, null);
}
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,
subject);
}
};
dropboxThread.start();

try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.//等待dropbox線程工作2s
} catch (InterruptedException ignored) {}

IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
//将阻塞狀态報告給activity controller
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system,傳回值1表示繼續等待,-1表示殺死系統
int res = controller.systemNotResponding(subject);
if (res >= 0) {
//新增:活動控制器要求繼續等待
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;//設定ActivityController的某些情況下,可以讓發生Watchdog時繼續等待
}
} catch (RemoteException e) {
}
}

// Only kill the process if the debugger is not attached.
//當debugger沒有attach時,才殺死程序
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
//删除:周遊輸出阻塞線程的棧資訊
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");

if (!Build.IS_USER && isCrashLoopFound()
&& !WatchdogProperties.should_ignore_fatal_count().orElse(false)) {
breakCrashLoop();//中斷crash循環【見小節4.7】
}
//殺死程序system_server【見小節4.6】
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
...
}
           

總結:

watchdog檢測到異常的資訊收集工作:

  • AMS.dumpStackTraces:輸出列印Java線程和native線程的堆棧
  • os.processCpuTracker:列印kernel棧資訊
  • doSysRq
  • dropBox

4.1.1 getBlockedCheckersLocked()

private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
//周遊所有的checker
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
//将所有沒有完成,且逾時的checker加入隊列
if (hc.isOverdueLocked()) {
checkers.add(hc);
}
}
return checkers;
}
           

4.1.2 describeCheckersLocked()

private String describeCheckersLocked(List<HandlerChecker> checkers) {//513
StringBuilder builder = new StringBuilder(128);
for (int i=0; i<checkers.size(); i++) {
if (builder.length() > 0) {
builder.append(", ");
}
//輸出所有checker資訊
builder.append(checkers.get(i).describeBlockedStateLocked());
}
return builder.toString();
}


String describeBlockedStateLocked() {//237
//非前台線程進入該分支
if (mCurrentMonitor == null) {
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
} else {
//前台線程進入該分支
return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
+ " on " + mName + " (" + getThread().getName() + ")";
}
}
           

說明:将所有執行時間超過1分鐘的handler線程或者monitor都記錄下來.

  • 當輸出的資訊是Blocked in handler,意味着相應的線程處理目前消息時間超過1分鐘;
  • 當輸出的資訊是Blocked in monitor,意味着相應的線程處理目前消息時間超過1分鐘,或者monitor遲遲拿不到鎖;

4.1.1 describeBlockedStateLocked()

String describeBlockedStateLocked() { 
if (mCurrentMonitor == null) { 
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")"; 
} else { 
return "Blocked in monitor " + mCurrentMonitor.getClass().getName() + " on " + mName + " (" + getThread().getName() + ")"; 
} 
}
           

說明:describeBlockedStateLocked 擷取Blocked狀态的描述 在哪個的HandlerChecker中 或者 目前handler執行哪個monitor。

4.2 AMS.dumpStackTraces()

frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java

public static File dumpStackTraces(ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids,
ArrayList<Integer> nativePids, StringWriter logExceptionCreatingFile) {
return dumpStackTraces(firstPids, processCpuTracker, lastPids, nativePids,
logExceptionCreatingFile, null);
}

...

//路徑/data/anr/traces.txt
final File tracesDir = new File(ANR_TRACE_DIR);//3997
// Each set of ANR traces is written to a separate file and dumpstate will process
// all such files and add them to a captured bug report if they're recent enough.
//每組ANR跟蹤都寫入一個單獨的檔案,并且轉儲狀态将處理所有此類檔案,如果它們足夠新,
//則将它們添加到捕獲的錯誤報告中。
maybePruneOldTraces(tracesDir);

// NOTE: We should consider creating the file in native code atomically once we've
// gotten rid of the old scheme of dumping and lot of the code that deals with paths
// can be removed.
//注意:一旦擺脫了舊的轉儲方案并且可以删除許多處理路徑的代碼,我們應該考慮以本機代碼原子方式建立檔案。
File tracesFile;
try {
//建立traces檔案
tracesFile = createAnrDumpFile(tracesDir);
} catch (IOException e) {
Slog.w(TAG, "Exception creating ANR dump file:", e);
if (logExceptionCreatingFile != null) {
logExceptionCreatingFile.append("----- Exception creating ANR dump file -----\n");
e.printStackTrace(new PrintWriter(logExceptionCreatingFile));
}
return null;
}

//輸出traces内容
Pair<Long, Long> offsets = dumpStackTraces(
tracesFile.getAbsolutePath(), firstPids, nativePids, extraPids);
if (firstPidOffsets != null) {
if (offsets == null) {
firstPidOffsets[0] = firstPidOffsets[1] = -1;
} else {
firstPidOffsets[0] = offsets.first; // Start offset to the ANR trace file//開始到ANR跟蹤檔案的偏移量
firstPidOffsets[1] = offsets.second; // End offset to the ANR trace file//結束到ANR跟蹤檔案的偏移量
}
}
return tracesFile;
}
           

總結:輸出system_server和mediaserver,/sdcard,surfaceflinger這3個native程序的traces資訊。

(這個有待進一步分析,确認是否是這些,6.0版本是下面列出這些)

在AMS的新版本中輸出的仍然是哪些程序的traces資訊?

4.3 OS.processCpuTracker()

frameworks/base/core/java/com/android/internal/os/ProcessCpuTracker.java

//WD.processCpuTracker()
processCpuTracker.update();//662
report.append(processCpuTracker.printCurrentState(anrTime));//【見小節4.3.1】
report.append(tracesFileException.getBuffer());

public class ProcessCpuTracker {//51
private static final String TAG = "ProcessCpuTracker";
private static final boolean DEBUG = false;
private static final boolean localLOGV = DEBUG || false;
...
@UnsupportedAppUsage  //限制framework中的某些定義無法被外部應用通路
//ProcessCpuTracker的構造函數
public ProcessCpuTracker(boolean includeThreads) {//314
mIncludeThreads = includeThreads;
long jiffyHz = Os.sysconf(OsConstants._SC_CLK_TCK);
mJiffyMillis = 1000/jiffyHz;
}
...
}
           

Android-R中的注釋UnsupportedAppUsage: https://blog.csdn.net/shanbl_linux_android/article/details/106094124

4.3.1 WD.printCurrentState()

final public String printCurrentState(long now) {
final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
buildWorkingProcs();
StringWriter sw = new StringWriter();
PrintWriter pw = new FastPrintWriter(sw, false, 1024);
pw.print("CPU usage from ");
if (now > mLastSampleTime) {
pw.print(now-mLastSampleTime);
pw.print("ms to ");
pw.print(now-mCurrentSampleTime);
pw.print("ms ago");
} else {
pw.print(mLastSampleTime-now);
pw.print("ms to ");
pw.print(mCurrentSampleTime-now);
pw.print("ms later");
}
pw.print(" (");
pw.print(sdf.format(new Date(mLastSampleWallTime)));
pw.print(" to ");
pw.print(sdf.format(new Date(mCurrentSampleWallTime)));
pw.print(")");
long sampleTime = mCurrentSampleTime - mLastSampleTime;
long sampleRealTime = mCurrentSampleRealTime - mLastSampleRealTime;
long percAwake = sampleRealTime > 0 ? ((sampleTime*100) / sampleRealTime) : 0;
if (percAwake != 100) {
pw.print(" with ");
pw.print(percAwake);
pw.print("% awake");
}
pw.println(":");
final int totalTime = mRelUserTime + mRelSystemTime + mRelIoWaitTime
+ mRelIrqTime + mRelSoftIrqTime + mRelIdleTime;
if (DEBUG) Slog.i(TAG, "totalTime " + totalTime + " over sample time "
+ (mCurrentSampleTime-mLastSampleTime));
int N = mWorkingProcs.size();
for (int i=0; i<N; i++) {
Stats st = mWorkingProcs.get(i);
printProcessCPU(pw, st.added ? " +" : (st.removed ? " -": " "),
st.pid, st.name, (int)st.rel_uptime,
st.rel_utime, st.rel_stime, 0, 0, 0, st.rel_minfaults, st.rel_majfaults);
if (!st.removed && st.workingThreads != null) {
int M = st.workingThreads.size();
for (int j=0; j<M; j++) {
Stats tst = st.workingThreads.get(j);
printProcessCPU(pw,
tst.added ? " +" : (tst.removed ? " -": " "),
tst.pid, tst.name, (int)st.rel_uptime,
tst.rel_utime, tst.rel_stime, 0, 0, 0, 0, 0);
}
}
}
printProcessCPU(pw, "", -1, "TOTAL", totalTime, mRelUserTime, mRelSystemTime,
mRelIoWaitTime, mRelIrqTime, mRelSoftIrqTime, 0, 0);
pw.flush();
return sw.toString();
}
           

4.4 WD.doSysRq()

private void doSysRq(char c) {//736
try {
FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");
sysrq_trigger.write(c);
sysrq_trigger.close();
} catch (IOException e) {
Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
}
}
           

總結:通過向節點/proc/sysrq-trigger寫入字元,觸發kernel來dump所有阻塞線程,輸出所有CPU的backtrace到kernel log。

4.5 dropBox()

關于dropbox已在dropBox源碼篇詳細講解過,輸出檔案到/data/system/dropbox。對于觸發watchdog時,生成的dropbox檔案的tag是system_server_watchdog,内容是traces以及相應的blocked資訊。

4.6 killProcess()

Process.killProcess已經在文章了解殺程序的實作原理已詳細講解,通過發送信号9給目标程序來完成殺程序的過程。

當殺死system_server程序,進而導緻zygote程序自殺,進而觸發init執行重新開機Zygote程序,這便出現了手機framework重新開機的現象。

五、總結

watchdog是一個運作在system_server程序引導服務中名為“watchdog”的線程:

  • watchdog運作過程,當阻塞時間超過1分鐘則觸發watchdog,會殺死system_server,觸發上層重新開機。
  • mHandlerCheckers記錄所有HandlerChecker對象的清單,包括foreground, main, ui, i/o, display,animation,surface animation線程的handler;
  • mHandlerCheckers.mMonitors記錄所有watchdog目前正在監控Monitor,所有的這些monitors都運作在foreground線程;

5.1 有兩種方式加入watchdog監控

  1. addThread():用于監測Handler線程,預設逾時時長為60s.這種逾時往往是所對應的handler線程消息處理得慢;
  2. addMonitor(): 用于監控實作了Watchdog.Monitor接口的服務.這種逾時可能是”android.fg”線程消息處理得慢,也可能是monitor遲遲拿不到鎖;

5.2 以下情況,即使觸發了Watchdog,也不會殺掉system_server程序:

  • monkey: 設定IActivityController,攔截systemNotResponding事件, 比如monkey.
  • hang: 執行am hang指令,不重新開機;
  • debugger: 連接配接debugger的情況, 不重新開機;

5.3監控Handler線程

Watchdog監控的線程:預設地DEFAULT_TIMEOUT=60s,調試時才為10s友善找出潛在的ANR問題。

序列 線程名 對應handler 說明 timeout
1 main new Handler(Looper.getMainLooper()) 目前主線程 60s
2 android.fg FgThread.getHandler 前台線程 60s
3 android.ui UiThread.getHandler() UI線程 60s
4 android.io IoThread.getHandler() I/O線程 60s
5 android.display DisplayThread.getHandler() display線程 60s
6 android.animation AnimationThread.getHandler() animation線程 60s
7 android.surface animation SurfaceAnimationThread.getHandler() surface animation線程 60s
8 ActivityManagerService AMS.MainHandler AMS線程 60s
9 PowerManagerService PMS.PowerManagerHandler PMS線程 60s
10 PackageManagerService PKMS.PackageHandler PKMS線程 10s
11 PermissionManagerService
12 RollbackManagerServiceImpl

目前watchdog會監控system_server程序中的以上12個線程:

  • 前9個線程的Looper消息處理時間不得超過1分鐘;
  • PackageManager線程的處理時間不得超過10分鐘;

5.4 監控同步鎖

能夠被Watchdog監控的系統服務都實作了Watchdog.Monitor接口,并實作其中的monitor()方法。運作在android.fg線程,

系統中實作該接口類(12個)主要有:

  • ActivityManagerService
  • WindowManagerService
  • InputManagerService
  • PowerManagerService
  • NetworkManagementService
  • MountService
  • NativeDaemonConnector
  • BinderThreadMonitor
  • MediaProjectionManagerService
  • MediaRouterService
  • MediaSessionService
  • TvRemoteService

5.5 輸出資訊

watchdog在check過程中出現阻塞1分鐘的情況,則會輸出:

  1. AMS.dumpStackTraces:輸出system_server和3個native程序的traces
    • 該方法會輸出兩次,第一次在逾時30s的地方;第二次在逾時1min;
  2. os.processCpuTracker,輸出system_server程序中所有線程的kernel stack;
    • 節點/proc/%d/task擷取程序内所有的線程清單
    • 節點/proc/%d/stack擷取kernel的棧
  3. doSysRq, 觸發kernel來dump所有阻塞線程,輸出所有CPU的backtrace到kernel log;
    • 節點/proc/sysrq-trigger
  4. dropBox,輸出檔案到/data/system/dropbox,内容是trace + blocked資訊
  5. 殺掉system_server,進而觸發zygote程序自殺,進而重新開機上層framework。