JVM coredump分析系列（2）使用Arthas抓取火焰圖導緻的crash分析

JvmtiExport::post_compiled_method_load 導緻的 crash

前一段時間，有使用者反映使用 Arthas 抓取火焰圖後 JVM 程序經常會出現 crash 導緻程序挂掉，他們crash後的 hs_err_<pid>.log 檔案（hs_err_<pid>.log 檔案為 JVM 虛拟機記錄緻命錯誤的日志，檔案預設生成在啟動目錄下，也可以通過 -XX:ErrorFile=/var/log/java/java_error%p.log 的指令來配置路徑）顯示如下：

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f4d2c8b9130, pid=1131915, tid=0x00007f4d72743700
  ......
# Core dump written. Default location: /opt/core or core.1131915
  ......
Stack: [0x00007f9541147000,0x00007f9541248000],  sp=0x00007f9541246b48,  free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x821193]  JvmtiExport::post_compiled_method_load(nmethod*)+0xdb
V  [libjvm.so+0xbcf683]  JvmtiDeferredEvent::post()+0x9b3
V  [libjvm.so+0xbd013c]  ServiceThread::service_thread_entry(JavaThread*, Thread*)+0x31c
V  [libjvm.so+0x7c4f22]  JavaThread::thread_main_inner()+0x212
V  [libjvm.so+0x9b52d8]  java_start(Thread*)+0xf8
C  [libpthread.so.0+0x8f4b]

Arthas 等工具使用了JVMTI[1]（全稱 JVM Tool Interface，是 JVM 虛拟機所提供的 native 程式設計接口，提供了可用于 debug 和 profiler 的接口，包括但不限于：分析、調試、監控、線程分析和覆寫率工具等）來做監控、增強等功能，追究此處 crash 原因需要我們先梳理下堆棧處 JVMTI 相關的邏輯。

我們從堆棧處可以看到 ServiceThread 線程，在 JVM 中除了業務線程之外還會有一些虛拟機自己建立的線程，服務線程 ServiceThread 就是其中一種，我們通過 jstack pid 指令檢視 JVM 程序線程堆棧的時候就可以看到其相關資訊：

"Service Thread" #8 daemon prio=9 os_prio=0 tid=0x0000ffff9c0c6800 nid=0x6fe runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

ServiceThread 線程會幫 JVM 程序做一些雜活如：記憶體檢查等，其中 JVMTI 事件的發生也是由它處理的。大家都知道 Hotspot 會将熱點方法 method 編譯成本地方法 nmethod，即 native method（這也是 Hotspot 名稱的由來），我們檢視編譯熱點代碼的方法 ciEnv::register_method 中的邏輯：

# src/share/vm/ci/ciEnv.cpp
void ciEnv::register_method(ciMethod* target,
                            int entry_bci,
                            CodeOffsets* offsets,
                            int orig_pc_offset,
                            CodeBuffer* code_buffer,
                            int frame_words,
                            OopMapSet* oop_map_set,
                            ExceptionHandlerTable* handler_table,
                            ImplicitExceptionTable* inc_table,
                            AbstractCompiler* compiler,
                            int comp_level,
                            bool has_unsafe_access,
                            bool has_wide_vectors,
                            RTMState  rtm_state) {
VM_ENTRY_MARK;
nmethod* nm = NULL;
{
    ......

    nm =  nmethod::new_nmethod(method,
                               compile_id(),
                               entry_bci,
                               offsets,
                               orig_pc_offset,
                               debug_info(), dependencies(), code_buffer,
                               frame_words, oop_map_set,
                               handler_table, inc_table,
                               compiler, comp_level);
    ......

    if (nm != NULL) {
        // JVMTI -- compiled method notification (must be done outside lock)
        nm->post_compiled_method_load_event();
    } else {
        // The CodeCache is full. Print out warning and disable compilation.
        record_failure("code cache is full");
        CompileBroker::handle_full_code_cache();
    }
}

可以發現如果 nmethod 編譯成功，那麼将會通過 nm->post_compiled_method_load_event(); 将對應的 compiled_method_load_event 事件（此事件表示 nmethod 已被編譯加載，也是我們 crash 日志裡面出錯的事件類型）添加到一個名為 JvmtiDeferredEventQueue 的隊列之中，代碼邏輯如下：

# src/share/vm/code/nmethod.cpp
  
void nmethod::post_compiled_method_load_event() {
......
    if (JvmtiExport::should_post_compiled_method_load()) {
        // Let the Service thread (which is a real Java thread) post the event
        MutexLockerEx ml(Service_lock, Mutex::_no_safepoint_check_flag);
        JvmtiDeferredEventQueue::enqueue(
            JvmtiDeferredEvent::compiled_method_load_event(this));
    }
}

而 ServiceThread 線程則會一直在 while 循環中周遊 JvmtiDeferredEventQueue 隊列是否有事件産生：

# src/share/vm/runtime/serviceThread.cpp
void ServiceThread::service_thread_entry(JavaThread* jt, TRAPS) {
    while (true) {
        ......
        {
            ......
            while (!(sensors_changed = LowMemoryDetector::has_pending_requests()) && 
                !(has_jvmti_events = JvmtiDeferredEventQueue::has_events()) && 
                !(has_gc_notification_event = GCNotifier::has_event()) && 
                !(has_dcmd_notification_event = DCmdFactory::has_pending_jmx_notification()) && 
                !(acs_notify = AllocationContextService::should_notify())) {
                    // wait until one of the sensors has pending requests, or there is a
                    // pending JVMTI event or JMX GC notification to post
                    Service_lock->wait(Mutex::_no_safepoint_check_flag);
            }

            if (has_jvmti_events) {
                // Get the event under the Service_lock
                jvmti_event = JvmtiDeferredEventQueue::dequeue();
                _jvmti_event = &jvmti_event;
            }
  
            if (has_jvmti_events) {
                _jvmti_event->post();
                _jvmti_event = NULL;  // reset
            }
        }
        ......
    }
}

我們可以發現如果隊列中有事件産生（即JvmtiDeferredEventQueue::has_events()），ServiceThread 就會将事件出隊（即JvmtiDeferredEventQueue::dequeue()），并将事件進行 post 釋出以供外部的 JVMTI 用戶端（或稱為代理，即常說的 Agent）監聽，外部的 JVMTI 用戶端（Arthas 就是一種 JVMTI 用戶端）監聽到後可以做一些相應的增強處理，我們檢視 post 的具體邏輯：

# src/share/vm/prims/jvmtiImpl.cpp
void JvmtiDeferredEvent::post() {
    assert(Thread::current()->is_service_thread(), "Service thread must post enqueued events");
    switch(_type) {
        case TYPE_COMPILED_METHOD_LOAD: {
            nmethod* nm = _event_data.compiled_method_load;
            JvmtiExport::post_compiled_method_load(nm);
            break;
        }
        case TYPE_COMPILED_METHOD_UNLOAD: {
            nmethod* nm = _event_data.compiled_method_unload.nm;
            JvmtiExport::post_compiled_method_unload( 
            _event_data.compiled_method_unload.method_id, 
            _event_data.compiled_method_unload.code_begin);
            // done with the deferred event so unlock the nmethod
            nmethodLocker::unlock_nmethod(nm);
            break;
        }
        ......
    }
}

此時我們就可以找到 hs_err_<pid>.log 檔案中報錯的那一行代碼：JvmtiExport::post_compiled_method_load，結合我們 crash 的原因 SIGSEGV (0xb) at pc=0x00007fa2444251a0, pid=1723332, tid=0x00007fa2de684700（SIGSEGV：SIG 是信号名的通用字首，SEGV 是 segmentation violation 存儲器區段錯誤，此錯誤通常是由于程式通路了不屬于它的記憶體空間造成的），是以我們有理由懷疑 JvmtiExport::post_compiled_method_load(nm); 中 post 的 nm，即 nmethod* nm = _event_data.compiled_method_load; 中取到的 nm 可能有問題，為了驗證猜想，我們可以使用 GDB 去調試 JVM 程序 crash 時産生的 Coredump 檔案。

Coredump 檔案産生的路徑寫在了 hs_err_<pid>.log 檔案中：Core dump written. Default location: /opt/core or core.1723332，如果你發現此處顯示 # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again，那麼代表你運作 JVM 程序的使用者限制了 Coredump 檔案的産生，你可以使用 ulimit -c 來檢視目前的限制檔案大小，使用 ulimit -c unlimited 來放開相應的限制。

我們可以發現，0x0000000000000000 确實是個無效的位址，這意味着這個 nmethod 可能在被 post 之前就被解除安裝了（方法所屬的類被解除安裝），我們遇到此類問題可以第一時間去 OpenJDK 上遊社群的 JDK BUG SYSTEM[2] 快速檢索此類問題是否已經被社群解決，以減少重複工作。

經過一番搜尋，我們發現 JDK-8173361 : various crashes in JvmtiExport::post_compiled_method_load[3]，遇到的問題與我們遇到的問題簡直一模一樣，也是由于 nmethod 在 post 前已經被 unload 解除安裝，GC 将其回收掉了，是以 nmethod->_jmethod_id 的内容為空進而引發了 SIGSEGV 的 crash，社群的 patch[4] 修複方式為：

将 nmethod 入隊到 JvmtiDeferredEventQueue 前，通過 No_Safepoint_Verifier 與 Mutex::_no_safepoint_check_flag 防止 JavaThread 進入 safepoint 安全點；
将 nmethod 入隊到 JvmtiDeferredEventQueue 後，新增 oops_do 方法讓 ServiceThread 周遊通路 JvmtiDeferredEventQueue，将 JvmtiDeferredEventQueue 作為 GC 的 Root。

當時 OpenJDK 上遊社群隻在 JDK14 當中修複了這個問題，并将其 backport 回合到了 JDK13.0.7、JDK13.0.6 與 JDK11.0.10 三個版本中，即當時 JDK8 中是依舊存在這個問題的，是以畢昇JDK發起了 backport 到 JDK8 的 patch，并成功将其回合到 OpenJDK 8u352 的版本中，如果你使用的 OpenJDK 版本大于等于 8u352，那麼放心，你不會遇到此類問題。

JvmtiExport::post_compiled_method_unload 導緻的 crash

使用者更新到我們修複此問題的 JDK 版本之後，這個問題成功的消失了，當我以為已經完美閉環的時候，意外還是發生了。使用者反映，之前 JvmtiExport::post_compiled_method_load 的 crash 消失了，但是程序跑一段時間又會出現一個新的 crash：

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fa2444251a0, pid=1723332, tid=0x00007fa2de684700
......
# C  0x00007fa2444251a0
......
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007fa2444251a0

Registers:
RAX=0x00007fa2444251a0, RBX=0x00007fa3287a3090, RCX=0x00007fa346274818, RDX=0x00007fa3315819a0
......
Stack: [0x00007fa2de584000,0x00007fa2de685000],  sp=0x00007fa2de683c48,  free space=1023k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x00007fa2444251a0
V  [libjvm.so+0x812481]  JvmtiDeferredEvent::post()+0xf1
V  [libjvm.so+0xa3e598]  ServiceThread::service_thread_entry(JavaThread*, Thread*)+0x398
V  [libjvm.so+0xaf3851]  JavaThread::thread_main_inner()+0xf1
V  [libjvm.so+0x99a888]  java_start(Thread*)+0xf8
C  [libpthread.so.0+0x7dc5]  start_thread+0xc5
......

我們觀察 hs_err_<pid>.log 檔案可以發現，這又是一個 SIGSEGV 問題，而且 pc 寄存器指向的位址 0x00007fa2444251a0 與 sig 信号中的 si_addr 的位址、RAX 寄存器中的位址相同，看上去大機率是 RAX 寄存器持有一個錯誤的值，然後直接 return 傳回，導緻 pc 寄存器跑飛進而 crash。

繼續使用 GDB 調試新出現的 Coredump 檔案，為了友善排查問題，我們可以挂載對應 JDK 版本的 debuginfo（debuginfo 可以傳達有關程式或庫的重要資訊，包括将可執行代碼映射回原始源代碼的行表、檔案名和包含的檔案、程式中函數變量和類型的描述等等，可以供 debug 調試時使用，由于 debuginfo 檔案比較大，一般 JDK 發行時都會将 debuginfo 相關檔案去除以精簡 JDK 的體積。

OpenJDK 的 debuginfo 可以通過系統 yum install 安裝，或者通過 rpm 檔案手動安裝如：redhat openjdk-debuginfo[5]，如果無法安裝請聯系系統發行商或自行更換 yum 源；如果使用畢昇JDK，可以通過鲲鵬社群-畢昇JDK産品頁[6]進行下載下傳，使用方法請見 debuginfo 的使用指導[7]；如果使用其他發行版 JDK，請自行聯系相應 JDK 發行商擷取），debuginfo 中儲存着相應的 symbols 資訊友善我們直接對應到出問題的代碼位置，我們使用 bt 檢視 crash 時的堆棧資訊：

其中0-12幀都是 JVM 處理異常信号并 crash 處理的邏輯，可以忽略不計，我們可以重點關注第14幀的報錯，關聯了 debuginfo 後我們直接可以關聯到源代碼的位置，即圖中 at /usr1/jenkins/HuaweiJDK/open source/openjdk/jdk8u-jdk8u342-ga/hotspot/src/share/wm/prims/jvmtiExport.cpp:779 ，檢視對應的源碼：

我們有很大理由懷疑，極有可能是此處的 callback 函數指針跑飛了，此處的 callback 是幹什麼的呢？我們上文中提到 ServiceThread 會将事件（compiled_method_load、compiled_method_unload 都屬于事件）進行 post 釋出以供外部的 JVMTI 用戶端監聽使用，此處的 callback 就是外部的 JVMTI 用戶端注冊的回調函數，外部的 JVMTI 用戶端利用這個回調函數來執行自己相應的操作，即此處的 callback 指向的位址是由外部的 JVMTI 用戶端控制的，我們繼續調試：

發現此時的 callback 的确是個無效的位址，很明顯這不是一個 JVM 虛拟機自身 BUG 造成的問題，而是由外部 JVMTI用戶端/JVMTI 工具引發的錯誤。由于使用者目前加載了多個 Agent，有 Arthas、Vmtool、以及數個自己編寫的 Agent，是以我們需要縮小排查的範圍，上文中我們提到了如果 nmethod 編譯成功，那麼将會通過 nm->post_compiled_method_load_event() 将對應的 compiled_method_load_event 事件添加到一個名為 JvmtiDeferredEventQueue 的隊列之中（unload 與 load 同理），其實此處展開來講，JVM 是預設關閉入隊 JvmtiDeferredEventQueue 的操作的（節省性能開銷），檢視相關邏輯（以 unload 為例）：

# src/share/vm/code/nmethod.cpp
void nmethod::post_compiled_method_unload() {
    ......
    if (_jmethod_id != NULL && JvmtiExport::should_post_compiled_method_unload()) {
        ......
        if (SafepointSynchronize::is_at_safepoint()) {
            // Don't want to take the queueing lock. Add it as pending and
            // it will get enqueued later.
            JvmtiDeferredEventQueue::add_pending_event(event);
        } else {
            MutexLockerEx ml(Service_lock, Mutex::_no_safepoint_check_flag);
            JvmtiDeferredEventQueue::enqueue(event);
        }
    }
    ......
}

我們可以發現是否可以入隊 JvmtiDeferredEventQueue 取決于 JvmtiExport::should_post_compiled_method_unload() 邏輯，而 JvmtiExport::should_post_compiled_method_unload() 取決于 JVMTI_SUPPORT_FLAG(should_post_compiled_method_unload)，此變量如果未手動設定的話在 JVM 中是預設關閉的：

# src/share/vm/prims/jvmtiExport.cpp
bool              JvmtiExport::_should_post_compiled_method_unload        = false;

當然我們是可以手動将 JvmtiExport::_should_post_compiled_method_unload 值設為 true 來開啟此功能的，追蹤此處設定的邏輯：

# src/share/vm/prims/jvmtiEventController.cpp
static const jlong  COMPILED_METHOD_LOAD_BIT = (((jlong)1) << (JVMTI_EVENT_COMPILED_METHOD_LOAD - TOTAL_MIN_EVENT_TYPE_VAL));
static const jlong  COMPILED_METHOD_UNLOAD_BIT = (((jlong)1) << (JVMTI_EVENT_COMPILED_METHOD_UNLOAD - TOTAL_MIN_EVENT_TYPE_VAL));
  
JvmtiExport::set_should_post_compiled_method_load((any_env_thread_enabled & COMPILED_METHOD_LOAD_BIT) != 0);
JvmtiExport::set_should_post_compiled_method_unload((any_env_thread_enabled & COMPILED_METHOD_UNLOAD_BIT) != 0);

我們可以看到 JvmtiExport::_should_post_compiled_method_unload 值是由 JVMTI_EVENT_COMPILED_METHOD_UNLOAD 值來決定的，那 JVMTI_EVENT_COMPILED_METHOD_UNLOAD 又是由誰來決定的呢？繼續追蹤邏輯代碼：

# hotspot/linux_aarch64_compiler2/generated/jvmtifiles/jvmtiEnter.cpp
// Check Event Capabilities
const bool JvmtiUtil::has_event_capability(jvmtiEvent event_type, const jvmtiCapabilities* capabilities_ptr) {
 switch (event_type) {
  ......
  case JVMTI_EVENT_COMPILED_METHOD_LOAD:
   return capabilities_ptr->can_generate_compiled_method_load_events != 0;
  case JVMTI_EVENT_COMPILED_METHOD_UNLOAD:
   return capabilities_ptr->can_generate_compiled_method_load_events != 0;
  ......
 }
 // if it does not have a capability it is required
 return JNI_TRUE;
}

發現所有變量的設定都來源于 jvmtiCapabilities，比較熟悉 JVMTI 的朋友看到這裡可能就瞬間警覺起來，jvmtiCapabilities[8] 不就是在編寫 JVMTI的用戶端/Agent 時聲明的，需要開啟 JVMTI 相關能力的配置集嘛，我們檢視 JVMTI 官方的相關介紹[9]也可以發現：

這是可以和我們的代碼對得上的，這就意味着，隻有在JVMTI的工具中将 can_generate_compiled_method_load_events 值設定為1，JVMTI 才會将 nmethod 的 load、unload 事件 post 給用戶端的 callback 使用，是以我們隻需要排查 JVMTI 工具中究竟是誰開啟了 can_generate_compiled_method_load_events，就可以鎖定相關的工具。

但是事情總沒有想象中的那麼簡單，使用者在排查了相關 Agent 的代碼後發現，反映并未找到有 Agent 手動注冊/開啟了 can_generate_compiled_method_load_events，這很明顯是講不通的，我們檢視了多個環境的 crash 日志，發現幾乎每個日志都有一個共同點那就是都包含 Arthas 的線程，說明當時都在使用 Arthas：

使用者此時使用的 Arthas 版本為 3.3.5，詢問使用者使用 Arthas 的用途，發現使用者業務中使用了 Arthas 去定時抓取火焰圖，檢視 Arthas 使用火焰圖的相關邏輯[10]：

可以發現，此時 Arthas 是直接将業界火焰圖工具 Async-Profiler 內建了進來，而 Async-Profiler 也是一種 JVMTI 工具，根據 Arthas 3.3.5 的曆史送出記錄：

我們可以發現，此版本的 Arthas 內建的是 Async-Profiler 1.7 版本，去 GitHub 找到 Async-Profiler 1.7[11] 的源碼，檢索果然發現了 can_generate_compiled_method_load_events的使用[12]：

此時我們離真相就隻有一步之遙了，我們基本可以斷定是 Async-Profiler 1.7 中與 CompiledMethodUnload 相關的 callback 邏輯寫出了問題，為了避免重複工作，我們優先檢視 Async-Profiler 社群有沒有修複這個錯誤，果然我們發現社群在 Async-Profiler 2.6 版本的 patch：Avoid JVM crashes related to CompiledMethodLoad bugs and stack walking during GC[13] 中修複了這個錯誤，當然這個 patch 比較大，除了修複 nmethod load 與 unload 的邏輯還修複了 CodeCache 相關的邏輯，檢視 patch，其中關于 unload 的調用已被社群删除：

我們使用2.6版本 Async-Profiler 的動态庫so替換 Arthas 中內建的1.7版本的 Async-Profiler 動态庫後問題成功被解決，當然我們後來注意到 Arthas 社群也在 Arthas 3.5.6 版本中将 Async-Profiler 更新到了2.6版本：upgrade async-profiler to 2.6. #2089[14]。

是以如果你使用的 Arthas 版本<3.5.6，那麼可以通過手動替換2.6及以上版本的 Async-Profiler 動态庫即可；
如果你使用的 Arthas 版本>=3.5.6，那麼恭喜你，你不會遇到此類問題。

為幫助開發者們提升面試技能、有機會入職BATJ等大廠公司，特别制作了這個專輯——這一次整體放出。

大緻内容包括了： Java 集合、JVM、多線程、并發程式設計、設計模式、Spring全家桶、Java、MyBatis、ZooKeeper、Dubbo、Elasticsearch、Memcached、MongoDB、Redis、MySQL、RabbitMQ、Kafka、Linux、Netty、Tomcat等大廠面試題等、等技術棧！

歡迎大家關注公衆号【Java爛豬皮】，回複【666】，擷取以上最新Java後端架構VIP學習資料以及視訊學習教程，然後一起學習，一文在手，面試我有。

每一個專欄都是大家非常關心，和非常有價值的話題，如果我的文章對你有所幫助，還請幫忙點贊、好評、轉發一下，你的支援會激勵我輸出更高品質的文章，非常感謝！

JVM coredump分析系列（2）使用Arthas抓取火焰圖導緻的crash分析

JvmtiExport::post_compiled_method_load 導緻的 crash

JvmtiExport::post_compiled_method_unload 導緻的 crash

繼續閱讀

馬上沒bug之xcodebuild stack overflow

Java診斷工具 Arthas是什麼安裝以及開始使用常見指令以及使用問題場景解決

AvoidCrash內建和簡單封裝 unrecognized selector sent to instance 崩潰攔截

C語言中的數組越界

ants crash problem

java.lang.IllegalStateException: ActionBarContextView can only be used with android:layout_width="

Python的marshal子產品簡介

Python的pickle子產品詳解

P2psim源代碼分析三

抓包工具tcpdump使用方法彙總

flutter 崩潰收集一、flutter 崩潰收集的方式二、Flutter異常收集最佳實踐三、Flutter crash 收集平台

封裝一個簡單的崩潰監聽類封裝一個簡單的崩潰監聽類

優化你程式的大小 - 微觀優化

分布式transaction中lock的設計和實作

[crash分析][arm]應用程式Segmentation fault

使用Windbg調試.Net應用程式