天天看點

記一次線上TransactionTimedOutException問題排查問題排查過程總結

問題排查過程

在某天的一個下午線上突然出現郵件預警,有個賬戶消費接口傳回500,異常msg為Transaction timed out: deadline was Sun Dec 18 17:14:02 CST 2022。

一開始通過查庫發現該賬号的消費日志和賬戶的餘額是正常的,于是乎将該次請求的trace_log導出來發現在出現TransactionTimedOutException前出現了feign調用逾時的異常,異常資訊如下:

feign.RetryableException: Read timed out executing POST xxx
	at feign.FeignException.errorExecuting(FeignException.java:65)
	at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:105)
	at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:77)
	at feign.hystrix.HystrixInvocationHandler$1.run$original$4jr1J8Wx(HystrixInvocationHandler.java:107)
	at feign.hystrix.HystrixInvocationHandler$1.run$original$4jr1J8Wx$accessor$sQ1jtBAA(HystrixInvocationHandler.java)
	at feign.hystrix.HystrixInvocationHandler$1$auxiliary$MLdm5cdf.call(Unknown Source)
	at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86)
	at feign.hystrix.HystrixInvocationHandler$1.run(HystrixInvocationHandler.java)
	at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:302)
	at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:298)
    ...
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
    ...
           

這裡feign調用就是往mq發送消息,于是去代碼裡檢視發送消息是在哪個位置去進行發送的,僞代碼如下:

@Service
public class AccountService {

  @Resource
  private MessageService messageService;

  @Transactional(timeout = 5,rollbackFor = Exception.class)
  public void consume() {
    // 消費扣餘額,記錄消費日志
    ...
    // 發消息
    messageService.sendMessage();
  }
}



@Service
public class MessageService {

    public void sendMessage() {
    ...
    TransactionSynchronizationManager.registerSynchronization(
        new TransactionSynchronizationAdapter() {
          @Override
          public void afterCommit() {
            //發消息
            ...
          }
        });
  }

}
           

可以看到consume方法标注了@Transactional并且設定了事務逾時時間為5s。往MQ發送消息利用了TransactionSynchronizationManager.registerSynchronization管理目前線程的事務,重寫了afterCommit方法,讓事務在送出後再執行消息的發送。排查到這裡,一開始我懷疑是TransactionSynchronizationManager.registerSynchronization的用法問題導緻,是以寫了個小Demo進行測試,具體代碼如下:

@Transactional(timeout = 5, rollbackFor = Exception.class)
public void test() {
    TestEntity entity = new TestEntity();
    entity.setName("test");
    entity.setAge(1);
    testMapper.insert(entity);
    testService.testSynchronization();
}



public void testSynchronization() {
    TransactionSynchronizationManager.registerSynchronization(
        new TransactionSynchronizationAdapter() {
          @Override
          public void afterCommit() {
            try {
              Thread.sleep(6000);
            } catch (InterruptedException e) {
              e.printStackTrace();
            }
            System.out.println("test synchronization");
          }
        });
}
           

經過調試後發現盡管在afterCommit方法裡讓線程休眠了5秒,但是資料依然新增成功,并且接口傳回200,到這裡我的首次推斷被推翻。

重新回到trace_log去排查,發現通過feign調用往mq發送消息前,有經過4次的Mysql讀寫操作,這4次加起來也就幾ms,真正進行feign調用發消息前也就經過了3ms,也就是業務代碼耗時是不存在的。繼續往下排查,發現在feign調用逾時後又擷取了JDBC的連接配接,具體資訊如下:

o.m.s.t.SpringManagedTransaction - JDBC Connection [org.apache.shardingsphere.shardingjdbc.jdbc.core.connection.ShardingConnection@6814b439] will be managed by Spring
c.f.e.c.mybatis.SQLExecutionWatcher - update, elapsedTime=4(ms)
o.s.jdbc.datasource.DataSourceUtils - Returning JDBC Connection to DataSource
c.f.e.w.p.w.e.ExceptionTrackingHandler - Resolving exception from handler org.springframework.transaction.TransactionTimedOutException: Transaction timed out: deadline was Sun Dec 18 17:14:02 CST 2022
org.springframework.transaction.TransactionTimedOutException: Transaction timed out: deadline was Sun Dec 18 17:14:02 CST 2022
	at org.springframework.transaction.support.ResourceHolderSupport.checkTransactionTimeout(ResourceHolderSupport.java:141)
	at org.springframework.transaction.support.ResourceHolderSupport.getTimeToLiveInMillis(ResourceHolderSupport.java:130)
	at org.springframework.transaction.support.ResourceHolderSupport.getTimeToLiveInSeconds(ResourceHolderSupport.java:114)
	at org.mybatis.spring.transaction.SpringManagedTransaction.getTimeout(SpringManagedTransaction.java:137)
    ...
	at org.mybatis.spring.SqlSessionTemplate.insert(SqlSessionTemplate.java:278)
	at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:58)
	at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:59)
	at com.sun.proxy.$Proxy170.insertSelective(Unknown Source)
           

從報錯的堆棧資訊中可以看到這裡有執行insert方法,跟蹤代碼發現在進行feign調用發送消息時有進行異常捕獲,若出現異常則将消息存到表裡。那麼到這裡問題就顯而易見了,問題如下:

在首次操作資料庫建立連接配接開啟事務時通過org.mybatis.spring.transaction.SpringManagedTransaction#getTimeout方法計算出事務的TimeToLive。經過一系列的Mysql讀寫操作後通過TransactionSynchronizationManager.registerSynchronization進行同步器的注冊。事務送出後執行了afterCommit方法,該方法内部通過feign調用進行消息發送。由于feign調用逾時導緻異常被捕獲後執行發送失敗的消息入庫操作。進行入庫操作時擷取JDBC連接配接時判斷事務已經過了逾時時間,是以出現TransactionTimedOutException異常。

追蹤下getTimeout方法源碼,org.mybatis.spring.transaction.SpringManagedTransaction#getTimeout

public Integer getTimeout() throws SQLException {
    ConnectionHolder holder = (ConnectionHolder) TransactionSynchronizationManager.getResource(dataSource);
    if (holder != null && holder.hasTimeout()) {
      return holder.getTimeToLiveInSeconds();
    } 
    return null;
  }
           

org.springframework.transaction.support.TransactionSynchronizationManager#getResource

public static Object getResource(Object key) {
		Object actualKey = TransactionSynchronizationUtils.unwrapResourceIfNecessary(key);
        // 根據key值擷取連接配接
        // 從TransactionSynchronizationManager.resources拿
        // resources的結構為ThreadLocal<Map<Object, Object>>
		Object value = doGetResource(actualKey);
		if (value != null && logger.isTraceEnabled()) {
			logger.trace("Retrieved value [" + value + "] for key [" + actualKey + "] bound to thread [" +
					Thread.currentThread().getName() + "]");
		}
		return value;
	}


public abstract class TransactionSynchronizationManager {

     //線程上下文中儲存着【線程池對象:ConnectionHolder】的Map對象。線程可以通過該屬性擷取到同一個Connection對象。
    private static final ThreadLocal<Map<Object, Object>> resources = new NamedThreadLocal<>("Transactional resources");
    //事務同步器,是Spring交由程式員進行擴充的代碼,每個線程可以注冊N個事務同步器。
    private static final ThreadLocal<Set<TransactionSynchronization>> synchronizations = new NamedThreadLocal<>("Transaction synchronizations");
    // 事務的名稱  
    private static final ThreadLocal<String> currentTransactionName = new NamedThreadLocal<>("Current transaction name");
    // 事務是否是隻讀  
    private static final ThreadLocal<Boolean> currentTransactionReadOnly = new NamedThreadLocal<>("Current transaction read-only status");
    // 事務的隔離級别
    private static final ThreadLocal<Integer> currentTransactionIsolationLevel = new NamedThreadLocal<>("Current transaction isolation level");
    // 事務是否開啟   actual:真實的
    private static final ThreadLocal<Boolean> actualTransactionActive = new NamedThreadLocal<>("Actual transaction active");
}
           

在afterCommit方法中通過feign調用發送消息時是在主線程進行發送的,導緻getTimeout方法内部通過checkTransactionTimeout方法進行判斷時直接抛異常。

由于發送消息的方法隻在主線程進行發送的,雖然不會影響資料的準确性,但是假設發送消息因網絡波動耗時了一會兒會導緻接口響應延遲,讓使用者體驗不好。是以這塊需要進行解耦,進行異步發送。那麼該如何解決這個問題呢?

  • 将發送消息的方法放在事務外,利用@Async進行異步發送。
  • 利用@[email protected]進行異步發送

總結

每一次問題的排查、解決都是一次提升,共勉!