問題排查過程
在某天的一個下午線上突然出現郵件預警,有個賬戶消費接口傳回500,異常msg為Transaction timed out: deadline was Sun Dec 18 17:14:02 CST 2022。
一開始通過查庫發現該賬号的消費日志和賬戶的餘額是正常的,于是乎将該次請求的trace_log導出來發現在出現TransactionTimedOutException前出現了feign調用逾時的異常,異常資訊如下:
feign.RetryableException: Read timed out executing POST xxx
at feign.FeignException.errorExecuting(FeignException.java:65)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:105)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:77)
at feign.hystrix.HystrixInvocationHandler$1.run$original$4jr1J8Wx(HystrixInvocationHandler.java:107)
at feign.hystrix.HystrixInvocationHandler$1.run$original$4jr1J8Wx$accessor$sQ1jtBAA(HystrixInvocationHandler.java)
at feign.hystrix.HystrixInvocationHandler$1$auxiliary$MLdm5cdf.call(Unknown Source)
at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86)
at feign.hystrix.HystrixInvocationHandler$1.run(HystrixInvocationHandler.java)
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:302)
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:298)
...
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
...
這裡feign調用就是往mq發送消息,于是去代碼裡檢視發送消息是在哪個位置去進行發送的,僞代碼如下:
@Service
public class AccountService {
@Resource
private MessageService messageService;
@Transactional(timeout = 5,rollbackFor = Exception.class)
public void consume() {
// 消費扣餘額,記錄消費日志
...
// 發消息
messageService.sendMessage();
}
}
@Service
public class MessageService {
public void sendMessage() {
...
TransactionSynchronizationManager.registerSynchronization(
new TransactionSynchronizationAdapter() {
@Override
public void afterCommit() {
//發消息
...
}
});
}
}
可以看到consume方法标注了@Transactional并且設定了事務逾時時間為5s。往MQ發送消息利用了TransactionSynchronizationManager.registerSynchronization管理目前線程的事務,重寫了afterCommit方法,讓事務在送出後再執行消息的發送。排查到這裡,一開始我懷疑是TransactionSynchronizationManager.registerSynchronization的用法問題導緻,是以寫了個小Demo進行測試,具體代碼如下:
@Transactional(timeout = 5, rollbackFor = Exception.class)
public void test() {
TestEntity entity = new TestEntity();
entity.setName("test");
entity.setAge(1);
testMapper.insert(entity);
testService.testSynchronization();
}
public void testSynchronization() {
TransactionSynchronizationManager.registerSynchronization(
new TransactionSynchronizationAdapter() {
@Override
public void afterCommit() {
try {
Thread.sleep(6000);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("test synchronization");
}
});
}
經過調試後發現盡管在afterCommit方法裡讓線程休眠了5秒,但是資料依然新增成功,并且接口傳回200,到這裡我的首次推斷被推翻。
重新回到trace_log去排查,發現通過feign調用往mq發送消息前,有經過4次的Mysql讀寫操作,這4次加起來也就幾ms,真正進行feign調用發消息前也就經過了3ms,也就是業務代碼耗時是不存在的。繼續往下排查,發現在feign調用逾時後又擷取了JDBC的連接配接,具體資訊如下:
o.m.s.t.SpringManagedTransaction - JDBC Connection [org.apache.shardingsphere.shardingjdbc.jdbc.core.connection.ShardingConnection@6814b439] will be managed by Spring
c.f.e.c.mybatis.SQLExecutionWatcher - update, elapsedTime=4(ms)
o.s.jdbc.datasource.DataSourceUtils - Returning JDBC Connection to DataSource
c.f.e.w.p.w.e.ExceptionTrackingHandler - Resolving exception from handler org.springframework.transaction.TransactionTimedOutException: Transaction timed out: deadline was Sun Dec 18 17:14:02 CST 2022
org.springframework.transaction.TransactionTimedOutException: Transaction timed out: deadline was Sun Dec 18 17:14:02 CST 2022
at org.springframework.transaction.support.ResourceHolderSupport.checkTransactionTimeout(ResourceHolderSupport.java:141)
at org.springframework.transaction.support.ResourceHolderSupport.getTimeToLiveInMillis(ResourceHolderSupport.java:130)
at org.springframework.transaction.support.ResourceHolderSupport.getTimeToLiveInSeconds(ResourceHolderSupport.java:114)
at org.mybatis.spring.transaction.SpringManagedTransaction.getTimeout(SpringManagedTransaction.java:137)
...
at org.mybatis.spring.SqlSessionTemplate.insert(SqlSessionTemplate.java:278)
at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:58)
at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:59)
at com.sun.proxy.$Proxy170.insertSelective(Unknown Source)
從報錯的堆棧資訊中可以看到這裡有執行insert方法,跟蹤代碼發現在進行feign調用發送消息時有進行異常捕獲,若出現異常則将消息存到表裡。那麼到這裡問題就顯而易見了,問題如下:
在首次操作資料庫建立連接配接開啟事務時通過org.mybatis.spring.transaction.SpringManagedTransaction#getTimeout方法計算出事務的TimeToLive。經過一系列的Mysql讀寫操作後通過TransactionSynchronizationManager.registerSynchronization進行同步器的注冊。事務送出後執行了afterCommit方法,該方法内部通過feign調用進行消息發送。由于feign調用逾時導緻異常被捕獲後執行發送失敗的消息入庫操作。進行入庫操作時擷取JDBC連接配接時判斷事務已經過了逾時時間,是以出現TransactionTimedOutException異常。
追蹤下getTimeout方法源碼,org.mybatis.spring.transaction.SpringManagedTransaction#getTimeout
public Integer getTimeout() throws SQLException {
ConnectionHolder holder = (ConnectionHolder) TransactionSynchronizationManager.getResource(dataSource);
if (holder != null && holder.hasTimeout()) {
return holder.getTimeToLiveInSeconds();
}
return null;
}
org.springframework.transaction.support.TransactionSynchronizationManager#getResource
public static Object getResource(Object key) {
Object actualKey = TransactionSynchronizationUtils.unwrapResourceIfNecessary(key);
// 根據key值擷取連接配接
// 從TransactionSynchronizationManager.resources拿
// resources的結構為ThreadLocal<Map<Object, Object>>
Object value = doGetResource(actualKey);
if (value != null && logger.isTraceEnabled()) {
logger.trace("Retrieved value [" + value + "] for key [" + actualKey + "] bound to thread [" +
Thread.currentThread().getName() + "]");
}
return value;
}
public abstract class TransactionSynchronizationManager {
//線程上下文中儲存着【線程池對象:ConnectionHolder】的Map對象。線程可以通過該屬性擷取到同一個Connection對象。
private static final ThreadLocal<Map<Object, Object>> resources = new NamedThreadLocal<>("Transactional resources");
//事務同步器,是Spring交由程式員進行擴充的代碼,每個線程可以注冊N個事務同步器。
private static final ThreadLocal<Set<TransactionSynchronization>> synchronizations = new NamedThreadLocal<>("Transaction synchronizations");
// 事務的名稱
private static final ThreadLocal<String> currentTransactionName = new NamedThreadLocal<>("Current transaction name");
// 事務是否是隻讀
private static final ThreadLocal<Boolean> currentTransactionReadOnly = new NamedThreadLocal<>("Current transaction read-only status");
// 事務的隔離級别
private static final ThreadLocal<Integer> currentTransactionIsolationLevel = new NamedThreadLocal<>("Current transaction isolation level");
// 事務是否開啟 actual:真實的
private static final ThreadLocal<Boolean> actualTransactionActive = new NamedThreadLocal<>("Actual transaction active");
}
在afterCommit方法中通過feign調用發送消息時是在主線程進行發送的,導緻getTimeout方法内部通過checkTransactionTimeout方法進行判斷時直接抛異常。
由于發送消息的方法隻在主線程進行發送的,雖然不會影響資料的準确性,但是假設發送消息因網絡波動耗時了一會兒會導緻接口響應延遲,讓使用者體驗不好。是以這塊需要進行解耦,進行異步發送。那麼該如何解決這個問題呢?
- 将發送消息的方法放在事務外,利用@Async進行異步發送。
- 利用@[email protected]進行異步發送
總結
每一次問題的排查、解決都是一次提升,共勉!