天天看點

SpringBoot項目中ES High Level Rest Client 逾時問題排查及解決

作者:蝸牛學技術

問題描述

最近項目中遇到一個問題,在某一個時刻應用系統出現了大量的ES通路逾時的現象,異常日志主要為:

java.io.IOException: Connection reset by peer
	at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:793) ~[elasticsearch-rest-client-7.4.0.jar!/:7.4.0]
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:218) ~[elasticsearch-rest-client-7.4.0.jar!/:7.4.0]
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:205) ~[elasticsearch-rest-client-7.4.0.jar!/:7.4.0]
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1454) ~[elasticsearch-rest-high-level-client-7.4.0.jar!/:7.4.0]
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1439) ~[elasticsearch-rest-high-level-client-7.4.0.jar!/:7.4.0]
	at org.elasticsearch.client.IndicesClient.exists(IndicesClient.java:785) ~[elasticsearch-rest-high-level-client-7.4.0.jar!/:7.4.0]
	at com.free4inno.scheduler.adapter.service.elasticsearch.GenericEsService.isIndexExists(GenericEsService.java:60) ~[classes!/:0.0.1-SNAPSHOT]
	at com.free4inno.scheduler.adapter.service.elasticsearch.MetricsToEsService.getTodayIndex(MetricsToEsService.java:56) ~[classes!/:0.0.1-SNAPSHOT]
	at com.free4inno.scheduler.adapter.service.elasticsearch.MetricsToEsService.insert(MetricsToEsService.java:49) ~[classes!/:0.0.1-SNAPSHOT]
	at com.free4inno.scheduler.adapter.service.Scheduler.sendMetrics(Scheduler.java:64) [classes!/:0.0.1-SNAPSHOT]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_261]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_261]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_261]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_261]
	at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) [spring-context-5.2.8.RELEASE.jar!/:5.2.8.RELEASE]
	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.2.8.RELEASE.jar!/:5.2.8.RELEASE]
	at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93) [spring-context-5.2.8.RELEASE.jar!/:5.2.8.RELEASE]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_261]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_261]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_261]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_261]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_261]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_261]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_261]
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_261]
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_261]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_261]
	at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[?:1.8.0_261]
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378) ~[?:1.8.0_261]
	at org.apache.http.impl.nio.reactor.SessionInputBufferImpl.fill(SessionInputBufferImpl.java:231) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
	at org.apache.http.impl.nio.codecs.AbstractMessageParser.fillBuffer(AbstractMessageParser.java:136) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:241) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
	... 1 more           

異常現象表現為:

①第一次連接配接出現timeout,後續連接配接正常且timeout時長為配置的socket時長。

②抛出異常時間段不固定。

問題定位

這個問題在之前開發多叢集管理平台的時候也出現,當時因為存在③跨版本且還沒有正式投産的情況下,并沒有太關注的此類問題,覺得此問題可能是版本相容性。項目組報上來問題才認真地排查了一番。

應用系統通過ES High Level Rest Client用戶端直連且版本相同,可以排查③的情況。起初懷疑是查詢或者寫入量大,背景處理不及時的原因導緻異常。于是檢視監控資料和日志,監控顯示在幾次出現異常的時間點上各項名額這都很正常,沒有出現流量陡增的情況,結合現象②,可以排除是因為服務端壓力的原因。

接着把影響異常的源頭定位在用戶端。ES用戶端和服務端的連接配接采用的是長連接配接,查閱源碼發現用戶端建立了client連接配接池,每個client持有一個http連接配接,并且開啟http的keep-alive政策複用連接配接。正是因為這個複用探活的原因導緻該異常發生。下面具體分析該問題。

問題分析

在未處理前restClient用戶端建立示意代碼:

final RestClientBuilder restClientBuilder =  RestClient.builder(redisHostList).setRequestConfigCallBack(requestConfigBuilder -> {;

      requestConfigBuilder.setConnectTimeout(5000);

      requestConfigBuilder.setSocketTimeout(60000);

      requestConfigBuilder.setConnectionRequestTimeout(500);

      return requestConfigBuilder;

      }).setHttpClientConfigCallback((httpAsyncClientBuilder) -> {

          httpAsyncClientBuilder.disableAuthCacheing();

       return httpAsyncClientBuilder;    final RestHighLevelClient restHighLevelClient = new RestHighLevelClient(restClientBuilder);
           

用戶端與服務端的連接配接示意圖:

SpringBoot項目中ES High Level Rest Client 逾時問題排查及解決

ES high-level-client 對長連接配接的實作是把逾時時間設定為-1,意味着用戶端永遠不逾時,服務端裝置為了資源的使用率會檢測與此裝置的連接配接是否在使用,如果一個連接配接長時間沒有使用,服務端會主動把這個連接配接關閉,而此時用戶端不知情,還處在連接配接狀态,可以說目前處于半連接配接狀态。當有流量進來的時候使用了該連接配接就會發現與服務端連接配接不上,産生timeout,用戶端也斷開此連結,如上圖 1 所示。用戶端使用新連接配接 2 與服務端通信。是以問題的根源在于用戶端沒有及時發現連接配接的不可用并斷開,是以需要設定讓用戶端主動對tcp連接配接進行探測保活。

解決方案

一、用戶端顯式的設定setKeepAliveStrategy
httpAsyncClientBuilder.setKeepAliveStrategy((httpResponse,httpContext) -> TimeUtils.MINUTES.toMillis(3))

二、顯示開啟 tcp keepalive
httpAsyncClientBuilder.setDefaultIOReactorConfig(IOReactorConfig.sustom().setSoKeepAlive(true).build())

三、在系統層面設定tcp keepalive探測保活時間
net.ipv4.tcp_keepalive_time = 60           

來源:https://www.modb.pro/db/388569