天天看點

阿裡雲資料湖分析一鍵建倉遇到net_write_timeout異常

今天,阿裡雲資料湖分析(DLA: Data Lake Analytics)某使用者遇到一鍵建倉同步資料失敗的問題,異常如下所示:

The whole JobGroup: Sync failed! State: JobGroupState(state=CANCELLED, message=[#1054] Failed syncing: gameinfolog_2018_05_06! message: mppQueryId=20200325_090656_18425_uuvx6, mppErrorCode=30101, because Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout' on the server.), 107 out of 775 jobs are finished

通過了解,使用者希望同步其自建的Mysql資料庫到資料湖分析,使用一鍵建倉方式同步。Mysql規格 8c32g,資料庫中一共775個表,3T資料量。

分析過程

調查了背景異常堆棧如下所示,可以看到是因為在同步過程中,與Mysql的連接配接突然被斷開。

Caused by: java.io.EOFException: Can not read response from server. Expected to read 239 bytes, read 172 bytes before connection was unexpectedly lost.
        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3014)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3522)
        ... 21 more           
total-allowed-connections=100  
connections-per-job=10           

但還是出現了相同的問題。

  • 分析應該是其他原因導緻,由于是使用者自建的Mysql,監控不太完善,隻能讓使用者提供Mysql的日志,發現出錯的時間點Mysql會有一次重新開機,判斷可能是Mysql重新開機導緻上面的異常。但為什麼Mysql會經常重新開機呢?
hongdeMacBook-Pro:Downloads hongshen$ grep shutdown error.log.1 
2020-03-25 09:34:33 25035 [Note] InnoDB: Database was not shutdown normally!
2020-03-25 11:35:07 31719 [Note] InnoDB: Database was not shutdown normally!
2020-03-25 14:03:33 7018 [Note] InnoDB: Database was not shutdown normally!
2020-03-25 15:02:10 10453 [Note] InnoDB: Database was not shutdown normally!
2020-03-25 15:09:33 11194 [Note] InnoDB: Database was not shutdown normally!
2020-03-25 16:28:36 15479 [Note] InnoDB: Database was not shutdown normally!
2020-03-25 16:35:42 16069 [Note] InnoDB: Database was not shutdown normally!
2020-03-25 17:07:15 17923 [Note] InnoDB: Database was not shutdown normally!
2020-03-26 00:36:17 7953 [Note] InnoDB: Database was not shutdown normally!
2020-03-26 00:43:20 8545 [Note] InnoDB: Database was not shutdown normally!           
  • 觀察重新開機的時間點,與同步時間相關性大,可能是同步觸發Mysql重新開機,但DLA之前測試一鍵建倉同步雲資料庫RDS更大的資料量也未出問題,為什麼這次出問題了呢?唯一的差別是這次使用者使用了自建的Mysql,猜測可能是Mysql某些配置不合理導緻,于是再次讓使用者提供Mysql的配置和系統日志syslog,檢視後發現Mysql重新開機的時間點都有一次oom-killer,而且Mysql配置的記憶體過高,是以可以斷定是由于Mysql記憶體使用太多,觸發了機器的oom-killer。
hongdeMacBook-Pro:Downloads hongshen$ grep oom-killer syslog.1 
Mar 25 09:34:30 iZwz91gspuopjn2g9y7p2bZ kernel: [778504.166514] cron invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
Mar 25 11:35:03 iZwz91gspuopjn2g9y7p2bZ kernel: [785737.242006] fpmmm.php invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Mar 25 14:03:29 iZwz91gspuopjn2g9y7p2bZ kernel: [794642.922230] mysqld invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Mar 25 15:02:07 iZwz91gspuopjn2g9y7p2bZ kernel: [798160.021297] mysqld invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Mar 25 15:09:30 iZwz91gspuopjn2g9y7p2bZ kernel: [798602.927184] mysqld invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Mar 25 16:28:33 iZwz91gspuopjn2g9y7p2bZ kernel: [803345.721144] mysqld invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Mar 25 16:28:33 iZwz91gspuopjn2g9y7p2bZ kernel: [803345.734002] ntpd invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Mar 25 16:35:38 iZwz91gspuopjn2g9y7p2bZ kernel: [803771.450129] mysqld invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Mar 25 17:07:11 iZwz91gspuopjn2g9y7p2bZ kernel: [805664.819003] mysqld invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Mar 26 00:36:13 iZwz91gspuopjn2g9y7p2bZ kernel: [832606.199291] mysqld invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
Mar 26 00:36:14 iZwz91gspuopjn2g9y7p2bZ kernel: [832606.207690] in:imklog invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Mar 26 00:43:16 iZwz91gspuopjn2g9y7p2bZ kernel: [833029.210693] mysqld invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0           

解決方法

由于使用者為Mysql的記憶體配置過高,讓使用者調小Mysql記憶體後,問題解決。