问题背景:
是一套从其他项目刚交接过来的MongoDB,登录上去了解下环境,我的天呐,不看不知道到,一看吓一跳,其中一个分片的副本集延时两个月前就有问题了。
rs_01:SECONDARY> db.printSlaveReplicationInfo()
source: xxxxx02:30001
syncedTo: Sat Jan 30 2016 00:54:17 GMT+0800 (CST)
= 5226316 secs ago (1451.75hrs)
source: xxxxx03:30001
syncedTo: Tue Feb 02 2016 20:12:44 GMT+0800 (CST)
= 4897609 secs ago (1360.45hrs)
原因分析:
登录副本集主节点,查看日志,狂刷屏,下面是一直打印的错误日志信息,日志大小已经达到上百GB:
/lib64/libc.so.6(clone+0x6d) [0x3437cd526d]
Wed Mar 30 15:53:39.121 [conn375182574] assertion 10334 BSONObj size: 1852142352 (0x1073656E) is invalid. Size must be between 0 and 16793600(16MB) First element: Status: ?type=100 ns:local.oplog.rs query:{}
Wed Mar 30 15:53:39.121 [conn375182574] ntoskip:0 ntoreturn:1
Wed Mar 30 15:53:39.121 [conn375182574] end connection 192.168.XX.XX:55806 (990 connections now open)
Wed Mar 30 15:53:39.122 [initandlisten] connection accepted from 192.168.XX.XX:55807 #375182575 (991 connections now open)
Wed Mar 30 15:53:39.122 [conn375182575] Assertion: 10334:BSONObj size: 1852142352 (0x1073656E) is invalid. Size must be between 0 and 16793600(16MB) First element: Status: ?type=100
0xde9a71 0xdab23b 0xdab77c 0x6ee52f 0x8c9adc 0xa8a52f 0xa8c27a 0xa87161 0xa8c7be 0xa8f628 0xa02af4 0xa03eb0 0x6eba58 0xdd603e 0x3438c0683d 0x3437cd526d
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde9a71]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xdab23b]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod [0xdab77c]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x5bf) [0x6ee52f]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo11BasicCursor7currentEv+0x5c) [0x8c9adc]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZNK5mongo21ResponseBuildStrategy7currentEbPNS_13ResultDetailsE+0x11f) [0xa8a52f]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo20OrderedBuildStrategy11handleMatchEPNS_13ResultDetailsE+0xaa) [0xa8c27a]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo20QueryResponseBuilder8addMatchEv+0xb1) [0xa87161]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo23queryWithQueryOptimizerEiRKSsRKNS_7BSONObjERNS_5CurOpES4_S4_RKN5boost10shared_ptrINS_11ParsedQueryEEES4_RKNS_12ChunkVersionERNS7_10scoped_ptrINS_25PageFaultRetryableSectionEEERNSG_INS_19NoPageFaultsAllowedEEERNS_7MessageE+0x3be) [0xa8c7be]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x1ac8) [0xa8f628]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod [0xa02af4]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x410) [0xa03eb0]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6eba58]
/opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdd603e]
/lib64/libpthread.so.0 [0x3438c0683d]
/lib64/libc.so.6(clone+0x6d) [0x3437cd526d]
从这个日志信息,基本可以了解到:
1、MongoDB读取到数据文件后发现存在BSON对象的大小超过16MB的非法对象所以报错;
2、报错对象是:local.oplog
直接登录对应分片主节点,确认:
问题解决:
通过上面的分析,已经定位到了问题:local.oplog 损坏。那么解决问题就简单了,选择合适的时间重建 oplog,下面是相关步骤:
查看主节点日志,MongoDB的日志已经正常打印了,错误已经没有出现。
由于日志被清理过多次,已经无法找到最早一次出现问题的时间了,后来了解到,此DB在之前断电重启过几次。
后续安排:
1、部署监控;
2、业务空闲时间重新初始化其分片对应的副本集;
3、2项完成后,也对其分片主节点进行降级重新初始化。