天天看点

[故障处理] MongoDB Assertion: 10334:BSONObj size: 1852142352 (0x1073656E) is invalid 故障处理一例

问题背景:

    是一套从其他项目刚交接过来的MongoDB,登录上去了解下环境,我的天呐,不看不知道到,一看吓一跳,其中一个分片的副本集延时两个月前就有问题了。

  rs_01:SECONDARY> db.printSlaveReplicationInfo()

  source:   xxxxx02:30001

           syncedTo: Sat Jan 30 2016 00:54:17 GMT+0800 (CST)

                   = 5226316 secs ago (1451.75hrs)

  source:   xxxxx03:30001

           syncedTo: Tue Feb 02 2016 20:12:44 GMT+0800 (CST)

                   = 4897609 secs ago (1360.45hrs)

原因分析:

    登录副本集主节点,查看日志,狂刷屏,下面是一直打印的错误日志信息,日志大小已经达到上百GB:

/lib64/libc.so.6(clone+0x6d) [0x3437cd526d]

Wed Mar 30 15:53:39.121 [conn375182574] assertion 10334 BSONObj size: 1852142352 (0x1073656E) is invalid. Size must be between 0 and 16793600(16MB) First element: Status: ?type=100 ns:local.oplog.rs query:{}

Wed Mar 30 15:53:39.121 [conn375182574]  ntoskip:0 ntoreturn:1

Wed Mar 30 15:53:39.121 [conn375182574] end connection 192.168.XX.XX:55806 (990 connections now open)

Wed Mar 30 15:53:39.122 [initandlisten] connection accepted from 192.168.XX.XX:55807 #375182575 (991 connections now open)

Wed Mar 30 15:53:39.122 [conn375182575] Assertion: 10334:BSONObj size: 1852142352 (0x1073656E) is invalid. Size must be between 0 and 16793600(16MB) First element: Status: ?type=100

0xde9a71 0xdab23b 0xdab77c 0x6ee52f 0x8c9adc 0xa8a52f 0xa8c27a 0xa87161 0xa8c7be 0xa8f628 0xa02af4 0xa03eb0 0x6eba58 0xdd603e 0x3438c0683d 0x3437cd526d 

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde9a71]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xdab23b]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod [0xdab77c]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x5bf) [0x6ee52f]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo11BasicCursor7currentEv+0x5c) [0x8c9adc]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZNK5mongo21ResponseBuildStrategy7currentEbPNS_13ResultDetailsE+0x11f) [0xa8a52f]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo20OrderedBuildStrategy11handleMatchEPNS_13ResultDetailsE+0xaa) [0xa8c27a]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo20QueryResponseBuilder8addMatchEv+0xb1) [0xa87161]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo23queryWithQueryOptimizerEiRKSsRKNS_7BSONObjERNS_5CurOpES4_S4_RKN5boost10shared_ptrINS_11ParsedQueryEEES4_RKNS_12ChunkVersionERNS7_10scoped_ptrINS_25PageFaultRetryableSectionEEERNSG_INS_19NoPageFaultsAllowedEEERNS_7MessageE+0x3be) [0xa8c7be]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x1ac8) [0xa8f628]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod [0xa02af4]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x410) [0xa03eb0]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6eba58]

 /opt/mongodb-linux-x86_64-2.4.11/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdd603e]

 /lib64/libpthread.so.0 [0x3438c0683d]

 /lib64/libc.so.6(clone+0x6d) [0x3437cd526d]

   从这个日志信息,基本可以了解到:

   1、MongoDB读取到数据文件后发现存在BSON对象的大小超过16MB的非法对象所以报错;

   2、报错对象是:local.oplog

   直接登录对应分片主节点,确认:

问题解决:

    通过上面的分析,已经定位到了问题:local.oplog 损坏。那么解决问题就简单了,选择合适的时间重建 oplog,下面是相关步骤:

   查看主节点日志,MongoDB的日志已经正常打印了,错误已经没有出现。

   由于日志被清理过多次,已经无法找到最早一次出现问题的时间了,后来了解到,此DB在之前断电重启过几次。

后续安排:

1、部署监控;

2、业务空闲时间重新初始化其分片对应的副本集;

3、2项完成后,也对其分片主节点进行降级重新初始化。