smartctl 指令行的使用.

指令行使用說明:
目前我們使用的伺服器都帶有lsi的raid卡,當磁盤為SAS盤時使用smartctl時需要添加:
smartctl -d megaraid,$deviceid /dev/$diskname
當磁盤為SATA盤時使用smartctl時需要添加:
smartctl -d sat+megaraid,$deviceid /dev/$diskname
可以使用raid卡工具來檢視磁盤接口類型
megacli -cfgdsply -aall |grep 'PD TYPE'
若沒有使用raid卡則不需要加 -d參數.

指令行傳回值
smartctl執行完畢之後可以從$? shell變量中取得傳回值,如果磁盤完全正常則傳回值為0,否則根據錯誤類型設定相應的bit位.
各個bit的說明如下:
Bit 0: Command line did not parse.
Bit 1: Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode (see -n option above).
Bit 2: Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure (see -b option above).
Bit 3: SMART status check returned "DISK FAILING".
Bit 4: We found prefail Attributes <= threshold.
Bit 5: SMART status check returned "DISK OK" but we found that some (usage or prefail) Attributes have been <= threshold at some time in the past.
Bit 6: The device error log contains records of errors.
Bit 7: The device self-test log contains records of errors. [ATA only] Failed self-tests outdated by a newer successful extended self-test are ignored.
檢視bit設定:
status=$?
for ((i=0; i<8; i++)); do
echo "Bit $i: $((status & 2**i && 1))"
done
需要重點監控bit3, bit4, bit6, bit7,bit5是否設定,其他位置的設定需要提醒.

smartctl 顯示的屬性(Attribute)資訊:
以公司内的一台伺服器為例說明:
[[email protected] ~]#smartctl -A -P use /dev/sdb
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.el6.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 086 086 016 Pre-fail Always - 10813449
2 Throughput_Performance 0x0005 132 132 054 Pre-fail Offline - 105
3 Spin_Up_Time 0x0007 117 117 024 Pre-fail Always - 615 (Average 615)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 314
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 112 112 020 Pre-fail Offline - 39
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 23637
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 313
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 478
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 478
194 Temperature_Celsius 0x0002 222 222 000 Old_age Always - 27 (Min/Max 5/70)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
1,首先不同的磁盤廠商提供的ATTRIBUTE_NAME 清單可能不一樣,隻是S.M.A.R.T屬性清單的子集,smart完整的屬性清單及其每個屬性的含義請參考這裡:
http://en.wikipedia.org/wiki/S.M.A.R.T.#8
2,我們需要關注的字段 WHEN_FAILED
WHEN_FAILED字段顯示的規則:
if(VALUE <= THRESH)
WHEN_FAILED = "FAILING_NOW";
else if (WORST <= THRESH)
WHEN_FAILED = "in_the_past"(or past);
else
WHEN_FAILED = "-";
也就說當某個ATTRIBUTE_NAME的WHEN_FAILED字段為“-”時表示這個屬性是正常的,也從沒發生過異常.
同時當smartctl 指令的傳回值的bit4,bit 5設定就可以檢檢視哪個ATTRIBUTE_NAME為非“-”就表示這個字段出問題了.

簡單的smartctl監控方案
針對每塊盤沒半個小時執行一次smartctl掃描:
smartctl -a /dev/$devname
每次都要檢查smartctl的傳回值,
如果傳回值的bit2,可以使用smartctl -x -b warn /dev/$devname可以看到不支援哪些指令
Warning: device does not support SCT Data Table command
Warning: device does not support SCT Error Recovery Control command
如果傳回值的bit4或者bit5設定,則需要檢查smartctl輸出中的START OF READ SMART DATA SECTION,即上節所講的ATTRIBUTE,并記錄WHEN_FAILED字段非“-”的ATTRIBUTE_NAME.
如果傳回值的bit6設定,記錄smartctl -l xerror /dev/$devname 的執行結果.
如果傳回值的bit7設定,記錄smartctl -l xselftest /dev/$devname 的執行結果.
如果bit3設定表示smart自檢失敗.
以上的bit除了bit5.其他最好都能實時的發出警報資訊,
其他bit如果置位,可以不需要實時的警報.

針對ATTRIBUTE_NAME的一些說明:
由于不同廠商的磁盤提供的ATTRIBUTE_NAME不完全一緻, 加上我現在對某些字段的含義了解不夠,是以警報資訊暫時不按照ATTRIBUTE_NAME來區分.
比如我們比較關注的Throughput_Performance,公司内的日立的磁盤的smart包含有此資訊,而希捷的盤沒有.
至于需要更細化的監控方案需要對ATTRIBUTE_NAME中的屬性有深入的了解再做定奪.

ssd盤的壽命監控
ssd 盤的壽命監控主要監控以下的ATTRIBUTE:
Media_Wearout_Indicator: 使用耗費, 表示SSD上NAND的擦寫次數的程度
Reallocated_Sector_Ct: 出廠後産生的壞塊個數
Host_Writes_32MiB: 已寫32MiB的個數.
Available_Reservd_Space: SSD上剩餘的保留白間。