前言:
随着上個世紀開始計算機的陸續普及,企業對于資料的安全性要求越來越高;對于業務系統使用的可連續性越來越重視。作為一個為企業提供資料存儲方或者是私有雲項目的建設方,那麼客戶的需求就是我們的建設目标,客戶聚焦關注的目标就是我們建設方案更新/技術更新的方向。
如果硬碟出現了故障的情況,我們如何在不停機的情況下幫助客戶更換硬碟?
其實伺服器供應商已經提供了此方法,例如戴爾伺服器的IPMI帶外管理,浪潮伺服器的BMC等都是可以對磁盤進行管理,例如:帶外磁盤引導、配置陣列、清除陣列、定位磁盤等。不過,此方法隻能針對最近出的新伺服器。舊伺服器上的帶外管理未內建磁盤點亮功能。
那麼我們如何定位到故障磁盤盤位進行線上更換老舊裝置上的硬碟?
此時需要使用到陣列卡管理指令行工具進行管理。下面将介紹如何定位到故障磁盤,如何添加到陣列卡中。這裡使用dell伺服器做為此次文章的輸出。
1. 安裝DELL伺服器陣列管理工具perccli
此方法适用于DELL伺服器上的所有版本raid卡
1.1. 說明:
軟體包名稱:perccli.zip
軟體包版本:7.1-007.0127、A05
軟體包類别:SAS RAID
軟體包釋出時間:17 5月 2018
此版本軟體包已支援PERC H740和H840包括前面版本
戴爾官網下載下傳位址:https://www.dell.com/support/home/zh-cn/drivers/driversdetails?driverid=f48c2
說明:此處下載下傳位址下載下傳的軟體包格式為RPM格式安裝包,作者已認證fakeroot alien指令将rpm包轉換成deb包,可直接解壓縮安裝
deb包下載下傳路徑:https://edisk.eflycloud.com/s/CD44wGDTbGQr32J //下載下傳密碼:ruijiang
1.2. 常用指令:
# ./perccli64 /c0/eall/sall show 檢視實體硬碟資訊清單
# ./perccli64 /c0/vall 檢視虛拟磁盤資訊清單,即陣列資訊
# ./perccli64 /c0 show preservedCache 檢視虛拟磁盤丢失資訊
# ./perccli64 /c0/fall show all 檢視脫機硬碟資訊
# ./perccli64 /c0/v11 delete preservedcache 清除控制器0上的虛拟磁盤11的緩存資訊
# ./perccli64 /c0/fall delete 清除外來硬碟配置資訊
# ./perccli64 /c0/fall import [preview] 導入外來硬碟配置
# ./perccli64 /c0 add vd r0 drives=32:10 wb ra 編号為32:10的硬碟做raid0 (32:10 == EID:Slt)
# ./perccli64 /c0 add vd r5 size=all drives=32:01,32:02,32:03 對應編号3塊硬碟做raid5
# ./perccli64 /c0 add vd r1 size=all drives=32:01,32:02 對應編号2塊硬碟做raid1 (32:01 == EID:Slt)
1.3. 安裝
1、軟體包擷取:https://edisk.eflycloud.com/s/CD44wGDTbGQr32J //下載下傳密碼:ruijiang
2、mkdir -p /opt/MegaRAID/perccli //建立perccli安裝目錄/opt/MegaRAID/perccli
3、unzip /opt/MegaRAID/perccli/perccli.zip //解壓縮
4、dpkg -i /opt/MegaRAID/perccli/Linux/perccli_007.0127.0000.0000-2_all.deb //安裝
2. 定位損壞磁盤方位
舉例損壞磁盤為:/dev/sdc
2.1. 檢視損壞磁盤盤符資訊
- 記錄對應盤符DID值
/dev/sdc為:[0:0:4:0],對應的DID值為4
root@nodeserver1:/opt/MegaRAID/perccli# lsscsi
[0:0:2:0] disk ATA WDC WDS100T2G0A- 0000 /dev/sda
[0:0:3:0] disk ATA INTEL SSDSC2BB80 0101 /dev/sdb
[0:0:4:0] disk ATA WDC WDS100T2G0A- 0000 /dev/sdc
[0:0:5:0] disk ATA INTEL SSDSC2KB96 0110 /dev/sdd
[0:0:6:0] disk ATA INTEL SSDSC2KB96 0110 /dev/sde
[0:0:7:0] disk ATA INTEL SSDSC2KB96 0110 /dev/sdf
[0:2:0:0] disk DELL PERC H730 Mini 4.27 /dev/sdg
- 含義
[0:0:4:0] :[controllerID:未知:DID:未知]
2.2. 查詢伺服器上的raid卡
root@nodeserver1:~# cd /opt/MegaRAID/perccli
root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 show
- 以下查詢顯示的raid卡隻有一張,序号為:0
---------------------
Status Code = 0
---------------------
Status = Success
Description = None
Number of Controllers = 1
Host Name = nodeserver1
Operating System = Linux4.15.0-29-generic
System Overview :
===============
------------------------------------------------------------------------
Ctl Model Ports PDs DGs DNOpt VDs VNOpt BBU sPR DS EHS ASOs Hlth
------------------------------------------------------------------------
0 PERCH730Mini 8 8 1 0 1 0 Opt On 3 N 0 Opt
------------------------------------------------------------------------
Ctl=Controller Index|DGs=Drive groups|VDs=Virtual drives|Fld=Failed
PDs=Physical drives|DNOpt=DG NotOptimal|VNOpt=VD NotOptimal|Opt=Optimal
Msng=Missing|Dgd=Degraded|NdAtn=Need Attention|Unkwn=Unknown
sPR=Scheduled Patrol Read|DS=DimmerSwitch|EHS=Emergency Hot Spare
Y=Yes|N=No|ASOs=Advanced Software Options|BBU=Battery backup unit
Hlth=Health|Safe=Safe-mode boot
2.3. 查詢raid卡下的磁盤
2.3.1. 文法:
- root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 /c$x/eall/sall show
$x替換成0或者1,可以從上面[2.2]步驟中擷取這個值:Status Code = 0/1
2.3.2. 示例:
root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 /c0/eall/sall show
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.
Drive Information :
=================
-------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
-------------------------------------------------------------------------------
32:0 0 Onln 0 278.875 GB SAS HDD N N 512B ST300MP0026 U
32:1 1 Onln 0 278.875 GB SAS HDD N N 512B AL13SXB300N U
32:2 2 JBOD - 931.0 GB SATA SSD N N 512B WDC WDS100T2G0A-00JH30 U
32:3 3 JBOD - 744.625 GB SATA SSD N N 512B INTEL SSDSC2BB800G7 U
32:4 4 JBOD - 931.0 GB SATA SSD N N 512B WDC WDS100T2G0A-00JH30 U
32:5 5 JBOD - 893.75 GB SATA SSD N N 512B INTEL SSDSC2KB960G8 U
32:6 6 JBOD - 893.75 GB SATA SSD N N 512B INTEL SSDSC2KB960G8 U
32:7 7 JBOD - 893.75 GB SATA SSD N N 512B INTEL SSDSC2KB960G8 U
-------------------------------------------------------------------------------
EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign
UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
其中,關鍵數值是:
1、EID:EnclosureDevice ID
2、DID:DeviceID
3、SLT:SlotNo
2.3.3. 定位故障磁盤c軸、e軸、z軸坐标
結合步驟[2.1]與[2.3.2]中擷取到的資訊,最終得出故障磁盤/dev/sdc的坐标為:c0/e32/s4
3. 點亮故障磁盤
3.1. 文法:
- root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 /c$x/e$y/s%z start locate
3.2. 字元串說明
- c$x = controllerID
- e$y = EID
- s%z = Slt
3.3. 點亮磁盤
root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 /c0/e32/s4 start locate
Controller = 0
Status = Success
Description = Start Drive Locate Succeeded.
3.4. 此時就能看到磁盤一直閃燈
dell伺服器:一直閃燈,亮—->暗—->亮—->暗,持續頻閃
4. 關閉磁盤閃燈
确定盤位之後,可以關閉磁盤閃燈,進行磁盤拔出操作
root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 /c0/e32/s4 stop locate
Controller = 0
Status = Success
Description = Stop Drive Locate Succeeded.