天天看點

如何線上更換伺服器故障硬碟之定位故障磁盤

      前言:

随着上個世紀開始計算機的陸續普及,企業對于資料的安全性要求越來越高;對于業務系統使用的可連續性越來越重視。作為一個為企業提供資料存儲方或者是私有雲項目的建設方,那麼客戶的需求就是我們的建設目标,客戶聚焦關注的目标就是我們建設方案更新/技術更新的方向。

如果硬碟出現了故障的情況,我們如何在不停機的情況下幫助客戶更換硬碟?

其實伺服器供應商已經提供了此方法,例如戴爾伺服器的IPMI帶外管理,浪潮伺服器的BMC等都是可以對磁盤進行管理,例如:帶外磁盤引導、配置陣列、清除陣列、定位磁盤等。不過,此方法隻能針對最近出的新伺服器。舊伺服器上的帶外管理未內建磁盤點亮功能。

那麼我們如何定位到故障磁盤盤位進行線上更換老舊裝置上的硬碟?

此時需要使用到陣列卡管理指令行工具進行管理。下面将介紹如何定位到故障磁盤,如何添加到陣列卡中。這裡使用dell伺服器做為此次文章的輸出。

1. 安裝DELL伺服器陣列管理工具perccli

此方法适用于DELL伺服器上的所有版本raid卡

1.1. 說明:

軟體包名稱:perccli.zip
軟體包版本:7.1-007.0127、A05
軟體包類别:SAS RAID
軟體包釋出時間:17 5月 2018
此版本軟體包已支援PERC H740和H840包括前面版本
戴爾官網下載下傳位址:https://www.dell.com/support/home/zh-cn/drivers/driversdetails?driverid=f48c2
說明:此處下載下傳位址下載下傳的軟體包格式為RPM格式安裝包,作者已認證fakeroot alien指令将rpm包轉換成deb包,可直接解壓縮安裝
deb包下載下傳路徑:https://edisk.eflycloud.com/s/CD44wGDTbGQr32J   //下載下傳密碼:ruijiang      

1.2. 常用指令:

# ./perccli64 /c0/eall/sall show                                 檢視實體硬碟資訊清單
# ./perccli64 /c0/vall                                           檢視虛拟磁盤資訊清單,即陣列資訊
# ./perccli64 /c0 show preservedCache                            檢視虛拟磁盤丢失資訊
# ./perccli64 /c0/fall show all                                  檢視脫機硬碟資訊
# ./perccli64 /c0/v11 delete preservedcache                      清除控制器0上的虛拟磁盤11的緩存資訊
# ./perccli64  /c0/fall delete                                   清除外來硬碟配置資訊
# ./perccli64 /c0/fall import [preview]                          導入外來硬碟配置
# ./perccli64 /c0 add vd r0 drives=32:10 wb ra                   編号為32:10的硬碟做raid0 (32:10 == EID:Slt)
# ./perccli64 /c0 add vd r5 size=all drives=32:01,32:02,32:03    對應編号3塊硬碟做raid5
# ./perccli64 /c0 add vd r1 size=all drives=32:01,32:02          對應編号2塊硬碟做raid1 (32:01 == EID:Slt)      

1.3. 安裝

1、軟體包擷取:https://edisk.eflycloud.com/s/CD44wGDTbGQr32J   //下載下傳密碼:ruijiang
2、mkdir -p /opt/MegaRAID/perccli                          //建立perccli安裝目錄/opt/MegaRAID/perccli
3、unzip /opt/MegaRAID/perccli/perccli.zip              //解壓縮
4、dpkg -i /opt/MegaRAID/perccli/Linux/perccli_007.0127.0000.0000-2_all.deb      //安裝      

2. 定位損壞磁盤方位

舉例損壞磁盤為:/dev/sdc

2.1. 檢視損壞磁盤盤符資訊

  • 記錄對應盤符DID值

/dev/sdc為:[0:0:4:0],對應的DID值為4

root@nodeserver1:/opt/MegaRAID/perccli# lsscsi
[0:0:2:0]    disk    ATA      WDC WDS100T2G0A- 0000  /dev/sda
[0:0:3:0]    disk    ATA      INTEL SSDSC2BB80 0101  /dev/sdb
[0:0:4:0]    disk    ATA      WDC WDS100T2G0A- 0000  /dev/sdc
[0:0:5:0]    disk    ATA      INTEL SSDSC2KB96 0110  /dev/sdd
[0:0:6:0]    disk    ATA      INTEL SSDSC2KB96 0110  /dev/sde
[0:0:7:0]    disk    ATA      INTEL SSDSC2KB96 0110  /dev/sdf
[0:2:0:0]    disk    DELL     PERC H730 Mini   4.27  /dev/sdg      
  • 含義
[0:0:4:0] :[controllerID:未知:DID:未知]      

2.2. 查詢伺服器上的raid卡

root@nodeserver1:~# cd /opt/MegaRAID/perccli
root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 show      
  • 以下查詢顯示的raid卡隻有一張,序号為:0
---------------------
Status Code = 0
---------------------
Status = Success
Description = None
Number of Controllers = 1
Host Name = nodeserver1
Operating System  = Linux4.15.0-29-generic
System Overview :
===============
------------------------------------------------------------------------
Ctl Model        Ports PDs DGs DNOpt VDs VNOpt BBU sPR DS EHS ASOs Hlth
------------------------------------------------------------------------
  0 PERCH730Mini     8   8   1     0   1     0 Opt On  3  N      0 Opt
------------------------------------------------------------------------
Ctl=Controller Index|DGs=Drive groups|VDs=Virtual drives|Fld=Failed
PDs=Physical drives|DNOpt=DG NotOptimal|VNOpt=VD NotOptimal|Opt=Optimal
Msng=Missing|Dgd=Degraded|NdAtn=Need Attention|Unkwn=Unknown
sPR=Scheduled Patrol Read|DS=DimmerSwitch|EHS=Emergency Hot Spare
Y=Yes|N=No|ASOs=Advanced Software Options|BBU=Battery backup unit
Hlth=Health|Safe=Safe-mode boot      

2.3. 查詢raid卡下的磁盤

2.3.1. 文法:

  • root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 /c$x/eall/sall show

$x替換成0或者1,可以從上面[2.2]步驟中擷取這個值:Status Code = 0/1

2.3.2. 示例:

root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 /c0/eall/sall show
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.
Drive Information :
=================
-------------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model                  Sp
-------------------------------------------------------------------------------
32:0      0 Onln   0 278.875 GB SAS  HDD N   N  512B ST300MP0026            U
32:1      1 Onln   0 278.875 GB SAS  HDD N   N  512B AL13SXB300N            U
32:2      2 JBOD   -   931.0 GB SATA SSD N   N  512B WDC WDS100T2G0A-00JH30 U
32:3      3 JBOD   - 744.625 GB SATA SSD N   N  512B INTEL SSDSC2BB800G7    U
32:4      4 JBOD   -   931.0 GB SATA SSD N   N  512B WDC WDS100T2G0A-00JH30 U
32:5      5 JBOD   -  893.75 GB SATA SSD N   N  512B INTEL SSDSC2KB960G8    U
32:6      6 JBOD   -  893.75 GB SATA SSD N   N  512B INTEL SSDSC2KB960G8    U
32:7      7 JBOD   -  893.75 GB SATA SSD N   N  512B INTEL SSDSC2KB960G8    U
-------------------------------------------------------------------------------
EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign
UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded      

其中,關鍵數值是:

1、EID:EnclosureDevice ID
2、DID:DeviceID
3、SLT:SlotNo      

2.3.3. 定位故障磁盤c軸、e軸、z軸坐标

結合步驟[2.1]與[2.3.2]中擷取到的資訊,最終得出故障磁盤/dev/sdc的坐标為:c0/e32/s4

3. 點亮故障磁盤

3.1. 文法:

  • root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 /c$x/e$y/s%z start locate

3.2. 字元串說明

  • c$x = controllerID
  • e$y = EID
  • s%z = Slt

3.3. 點亮磁盤

root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 /c0/e32/s4 start locate
Controller = 0
Status = Success
Description = Start Drive Locate Succeeded.      

3.4. 此時就能看到磁盤一直閃燈

dell伺服器:一直閃燈,亮—->暗—->亮—->暗,持續頻閃

如何線上更換伺服器故障硬碟之定位故障磁盤

​4. 關閉磁盤閃燈​

确定盤位之後,可以關閉磁盤閃燈,進行磁盤拔出操作

root@nodeserver1:/opt/MegaRAID/perccli# ./perccli64 /c0/e32/s4 stop locate
Controller = 0
Status = Success
Description = Stop Drive Locate Succeeded.