在 ESXi 中用 PERCCli 换 RAID 中的盘

背景

最近有一台机器的盘出现了报警,需要换掉,然后重建 RAID5 阵列。iDRAC 出现报错:

  1. Disk 2 in Backplane 1 of Integrated RAID Controller 1 is not functioning correctly.
  2. Virtual Disk 1 on Integrated RAID Controller 1 has become degraded.
  3. Error occurred on Disk2 in Backplane 1 of Integrated RAID Controller 1 : (Error 2)

安装 PERCCli

首先,因为系统是 VMware ESXi 6.7,所以在DELL 官网下载对应的文件。按照里面的 README 安装 vib:

esxcli software vib install -v /vmware-perccli-007.1420.vib

需要注意的是,如果复制上去 Linux 版本的 PERCCli,虽然也可以运行,但是找不到控制器。安装好以后,就可以运行 /opt/lsi/perccli/perccli 。接着,运行 perccli show all,可以看到类似下面的信息:

$ perccli show all
--------------------------------------------------------------------------------
EID:Slt DID State  DG     Size Intf Med SED PI SeSz Model               Sp Type
--------------------------------------------------------------------------------
32:2      2 Failed  1 3.637 TB SATA HDD N   N  512B ST4000NM0033-9ZM170 U  -
32:4      4 UGood   F 3.637 TB SATA HDD N   N  512B ST4000NM0033-9ZM170 U  -
--------------------------------------------------------------------------------

其中 E32S2 是 Failed 的盘,属于 Disk Group 1;E32S4 是新插入的盘,准备替换掉 E32S2,目前不属于任何的 Disk Group。查看一下 Disk Group :perccli /c0/dall show

$ perccli /c0/dall show
-----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT       Size PDC  PI SED DS3  FSpace TR
-----------------------------------------------------------------------------
 1 -   -   -        -   RAID5 Dgrd   N    7.276 TB dflt N  N   dflt N      N
 1 0   -   -        -   RAID5 Dgrd   N    7.276 TB dflt N  N   dflt N      N
 1 0   0   32:1     1   DRIVE Onln   N    3.637 TB dflt N  N   dflt -      N
 1 0   1   32:2     2   DRIVE Failed N    3.637 TB dflt N  N   dflt -      N
 1 0   2   32:3     3   DRIVE Onln   N    3.637 TB dflt N  N   dflt -      N

可以看到 DG1 处于 Degraded 状态,然后 E32S4 处于 Failed 状态。参考了一下 PERCCli 文档,它告诉我们要这么做:

perccli /cx[/ex]/sx set offline
perccli /cx[/ex]/sx set missing
perccli /cx /dall show
perccli /cx[/ex]/sx insert dg=a array=b row=c
perccli /cx[/ex]/sx start rebuild

具体到我们这个情景,就是把 E32S2 设为 offline,然后用 E32S4 来替换它:

perccli /c0/e32/s2 set offline
perccli /c0/e32/s2 set missing
perccli /cx /dall show
perccli /cx/e32/s4 insert dg=1 array=0 row=2
perccli /cx/e32/s4 start rebuild

完成以后的状态:

TOPOLOGY :
========

---------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT     Size PDC  PI SED DS3  FSpace TR
---------------------------------------------------------------------------
 1 -   -   -        -   RAID5 Dgrd  N  7.276 TB dflt N  N   dflt N      N
 1 0   -   -        -   RAID5 Dgrd  N  7.276 TB dflt N  N   dflt N      N
 1 0   0   32:1     1   DRIVE Onln  N  3.637 TB dflt N  N   dflt -      N
 1 0   1   32:4     4   DRIVE Rbld  Y  3.637 TB dflt N  N   dflt -      N
 1 0   2   32:3     3   DRIVE Onln  N  3.637 TB dflt N  N   dflt -      N
---------------------------------------------------------------------------

可以看到 E32S4 替换了原来 E32S2 的位置,并且开始重建。查看重建进度:

$ perccli /c0/32/s4 show rebuild
-----------------------------------------------------
Drive-ID   Progress% Status      Estimated Time Left
-----------------------------------------------------
/c0/e32/s4         3 In progress -
-----------------------------------------------------
$ perccli show all
Need Attention :
==============

Controller 0 :
============

-------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model               Sp Type
-------------------------------------------------------------------------------
32:4      4 Rbld   1 3.637 TB SATA HDD N   N  512B ST4000NM0033-9ZM170 U  -
-------------------------------------------------------------------------------

然后,查看一下出错的盘:

$ perccli /c0/e32/s2 show all
Drive /c0/e32/s2 State :
======================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 6
Drive Temperature =  36C (96.80 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

果然有错误,但是也看不到更多信息了。

坏块统计:

$ perccli /c0 show badblocks
Detailed Status :
===============

-------------------------------------------------------------
Ctrl Status Ctrl_Prop       Value ErrMsg               ErrCd
-------------------------------------------------------------
   0 Failed Bad Block Count -     BadBlockCount failed     2
-------------------------------------------------------------

经过检查以后,发现 E32S2 盘的 SMART 并没有报告什么问题,所以也没有把盘取走,而是作为 hot spare 当备用:

$ perccli /c0/e32/s2 add hotsparedrive DG=1
$ perccli /c0/d1 show
TOPOLOGY :
========

---------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT     Size PDC  PI SED DS3  FSpace TR
---------------------------------------------------------------------------
 1 -   -   -        -   RAID5 Dgrd  N  7.276 TB dflt N  N   dflt N      N
 1 0   -   -        -   RAID5 Dgrd  N  7.276 TB dflt N  N   dflt N      N
 1 0   0   32:1     1   DRIVE Onln  N  3.637 TB dflt N  N   dflt -      N
 1 0   1   32:4     4   DRIVE Rbld  Y  3.637 TB dflt N  N   dflt -      N
 1 0   2   32:3     3   DRIVE Onln  N  3.637 TB dflt N  N   dflt -      N
 1 -   -   32:2     2   DRIVE DHS   -  3.637 TB -    -  -   -    -      N
---------------------------------------------------------------------------

DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Optl=Optimal|Dgrd=Degraded
Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
TR=Transport Ready

这样就可以做后备盘,当别的盘坏的时候,作为备用。

相关软件下载

可以在这里寻找 StorCLI 版本。

StorCLI:

MegaCLI:

PercCLI:

comments powered by Disqus