欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 健康 > 美食 > ODA服务器计算节点本地硬盘状态异常的处理

ODA服务器计算节点本地硬盘状态异常的处理

2025/5/15 18:43:48 来源:https://blog.csdn.net/q947817003/article/details/147857346  浏览:    关键词:ODA服务器计算节点本地硬盘状态异常的处理

近期,在系统巡检过程中发现一个客户的ODA服务器本地硬盘节点出现告警,ODAX8 X9等,本地硬盘是使用的240GB M.2接口的SSD盘(卡式)的,这个没有外置的指示灯可以从服务器前面板查看,打开服务器机箱盖子倒是可以看到M.2盘上面有绿色指示灯,但是一般巡检不会看这个。

因此这个问题很有隐蔽性,2块M.2接口的SSD盘做的RAID1,系统层面巡检一般也不会注意到该问题,需要通过特定的命令 cat /proc/mdstat、odaadmcli show localdisk来查看,同时通过ILOM查看时候STORAGE菜单里面显示的盘状态是正常,系统日志里可能会有日志显示盘INSERT/REMOTE,可以参考。

本次通过插拔硬盘和重启主机后,告警恢复了。

如下是本次的处理过程:

状态检查:

[root@hisdb2 ~]# odaadmcli show localdisk
        NAME            PATH            TYPE            STATUS                 STATE_IN_ILOM
 
        lpd_0           sda             SSD             GOOD                   GOOD           
        lpd_1           N/A             SSD             MISSING                GOOD    ====这个盘损坏了       

[root@hisdb2 ~]# cat /proc/mdstat
Personalities : [raid1] 
md126 : active raid1 sda[0]
      234425344 blocks super external:/md127/0 [2/1] [U_]  ==正常是UU,这里是U_
      
md127 : inactive sda[0](S)
      5201 blocks super external:imsm
       
unused devices: <none>

ILOM里面的日志:可以看到有盘INSERT/REMOVED,类似盘不稳定,

重启和插拔硬盘后,系统恢复:

需要注意,ODA的服务需要开启集群软件,所以不能吧CRS开机启动关闭:

May  8 17:41:46 hisdb2 su: (to grid) root on none
May  8 17:41:46 hisdb2 su: (to root) root on none
May  8 17:41:46 hisdb2 su: (to root) root on none
May  8 17:43:14 hisdb2 init.oak: 2025-05-08 17:43:14.460969204:[init.oak]:[Waiting for Cluster Ready Services. Diagnostics in /tmp/crsctl.4142]
May  8 17:45:45 hisdb2 init.oak: 2025-05-08 17:45:45.619750299:[init.oak]:[Waiting for Cluster Ready Services. Diagnostics in /tmp/crsctl.4142]

重启后识别到2块M2硬盘,系统会自动同步数据修复RAID,日志如下:

[root@hisdb2 ~]# odaadmcli show localdisk
        NAME            PATH            TYPE            STATUS                  STATE_IN_ILOM
 
        lpd_0           sda             SSD             GOOD                    GOOD           
        lpd_1           sdb             SSD             GOOD                    GOOD           
[root@hisdb2 ~]# cat /proc/mdstat
Personalities : [raid1] 
md126 : active raid1 sdb[1] sda[0]
      234425344 blocks super external:/md127/0 [2/1] [U_]
      [=====>...............]  recovery = 25.8% (60691392/234425344) finish=14.3min speed=202388K/sec
      
md127 : inactive sda[1](S) sdb[0](S)
      10402 blocks super external:imsm
       
unused devices: <none>
[root@hisdb2 ~]# cat /proc/mdstat
Personalities : [raid1] 
md126 : active raid1 sdb[1] sda[0]
      234425344 blocks super external:/md127/0 [2/1] [U_]
      [======>..............]  recovery = 32.2% (75616576/234425344) finish=13.2min speed=199081K/sec
      
md127 : inactive sda[1](S) sdb[0](S)
      10402 blocks super external:imsm
       
unused devices: <none>

最终状态:

[root@hisdb2 ~]# cat /proc/mdstat
Personalities : [raid1] 
md126 : active raid1 sdb[1] sda[0]
      234425344 blocks super external:/md127/0 [2/2] [UU]
      
md127 : inactive sda[1](S) sdb[0](S)
      10402 blocks super external:imsm
       
unused devices: <none>
[root@hisdb2 ~]#  odaadmcli show localdisk
        NAME            PATH            TYPE            STATUS                  STATE_IN_ILOM
 
        lpd_0           sda             SSD             GOOD                    GOOD           
        lpd_1           sdb             SSD             GOOD                    GOOD 
 

可以参考的MESSAGE日志

May  8 18:16:20 hisdb2 kernel: md/raid1:md126: active with 1 out of 2 mirrors
May  8 18:16:20 hisdb2 kernel: md126: detected capacity change from 0 to 240051552256
May  8 18:16:20 hisdb2 kernel: md126: p1 p2 p3
May  8 18:16:20 hisdb2 systemd: Starting MD Metadata Monitor on /dev/md127...
May  8 18:16:20 hisdb2 systemd: Started MD Metadata Monitor on /dev/md127.
May  8 18:16:20 hisdb2 kernel: md: recovery of RAID array md126
May  8 18:16:21 hisdb2 kernel: EXT4-fs (md126p2): mounted filesystem with ordered data mode. Opts: (null)
May  8 18:16:21 hisdb2 kernel: md: md126: recovery interrupted.
May  8 18:16:21 hisdb2 kernel: md: md126 still in use.
May  8 18:16:21 hisdb2 kernel: md: recovery of RAID array md126
May  8 18:16:21 hisdb2 kernel: md: md126: recovery interrupted.
May  8 18:16:21 hisdb2 kernel: md: md126 still in use.
May  8 18:16:21 hisdb2 kernel: md: recovery of RAID array md126
May  8 18:16:21 hisdb2 kernel: md: md126: recovery interrupted.
May  8 18:16:21 hisdb2 kernel: md: md126 still in use.
May  8 18:16:21 hisdb2 kernel: md: recovery of RAID array md126
May  8 18:16:21 hisdb2 kernel: md: md126: recovery interrupted.
May  8 18:16:21 hisdb2 kernel: md: md126 still in use.
May  8 18:16:21 hisdb2 kernel: md: recovery of RAID array md126
May  8 18:16:21 hisdb2 kernel: md: md126: recovery interrupted.
May  8 18:16:21 hisdb2 kernel: md: md126 still in use.
May  8 18:16:21 hisdb2 kernel: md: recovery of RAID array md126
May  8 18:16:22 hisdb2 systemd: Stopped MD Metadata Monitor on /dev/md127.
May  8 18:16:24 hisdb2 systemd: Starting MD Metadata Monitor on /dev/md127...
May  8 18:16:24 hisdb2 systemd: Started MD Metadata Monitor on /dev/md127.
May  8 18:16:24 hisdb2 systemd-fsck: /dev/md126p2: clean, 67/128016 files, 148390/512000 blocks
May  8 18:16:24 hisdb2 kernel: EXT4-fs (md126p2): mounted filesystem with ordered data mode. Opts: (null)
May  8 18:16:24 hisdb2 kernel: FAT-fs (md126p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
May  8 18:17:21 hisdb2 systemd: rc-local.service: control process exited, code=exited status=127
May  8 18:31:12 hisdb2 systemd: Starting Cleanup of Temporary Directories...
May  8 18:31:12 hisdb2 systemd: Started Cleanup of Temporary Directories.
May  8 18:36:11 hisdb2 kernel: md: md126: recovery done.


t同时,linux   lsblk命令也可以看到2块盘对应了系统的分区:

[root@hisdb2 ~]# lsblk
NAME                         MAJ:MIN   RM   SIZE RO TYPE  MOUNTPOINT

sdb                            8:16     0 223.6G  0 disk  
└─md126                        9:126    0 223.6G  0 raid1 
  ├─md126p2                  259:1      0   500M  0 md    /boot
  ├─md126p3                  259:2      0 222.6G  0 md    
  │ ├─VolGroupSys-LogVolOpt  252:20     0    30G  0 lvm   /opt
  │ ├─VolGroupSys-LogVolSwap 252:1      0    24G  0 lvm   [SWAP]
  │ ├─VolGroupSys-LogVolU01  252:21     0    40G  0 lvm   /u01
  │ └─VolGroupSys-LogVolRoot 252:0      0    30G  0 lvm   /
  └─md126p1                  259:0      0   500M  0 md    /boot/efi
sda                            8:0      0 223.6G  0 disk  
└─md126                        9:126    0 223.6G  0 raid1 
  ├─md126p2                  259:1      0   500M  0 md    /boot
  ├─md126p3                  259:2      0 222.6G  0 md    
  │ ├─VolGroupSys-LogVolOpt  252:20     0    30G  0 lvm   /opt
  │ ├─VolGroupSys-LogVolSwap 252:1      0    24G  0 lvm   [SWAP]
  │ ├─VolGroupSys-LogVolU01  252:21     0    40G  0 lvm   /u01
  │ └─VolGroupSys-LogVolRoot 252:0      0    30G  0 lvm   /
  └─md126p1                  259:0      0   500M  0 md    /boot/efi

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com

热搜词