Server Data Recovery - Server RAID 5 Disk Offline Hot Spare Unactivated Data Recovery Case

Server Data Recovery Environment:

Four SAS hard drives in a brand server form a RAID5 array, and the other disk is used as a hot spare. The upper operating system is RedHat Linux, and a database is deployed as Oracle's OA.

Server Failure & Initial Inspection:

After one disk goes offline in RAID5, the hot spare does not automatically activate rebuild, and then another disk goes offline and the RAID5 array crashes. Because Oracle no longer provides follow-up support for OAs deployed in servers, users contact our Data Recovery Center to request data recovery and restore operating systems.

All disks in the failed server are numbered and removed and tested by hardware engineers. After testing, it was found that the hot spare was not enabled at all, there was no physical failure, and there was no obvious synchronization performance.

Server Data Recovery & Operating System Recovery Process:

1. All disks in the failed server are fully mirrored in read-only mode, and the disks that are offline after the mirroring process have more than a dozen bad sectors, and the rest of the disks are found to have no bad sectors. After the mirroring is completed, all disks are restored to the original server according to their numbers, and subsequent data analysis and data recovery are based on the image file to avoid secondary damage to the original disk data.

2. Analyze the RAID5 structure information based on the image file to obtain RAID related information such as disk sequence, block size, backward parity (Adaptec) and so on.

Server Data Recovery - Server RAID 5 Disk Offline Hot Spare Unactivated Data Recovery Case

North Asia Enterprise Data Recovery – RAID 5 Data Recovery

3. According to the RAID-related information obtained in the previous step, virtually reorganize the RAID and verify the data, find that the compressed packets of more than 200M are decompressed without error, and determine that the structure is correct.

4. According to this RAID structure, generate virtual RAID to a single hard disk, open the file system and find no obvious error.

5. Rebuild the RAID on the original disk after obtaining the user's authorization (the offline disk that found bad sectors has been replaced with a new hard disk during the reconstruction).

6. Connect the recovered single disk to the failed server by USB, start the failed server with linux SystemRescueCd, and then use the dd command to write back the whole disk.

7. After the write-back is completed, start the operating system, but cannot enter the system, and the error message is: /etc/rc.d/rc.sysinit:Line 1:/sbin/pidof:Permission denied.

8. If you suspect that there is a problem with the permissions of this file, check after restarting with SystemRescueCd, there are obvious errors in the time, permissions, and size of this file, which is obviously an error caused by node damage.

9. Re-analyze the root partition in the reorganization data, locate the faulty /sbin/pidof, and find that the error is caused by bad sectors on the disk that is offline.

10. Use 3 intact disks to complete the damaged area of the disk that is offline after being offline. After completing the filesystem, the error still occurs. Check the inode table again and find that some nodes in the damaged area of the offline disk behave abnormally.

North Asia Enterprise Data Recovery – RAID 5 Data Recovery

While the UID described in the node exists normally, the attributes, size, and initial allocation block are all wrong. Perform all possible analyses, but no way has been found to recover this corrupted node. You can only try to repair this node or copy an identical file over.

11. For all files that may have errors, the data recovery engineer of North Asia Enterprise Security determines the node information of the original node block through the log, and then makes corrections.

12. After fixing, re-dd the root partition, execute fsck -fn /dev/sda5 and still report an error.

North Asia Enterprise Data Recovery – RAID 5 Data Recovery

According to the error message, the data recovery engineer found that multiple nodes shared the same data block in the system. Follow this prompt to analyze the underlying layer and find that there is an intersection of old and new node information.

13. Distinguish according to the file to which the node belongs, clear the error node and run fsck -fn /dev/sda5 again, and there are still a small number of errors. According to the error message, it was found that most of these nodes were located in the doc directory and did not affect the system startup, so fsck -fy /dev/sda5 was directly executed to force the repair.

14. After the repair is completed, restart the system and successfully enter the system desktop.

15. Start Oracle Database Service and OA without error.

16. The user side will test the recovered operating system and data (OA and Oracle database), and confirm that the restored data is complete and effective after multiple tests by the user side. The data recovery work is complete.