laitimes

The production HDFS enters safe mode and troubleshoots

author:Flash Gene

Problem phenomenon

HDFS When one of the datanodes goes down, the block on the datanode is corrupted, causing HDFS to enter safe mode.

On the home page of HDFS, you can see that the current Safe Mode is ON enabled. HDFS has entered safe mode.

02

What is Safe Mode

HDFS safe mode, or HDFS safe mode, is a special state of the HDFS file system, in which the HDFS file system only accepts data read requests, but does not accept change requests such as deletion and modification, and of course, it cannot perform operations such as replicating copies of the underlying blocks.

Essentially, Safe Mode is a special state of HDFS, and the purpose of HDFS entering this special state is to ensure data consistency/no data loss across the file system, so as to limit the user to only read the data and not change the data.

03

When entering safe mode

Passive entry

Passive entry is generally manually handled by administrators, such as cluster O&M and scale-out.

You can run the following command to enter safe mode:

hdfs dfsadmin -safemode enter

After the processing is complete, you can run the following command to exit the safe mode:

hdfs dfsadmin -safemode leave

Take the initiative to enter

Comparatively speaking, there are more scenarios in which you actively enter safe mode. In this case, HDFS actively enters a self-protection state in order to ensure the data consistency of the entire file system and prevent data loss from the entire file system under special circumstances.

When the nameNode is started, HDFS first enters the safe mode, and the datanode reports the available blocks and other statuses to the namenode when it starts, and when the entire system meets the security standards, HDFS automatically leaves the safe mode. If HDFS is in safe mode, the file block cannot be copied to any replicas, so the minimum number of replicas is determined based on the state of the datanode when it is started, and no replicas are replicated at startup (thus meeting the minimum number of replicas).

When does the system leave safe mode, and what conditions need to be met?

● HDFS 底层启动成功并能够跟 namenode 保持定期心跳的 datanode 的个数没有达到指定的阈值,阈值通过 dfs.namenode.safemode.min.datanodes参数指定;

● The percentage of blocks that meet the minimum number of HDFS replicas in the total number of blocks in the system can be left in safe mode only after the actual number exceeds the specified number (but other conditions must be met). The default value is 0.999f, which means that the number of blocks that meet the minimum number of replicas exceeds 99.9% and other conditions are met before you can leave safe mode. If it is less than or equal to 0, it does not wait for any replica to meet the requirements before leaving. If it is greater than 1, it will always be in safe mode. The configuration is specified by dfs.namenode.safemode.threshold-pct.

● Of course, if the number of datanodes that are successfully started at the bottom of HDFS and can maintain regular heartbeats with the namenodes does not reach the specified threshold, the percentage of blocks that meet the minimum number of replicas at the bottom layer of HDFS generally does not reach the specified threshold.

● The minimum number of file block replicas meets the requirements, which is specified by the dfs.namenode.replication.min parameter, which is 1 by default.

● dfs.namenode.safemode.extension: After the percentage of available blocks and available datanodes in the cluster meet the requirements, the cluster will leave the safe mode only after the time period configured by the extension can still meet the requirements. The unit is milliseconds, and the default value is 1. That is, when the conditions are met and can be maintained for 1 millisecond, the safe mode is exited. This configuration is mainly used to further confirm the stability of the cluster. Avoid meeting safety standards immediately after meeting the requirements.

In reality, the most common reasons why HDFS enters safe mode are as follows:

● 部分 Datanode 启动失败或者因为网络原因与 Namenode 心跳失败;

● Some of the disk volumes on which the DataNode node stores HDFS data are corrupted, making the data stored in the disk volume unreadable;

● The disk partition space of some DataNode nodes storing HDFS data is full, resulting in the data stored in the disk volume being unable to be read normally.

List of HDFS security mode parameters

The production HDFS enters safe mode and troubleshoots

04

How to fix it

Causes of analysis

When HDFS enters the safe mode, you need to analyze the reason why the HDFS system enters the safe mode in one of the following ways:

1. On the HDFS WebUI page, view the status of the cluster and the status of the datanodes to understand the overall status of the cluster.

2. Check the relevant logs, usually in /var/log/xxx, and view the detailed logs to find valuable information.

Fix the issue

After confirming the reason for entering safe mode through the above checks, you can perform targeted fixes:

● 比如如果有 Datanode 未启动成功,则尝试修复并启动对应的 Datanode;

● For example, if the disk partition space of the DataNode to store HDFS data is full, try to expand the disk partition space;

● For example, if there is a storage volume failure on the datanode, try to repair the storage volume, if it cannot be repaired, you need to replace the storage volume (the data on the storage volume will be lost);

● It should be noted that if some datanode nodes are completely damaged and cannot be started, or some datanode disk volumes are completely faulty and cannot be repaired, the block corresponding to these data and the HDFS files on the block will be lost, and the business personnel may need to contact the business personnel to make up the data (re-pull the data from the upstream, or re-run the job to generate data), if the business personnel cannot make up the data, the data will be completely damaged and cannot be recovered;

● 可以通过如下命令查看丢失的 block 及这些 block 对应的上层 hdfs 文件,并记录下来后续交给业务人员去判断是否需要补数据:hdfs fsck / -list-corruptfileblocks,hdfs fsck / -files -blocks -locations;

● If there is no data loss, after the cluster is repaired and restarted as described above, HDFS will exit the safe mode and provide read and write services to the outside world as normal.

● In the case of data loss, you need to manually exit the safe mode and delete the upper-level HDFS file corresponding to the corrupted/lost block by running the following command:

1、退出安全模式(只有退出安全模式才能删除数据):sudo -u hdfs hdfs dfsadmin -safemode leave;

2. Delete the upper-layer HDFS file corresponding to the lost blockd (automatically check the file system and delete the upper-layer HDFS file of the lost block): sudo -u hdfs hdfs fsck / -delete;

3. After deleting the upper-layer HDFS files corresponding to the lost blocks, the percentage of blocks at the bottom of HDFS that meet the minimum number of replicas will reach the specified threshold (the total number of files will be reduced, and the total number of blocks will also be reduced accordingly, so the percentage of blocks that are successfully reported will increase accordingly to reach the threshold), so after restarting, it will be able to exit the security mode normally and provide read and write services to the outside world normally;

05

Production of complete processing processes

Note that all of the following commands need to be executed under the hdfs user and can be executed using su -hdfs toggle.

1. Exit safe mode first

sudo -u hdfs hdfs dfsadmin -safemode leave

2. Check the current HDFS status

hdfs dfsadmin -report

3、列出hdfs文件系统上所有corrupted blocks path

hdfs fsck -list-corruptfileblocks

4. Check your health status

hdfs fsck /

5. Check the specific information of the damaged block

hdfs fsck /path/to/corrupt/file -locations -blocks -files

6. Delete these bad blocks

hdfs fsck / -delete

该指令将会删除 path“/”下的missing and corrupted blocks;此外,您可能需要根据第4步的结果来修改为指定的path。

7、再次检查是否是healthy状态(”The filesystem under path ‘/’ is HEALTHY”);删除blocks需要时间,因此如果仍是unhealthy可以稍后再来检查。

hdfs fsck -list-corruptfileblocks

8. In some cases, files are not deleted in step 5, so you can run the following command to delete them from HDFS:

hdfs dfs -rm “/File/Path/of/the/missing/blocks”

After the preceding steps are completed, HDFS will return to normal under normal circumstances.

06

summary

● For HDFS high availability, we recommend that you have at least 2 block replicas in the production environment.

● Pay attention to the disk usage of each datanode in HDFS, and expand the capacity in time if the threshold is exceeded.

● If HDFS enters safe mode, analyze it first. Never force exit safe mode directly.

● If HDFS enters safe mode because the block is damaged, try to restore the block from another replica and do not forcibly delete the damaged block. If the data cannot be recovered, delete the damaged block and try to restore the data from the upstream after the deletion, such as rewriting the service.

● Be cautious when modifying the relevant parameters of the production security mode, and do not directly modify the relevant parameters to exit the HDFS security mode.

Author | Feng Chengyang is a senior big data development engineer

Source-WeChat public account: micro carp technical team

Source: https://mp.weixin.qq.com/s/3Aa1fIbBmZ4oZXX9booSTQ

Read on