background

Online failures are an inevitable part of technological growth, and we can learn valuable lessons and become more experienced. However, not every team or technical classmate can handle faults in a reasonable and scientific way. Based on daily practical work experience and personal experience, I have compiled a list of the three-character scriptures for the team to quickly troubleshoot problems or suspected problems, as well as correct ✅ and wrong ❌ cases. This checklist will help you troubleshoot issues quickly without worrying about making mistakes in a high-pressure environment and missing critical steps. With this checklist in hand, you'll have better control over the scene and thus avoid losses caused by negligence. Let us remain calm in the face of failures, conduct investigations in an orderly manner, and continuously improve our technical level and problem-solving ability.

swear word

Note: The following is not a strict order, you need to adjust according to the actual situation can be multi-way parallel, such as clear the general link of the problem, stop the bleeding first. If it is not clear, report the division of labor, preliminarily locate the approximate link and then quickly stop the bleeding.

All means and actions taken in the process of fault handling, all with the highest priority of restoring business, and the rapid hemostatic solution at the recovery site is higher than finding the cause of the failure and all other links.

Don't panic, report first

First organize the meeting, clear division of labor

Describe phenomena, not conclusions

Stop bleeding first, then localize

Look at the monitoring, look at the logs

Find a pattern and experiment first

Look at the input, look at the output

Stay on site and give feedback

Don't panic, report first

1. When dealing with emergencies (online problems or suspected problems), first report the problem to the group.

2. Give full play to the strength of the team, through brainstorming, you can find the root cause of the problem more quickly and take corresponding measures to solve it.

✅ Positive example:

Suppose a member of a team encounters an online problem with feedback from a business or other team. He first escalated the issue to the TL/architecture within the group
Then share the details of the issue with team members through online meetings or instant messengers. Team members quickly engage in discussions, brainstorm ideas, and work together to analyze the causes of problems and possible solutions.
Through the joint efforts of everyone, the root of the problem was finally found and successfully solved. In this process, team members give full play to the power of teamwork and improve the efficiency of problem solving.

❌Counterexample:

Suppose a member of a team encounters an online problem with feedback from a business or other team. Instead of escalating the problem to the group, he tried to solve it himself. As a result, the problem was not solved in a timely manner, but it aggravated the online losses.

First organize the meeting, clear division of labor

1. The fault commander (problem finder or team leader/structure) has the right to convene the corresponding business, product, development or other necessary resources to quickly organize meetings.

2. The fault commander clarifies the responsibilities of each role and has a clear division of labor, which can effectively improve the efficiency of fault handling.

✅ Positive example:

The fault commander (problem finder or TL/architecture) has the right to gather relevant business, product, development and other team members to quickly organize a meeting. At the meeting, clarify the roles and responsibilities of each member
Such as product personnel, developers, testers, etc., and the division of labor is clear. In this way, everyone has a clear idea of what they need to accomplish, effectively improving the efficiency of troubleshooting.
By brainstorming ideas, team members are able to quickly identify the root cause of the problem and take appropriate action to resolve it.

❌Counterexample:

The fault commander (problem finder or TL/architecture) did not call relevant business, product, development, etc. team members to meet in a timely manner. He tried to solve the problem on his own, but due to lack of professional business knowledge and experience, the problem was not solved in a timely manner. At the same time, other team members did not pay enough attention because they were unaware of the seriousness of the problem. In this case, the fault commander needs to spend more time and energy to coordinate the resources of all parties, which may eventually lead to the problem not being effectively solved, affecting the progress of the entire project.

Describe phenomena, not conclusions

1. Let the problem finder describe the phenomenon found (time, scope of influence, degree of impact), rather than the conclusion of the judgment (because the conclusion of the judgment may be wrong, which will mislead everyone to investigate)

2. Please avoid expressing too much of your judgment and opinions when describing the problem phenomenon, because this may affect the direction of everyone's investigation. We need to maintain an objective and rational attitude in order to better analyze the problem.

3. At the same time, please also pay attention to your own way of thinking and avoid letting your brain become a racetrack for other people's thoughts. When discussing issues, we can put forward our own opinions and suggestions, but please make sure that these views are based on facts and evidence, and not personal subjective assumptions.

✅ Positive example:

Suppose a software development team encounters a performance issue and the problem finder describes the following phenomenon:

Time: Between 9 a.m. and 10 a.m. on August 18, 2023, users noticed significant performance degradation during use.
Scope of influence: affects the main functional modules such as user login and data query.
Impact degree: Due to the performance degradation, the response time of users' requests increases significantly, and some users cannot complete operations normally.

In this example, the problem finder provides specific information about the phenomenon, allowing team members to quickly locate the problem and take appropriate action to optimize and fix it.

❌Counterexample:

Suppose a software development team encounters a performance issue and the problem finder only gives his own judgment:

When: August 18, 2023, between 9am and 10am.
Scope: This could be an issue with database connection pooling.
Degree of impact: Very serious, resulting in most users being unable to use the system normally.

In this example, the problem finder did not provide specific information about the phenomenon, but only gave his own judgment conclusion. This can lead to confusion among team members when troubleshooting issues and not being able to pinpoint the source of the problem. At the same time, because judgments can be wrong, it can also mislead team members, causing them to waste time and effort troubleshooting issues.

Stop bleeding first, then localize

1. All means and actions taken in the process of fault handling, all with restoring business as the highest priority, and restoring the on-site hemostatic plan is higher than finding the cause of the failure.

2. Rapid hemostasis: We can use switching technology, rollback technology and other means to quickly restore system functions and avoid service interruptions.

3. Daily emergency plan: Please formulate an emergency plan in advance, including the disaster recovery strategy and failover process of key services, so that you can quickly start the plan in the event of a failure and reduce the impact of the failure on the business.

4. Please do not pay too much attention to the search for the cause of the fault in the process of fault handling. Although finding the cause of the failure is the key to solving the problem, in an emergency, we need to prioritize how to quickly stop the bleeding and restore business. Only under the premise of ensuring the stable operation of the business can we have more time and energy to deeply analyze the cause of the failure and solve the problem fundamentally.

✅ Positive example:

Case 1: Suppose a software development team encounters a system failure, and the problem finder quickly takes quick measures to stop the bleeding and restore system functionality. At the same time, team members have formulated detailed emergency plans in their daily work, so that when similar failures occur, they can quickly start the plan and reduce the impact of failures on the business. In this example, the problem finder and team members made restoring the business a top priority, and quickly took measures to stop the bleeding, ensuring the stability and availability of the system. At the same time, they also pay attention to the development of daily emergency plans, fully prepared for possible future failures.
Case 2: During the live release, if I start reporting an error, and everything is normal before the release? Then don't worry about anything, roll back first, and then troubleshoot the problem when it is back to normal.
Case 3: The application has been running stably for a long time, but suddenly the process starts to exit? It is likely to be a memory leak, you can use the restart method, restart some machines and observe to see if the bleeding has stopped.

❌Counterexample:

Suppose a software development team encounters a system failure, and the problem finder spends a lot of time looking for the cause of the failure, ignoring the importance of quickly stopping the bleeding and restoring the business. At the same time, team members did not have a detailed emergency plan in their daily work, resulting in chaotic processing in the event of similar failures, and the inability to quickly restore system functionality. In this example, problem finders and team members were too focused on finding the cause of the failure and neglected the importance of stopping bleeding quickly and restoring the business. This leads to inefficient processing in the event of a failure, and the system cannot be restored in time, which has a great impact on the business.

Look at the monitoring, look at the logs:

1. Collect and analyze UMP performance indicators, Logbook error logs, MDC system operation status and other information to more accurately determine the problem.

2. Communicate with relevant teams, jointly analyze the cause of the problem, and formulate solutions. In the process of problem solving, it is necessary to remain calm and patient, follow the best practices of fault handling, and ensure that the problem is solved in a timely and effective manner.

✅ Positive example:

Suppose that in a production environment, the system suddenly has a performance degradation. As oncall personnel, after receiving the problem report, first collect information such as system operation status and performance indicators through various monitoring tools. Then, based on the information gathered, analyze the possible causes of the problem and communicate and collaborate with the relevant teams. Finally, the solution was developed and implemented, and the problem of system performance degradation was successfully solved and the normal production environment was restored.

❌Counterexample:

Suppose in a production environment where the system suddenly fails, causing the business to not function properly. As an oncaller, after receiving the problem report, he did not first locate the general direction, nor did he collect relevant information through monitoring tools for analysis. Instead, it directly tries to fix the problem, but due to the lack of accurate information and analysis, it is difficult to find the root cause of the problem, resulting in the problem not being effectively solved, affecting the normal operation of the business.

Find a pattern and experiment first

1. Through various monitoring tools, such as UMP (traffic, tp99, availability) and log analysis, we can find patterns and understand the performance of the system before and after the launch. For example, we can compare yesterday's log data with last week's log data to see if there is a similar problem. At the same time, we can also monitor changes in UMP traffic to determine whether the system is affected by anomalies.

2. If we find that there are similar doubts before, then it may mean that the problem has nothing to do with today's launch. We need to continue digging deeper to find out the root cause.

3. For each question point, we should experiment and verify according to priority. This ensures that we solve the most critical problems first and avoid affecting the proper functioning of the system.

4. If the problem is found to exist during the test, we should adjust the plan in time and re-conduct the test. This helps us find the root cause of the problem more quickly and take appropriate action to fix it.

5. During the whole investigation process, we should maintain communication and collaboration. If you encounter a difficult or uncertain situation, you can communicate with other team members to solve the problem together.

✅ Positive example:

Suppose that in a production environment, the system suddenly has a performance degradation. As an oncall person, UMP and log data were collected through various monitoring tools, and it was found that there were similar problems yesterday and last week. At the same time, no significant change in UMP traffic was detected compared to before. Based on this information, O&M personnel can preliminarily determine that the problem is not related to today's launch, and may be caused by other reasons. They prioritized and tested and eventually found that the problem was caused by an incorrect configuration parameter setting. By adjusting the configuration and re-experimenting, the problem was resolved and system performance returned to normal.

❌Counterexample:

Suppose in a production environment where the system suddenly fails, causing the business to not function properly. As an oncall person, there is no detailed monitoring and analysis to try directly to fix the problem. However, due to the lack of accurate information and analysis, the problem is not effectively solved, and may even lead to more serious errors due to incorrect operation. In this case, oncall personnel need to adjust the plan in time and re-test to ensure that the problem is handled correctly.

Look at the input, look at the output

1. First, confirm the input and output parameters that need to be compared. These parameters may include request parameters, response data, and so on. Make sure you know exactly what needs to be compared. During the comparison, pay attention to the difference in parameter values. If discrepancies are found, further analyze possible causes, such as parameter passing errors, interface logic problems, etc.

2. If you find that the problem is caused by interface logic, you can try to roll back an N machine to the previous version, and then test whether the interface works normally again. If the issue persists, you may need to troubleshoot further and fix the code.

3. According to the results of comparison and the process of investigation, summarize lessons learned and put forward suggestions for improvement to avoid similar problems from happening again.

✅ Positive example:

Suppose that in a production environment, the system suddenly has a performance degradation. As an O&M personnel, if you use technology to roll back a certain machine or drain the traffic, you find that the input parameters are not as expected, resulting in the interface not being able to handle requests correctly. After careful troubleshooting, it was found that it was caused by an error in a configuration parameter. After the issue is fixed, the system performance returns to normal and the business runs normally.

❌Counterexample:

Suppose in a production environment where the system suddenly fails, causing the business to not function properly. As an O&M personnel, we did not conduct any comparison and troubleshooting, and directly tried to fix the problem. However, due to the lack of accurate information and analysis, the problem is not effectively solved, and may even lead to more serious errors due to incorrect operation. In this case, O&M personnel need to adjust the plan in time and re-experiment to ensure that the problem is handled correctly.

Stay on site and give feedback

1. When troubleshooting and processing, it is very important to retain the status quo and record the measures taken and the solutions tried (for example, do not restart all machines, but keep 1 on-site machine)

2. Record it in detail, including the measures taken and the solutions tried.

3. No progress is also progress, and timely feedback is also required.

✅ Positive example:

Suppose that in a production environment, the system suddenly has a problem of performance degradation, the on-duty operation and maintenance personnel retain 1 machine and remove the traffic, other machines are restarted in batches, quickly stop bleeding and restore the scene, and arrange other personnel to analyze the machines that have not been restarted, and dump application snapshots (commonly used snapshot types are generally thread stacks and heap memory maps) to analyze the cause.

❌Counterexample:

Suppose that in a production environment, the system suddenly has a problem of performance degradation, and the on-duty operation and maintenance personnel directly restart the machines, so that the bleeding is quickly stopped and the scene is restored, but because the site is not retained, the key information on the site is lost, and the root cause may not be able to continue

If you find that the information I provided is incorrect or have a more suitable list, please feel free to correct it and contact me to complete it. Thank you very much for your feedback and help! I will make corrections and provide more accurate information as soon as possible.

Author: JD Logistics Feng Zhiwen

Source: Jingdong Cloud Developer Community Since Ape Qi said Tech Please indicate the source

【Stability】The three-character scripture of the secret team to quickly troubleshoot the problem, have you learned it?

background

swear word

Don't panic, report first

First organize the meeting, clear division of labor

Describe phenomena, not conclusions

Stop bleeding first, then localize

Look at the monitoring, look at the logs:

Find a pattern and experiment first

Look at the input, look at the output

Stay on site and give feedback