laitimes

A network card driver bug troubleshooting process

author:Flash Gene

Preface

In daily O&M, there will always be some thorny faults or problems, especially in the compatibility of multi-system convergence or the unknown bugs that may exist in some fusion nodes, which will increase the difficulty of troubleshooting.

This article will extend from a small event as the entry point, introduce the troubleshooting process of multi-fusion node faults of the host ESXi basic system, and introduce the environmental conditions on which the stability of the ESXI basic system depends in combination with the troubleshooting process.

A network card driver bug troubleshooting process

Troubleshooting stage 1: The fault occurs at the beginning and recurs randomly

First, the fault is emerging

Background of the incident: One afternoon in a workplace, there were several people who could not connect to the wireless network
Event Cause: The network authentication server of wireless network authentication cannot reach an AD domain control network, and the user cannot obtain the ad status and refuses to connect to the network
Temporary Measures: Restart the network service of the network authentication server and restore it

Problem Symptom:

The network authentication server cannot communicate with some destination IP addresses (one of the AD domain controllers belongs to it)

Network authentication server to some destination IP addresses (the alarm system belongs to the IP address of the pass, but no server alarm is generated)

Some VMs on the ESXi have similar issues, and some VMs are normal

2. Random reproduction

Question: Why is this problem occurring and how can it be reproduced?

Since most of the core business systems run on ESXi, how to identify the specific cause of the failure and avoid similar risks has become the top priority.

1. Packet capture analysis

At the same time, we conducted a detailed inspection of other VMs on the host, and found that some IP addresses of the other VM were not available (the local service did not depend on remote IP communication, and the VM service was not affected).

  • 1. The cascading switch catches the ICMP reply packet sent by the server, but the reply packet is not found on the uplink core switch
  • 2. In order to rule out the abnormal packet structure sent by the server, the tcpreplay tool (a replay tool for pcap packets, which can replay the network packets captured by ethreal, wireshark and other tools as they are or after arbitrary modification) was rewritten and replayed in the experimental environment, and the normal sending and receiving of packets proved that the packet structure was normal (in fact, there were other differences between the experimental environment and the fault environment at that time)
  • 3. After the switch port is restarted, the faulty virtual machine recovers

At that time, the conclusion was that there was a problem with the cascading or uplink switches that caused the loss of packets (the conclusion was a false positive)

2. Reproduction attempts

Although there is a simple conclusion, it is necessary to reproduce the failure to determine the specific problem link!

During the Esix troubleshooting, it is found that there is an exception log on the fault time node, which may be related

Ixgben:indrv_uplinkreset:vmnic0 device reset started
           

I checked the relevant forum materials of VMware and found that a command can trigger the same log to come out (important node 1)

VMkernel Sys Info Shell:vsish -e set /net/pNics/vmnic0/reset 1
           

This command was successfully reproduced in a failed environment, but not in a test environment

At that time, it was concluded that vmnic0's reset action triggered the failure, but other factors that satisfied the failure occurred are still unknown

Troubleshooting stage 2: interlocking and intricate

Question: In what scenario does the VMNIC's reset action trigger this fault?

In order to test which X factors meet the fault trigger, we tried to enumerate at least a dozen possible correlated conditions, and arranged and combined more than 10 sets of test scenarios for test confirmation

A network card driver bug troubleshooting process

Enumerate X-factors: physical machine model, NIC model, ESXI version, NLB mode, port group configuration, VSwitch configuration, switch model, switch system version, VM OS, number of VMs, VM role, network segment, server jumper, etc

After no less than 50 permutations and combinations tests, there are contradictions and overlaps between different X factors, and the law and frequency of recurrence are not fixed, which cannot be stably reproduced, but a certain range of conditions is finally determined and narrowed

Conclusion at that time: The following is basically determined to be the scope of the trigger condition (not final)

  • Server: Dell R740xd
  • 网卡:Intel(R) Ethernet Controller 10G X550
  • Driver: ixgben network card driver
  • Switches: 35 series switches
  • Virtual Machine: Linux

Troubleshooting stage 3: Stable recurrence and finding the key

1. Stable reproduction

Question: What are the hidden factors in these conditions that can determine the stable recurrence of faults?

One of the most certain triggers is why every time a failure is reproduced, it must be a Linux virtual machine, and Windows never reproduces it. In view of this particularity, we paid close attention to this particularity during the troubleshooting process, and found a feature after multiple combined tests: the subtype type of the VM with the problem is "6" (important node 2) in the network information of the VM displayed at the bottom of ESXI

A network card driver bug troubleshooting process

After data query, it is confirmed that subtype=6 means that the virtual NIC type is vlance ("[AMD] 79c970 [PCnet32]"), and the vlance type appears because "Other Linux Systems" is selected in the early Linux VM creation wizard, and ESXI will automatically create the virtual machine NIC as vlance

In the experimental environment, prepare a vlance type NIC VM for testing, and finally "DELL model + IP hash-based + vlance" can be stably reproduced under the trigger of the reset vmnic action, and the test of the reproduced virtual machine to multiple target networks is half through and half is not available, and the NLB mode that is not "based on IP hash" cannot be reproduced

Second, find the key

Question: After stable reproduction, how to determine which link causes some of the parts to be blocked?

At the same time, after upgrading the after-sales work order level of the Cisco vendor, the network group, with the assistance of senior after-sales service, obtains an important information through EPC (Embedded Packet Capture), an embedded packet capture tool with stronger flexibility and pertinence, especially suitable for network troubleshooting and online packet capture If the tag is 0, the VLAN tag will be the normal ID number when the normal destination IP address of the VM arrives (important node 3).

A network card driver bug troubleshooting process
A network card driver bug troubleshooting process

At that time, it was concluded that the Layer 3 switching could not be forwarded because the packets received at the time of the failure were not properly labeled with VLAN IDs

Troubleshooting stage 4: clear the clouds and see the sun, and the water will come out

1. See the sun

1. Confirm the fault factor:

Question: What causes VLAN tag exceptions?

With the loss of VLAN tag as the key factor, we compared the update introduction of each ESXi version on the VMware official website, in which the relevant versions of the 6th series mentioned the description of the VLAN tag-related bug, and gave a temporary solution, and the 7.0 version explained that the bug was fixed

A network card driver bug troubleshooting process

After the failure was replicated in the lab environment, the failure was recovered as expected according to VMware's temporary solution

According to the analysis and study of some parameters of the command in this solution, the command effect is to restart the VLAN-related functions in the underlying network part of ESXi (similar to the operation we restarted the NIC or switch port)

In addition, after we upgraded ESXi to 7.0 in the experimental session, the fault could not be reproduced

At that time, the conclusion was that the VLAN tag was lost and the problem was verified, but the bug was not explained in detail

2. Confirm the fault link:

Question: The problem seems to have been found, but what caused the loss of the VLAN tag?

From the virtual machine to the switch, it also has to go through multiple network links at the Esxi layer: virtual machine, port group, vswitch, and physical network card

The pktcap-uw packet capture tool (an advanced packet capture and analysis tool mainly used in ESXi 5.5 or later) is used to capture packets of the three network links at the bottom of ESXi

Test the scenario outcome
Virtual machine to port group The packets are normal, and the VLAN tag is normal
port to vswitch The packets are normal, and the VLAN tag is normal
vswitch to a physical NIC The packets are normal, and the VLAN tag is normal
Physical NIC to the switch The packet is normal and the VLAN tag is abnormal

At that time, it was concluded that the physical NIC caused the loss of the VLAN tag

Second, the water is clear

1. Confirm the cause of the failure:

Question: Why does a physical NIC cause VLAN tag to be lost?

According to the characteristics of the 10 Gigabit network card X550 matched with the DELL machine, we found an article in the official forum of lenovo about the 10G ixgben network card in the 1.8.7 driver version to fix the vlan tag related bugs, our version is: 1.7.10 (important node 4)

A network card driver bug troubleshooting process

Therefore, we upgraded the NIC driver of the faulty host to a higher version for testing, and the fault scenario could not be reproduced

To further confirm the correlation, several more combinatorial tests were conducted

Test the scenario outcome
Swap the bound line of the faulty host to a Gigabit network card Cannot be reproduced
Upgrade ESXi6.* to 8.0 (the NIC driver is automatically upgraded to 1.13.10) Cannot be reproduced
Downgrade the NIC driver of ESXi 8.0 to 1.7.10 Failure recurrence
Install the 10 Gigabit X550 (1.7.10) network card in the HP server Failure recurrence

So far, the doubts in different links of the whole troubleshooting process have been reasonably explained

Q1: Why are some of the same DELL host models, vmware configurations, and network configurations duplicated, and some not?

This type of fault affects VMs with VANCE NICs

Question 2: Why does Linux fail on the same host, but Windows does not fail?

The fault Linux belongs to the vlance type NIC (caused by selecting the version "Other" when installing the operating system), and Windows belongs to the E1000 NIC (NET).

Q3: Why is the load balancing mode adjusted to other non-"IP hash+trunk-based" mode, which cannot be reproduced?

Because the "IP hash + trunk based" working mode involves the network layer VLAN tag, triggering the IXGBEN 1.7.10 version driver bug, other collocations such as: IP hash + access, source virtual port, source MAC and other load balancing modes do not involve the network layer VLAN tag

Question 4: Why is it half through and half not available to the target IP when the virtual machine fails?

The IP hash-based algorithm will hash the source and destination IPs and allocate different physical links, and when one of the NICs is reset, half of the destinations are affected

Question 5: Why can't the host of HP be reproduced?

HP host does not have a 10 Gigabit network card with x550 (ixgben1.7.10)

Q6: Why is the tagging of ESXI set to On not as the root solution?

Because when turned on, it's similar to turning on promiscuous mode. More security risks will be exposed

Q7: Why does the loss of VLAN tag not belong to ESX6.7 version bug (the official says that 7.0 fixes similar bugs)

Because the ESXi high version can still be reproduced with ixgben1.7.10

Summary:

Through complete troubleshooting, the root cause of the event was finally determined, and the following four conditions were met at the same time to trigger this type of failure

  • Condition 1: Use an X550 NIC that drives ixgben 1.7.10
  • Condition 2: Use the "IP hash-based" load balancing mode
  • Condition 3: The virtual NIC model of the VM is vlance.
  • Condition 4: The VMNIC of the ESXI instance is reset

Seeing is not necessarily believing, in the complex fault scenario analysis, there will be a lot of interference factors affecting our judgment, through continuous permutation and combination to find the commonality in the difference, in the commonality to find the difference is one of the ways to eliminate the interference information, with the spirit of removing the false and keeping the truth to find the root cause behind the problem, in order to provide a more effective solution.

About the Author:

JUN, IS CURRENTLY AN IT SPECIALIST

Source: WeChat public account: Auction yard

Source: https://mp.weixin.qq.com/s/pTX2LuqCZO46lqd9i6gsEw

Read on