天天看點

Interpreting an ESX/ESXi host purple diagnostic screen (1004250)

<a href="https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&amp;cmd=displayKC&amp;externalId=1004250" target="_blank">https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&amp;cmd=displayKC&amp;externalId=1004250</a>

This article provides information to decode ESX/ESXi host purple screen errors.

An ESX/ESXi purple screen error appears similar to:

Note: This article uses the information in this purple screen as an example.

The VMkernel is the operating system core of ESX/ESXi. The kernel handles resource scheduling and device IO. Device IO is handled by the VMware network and storage stacks, which serves as a layer between the virtual file system, network devices and the device drivers that control physical devices.

If the VMkernel experiences an error, the error appears in a purple diagnostic screen. The purple diagnostic screen looks similar to:

VMware ESX Server [Releasebuild-98103

PCPU 1 locked up. Failed to ack TLB invalidate.

frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c

es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff

eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff

ebp=0x3a37ef4 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff

*0:1037/helper1-4 1:1107/vmm0:Fagi 2:1121/vmware-vm 3:1122/mks:Franc

0x3a37ef4:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x3a37f10, 0x3a37f48

0x3a37f04:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x1, 0x14a03a0

0x3a37f48:[0x64bfa4]TLBDoInvalidate+0x38f stack: 0x3a37f54, 0x40, 0x2

0x3a37f70:[0x66da4d]XMapForceFlush+0x64 stack: 0x0, 0x4d3a, 0x0

0x3a37fac:[0x652b8b]helpFunc+0x2d2 stack: 0x1, 0x14a4580, 0x0

0x3a37ffc:[0x750902]CpuSched_StartWorld+0x109 stack: 0x0, 0x0, 0x0

0x3a38000:[0x0]blk_dev+0xfd76461f stack: 0x0, 0x0, 0x0 

VMK uptime: 7:05:43:45.014 TSC: 1751259712918392

Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1...using slot 1 of 1... log

Here is a breakdown of each section of the above purple diagnostic screen:

The Product and Build:

<code>VMware ESX Server [Releasebuild-98103]</code>

This section of the purple diagnostic screen identifies the product and build that has experienced the error. In this example, the product is VMware ESX Server build 98103.

The Error Message:

<code>PCPU 1 locked up. Failed to ack TLB invalidate</code>

This section of the purple diagnostic screen identifies the reported error message. There are only a finite number of error messages that can be reported. These error messages are discussed in this article.

The CPU Registers:

<code>frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff ebp=0x3a37ef4 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff</code>

Note: The preceding links were correct as of March 28, 2013. If you find the links to be broken, provide feedback on the article and a VMware employee will update the article as necessary. 

The Physical CPU:

<code>*0:1037/helper1-4 1:1107/vmm0:Fagi 2:1121/vmware-vm 3:1122/mks:Franc</code>

This section of the purple diagnostic screen identifies the physical CPU that was running instructions during the VMkernel error. In the example, the * beside the 0 indicates that physical CPU 0 was running an operation at the time of the failure. In newer versions of ESX, instead of including an *, the preceding letters CPU are included. For example, if the same error as the above were to occur in newer versions of VMware ESX, the same line appears as: 

<code>CPU0:1037/helper1-4 cpu1:1107/vmm0:Fagi cpu2:1121/vmware-vm cpu3:1122/mks:Franc. </code>

This section of the purple diagnostic screen also describes the world (process) that was running on the CPU at the time of the error. In the above example, the userworld running was helper1-4. 

Note: The name of the process may be truncated.

The Stack Trace:

<code>0x3a37ef4:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x3a37f10, 0x3a37f48 0x3a37f04:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x1, 0x14a03a0 0x3a37f48:[0x64bfa4]TLBDoInvalidate+0x38f stack: 0x3a37f54, 0x40, 0x2 0x3a37f70:[0x66da4d]XMapForceFlush+0x64 stack: 0x0, 0x4d3a, 0x0 0x3a37fac:[0x652b8b]helpFunc+0x2d2 stack: 0x1, 0x14a4580, 0x0 0x3a37ffc:[0x750902]CpuSched_StartWorld+0x109 stack: 0x0, 0x0, 0x0 0x3a38000:[0x0]blk_dev+0xfd76461f stack: 0x0, 0x0, 0x0</code>

The stack represents what the VMkernel was doing at the time of the error. In this example, it was trying to clear memory page tables (TLB). This information is a vital tool in the diagnosis of purple screen errors by evaluating the actions of the kernel at the time of the error.

The Uptime:

<code>VMK uptime: 7:05:43:45.014 TSC: 1751259712918392</code>

This section indicates how long a server is running since the last boot. In this example, the ESX host was running for 7 days, 5 hours, 43 minutes and 45.014 seconds. The TSC value is the number of CPU clock cycles that have elapsed since the server was started.

The Core Dump:

<code>Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1...using slot 1 of 1... log</code>

This section of the purple diagnostic screen indicates that the contents of the VMkernel memory are being copied to the vmkcore partition. 

The VMkernel error message generated by the purple screen can be used to identify the cause of the issue. The number of error messages that can be produced are finite. This is a list of known VMkernel error messages.

Type: Console Oops

Example Error: <code>COS Error: Oops</code>

Type: Lost Heartbeat

Example Error: <code>Lost Heartbeat</code>

Type: Assert

Example Error: <code>ASSERT bora/vmkernel/main/pframe_int.h:527 </code>

Type: Not Implemented

Example Error: <code>NOT_IMPLEMENTED /build/mts/release/bora-84374/bora/vmkernel/main/util.c:83</code>

Type: Spin count exceeded / Possible deadlock

Example Error: <code>Spin count exceeded (iplLock) - possible deadlock</code>

Type: Failed to ack TLB invalidate

Example Error: <code>PCPU 1 locked up. Failed to ack TLB invalidate.</code>

A purple diagnostic screen can also come in the form of an Exception. An Exception Handler is a computer hardware mechanism designed to handle some condition that changes the normal flow of execution (Division by Zero, Page Fault, etc). There is no trace from handlers, so you need logging to determine if handler faulted (or single step debugging). This is a list of common exceptions:

Type: Exception 13 (General Protection Fault)

Example Error: <code>#GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303</code>

Type: Exception 14 (Page Fault)

Example Error: #PF Exception type 14 in world 136:helper0-0 @ 0x4a8e6e

Type: Exception 18 (Machine Check Exception)

Example Error: Machine Check Exception: Unable to continue

Example Error: Hardware (Machine) Error

For more information, see:

<a href="https://kb.vmware.com/selfservice/search.do?cmd=displayKC&amp;docType=kc&amp;docTypeID=DT_KB_1_1&amp;externalId=1008524" target="_blank">Collecting diagnostic information for VMware products (1008524)</a>

<a href="http://www.vmware.com/support/policies/howto.html" target="_blank">How to Submit a Support Request</a>

In the event that you experience multiple purple diagnostic screens from the same VMware ESX host, you can use the sample of multiple purple diagnostic screens to determine the likeliness of an issue being related to hardware or software. This can be done by identifying patterns in these sections of the purple diagnostic screen:

The error message and the stack trace: 

If the error message and stack vary greatly between vmkernel errors, this indicates that software is not always hitting the same error. Although inconclusive, this may indicate a hardware issue.

If the error message and the stack are always identical between vmkernel errors, this indicates that software is always hitting the same error. Although inconclusive, this may indicate a software issue.

For more information about the error message you are experiencing, refer to the above section about the specific error message.

The physical CPU: 

If the physical CPU value remains the same across multiple vmkernel errors, this indicates that the software is always failing on the same physical CPU. Although inconclusive, this may indicate a CPU issue.

The world: 

If the world value remains the same across multiple VMkernel errors, this indicates that the vmkernel is failing when receiving instructions from the same world. Although inconclusive, this may indicate a world is sending instructions that may be triggering the VMkernel error.

This is a complete list of exceptions:

Exception Type 0 #DE: Divide Error

Exception Type 1 #DB: Debug Exception

Exception Type 2 NMI: Non-Maskable Interrupt

Exception Type 3 #BP: Breakpoint Exception

Exception Type 4 #OF: Overflow (INTO instruction)

Exception Type 5 #BR: Bounds check (BOUND instruction)

Exception Type 6 #UD: Invalid Opcode

Exception Type 7 #NM: Coprocessor not available

Exception Type 8 #DF: Double Fault

Exception Type 10 #TS: Invalid TSS

Exception Type 11 #NP: Segment Not Present

Exception Type 12 #SS: Stack Segment Fault

Exception Type 13 #GP: General Protection Fault

Exception Type 14 #PF: Page Fault

Exception Type 16 #MF: Coprocessor error

Exception Type 17 #AC: Alignment Check

Exception Type 18 #MC: Machine Check Exception

Exception Type 19 #XF: SIMD Floating-Point Exception

Exception Type 20-31: Reserved

Exception Type 32-255: User-defined (clock scheduler)

Note: The preceding links were correct as of October 2, 2015. If you find a link is broken, provide feedback and a VMware employee will update the link.

esx purple-diagnostic-screen root-cause-analysis fault/crash, vmware purple screen, esxi host psod,ESXi 6.0 hosts crashed with PSOD, Purple screen when booting ESX, esxi server has a purple screen, Purple Screen on three different servers

Determining if virtual machine and ESX host unresponsiveness is caused by hardware issues (1003560)

<a href="http://kb.vmware.com/kb/1005184" target="_blank">Decoding Machine Check Exception (MCE) output after a purple screen error (1005184)</a>

<a href="http://kb.vmware.com/kb/1006802" target="_blank">Understanding an "Oops" purple diagnostic screen (1006802)</a>

<a href="http://kb.vmware.com/kb/1008524" target="_blank">Collecting diagnostic information for VMware products  (1008524)</a>

<a href="http://kb.vmware.com/kb/1009525" target="_blank">Understanding a "Lost Heartbeat" purple diagnostic screen (1009525)</a>

<a href="http://kb.vmware.com/kb/1019956" target="_blank">Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956)</a>

<a href="http://kb.vmware.com/kb/1020105" target="_blank">Understanding a "Spin count exceeded" purple diagnostic screen (1020105)</a>

<a href="http://kb.vmware.com/kb/1020181" target="_blank">Understanding Exception 13 and Exception 14 purple diagnostic screen events in ESX 3.x/4.x and ESXi 3.x/4.x/5.x (1020181)</a>

<a href="http://kb.vmware.com/kb/1020214" target="_blank">Understanding a Failed to ack TLB invalidate purple diagnostic screen (1020214)</a>

<a href="http://kb.vmware.com/kb/1033242" target="_blank">ESX/ESXi ホストのパープル スクリーンの解析 (1033242)</a>

<a href="http://kb.vmware.com/kb/2007269" target="_blank">ESXi 4.0 hosts may experience a purple screen after vCenter Server is upgraded to 5.0 (2007269)</a>

<a href="http://kb.vmware.com/kb/2077746" target="_blank">解釋 ESX/ESXi 主機紫色診斷螢幕 (2077746)</a>

<a href="http://kb.vmware.com/kb/2086258" target="_blank">VMware ESXi ホストに障害が発生し、紫色の診斷畫面に次のエラーが表示される:PF Exception 14 in world (2086258)</a>

<a href="http://kb.vmware.com/kb/2109424" target="_blank">IPv6 を有効化した ESXi 5.5 ホストが停止し、紫色の診斷畫面に find_pfxlist_reachable_router と表示される (2109424)</a>

<a href="http://kb.vmware.com/kb/2145091" target="_blank">Interpretieren eines violetten Diagnosebildschirms des ESX/ESXi-Hosts (2145091)</a>

本文轉自學海無涯部落格51CTO部落格,原文連結http://blog.51cto.com/549687/1912033如需轉載請自行聯系原作者

520feng2007

繼續閱讀