天天看點

如何閱讀作業系統trace檔案

How to interpret OS system traces (文檔 ID 1429678.1)

修改時間:2013-9-4

如何閱讀作業系統trace檔案

類型:REFERENCE

In this Document

Purpose

Scope

Details

What are system calls:

It's all about system calls

Invocation of Unix trace commands

Linux:

Solaris, AIX:

HP-UX:

What does the output look like ?

System calls to pay attention to:

Typical output of system traces:

Example:

While in network socket, the same descriptors applies to socket system calls when client connect to server:

File descriptor example:

open() returns a file descriptor:

Read() the file descriptor:

Close() the file descriptor:

Signals:

Signals are printed as a signal symbol and a signal string. An excerpt from tracing and interrupting the command ``sleep 666'' is:

OR:

When we suspect errors:

Errors reported always at the return value of the system call. For example:

How do we get more information about the error:

As the system calls are Unix functions, they're explained in detail in the online manual pages for every Unix platform.

For example, in shmat() man pages (using: man shmat), detailed information about each error has been inserted.

More information:

</Internal_Only>

References

Applies to:

Oracle Database Products > Oracle Database > Oracle Database

Generic UNIX

***Checked for relevance on 04-Sep-2013***

Purpose

This document highlights the main purpose of using system traces and how to generate/parse/identify the major error when reviewing system trace files.

Scope

OS/DBA engineers who are interested in identifying the root cause of an issue by using operating system traces.

This document may require understanding of basic OS concepts and how to get help from online manuals (man pages)

Details

What are system calls:

System calls are how a program requests a service from an operating system's kernel that it does not normally have permission to run.

System calls provide the interface between a process and the operating system.

It's all about system calls

Most operations interacting with the operating system require permissions not available to a user level process, e.g. I/O performed with a device present on the system, or any form of communication with other processes requires the use of system calls.

Invocation of Unix trace commands

Linux:

strace -frT -o <filename> -p <pid> | command

Solaris, AIX:

truss -aef -o <filename> -p <pid> | command

HP-UX:

tusc -aef -o <filename> -p <pid> | command

What does the output look like ?

18:07:47 gettimeofday({1330963667, 912233}, NULL) = 0 <0.000814>

18:07:47 times({tms_utime=168, tms_stime=136, tms_cutime=0, tms_cstime=0}) = 429707655 <0.000100>

18:07:47 gettimeofday({1330963667, 925627}, NULL) = 0 <0.000085>

18:07:47 _llseek(10, 0, [9399], SEEK_CUR) = 0 <0.000170>

18:07:47 write(10, "\n*** 2012-03-05 18:07:47.925\n", 29) = 29 <0.000290>

18:07:47 write(13, "3?C2Cc3~0T\n", 11) = 11 <0.000167>

18:07:47 _llseek(10, 0, [9428], SEEK_CUR) = 0 <0.000378>

18:07:47 write(10, "2012-03-05 18:07:47.912233 : nsd"..., 57) = 57 <0.000109>

18:07:47 write(13, "i8{0TvfOgQb+NN1\"MP1;\n", 21) = 21 <0.000105>

18:07:47 write(15, "\1v\0\0\6\0\0\0\0\0\0\0\1lD\0\0\200\2\0\0\0\0\4\0\0\350\20w\21\220\0"..., 374) = 374 <0.000148>

18:07:47 gettimeofday({1330963667, 928374}, NULL) = 0 <0.000086>

18:07:47 gettimeofday({1330963667, 928579}, NULL) = 0 <0.000086>

18:07:47 times({tms_utime=168, tms_stime=136, tms_cutime=0, tms_cstime=0}) = 429707655 <0.000086>

18:07:47 poll([{fd=14, events=POLLIN|POLLRDNORM}, {fd=15, events=POLLIN|POLLRDNORM}], 2, 3000) = 1 ([{fd=15, revents=POLLIN|POLLRDNORM}]) <0.002218>

18:07:47 gettimeofday({1330963667, 931608}, NULL) = 0 <0.000215>

18:07:47 gettimeofday({1330963667, 932044}, NULL) = 0 <0.000149>

18:07:47 read(15, "\0\266\0\0\6\0\0\0\0\0\0\0\0\254\244\0\0\200\7\0\0\0\0\0\0\0\0\0\0\0\5\0"..., 8208) = 182 <0.000154>

18:07:47 gettimeofday({1330963667, 932893}, NULL) = 0 <0.000150>

18:07:47 _llseek(10, 0, [9485], SEEK_CUR) = 0 <0.000165>

18:07:47 write(10, "2012-03-05 18:07:47.932893 : nsd"..., 59) = 59 <0.000711>

18:07:47 write(13, "iC{0Txq25\"2;\n", 13) = 13 <0.000167>

18:07:47 gettimeofday({1330963667, 935025}, NULL) = 0 <0.000322>

18:07:47 times({tms_utime=168, tms_stime=137, tms_cutime=0, tms_cstime=0}) = 429707656 <0.000148>

18:07:47 poll([{fd=14, events=POLLIN|POLLRDNORM}, {fd=15, events=POLLIN|POLLRDNORM}], 2, 3000) = 0 (Timeout) <3.014527>

18:07:50 gettimeofday({1330963670, 957436}, NULL) = 0 <0.000222>

18:07:50 gettimeofday({1330963670, 958448}, NULL) = 0 <0.000412>

18:07:50 getrusage(RUSAGE_SELF, {ru_utime={1, 682744}, ru_stime={1, 371791}, ...}) = 0 <0.000163>

18:07:50 gettimeofday({1330963670, 959385}, NULL) = 0 <0.000150>

18:07:50 getrusage(RUSAGE_SELF, {ru_utime={1, 682744}, ru_stime={1, 371791}, ...}) = 0 <0.000148>

18:07:50 open("/proc/3967/stat", O_RDONLY|O_LARGEFILE) = 18 <0.000211>

18:07:50 read(18, "3967 (oracle) S 1 3967 3967 0 -1"..., 999) = 203 <0.000269>

18:07:50 close(18) = 0 <0.000483>

System calls to pay attention to:

open()       - opens a file      
close()      - closes a file descriptor (a file, socket)      
read()       - read data      
write()      - write data      
mbind()      - sets  the  NUMA memory policy      
execve()     - executes or runs a new program (not a fork)      
socket()     - anything with "socket" (i.e. so_socket,socket) 

* Note that you may get unwanted hits, so also look for "INET" as well      
connect()    - connects to a socket      
bind()       - server only.  Bind a socket to an ip and port      
listen()     - listen for connections      

Typical output of system traces:

system_call(arguments) = exit code/return value

Example:

open("/proc/loadavg", O_RDONLY) = 4

lseek(4, 0, SEEK_SET) = 0

read(4, "0.03 0.07 0.02 1/122 11930\n", 1023) = 27

fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0

mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fe8000

write(1, " 19:37:34 up 1 day, 7:43, 3 us"..., 69) = 69

munmap(0xb7fe8000, 4096) = 0

While in network socket, the same descriptors applies to socket system calls when client connect to server:

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9

connect(9, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.141.160.245")}, 16) = 0

write(9, "\0\357\0\0\1\0\0\0\0019\1,\f\1\10\0\177\377\177\10\0\0"..., 239) = 239

read(9, "\0\10\0\0\v\0\0\0", 2064) = 8

close(9) = 0 Note about file descriptors:

What is a file descriptor ? A file descriptor is an index for an entry in a kernel-resident data structure containing the details of all open files.

In POSIX this data structure is called a file descriptor table, and each process has its own file descriptor table.

The user application passes the abstract key to the kernel through a system call, and the kernel will access the file on behalf of the application, based on the key.

The application itself cannot read or write the file descriptor table directly.

File descriptor example:

open() returns a file descriptor:

open("/lib/libnsl.so.1", O_RDONLY) = 3

Read() the file descriptor:

read( 3, "\177ELF\1\3\0\0 <\0\000"..., 512) = 512

Close() the file descriptor:

close( 3)

Signals:

Signals are printed as a signal symbol and a signal string. An excerpt from tracing and interrupting the command ``sleep 666'' is:

sigsuspend([] <unfinished ...>

--- SIGINT (Interrupt) ---

+++ killed by SIGINT +++

OR:

--- SIGINT (Interrupt) @ 0 (0) ---

+++ killed by SIGINT +++

When we suspect errors:

Errors reported always at the return value of the system call. For example:

shmget(0xcf856443, 4096, 0) = -1 ENOENT (No such file or directory)

shmat(16777323, 0x60000000, 040000) Err#22 EINVAL

open("/usr/lib64/libnuma.so", O_RDONLY) = -1 ENOENT (No such file or directory)

semtimedop(100663335, 0xFFFFFC7FFFDFDAC8, 1, 0xFFFFFC7FFFDFDAE0) Err#11 EAGAIN

semtimedop(327681, 0xbfd11e98, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)

How do we get more information about the error:

As the system calls are Unix functions, they're explained in detail in the online manual pages for every Unix platform.

For example, in shmat() man pages (using: man shmat), detailed information about each error has been inserted.

void *shmat(int shmid, const void *shmaddr, int shmflg);

ERRORS

When shmat() fails, errno is set to one of the following:

EACCES The calling process does not have the required permissions ...

EINVAL Invalid shmid value, unaligned (i.e., not page-aligned ...

ENOMEM Could not allocate memory for the descriptor or for the page tables.

When shmdt() fails, errno is set as follows:

EINVAL There is no shared memory segment attached at shmaddr; ... Note about manpages:

when using man pages, most of system call are categorized under category (2), in order to differentiate between open command and open() system call we need to use set 2 at at the man command:

$ man 2 open

you may refer to the manual of man command for more information:      
$ man man 
      
More information:

- "man" pages for strace, truss, and tusc.

- How To Use Truss With Opatch? ( Note:470225.1 )

References

NOTE:470225.1 - How To Use Truss With Opatch?

轉載于:https://blog.51cto.com/aimospang/1317133