天天看点

Sensor记录日志导致的亮屏慢问题分析一. 问题描述二. 问题分析

一. 问题描述

1.1 现象

      手机解锁卡顿

1.2 结论

    diag记录数据慢导致系统卡顿

二. 问题分析

2.1日志分析

    在大量的日志中发现Slow Looper的日志,都是DisplayPowerController的DisplayControllerHandler处理消息超时,并且有大量的dvm_lock_sample提示registerListenerImpl耗时了,亮屏过程中无论system_server还是systemui都会去注册很多sensor,由于大量异常的

registerListenerImpl慢不得不怀疑是sensor注册慢导致的:
           

关键日志

1

11

-

19

20

:

03

:

43.936

1000

1683

1903

W Looper : Slow Looper PowerManagerService: Long Msg: seq=

93258

plan=

20

:

03

:

40.597

late=0ms wall=3340ms running=9ms runnable=6ms h=com.android.server.display.DisplayPowerController$DisplayControllerHandler w=

1

关键日志

2

11

-

19

20

:

41

:

21.495

1000

2834

18488

I dvm_lock_sample: 

43

,

1751

*,SystemSensorManager.java,

183

,

boolean

android.hardware.SystemSensorManager.registerListenerImpl(android.ha rdware.SensorEventListener, android.hardware.Sensor, 

int

, android.os.Handler, 

int

int

),,

183

,

boolean

android.hardware.SystemSensorManager.registerListenerImpl(android.hardware.SensorEventListener, androi d.hardware.Sensor, 

int

, android.os.Handler, 

int

int

),

100

">com.android.systemui,

,AsyncTask #

43

,

1751

,SystemSensorManager.java,

183

,

boolean

android.hardware.SystemSensorManager.registerListenerImpl(android.ha rdware.SensorEventListener, android.hardware.Sensor, 

int

, android.os.Handler, 

int

int

),,

183

,

boolean

android.hardware.SystemSensorManager.registerListenerImpl(android.hardware.SensorEventListener, androi d.hardware.Sensor, 

int

, android.os.Handler, 

int

int

),

100

怀疑归怀疑,但是有没有直接的证据来证明是sensor注册慢导致的,光看这日志没法继续分析到底是哪里慢了

2.2复现分析

    在想办法复现的过程中,刚好被拉进一卡顿群,说有个卡顿的问题想找framework的同学帮忙看一下。那我就去看了一下,发现他们有一个大概率复现的方法。打开打开小爱同学(其实和小爱没关系),设置指纹解锁,然后不断的指纹解锁就能复现卡顿的问题了。

    我就用这个方法去复现,发现还真的很好复现,复现之后当然需要知道在哪里卡顿了,这时候十八般武艺可以轮番上阵了:

1.查看system_server卡住的原因

    既然是system_server卡顿,那么就在卡顿的时候抓system_server 的trace,具体命令为kill -3 pid(system_server的pid),然后我抓到的关键trace如下:

"main"

prio=

5

tid=

1

Native

| group=

"main"

sCount=

1

dsCount=

flags=

1

obj=

0x7610b3c8

self=

0x7153469c00

| sysTid=

1636

nice=-

2

cgrp=

default

sched=

/

handle=

0x71d94d7ed0

| state=S schedstat=( 

12607677644

3411181337

21636

) utm=

905

stm=

355

core=

5

HZ=

100

| stack=

0x7fc32bf000

-

0x7fc32c1000

stackSize=8192KB

| held mutexes=

kernel: (couldn't read /proc/self/task/

1636

/stack)

native

: #

00

pc 00000000000c14a4  /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+

4

)

native

: #

01

pc 000000000007bb2c  /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+

132

)

native

: #

02

pc 000000000008a338  /system/lib64/libhidlbase.so (android::hardware::IPCThreadState::transact(

int

, unsigned 

int

, android::hardware::Parcel 

const

&, android::hardware::Parcel*, unsigned 

int

)+

1772

)

native

: #

03

pc 

0000000000085824

/system/lib64/libhidlbase.so (android::hardware::BpHwBinder::transact(unsigned 

int

, android::hardware::Parcel 

const

&, android::hardware::Parcel*, unsigned 

int

, std::__1::function<

void

(android::hardware::Parcel&)>)+

72

)

native

: #

04

pc 000000000000c8d8  /system/lib64/android.hardware.sensors

@1

.0.so (android::hardware::sensors::V1_0::BpHwSensors::_hidl_activate(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, 

int

, bool)+

236

)

native

: #

05

pc 000000000001b8e4  /system/lib64/libsensorservice.so (android::SensorDevice::activateLocked(

void

*, 

int

int

)+

784

)

native

: #

06

pc 000000000001ca78  /system/lib64/libsensorservice.so (android::SensorDevice::activate(

void

*, 

int

int

)+

72

)

native

: #

07

pc 

0000000000031530

/system/lib64/libsensorservice.so (android::SensorService::enable(android::sp<android::SensorService::SensorEventConnection> 

const

&, 

int

long

long

int

, android::String16 

const

&)+

1972

)

native

: #

08

pc 

0000000000026194

/system/lib64/libsensorservice.so (android::SensorService::SensorEventConnection::enableDisable(

int

, bool, 

long

long

int

)+

216

)

native

: #

09

pc 00000000001a84f8  /system/lib64/libandroid_runtime.so ((anonymous namespace)::nativeEnableSensor(_JNIEnv*, _jclass*, 

long

int

int

int

)+

116

)

at android.hardware.SystemSensorManager$BaseEventQueue.nativeEnableSensor(Native method)

at android.hardware.SystemSensorManager$BaseEventQueue.enableSensor(SystemSensorManager.java:

761

)

at android.hardware.SystemSensorManager$BaseEventQueue.addSensor(SystemSensorManager.java:

686

)

at android.hardware.SystemSensorManager.registerListenerImpl(SystemSensorManager.java:

192

)

- locked <

0x088006b6

> (a java.util.HashMap)

at android.hardware.SensorManager.registerListener(SensorManager.java:

823

)

at android.hardware.SensorManager.registerListener(SensorManager.java:

730

)

at miui.util.ProximitySensorWrapper.registerListener(ProximitySensorWrapper.java:

96

)

- locked <

0x01fcf5a1

> (a java.util.ArrayList)

at com.android.server.policy.MiuiScreenOnProximityLock.aquire(MiuiScreenOnProximityLock.java:

115

)

- locked <

0x0f945324

> (a com.android.server.policy.MiuiScreenOnProximityLock)

at com.android.server.policy.BaseMiuiPhoneWindowManager.startedWakingUp(BaseMiuiPhoneWindowManager.java:

424

)

at com.android.server.power.Notifier$

2

.run(Notifier.java:

460

)

at android.os.Handler.handleCallback(Handler.java:

883

)

at android.os.Handler.dispatchMessage(Handler.java:

100

)

at android.os.Looper.loop(Looper.java:

224

)

at com.android.server.SystemServer.run(SystemServer.java:

556

)

at com.android.server.SystemServer.main(SystemServer.java:

362

)

at java.lang.reflect.Method.invoke(Native method)

at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:

539

)

at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:

928

)

"PowerManagerService"

prio=

5

tid=

34

Native

| group=

"main"

sCount=

1

dsCount=

flags=

1

obj=

0x143c1770

self=

0x713aabf800

| sysTid=

1834

nice=-

4

cgrp=

default

sched=

/

handle=

0x70d5919d50

| state=S schedstat=( 

1568883166

1516633035

9751

) utm=

101

stm=

55

core=

6

HZ=

100

| stack=

0x70d5817000

-

0x70d5819000

stackSize=1039KB

| held mutexes=

kernel: (couldn't read /proc/self/task/

1834

/stack)

native

: #

00

pc 000000000007067c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+

28

)

native

: #

01

pc 0000000000073be8  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(

void

volatile

*, bool, 

int

, bool, timespec 

const

*)+

140

)

native

: #

02

pc 00000000000d7a68  /apex/com.android.runtime/lib64/bionic/libc.so (NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec 

const

*)+

212

)

native

: #

03

pc 000000000002fb24  /system/lib64/libsensorservice.so (android::SensorService::createSensorEventConnection(android::String8 

const

&, 

int

, android::String16 

const

&)+

80

)

native

: #

04

pc 000000000001177c  /system/lib64/libsensor.so (android::SensorManager::createEventQueue(android::String8, 

int

)+

96

)

native

: #

05

pc 00000000001a81fc  /system/lib64/libandroid_runtime.so ((anonymous namespace)::nativeInitSensorEventQueue(_JNIEnv*, _jclass*, 

long

, _jobject*, _jobject*, _jstring*, 

int

)+

160

)

at android.hardware.SystemSensorManager$BaseEventQueue.nativeInitBaseEventQueue(Native method)

at android.hardware.SystemSensorManager$BaseEventQueue.<init>(SystemSensorManager.java:

666

)

at android.hardware.SystemSensorManager$SensorEventQueue.<init>(SystemSensorManager.java:

798

)

at android.hardware.SystemSensorManager.registerListenerImpl(SystemSensorManager.java:

191

)

- locked <

0x05548e8d

> (a java.util.HashMap)

at android.hardware.SensorManager.registerListener(SensorManager.java:

823

)

at com.android.server.display.AutomaticBrightnessController.setLightSensorEnabled(AutomaticBrightnessController.java:

511

)

at com.android.server.display.AutomaticBrightnessController.configure(AutomaticBrightnessController.java:

367

)

at com.android.server.display.DisplayPowerController.updatePowerState(DisplayPowerController.java:

976

)

at com.android.server.display.DisplayPowerController.access$

600

(DisplayPowerController.java:

92

)

at com.android.server.display.DisplayPowerController$DisplayControllerHandler.handleMessage(DisplayPowerController.java:

2027

)

at android.os.Handler.dispatchMessage(Handler.java:

107

)

at android.os.Looper.loop(Looper.java:

224

)

at android.os.HandlerThread.run(HandlerThread.java:

67

)

at com.android.server.ServiceThread.run(ServiceThread.java:

45

)

我们能发现PowerMangerService在亮屏的时候去注册sensor卡住了,在等一个锁。那么我们继续看看这个锁被哪个线程拿着的,最后发现是main现场拿着这个锁。main线程拿着这个锁去注册activate 对应的sensor,那么我们就得去看看对端为什么卡住了。

2.查看[email protected]卡在的原因

   查看native进程的trace我们一般用的是debuggerd -b pid([email protected]进程的pid),我抓取的关键trace如下:

"HwBinder:804_1"

sysTid=

859

#

00

pc 00000000000c2aa8  /apex/com.android.runtime/lib64/bionic/libc.so (write+

8

) (BuildId: 084c8a81b8c78e19cd9a1ff6208e77cf)

#

01

pc 0000000000019eac  /vendor/lib64/libdiag.so (log_commit+

68

) (BuildId: 18b6916c12e26ffe8854638aa34a6050)

#

02

pc 

0000000000005920

/vendor/lib64/libsnsdiaglog.so (sensors_diag_log::submit_log_packet(sensors_diag_logid, std::__1::basic_string<

char

, std::__1::char_traits<

char

>, std::__1::allocator<

char

>>&)+

380

) (BuildId: 52c9f5d16b9d183648710f49b3da4510)

#

03

pc 000000000000531c  /vendor/lib64/libsnsdiaglog.so (sensors_diag_log::log_request_msg(std::__1::basic_string<

char

, std::__1::char_traits<

char

>, std::__1::allocator<

char

>> 

const

&, std::__1::basic_string<

char

, std::__1::char_traits<

char

>, std::__1::allocator<

char

>> 

const

&, 

char

*)+

556

) (BuildId: 52c9f5d16b9d183648710f49b3da4510)

#

04

pc 00000000000444b4  /vendor/lib64/sensors.ssc.so (ssc_sensor::send_sensor_config_request()+

1292

) (BuildId: 761371c66dd776f38c4ae0949238b119)

#

05

pc 0000000000043ee0  /vendor/lib64/sensors.ssc.so (ssc_sensor::activate()+

528

) (BuildId: 761371c66dd776f38c4ae0949238b119)

#

06

pc 000000000003c0a4  /vendor/lib64/sensors.ssc.so (sensors_hal::activate(

int

int

)+

380

) (BuildId: 761371c66dd776f38c4ae0949238b119)

#

07

pc 0000000000007d60  /vendor/lib64/hw/android.hardware.sensors

@1

.0-impl.so (android::hardware::sensors::V1_0::implementation::Sensors::activate(

int

, bool)+

28

) (BuildId: df21e23627085ff60f326db9a312a9ce)

#

08

pc 000000000000f9d4  /system/lib64/vndk-

29

/android.hardware.sensors

@1

.0.so (android::hardware::sensors::V1_0::BnHwSensors::_hidl_activate(android::hidl::base::V1_0::BnHwBase*, android::hardware::Parcel 

const

&, android::hardware::Parcel*, std::__1::function<

void

(android::hardware::Parcel&)>)+

204

) (BuildId: 6fcc0903c06dcd887cd709c04289ca69)

#

09

pc 0000000000010c18  /system/lib64/vndk-

29

/android.hardware.sensors

@1

.0.so (android::hardware::sensors::V1_0::BnHwSensors::onTransact(unsigned 

int

, android::hardware::Parcel 

const

&, android::hardware::Parcel*, unsigned 

int

, std::__1::function<

void

(android::hardware::Parcel&)>)+

948

) (BuildId: 6fcc0903c06dcd887cd709c04289ca69)

#

10

pc 0000000000084dd8  /system/lib64/vndk-sp-

29

/libhidlbase.so (android::hardware::BHwBinder::transact(unsigned 

int

, android::hardware::Parcel 

const

&, android::hardware::Parcel*, unsigned 

int

, std::__1::function<

void

(android::hardware::Parcel&)>)+

72

) (BuildId: 3684cb22817553ecd1e96bdb428ff263)

#

11

pc 

0000000000088778

/system/lib64/vndk-sp-

29

/libhidlbase.so (android::hardware::IPCThreadState::getAndExecuteCommand()+

1040

) (BuildId: 3684cb22817553ecd1e96bdb428ff263)

#

12

pc 

0000000000089990

/system/lib64/vndk-sp-

29

/libhidlbase.so (android::hardware::IPCThreadState::joinThreadPool(bool)+

96

) (BuildId: 3684cb22817553ecd1e96bdb428ff263)

#

13

pc 00000000000988d0  /system/lib64/vndk-sp-

29

/libhidlbase.so (android::hardware::PoolThread::threadLoop()+

24

) (BuildId: 3684cb22817553ecd1e96bdb428ff263)

#

14

pc 00000000000135f0  /system/lib64/vndk-sp-

29

/libutils.so (android::Thread::_threadLoop(

void

*)+

328

) (BuildId: ae8cf46c4124b6f522dd7de2efc17c1c)

#

15

pc 00000000000d6cb0  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(

void

*)+

36

) (BuildId: 084c8a81b8c78e19cd9a1ff6208e77cf)

#

16

pc 0000000000074eac  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+

64

) (BuildId: 084c8a81b8c78e19cd9a1ff6208e77cf)

我们用addr2line工具来找到对应的代码行号:

pzc

@pzc

-OptiPlex-

7050

:~/sd/下载/out/raphael-symbols-

9.11

.

16

-internal-8ad115968f/

out/target/product/raphael/symbols/vendor/lib64$ addr2line -f -e libdiag.so 

0000

000000019eac

_ZL5writeiPKvU17pass_object_size0m

out/soong/.intermediates/bionic/libc/libc.llndk/android_arm64_armv8-a_vendor_shared/gen/include/bits/fortify/unistd.h:

201

pzc

@pzc

-OptiPlex-

7050

:~/sd/下载/out/raphael-symbols-

9.11

.

16

-internal-8ad115968f/

out/target/product/raphael/symbols/vendor/lib64$ addr2line -f -e libsnsdiaglog.s

0000000000005920

_ZN16sensors_diag_log17submit_log_packetE18sensors_diag_logidRNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE

vendor/qcom/proprietary/commonsys-intf/sensors-see/sensors-diag-log/src/sensor_diag_log.cpp:

304

我们找到对应的代码:

sensor_diag_log.cpp

263sensors_diag_log_return_code sensors_diag_log::submit_log_packet(

264

sensors_diag_logid internal_log_id,

265

string &pb_encoded_sns_diag_sensor_log)

266

{

.......

304

log_commit(log_pkt);

.......

309

return

return_code;

310

}

找到log_commit定义的地方:

diag_lsm_log.c

333void log_commit (

void

*ptr)

334

{

335

if

(ptr)

336

{

337

log_header_type *phdr_ptr = NULL;

338

log_commit_last = (

void

*) ptr;

339

340

phdr_ptr = (log_header_type *)ptr;

341

ptr = ((

byte

*) ptr - (LOG_DIAGPKT_OFFSET+DIAG_REST_OF_DATA_POS)); 

342

if

(-

1

!= diag_fd)

343

{

344

int

NumberOfBytesWritten = 

;

345

if

((NumberOfBytesWritten = write(diag_fd, (

const

void

*) ptr, DIAG_REST_OF_DATA_POS+LOG_DIAGPKT_OFFSET + phdr_ptr->len)) != 

346

{

347

DIAG_LOGE(

"Diag_LSM_log: Write failed in %s, bytes written: %d, error: %d\n"

,

348

__func__, NumberOfBytesWritten, errno);

349

log_commit_to_cs_fail++;

350

}

351

DiagSvc_Free(ptr, GEN_SVC_ID);

352

}

353

}

354

return

;

355

}

我们通过相关的日志还有trace可以定位到是write超时了,一般通过linux系统调用超时了的情况我们可以猜到是这个线程进入了D状态,那么我们继续看看write在kernel中卡在了哪里,首先我们先找到[email protected]进程,然后去看它有哪些子线程

raphael:/ # ps -ef | grep sensor                                                                                                                                                                          

system         

604

1

16

:

12

:

41

?     

00

:

00

:

00

sscrpcd sensorspd

system         

775

1

16

:

12

:

41

?     

00

:

00

:

37

android.hardware.sensors

@1

.0-service

system         

802

1

16

:

12

:

41

?     

00

:

00

:

00

vendor.qti.hardware.sensorscalibrate

@1

.0-service

system         

820

1

16

:

12

:

41

?     

00

:

00

:

11

vendor.xiaomi.hardware.citsensorservice

@1

.1-service

system         

926

1

16

:

12

:

42

?     

00

:

00

:

02

sensors.qti

root         

13571

19214

2

10

:

18

:

01

pts/

1

00

:

00

:

00

grep sensor

raphael:/ # cd /pr

proc/             product/          product_services

raphael:/ # cd /proc/

775

/task/                                                                                                                                                                            

raphael:/proc/

775

/task # ls

10249

10251

10253

13307

13308

13309

13321

13322

13323

1842

1845

1846

1847

1848

1849

1850

5737

5738

5740

775

840

842

848

我们去复现问题,然后分别查看tid比较小的线程(一部分是hwbinder线程),然后再看对应的线程状态

cat 

1850

/stack && cat 

5737

/stack && cat 

5738

/stack && cat 

5740

/stack && cat 

775

/stack && cat 

840

/stack && cat 

842

/stack && cat 

848

/stack && ps -AT | grep 

" D "

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] diagchar_read+

0x134

/

0x1640

[<

0000000000000000

>] __vfs_read+

0x44

/

0x128

[<

0000000000000000

>] vfs_read+

0xa0

/

0x138

[<

0000000000000000

>] SyS_read+

0x54

/

0xb8

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] futex_wait_queue_me+

0xbc

/

0x108

[<

0000000000000000

>] futex_wait+

0x278

/

0x448

[<

0000000000000000

>] do_futex+

0x138

/

0x17e0

[<

0000000000000000

>] SyS_futex+

0x130

/

0x1b0

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] futex_wait_queue_me+

0xbc

/

0x108

[<

0000000000000000

>] futex_wait+

0x278

/

0x448

[<

0000000000000000

>] do_futex+

0x138

/

0x17e0

[<

0000000000000000

>] SyS_futex+

0x130

/

0x1b0

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] futex_wait_queue_me+

0xbc

/

0x108

[<

0000000000000000

>] futex_wait+

0x278

/

0x448

[<

0000000000000000

>] do_futex+

0x138

/

0x17e0

[<

0000000000000000

>] SyS_futex+

0x130

/

0x1b0

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] futex_wait_queue_me+

0xbc

/

0x108

[<

0000000000000000

>] futex_wait+

0x278

/

0x448

[<

0000000000000000

>] do_futex+

0x138

/

0x17e0

[<

0000000000000000

>] SyS_futex+

0x130

/

0x1b0

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] futex_wait_queue_me+

0xbc

/

0x108

[<

0000000000000000

>] futex_wait+

0x278

/

0x448

[<

0000000000000000

>] do_futex+

0x138

/

0x17e0

[<

0000000000000000

>] SyS_futex+

0x130

/

0x1b0

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] do_sigtimedwait+

0x158

/

0x338

[<

0000000000000000

>] SyS_rt_sigtimedwait+

0xd0

/

0x178

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] futex_wait_queue_me+

0xbc

/

0x108

[<

0000000000000000

>] futex_wait+

0x278

/

0x448

[<

0000000000000000

>] do_futex+

0x138

/

0x17e0

[<

0000000000000000

>] SyS_futex+

0x130

/

0x1b0

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] diagchar_write+

0xcf4

/

0x1538

[<

0000000000000000

>] __vfs_write+

0x44

/

0x130

[<

0000000000000000

>] vfs_write+

0xc8

/

0x188

[<

0000000000000000

>] SyS_write+

0x54

/

0xb8

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] do_sigtimedwait+

0x158

/

0x338

[<

0000000000000000

>] SyS_rt_sigtimedwait+

0xd0

/

0x178

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] diagchar_write+

0x7a0

/

0x1538

[<

0000000000000000

>] __vfs_write+

0x44

/

0x130

[<

0000000000000000

>] vfs_write+

0xc8

/

0x188

[<

0000000000000000

>] SyS_write+

0x54

/

0xb8

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] futex_wait_queue_me+

0xbc

/

0x108

[<

0000000000000000

>] futex_wait+

0x278

/

0x448

[<

0000000000000000

>] do_futex+

0x138

/

0x17e0

[<

0000000000000000

>] SyS_futex+

0x130

/

0x1b0

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] futex_wait_queue_me+

0xbc

/

0x108

[<

0000000000000000

>] futex_wait+

0x278

/

0x448

[<

0000000000000000

>] do_futex+

0x138

/

0x17e0

[<

0000000000000000

>] SyS_futex+

0x130

/

0x1b0

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] do_sys_poll+

0x5c4

/

0x700

[<

0000000000000000

>] SyS_ppoll+

0x174

/

0x268

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

[<

0000000000000000

>] 

0xffffffffffffffff

system         

775

775

1

476464

6812

diagchar_write      

D sensors

@1

.0-ser

system         

775

1842

1

476464

6812

diagchar_read       

D sensors

@1

.0-ser

system         

820

1548

1

197708

7136

diagchar_read       

D citsensorservic

system         

926

944

1

124940

4252

diagchar_read       

D sensors.qti

system        

1071

1079

1

50644

4660

diagchar_read       

D cnss_diag

radio         

1441

1484

1

1440532

29140

diagchar_read       

D qcrild

radio         

1456

1485

1

1298900

29172

diagchar_read       

D qcrild

radio         

2317

2325

1

182856

3584

diagchar_read       

D ims_rtp_daemon

shell        

14636

14636

1

39684

3988

diagfwd_write       

D diag_mdlog_syst

shell        

16121

16121

1

30352

2568

diagchar_open       

D diag_mdlog_syst

从这个日志中我们能看到几个信息:

[email protected] 有两个现场进入D状态,一个现场是在write,这个已经可以解释我们系统为什么卡了(因为这里卡住了,导致system_server那边也卡住了,最终系统卡住了)

2.我们还能看到其他的几个线程都是在diagchar_read/diagchar_write的时候进入D状态,还有一个是diagchar_open,并且这个进程是shell进程启动的

查看对应的kernel日志:

[

238325.121834

] init: Received control message 

'start'

for

'diag_mdlog_stop'

from pid: 

16096

(start diag_mdlog_stop)

[

238325.127831

] init: Received control message 

'start'

for

'diag_mdlog_stop'

from pid: 

16097

(start diag_mdlog_stop)

[

238325.163024

] init: Received control message 

'start'

for

'diag_mdlog_start'

from pid: 

16099

(start diag_mdlog_start)

[

238325.171552

] init: Received control message 

'start'

for

'diag_mdlog_start'

from pid: 

16100

(start diag_mdlog_start)

[

238330.007418

] init: Received control message 

'start'

for

'diag_mdlog_stop'

from pid: 

16123

(start diag_mdlog_stop)

[

238330.039798

] init: Received control message 

'start'

for

'diag_mdlog_start'

from pid: 

16124

(start diag_mdlog_start)

[

238332.430154

] init: Service 

'diag_mdlog_stop'

(pid 

16121

) exited with status 

[

238332.439974

] init: Service 

'diag_mdlog_start'

(pid 

14636

) exited with status 

我们发现这里在频繁的执行“start diag_mdlog_start”,“start diag_mdlog_stop”(不太正常),这两个命令分别是抓子系统命令的日志,通过这两个关键字查找到对应的代码:

@Override

public

void

onEventOccur(ExceptionEvent event) {

if

(event == 

null

) {

Utils.logE(TAG, 

"event is null!"

);

return

;

}

FeatureEvent featureEvent = (FeatureEvent)event;

if

(FEATURE_SUBCLASS.contains(featureEvent.getSubClass())) {

ThreadPool.execute(

new

Runnable() {

@Override

public

void

run() {

SystemProperties.set(

"debug.offline_log.enable"

"true"

);

SystemProperties.set(

"debug.offline_log.module"

"sensor"

);

CaptureLogUtils.getInstance().runCommand(

"start"

"diag_mdlog_start"

);

try

{

Thread.sleep(

50000

);

catch

(InterruptedException e) {

Utils.logD(TAG, 

"captureSensorLog failed: "

+ e.getMessage());

}

SystemProperties.set(

"debug.offline_log.enable"

"false"

);

SystemProperties.set(

"debug.offline_log.module"

"null"

);

CaptureLogUtils.getInstance().runCommand(

"start"

"diag_mdlog_stop"

);

Handler h = mManager.getWorkHandler();

Message msg = h.obtainMessage(MQSEventManager.MSG_HANDLE_FEATURE_EVENTS, 

, event);

h.sendMessage(msg);

}

});

}

}

到了这里我们就已经大概可以猜到问题的原因了,肯定是这个重复触发抓diag日志功能引起的问题。那么我们继续分析刚才我们提到的两个问题:

[email protected] 有两个现场进入D状态,一个现场是在write,这个已经可以解释我们系统为什么卡了(因为这里卡住了,导致system_server那边也卡住了,最终系统卡住了)

首先我们通过对应的trace看write在kernel中到底卡在了哪里:

[<

0000000000000000

>] __switch_to+

0xd8

/

0xf8

[<

0000000000000000

>] diagchar_write+

0x7a0

/

0x1538

[<

0000000000000000

>] __vfs_write+

0x44

/

0x130

[<

0000000000000000

>] vfs_write+

0xc8

/

0x188

[<

0000000000000000

>] SyS_write+

0x54

/

0xb8

[<

0000000000000000

>] el0_svc_naked+

0x34

/

0x38

这两个trace是D状态的,那么我们下载对应的vmlinux然后用gdb看一下到底卡在了哪里:

pzc

@pzc

-OptiPlex-

7050

:~/code/J11/prebuilts/gdb/linux-x86/bin$ ./gdb

GNU gdb (GDB) 

7.11

Copyright (C) 

2016

Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 

3

or later <http:

//gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.  Type 

"show copying"

and 

"show warranty"

for

details.

This GDB was configured as 

"x86_64-linux-gnu"

.

Type 

"show configuration"

for

configuration details.

For bug reporting instructions, please see:

<http:

//www.gnu.org/software/gdb/bugs/>.

Find the GDB manual and other documentation resources online at:

<http:

//www.gnu.org/software/gdb/documentation/>.

For help, type 

"help"

.

Type 

"apropos word"

to search 

for

commands related to 

"word"

.

(gdb) file /home/pzc/sd/vmlinux-811877d

Reading symbols from /home/pzc/sd/vmlinux-811877d...done.

(gdb) list diagchar_write+

0x7a0

/

0x1538

Function 

"diagchar_write+0x7a0/0x1538"

not defined.

(gdb) list diagchar_write+

0x7a0

Function 

"diagchar_write+0x7a0"

not defined.

(gdb) list *diagchar_write+

0x7a0

0xffffff800863e210

is in diagchar_write (/home/work/raphael-q-dev-build/kernel/msm-

4.14

/drivers/

char

/diag/diagchar_core.c:

3577

).

3572

/home/work/raphael-q-dev-build/kernel/msm-

4.14

/drivers/

char

/diag/diagchar_core.c: 没有那个文件或目录.

(gdb) list *diagchar_read+

0xf4

0xffffff800863c524

is in diagchar_read (/home/work/raphael-q-dev-build/kernel/msm-

4.14

/drivers/

char

/diag/diagchar_core.c:

3632

).

3627

in /home/work/raphael-q-dev-build/kernel/msm-

4.14

/drivers/

char

/diag/diagchar_core.c

(gdb)

我们反编译后找到write时和read时分别卡在了diagchar_core.c 3577行和diagchar_core.c 3632行,对应的代码为:

diagchar_core.c

3518static 

int

diag_user_process_apps_data(

const

char

__user *buf, 

int

len,

3519

int

pkt_type)

3520

{

......

3577

mutex_lock(&driver->apps_data_mutex);

......

3600

return

;

3601

}

因为光有这个日志不同好分析谁持有锁,所以我们需要触发一个ramdump再继续分析

2.3 Kernel ramdump分析

光通过日志可能不太好分析,我们再复现触发一个ramdump进行分析:

randump加载参考:RAMDUMP抓取方法

1

.加载ramdump

crash vmlinux DDRCS0_0.BIN

@0x0000000080000000

,DDRCS0_1.BIN

@0x

0000000100000000

,DDRCS1_0.BIN

@0x0000000180000000

,DDRCS1_1.BIN

@0x0000000200000000

--kaslr 

0x000000074f800000

2

.查看D状态进程

crash> ps | grep UN

76

2

2

ffffffe0336c0000  UN   

0.0

[kworker/u16:

1

]

249

2

ffffffe030f40000  UN   

0.0

[kworker/u16:

5

]

340

2

1

ffffffe032fef000  UN   

0.0

[kworker/u16:

9

]

364

2

3

ffffffe02b07c000  UN   

0.0

[kworker/u16:

12

]

1134

1

4

ffffffdfff54f000  UN   

0.1

50644

5980

cnss_diag

1519

1

6

ffffffdfd9d08000  UN   

0.4

1299028

43208

qcrild

1600

1

4

ffffffdfc14c7000  UN   

0.1

476464

9640

sensors

@1

.0-ser

1624

1

6

ffffffdfc1549000  UN   

0.1

197708

8860

citsensorservic

2377

1

ffffffdf76f6a000  UN   

0.1

182856

5872

ims_rtp_daemon

5441

1

1

ffffffdf39246000  UN   

0.1

476464

9640

HwBinder:789_1

7756

1

3

ffffffdf79799000  UN   

0.0

108400

3844

diag_mdlog_syst

7764

1

6

ffffffdf87ad3000  UN   

0.0

108400

3844

diag_mdlog_syst

3

.查看HwBinder:789_1 D状态现场trace:

crash> bt 

5441

PID: 

5441

TASK: ffffffdf39246000  CPU: 

1

COMMAND: 

"HwBinder:789_1"

#

[ffffff801913bb70] __switch_to at ffffff8757886b64

#

1

[ffffff801913bc10] __schedule at ffffff8758a56ba8

#

2

[ffffff801913bc30] schedule_preempt_disabled at ffffff8758a570d0

#

3

[ffffff801913bcb0] __mutex_lock at ffffff8758a58b88

#

4

[ffffff801913bcc0] __mutex_lock_slowpath at ffffff8758a581ac

#

5

[ffffff801913bcd0] mutex_lock at ffffff8758a58194

#

6

[ffffff801913bda0] diagchar_write at ffffff8757e3e20c

#

7

[ffffff801913be30] __vfs_write at ffffff8757a72898

#

8

[ffffff801913be70] vfs_write at ffffff8757a72c54

#

9

[ffffff801913beb0] sys_write at ffffff8757a72e20

#

10

[ffffff801913bff0] el0_svc_naked at ffffff875788487c

PC: 0000007ba1887aa8   LR: 0000007ba0cc6eb0   SP: 0000007b92c3d8e0

X29: 0000007b92c3d900  X28: 0000007ba1321e80  X27: 0000007ba067000c

X26: 0000007ba083a028  X25: 0000007ba1230440  X24: 0000007b92c3e020

X23: 0000007ba0ab9118  X22: 00000000b630e2ae  X21: 

0000000000000078

X20: 0000007b9f251808  X19: 0000007b9f251800  X18: 0000007b90abe000

X17: 0000007ba1887aa0  X16: 0000007ba0cd9930  X15: 0000c93977e960fb

X14: 00245bdc48d5ee00  X13: 656d6f6465700912  X12: 0000007b9f23bc00

X11: 001a00000000b630  X10: bd9711000000810d   X9: 

0000000000000084

X8: 

0000000000000040

X7: e1d98509120a2606   X6: aa726574656d6f64

X5: 0000007b9f25188c   X4: 0000007ba12277f8   X3: 0000007b9f251810

X2: 000000000000008c   X1: 0000007b9f251800   X0: 000000000000000d

ORIG_X0: 000000000000000d  SYSCALLNO: 

40

PSTATE: 

00000000

4

.查看对应的持锁点对应的代码:

crash> sym ffffff8757e3e20c

ffffff8757e3e20c (t) diagchar_write+

1948

/home/pzc/code/F11Q/kernel/msm-

4.14

/drivers/

char

/diag/diagchar_core.c: 

3576

持锁点为:

mutex_lock(&apps_data_mutex);

5

.查找谁持有了这个锁

apps_data_mutex的定义:

static

struct mutex apps_data_mutex;

这是一个全局的锁,我们可以这样看谁持有了这个锁:

crash> p apps_data_mutex

apps_data_mutex = $

1

= {

owner = {

counter = -

136948846591

},

wait_lock = {

{

rlock = {

raw_lock = {

owner = 

92

,

next = 

92

}

}

}

},

osq = {

tail = {

counter = 

}

},

wait_list = {

next = 

0xffffff801913bc48

,

prev = 

0xffffff801913bc48

}

}

我们看到owner对应的结构体: counter = -

136948846591

,这个就是对应的持锁的owner线程,-

136948846591

需要用

16

进制查看:

printf %x -

136948846591

ffffffe01d367001

查看对应task_struct结构体:

struct task_struct ffffffe01d367000

struct task_struct {

......

pid = 

789

,

tgid = 

789

,

......

}

这里就是说

789

这个线程持有了锁,那么我们看一下

789

这个线程再干嘛:

crash> bt 

789

PID: 

789

TASK: ffffffe01d367000  CPU: 

1

COMMAND: 

"[email protected]"

#

[ffffff800c92bc10] __switch_to at ffffff8757886b64

#

1

[ffffff800c92bcb0] __schedule at ffffff8758a56ba8

#

2

[ffffff800c92bcd0] schedule at ffffff8758a56ff4

#

3

[ffffff800c92bda0] diagchar_write at ffffff8757e3e760

#

4

[ffffff800c92be30] __vfs_write at ffffff8757a72898

#

5

[ffffff800c92be70] vfs_write at ffffff8757a72c54

#

6

[ffffff800c92beb0] sys_write at ffffff8757a72e20

#

7

[ffffff800c92bff0] el0_svc_naked at ffffff875788487c

PC: 0000007ba1887aa8   LR: 0000007ba0cc6eb0   SP: 0000007fd405cff0

X29: 0000007fd405d010  X28: 

0000000000000000

X27: 

0000000000000001

X26: 

0000000000000000

X25: 0000007ba1230440  X24: 0000007ba204b020

X23: 0000007ba0ab9118  X22: 00000000b619242b  X21: 

0000000000000077

X20: 0000007b9f251008  X19: 0000007b9f251000  X18: 0000007ba2aec000

X17: 0000007ba1887aa0  X16: 0000007ba0cd9930  X15: 0000b7fc04297d3a

X14: 001f972848ddbe00  X13: 6d69786f72700912  X12: 0000007b9f23d400

X11: 0008020a04220010  X10: 0108041a00000202   X9: 

0000000000000083

X8: 

0000000000000040

X7: 6d6109120a2506a2   X6: 7974696d69786f72

X5: 0000007b9f25108b   X4: 0000007ba1227777   X3: 0000007b9f251010

X2: 000000000000008b   X1: 0000007b9f251000   X0: 000000000000000d

ORIG_X0: 000000000000000d  SYSCALLNO: 

40

PSTATE: 

00000000

继续找到schedule前在干嘛:

crash> sym ffffff8757e3e760

ffffff8757e3e760 (t) diagchar_write+

3312

/home/pzc/code/F11Q/kernel/msm-

4.14

/drivers/

char

/diag/diagchar_core.c: 

3070

static

int

diag_process_apps_data_hdlc(unsigned 

char

*buf, 

int

len,

int

pkt_type)

{

......

wait_event_interruptible(driver->hdlc_wait_q,

(data->flushed == 

));

......

}

因为这里是需要等待唤醒,唤醒的条件是data->flushed=

,所以我得去调查哪里设置成了

1

,哪里设置成了

.我加了各种日志发现正常情况下data->flushed是等于

的,但是有一个定时的任务会把它改成

1

,当diagfwd_mux_write_done之后才会把它改成

.

现在我们又需要去调查什么时候才会diagfwd_mux_write_done,加了了dump_backtrace看了下,发现diagchar_close的时候会diagfwd_mux_write_done。

139.272177

] Call trace:

139.272206

] dump_backtrace+

0x0

/

0x268

139.272215

] show_stack+

0x14

/

0x20

139.272231

] dump_stack+

0xc4

/

0x100

139.272244

] diagfwd_mux_write_done+

0x1f0

/

0x2f8

139.272254

] diag_md_close_peripheral+

0x118

/

0x138

139.272261

] diag_mux_close_peripheral+

0x8c

/

0xa8

139.272268

] diag_remove_client_entry+

0x4d0

/

0x538

139.272275

] diagchar_close+

0x50

/

0x60

139.272287

] __fput+

0xbc

/

0x1b8

这个时候我们就得去看看为什么没有及时的调用到diagchar_close。加了很多日志发现一个可疑的卡顿点:

[  

180.092586

] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -

12

, retry_count=

[  

180.408500

] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -

12

, retry_count=

[  

180.718953

] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -

12

, retry_count=

[  

181.032199

] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -

12

, retry_count=

[  

181.337216

] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -

11

, retry_count=

[  

181.640242

] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -

11

, retry_count=

[  

181.956562

] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -

12

, retry_count=

[  

182.269410

] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -

12

, retry_count=

[  

182.599262

] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -

12

, retry_count=

对应的代码:

int

diagfwd_write(uint8_t peripheral, uint8_t type, 

void

*buf, 

int

len)

{

......

while

(retry_count < max_retries) {

err = 

;

err = fwd_info->p_ops->write(fwd_info->ctxt, buf, len);

if

(err && err != -ENODEV) {

usleep_range(

100000

101000

);

pr_err(

"pzc diag: In %s, diagfwd_write usleep_range----- err = %d, retry_count=%d\n"

, __func__, err, retry_count);

retry_count++;

continue

;

}

break

;

}

......

}

我们在ramdump中也找到了对应的调用栈:

crash> bt 

7756

PID: 

7756

TASK: ffffffdf79799000  CPU: 

3

COMMAND: 

"diag_mdlog_syst"

#

[ffffff800d26b7f0] __switch_to at ffffff8757886b64

#

1

[ffffff800d26b890] __schedule at ffffff8758a56ba8

#

2

[ffffff800d26b8b0] schedule at ffffff8758a56ff4

#

3

[ffffff800d26b950] schedule_timeout at ffffff8758a5b42c

#

4

[ffffff800d26b9c0] wait_for_common at ffffff8758a57ecc

#

5

[ffffff800d26b9d0] wait_for_completion_timeout at ffffff8758a57f9c

#

6

[ffffff800d26ba90] __glink_spi_send at ffffff87586e8660

#

7

[ffffff800d26baa0] glink_spi_send at ffffff87586e83b4

#

8

[ffffff800d26bab0] rpmsg_send at ffffff87586e0038

#

9

[ffffff800d26baf0] diag_rpmsg_write at ffffff8757e4a7f8

#

10

[ffffff800d26bb50] diagfwd_write at ffffff8757e451c8

#

11

[ffffff800d26bbf0] diag_send_msg_mask_update at ffffff8757e59c04

#

12

[ffffff800d26bc60] diag_cmd_set_all_msg_mask at ffffff8757e5c6e4

#

13

[ffffff800d26bc70] diag_process_apps_masks at ffffff8757e5aae4

#

14

[ffffff800d26bcd0] diag_process_apps_pkt at ffffff8757e419ec

#

15

[ffffff800d26bda0] diagchar_write at ffffff8757e3e468

#

16

[ffffff800d26be30] __vfs_write at ffffff8757a72898

#

17

[ffffff800d26be70] vfs_write at ffffff8757a72c54

#

18

[ffffff800d26beb0] sys_write at ffffff8757a72e20

#

19

[ffffff800d26bff0] el0_svc_naked at ffffff875788487c

经过上面的分析我们大概知道了整个流程:
           
Sensor记录日志导致的亮屏慢问题分析一. 问题描述二. 问题分析

1.首先diagchar驱动初始化的时候有一个work会定时将flushed设置为1.

2.我们的diag_mdlog_system在记录日志时调用diagchar_write卡顿,导致不能及时调用diagchar_close将flushed设置为0,并且调用wake_up_interruptible

3.这时候system_server注册sensor时向[email protected]发起binder call去注册,注册时会调用diagchar_write,因为之前flushed已经被设置为了1所以[email protected]就会停在wait_event_interruptible这里,并进入D状态,导致系统卡住,直到diagchar_close结束

继续阅读