在正式解释什么是 fd 泄露的时候, 先看看三份 log, 是否有眼熟而不知所措感觉? 结合公司同事的深入研究, 总结了多种实际案例, 才有了这篇文章, 以后 FD 泄露问题在也不慌了.
- log 1: Could not read input channel file descriptors from parcel
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: FATAL EXCEPTION: main
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: Process: com.miui.weather2, PID: 20556
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: java.lang.RuntimeException: Could not read input channel file descriptors from parcel.
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at Android.view.InputChannel.nativeReadFromParcel(Native Method)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at Android.view.InputChannel.readFromParcel(InputChannel.java:148)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at Android.view.InputChannel$1.createFromParcel(InputChannel.java:39)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at Android.view.InputChannel$1.createFromParcel(InputChannel.java:37)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at com.Android.internal.view.InputBindResult.<init>(InputBindResult.java:68)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at com.Android.internal.view.InputBindResult$1.createFromParcel(InputBindResult.java:112)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at com.Android.internal.view.InputBindResult$1.createFromParcel(InputBindResult.java:110)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at com.Android.internal.view.IInputMethodManager$Stub$Proxy.startInputOrWindowGainedFocus(IInputMethodManager.java:723)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at Android.view.inputmethod.InputMethodManager.startInputInner(InputMethodManager.java:1295)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at Android.view.inputmethod.InputMethodManager.onPostWindowFocus(InputMethodManager.java:1543)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at Android.view.ViewRootImpl$ViewRootHandler.handleMessage(ViewRootImpl.java:4069)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at Android.os.Handler.dispatchMessage(Handler.java:106)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at Android.os.Looper.loop(Looper.java:171)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at Android.App.ActivityThread.main(ActivityThread.java:6642)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at java.lang.reflect.Method.invoke(Native Method)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at com.Android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:518)
- 06-22 20:34:43.035 10037 20556 20556 E AndroidRuntime: at com.Android.internal.os.ZygoteInit.main(ZygoteInit.java:873)
- log 2:Could not allocate JNI Env
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: FATAL EXCEPTION: main
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: Process: com.xiaomi.bluetooth, PID: 2308
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: java.lang.OutOfMemoryError: Could not allocate JNI Env
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at java.lang.Thread.nativeCreate(Native Method)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at java.lang.Thread.start(Thread.java:730)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at com.Android.bluetooth.ble.c.dk(SynchronizedGattCallback.java:54)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at com.Android.bluetooth.ble.m.dk(GattPeripheral.java:97)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at com.Android.bluetooth.ble.m.eN(GattPeripheral.java:227)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at com.Android.bluetooth.ble.m.eq(GattPeripheral.java:221)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at com.Android.bluetooth.ble.z.run(PeripheralConnectionManager.java:462)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at Android.os.Handler.handleCallback(Handler.java:754)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at Android.os.Handler.dispatchMessage(Handler.java:95)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at Android.os.Looper.loop(Looper.java:160)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at Android.App.ActivityThread.main(ActivityThread.java:6202)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at java.lang.reflect.Method.invoke(Native Method)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at com.Android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:874)
- 06-22 11:59:30.335 2308 2308 E AndroidRuntime: at com.Android.internal.os.ZygoteInit.main(ZygoteInit.java:764)
- log 3:unable to open database file (code 14)
- Android.database.SQLite.SQLiteCantOpenDatabaseException: unable to open database file (code 14)
- at Android.database.SQLite.SQLiteConnection.nativeExecuteForChangedRowCount(Native Method)
- at Android.database.SQLite.SQLiteConnection.executeForChangedRowCount(SQLiteConnection.java:735)
- at Android.database.SQLite.SQLiteSession.executeForChangedRowCount(SQLiteSession.java:754)
- at Android.database.SQLite.SQLiteStatement.executeUpdateDelete(SQLiteStatement.java:64)
- at Android.database.SQLite.SQLiteDatabase.updateWithOnConflict(SQLiteDatabase.java:1653)
- at Android.database.SQLite.SQLiteDatabase.update(SQLiteDatabase.java:1599)
- at com.Android.providers.telephony.TelephonyProvider.update(TelephonyProvider.java:2704)
- at Android.content.ContentProvider$Transport.update(ContentProvider.java:357)
- at Android.content.ContentResolver.update(ContentResolver.java:1688)
- at com.Android.internal.telephony.SubscriptionController.setCarrierText(SubscriptionController.java:1202)
- at com.Android.internal.telephony.SubscriptionControllerInjectorBase$DatabaseHandler.handleMessage(SubscriptionControllerInjectorBase.java:201)
- at Android.os.Handler.dispatchMessage(Handler.java:106)
- at Android.os.Looper.loop(Looper.java:164)
- at Android.os.HandlerThread.run(HandlerThread.java:65)
相信大多人都觉得这类问题不好解决, 更有人觉得这种问题直接加个 try catch, 结果下个版本接中上报. 因为上面的 Log 基本上都不是案发现场, 正式开始需要先补一波 FD 泄露的基础知识. 在 Android 进程系列第一篇 --- 进程基础中的 2.3 小节也有简单介绍.
一, FD 的相关概念
概念: Fd 的全称是 File descriptor, 在 Linux OS 里, 所有都可以抽象成文件, 比如普通的文件, 目录, 块设备, 字符设备, socket, 管道等等. 当通过一些系统调用 (如 open/socket 等), 会返回一个 fd(就是一个数字) 给你, 然后根据这个 fd 对应的文件进行操作, 比如读, 写.
1,FD 从何而来?
我们说 fd 是一个数字, 那么这个数字是怎么计算出来的? 在内核进程结构体 task_struct 中为每个进程维护了一个数组, 数组下标就是 fd, 里面存储的是对这个文件的描述了. 里面就有 files 指针, 维护着所有打开的文件信息:
- struct task_struct {
- ...
- /* Open file information: */
- struct files_struct *files;
- ...
- }
- /*
- * Open file table structure
- */
- struct files_struct {
- /*
- * read mostly part
- */
- atomic_t count;
- bool resize_in_progress;
- wait_queue_head_t resize_wait;
- struct fdtable __rcu *fdt;
- struct fdtable fdtab;
- /*
- * written part on a separate cache line in SMP
- */
- spinlock_t file_lock ____cacheline_aligned_in_smp;
- unsigned int next_fd;
- unsigned long close_on_exec_init[1];
- unsigned long open_fds_init[1];
- unsigned long full_fds_bits_init[1];
- struct file __rcu * fd_array[NR_OPEN_DEFAULT];
- };
files_struct 中维护一个 fdtable,fdtable 里的 fd 就是一个数组, file 结构体就是为打开文件的信息了.
- struct fdtable {
- unsigned int max_fds;
- struct file __rcu **fd; /* current fd array */
- unsigned long *close_on_exec;
- unsigned long *open_fds;
- unsigned long *full_fds_bits;
- struct rcu_head rcu;
- };
2, 阈值
Linux 默认对每个进程最大能打开的 fd 的个数是 1024(软限制是 1024, 硬限制是 4096), 你可以通过 / proc/$pid/limits 查看 Max open files:
fd 阈值查看
当一个进程打开的文件数超过这个软限制值 1024 将无法再打开文件了. 所以就报出各种问题, 和 OOM 问题一样, crash 堆栈有可能只是压死骆驼的最后一根稻草, 并不是真实案发现场. 所以用完 fd 后需要 close 关闭这个 fd, 那么这个 fd 对应的数字就被系统回收了, 下一次的 open 才会被重新利用.
软限制和硬限制的区别
硬限制是可以在任何时候任何进程中设置 但硬限制只能由超级用户提起
软限制是内核实际执行的限制, 任何进程都可以将软限制设置为任意小于等于对进程限制的硬限制的操作 fd
如果觉得 fd 不够用了, 也可以用下面方式调整.
- getrlimit(RLIMIT_NOFILE, &rlim);
- setrlimit(RLIMIT_NOFILE, &rlim);
- ulimit -n 2048
- Android.system.Os.getrlimit(OsConstants.RLIMIT_NOFILE);
- egg:
- void modifyfdlimit() {
- rlimit fdLimit;
- fdLimit.rlim_cur = 30000;
- fdLimit.rlim_max = 30000;
- if (-1 == setrlimit(RLIMIT_NOFILE, & fdLimit)) {
- printf("Set max fd open count fai. /nl");
- char cmdBuffer[ 64];
- sprintf(cmdBuffer, "ulimit -n %d", 30000);
- if (-1 == system(cmdBuffer)) {
- printf("%s failed. /n", cmdBuffer);
- exit(0);
- }
- if (-1 == getrlimit(RLIMIT_NOFILE, & fdLimit)){
- printf("Ulimit fd number failed.");
- exit(0);
- }
- }
- }
3, 打开的 fd 查看
使用 ls -la /proc/$pid/fd 查看
- nitrogen:/ # pidof system_server
- 1956
- nitrogen:/ # ls -la /proc/1956/f
- fd/ fdinfo/
- nitrogen:/ # ls -la /proc/1956/fd
- total 0
- dr-x------ 2 system system 0 2018-11-02 10:57 .
- dr-xr-xr-x 9 system system 0 2018-11-02 10:57 ..
- lrwx------ 1 system system 64 2018-11-02 12:26 0 -> /dev/null
- lrwx------ 1 system system 64 2018-11-02 12:26 1 -> /dev/null
- lr-x------ 1 system system 64 2018-11-02 12:26 10 -> /system/framework/QPerformance.jar
- lrwx------ 1 system system 64 2018-11-02 12:26 100 -> anon_inode:[timerfd]
- lrwx------ 1 system system 64 2018-11-02 12:26 101 -> anon_inode:[timerfd]
- lrwx------ 1 system system 64 2018-11-02 12:26 102 -> anon_inode:[timerfd]
- lrwx------ 1 system system 64 2018-11-02 12:26 103 -> anon_inode:[timerfd]
- lrwx------ 1 system system 64 2018-11-02 12:26 104 -> anon_inode:[timerfd]
- lrwx------ 1 system system 64 2018-11-02 12:26 105 -> anon_inode:[eventpoll]
- lr-x------ 1 system system 64 2018-11-02 12:26 106 -> anon_inode:inotify
- lr-x------ 1 system system 64 2018-11-02 12:26 107 -> pipe:[36681]
- l-wx------ 1 system system 64 2018-11-02 12:26 108 -> pipe:[36681]
- lrwx------ 1 system system 64 2018-11-02 12:26 109 -> anon_inode:[eventfd]
- lr-x------ 1 system system 64 2018-11-02 12:26 11 -> /system/framework/core-oj.jar
- lrwx------ 1 system system 64 2018-11-02 12:26 110 -> anon_inode:[eventpoll]
- lrwx------ 1 system system 64 2018-11-02 12:26 111 -> socket:[35371]
- lrwx------ 1 system system 64 2018-11-02 12:26 112 -> socket:[35372]
- lr-x------ 1 system system 64 2018-11-02 12:26 113 -> /system/media/theme/defau
二, Fd Leak 案例
现在看几种 FD 泄露问题的案例, FD 泄露问题的特点是:
同一个问题可能出现不同堆栈, 比较隐晦
Fd 泄漏时内存可能不会出现不足, 就算触发 GC 也不一定能够回收已经创建的文件句柄
日志关键字:
- ashmem_create_region failed for 'indirect ref table': Too many open files
- "Too many open files"
- "Could not allocate JNI Env"
- "Could not allocate dup blob fd"
- "Could not read input channel file descriptors from parcel"
- "pthread_create"
- "InputChannel is not initialized"
- "Could not open input channel pair"
当你看到上面几种 crash 的堆栈之后, 就需要往 fd 泄露的方向上去思考了
1,Resource 相关
使用输入输出流没有关闭的可能会出问题, FileInputStream,FileOutputStream,FileReader,FileWriter 等, 因为每打开一个文件需要 fd. 一些输入流也提供了基于 fd 的构造方法
- public FileInputStream(FileDescriptor fdObj) {
- this(fdObj, false /* isFdOwner */);
- }
下面是一种泄露案例
- frameworks/base/services/core/java/com/Android/server/pm/ResmonWhitelistPackage.java
- 10final class ResmonWhitelistPackage {
- 11 private final File mSystemDir;
- 12 private final File mWhitelistFile;
- 13
- 14 final ArrayList<String> mPackages = new ArrayList<String>();
- 15
- 16 ResmonWhitelistPackage() {
- 17 mSystemDir = new File("/system/", "etc");
- 18 mWhitelistFile = new File(mSystemDir, "resmonwhitelist.txt");
- 19 }
- 20
- 21 void readList() {
- ....
- 25 try {
- 26 /// M: Clear white list record before update it
- 27 mPackages.clear();
- 28 BufferedReader br = new BufferedReader(new FileReader(mWhitelistFile));
- 29 String line = br.readLine();
- 30 while (line != null) {
- 31 mPackages.add(line);
- 32 line = br.readLine();
- 33 }
- 34 br.close();
- 35 } catch (IOException e) {
- 36 //Log.e(PackageManagerService.TAG, "IO Exception happened while reading resmon whitelist");
- 37 e.printStackTrace();
- 38 }
- 39 }
- 40}
br.close 并不是在 finally 语句中, 可能会出现未关闭的可能. 如果代码写的风骚一点, 也有办法. 从 Java 7 build 105 版本开始, Java 7 的编译器和运行环境支持新的 try-with-resources 语句, 称为 ARM 块(Automatic Resource Management) , 自动资源管理.
- private static void customBufferStreamCopy(File source, File target) {
- try (InputStream fis = new FileInputStream(source);
- OutputStream fos = new FileOutputStream(target)){
- byte[] buf = new byte[8192];
- int i;
- while ((i = fis.read(buf)) != -1) {
- fos.write(buf, 0, i);
- }
- }
- catch (Exception e) {
- e.printStackTrace();
- }
- }
代码清晰, 且不会发生泄露.
2,HandlerThread 相关
使用 HandlerThread 不小心也会发生 fd 泄露, 看看这个案例
2.1, 现象
systemui 总是 crash, 发生问题系统版本 Android O
2.2, 初步分析
- pid: 18465, tid: 32737, name: async_sensor>>> com.Android.systemui <<<signal 5 (SIGTRAP), code -32763 (PTRACE_EVENT_STOP), fault addr 0x3e800007fe1
- x0 fffffffffffffffc x1 000000735df1ec38 x2 0000000000000010 x3 00000000ffffffff
- x4 0000000000000000 x5 0000000000000008 x6 0000007428971000 x7 0000000000bb3876
- x8 0000000000000016 x9 7fffffffffffffff x10 000000000000000c x11 0000000000000000
- x12 000000735df1ed38 x13 000000005b20a831 x14 002cd0c4e58dc31b x15 0000fdf7aa690a91
- x16 00000074245e7498 x17 000000742453bd00 x18 0000000000000004 x19 000000735df1f588
- x20 000000738727b708 x21 00000000ffffffff x22 000000735df1f588 x23 000000738727b660
- x24 0000000000000028 x25 000000000000000c x26 0000000014a000b0 x27 0000007385815300
- x28 00000000710507c8 x29 000000735df1ebe0 x30 000000742453bd38
- sp 000000735df1ebc0 pc 00000074245866dc pstate 0000000060000000
- v0 00000000000000000000000000000000 v1 00000000000000000000000000000001
- v2 00000000000000002065766974616e3c v3 00000000000000000000000000000000
- v4 00000000000000008020080200000000 v5 00000000000000004000000000000000
- v6 00000000000000000000000000000000 v7 00000000000000008020080280200802
- v8 00000000000000000000000000000000 v9 00000000000000000000000000000000
- v10 00000000000000000000000000000000 v11 00000000000000000000000000000000
- v12 00000000000000000000000000000000 v13 00000000000000000000000000000000
- v14 00000000000000000000000000000000 v15 00000000000000000000000000000000
- v16 40100401401004014010040140100401 v17 a0080000a00a0000a800aa0040404000
- v18 80200800000000008020080200000000 v19 000000000000000000000000ebad8083
- v20 000000000000000000000000ebad8084 v21 000000000000000000000000ebad8085
- v22 000000000000000000000000ebad8086 v23 000000000000000000000000ebad8087
- v24 000000000000000000000000ebad8088 v25 000000000000000000000000ebad8089
- v26 000000000000000000000000ebad808a v27 000000000000000000000000ebad808b
- v28 000000000000000000000000ebad808c v29 000000000000000000000000ebad808d
- v30 000000000000000000000000ebad808e v31 00000000000000000000000041e00000
- fpsr 00000013 fpcr 00000000
- backtrace:
- #00 pc 000000000006a6dc /system/lib64/libc.so (__epoll_pwait+8)
- #01 pc 000000000001fd34 /system/lib64/libc.so (epoll_pwait+52)
- #02 pc 0000000000015d08 /system/lib64/libutils.so (Android::Looper::pollInner(int)+144)
- #03 pc 0000000000015bf0 /system/lib64/libutils.so (Android::Looper::pollOnce(int, int*, int*, void**)+108)
- #04 pc 0000000000111bac /system/lib64/libandroid_runtime.so (Android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44)
- #05 pc 0000000000c005cc /system/framework/arm64/boot-framework.oat (offset 0x9cb000) (Android.App.NativeActivity.onWindowFocusChangedNative [DEDUPED]+140)
- #06 pc 0000000001773f00 /system/framework/arm64/boot-framework.oat (offset 0x9cb000) (Android.os.MessageQueue.next+192)
乍看是处理消息的时候挂了? 继续查看 log 发现
- 06-15 22:00:33.921 1000 2155 2335 E Parcel : fcntl(F_DUPFD_CLOEXEC) failed in Parcel::read, i is 0, fds[i] is -1, fd_count is 2, error: Too many open files
- 06-15 22:00:33.921 1000 2155 2335 E Surface : dequeueBuffer: IGraphicBufferProducer::requestBuffer failed: -22
- 06-15 22:00:33.921 1000 2155 2335 I Adreno : DequeueBuffer: dequeueBuffer failed
- 06-15 22:00:33.921 1000 2155 2335 E Parcel : fcntl(F_DUPFD_CLOEXEC) failed in Parcel::read, i is 0, fds[i] is -1, fd_count is 2, error: Too many open files
- 06-15 22:00:33.921 1000 2155 2335 E Surface : dequeueBuffer: IGraphicBufferProducer::requestBuffer failed: -22
- 06-15 22:00:33.921 1000 2155 2335 I Adreno : DequeueBuffer: dequeueBuffer failed
- 06-15 22:00:33.921 1000 2155 2335 E Parcel : fcntl(F_DUPFD_CLOEXEC) failed in Parcel::read, i is 0, fds[i] is -1, fd_count is 2, error: Too many open files
- 06-15 22:00:33.921 1000 2155 2335 E Surface : dequeueBuffer: IGraphicBufferProducer::requestBuffer failed: -22
- 06-15 22:00:33.921 1000 2155 2335 I Adreno : DequeueBuffer: dequeueBuffer failed
- 06-15 22:00:33.922 1000 2155 2335 E Parcel : fcntl(F_DUPFD_CLOEXEC) failed in Parcel::read, i is 0, fds[i] is -1, fd_count is 2, error: Too many open files
- 06-15 22:00:33.922 1000 2155 2335 E Surface : dequeueBuffer: IGraphicBufferProducer::requestBuffer failed: -22
- 06-15 22:00:33.922 1000 2155 2335 I Adreno : DequeueBuffer: dequeueBuffer failed
- 06-15 22:00:33.922 1000 2155 2335 E OpenGLRenderer: GL error: GL_INVALID_OPERATION
- 06-15 22:00:33.922 1000 2155 2335 F OpenGLRenderer: glCopyTexSubImage2D error! GL_INVALID_OPERATION (0x502
状态栏 open fd 超过 1024, 看 log 有很多上面这种 log, 这个是真实案发现场吗?
2.3, 深入分析
O 上发生 NE 时会将 fd 信息打印到 tombstone 文件中, 看 fd 信息确实已经满了, 多为 anon_inode:[eventfd]和 anon_inode:dmabuf,
- backtrace:
- #00 pc 000000000006a6dc /system/lib64/libc.so (__epoll_pwait+8)
- #01 pc 000000000001fd34 /system/lib64/libc.so (epoll_pwait+52)
- #02 pc 0000000000015d08 /system/lib64/libutils.so (Android::Looper::pollInner(int)+144)
- #03 pc 0000000000015bf0 /system/lib64/libutils.so (Android::Looper::pollOnce(int, int*, int*, void**)+108)
- #04 pc 0000000000111bac /system/lib64/libandroid_runtime.so (Android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44)
- #05 pc 0000000000c005cc /system/framework/arm64/boot-framework.oat (offset 0x9cb000) (Android.App.NativeActivity.onWindowFocusChangedNative [DEDUPED]+140)
- #06 pc 0000000001773f00 /system/framework/arm64/boot-framework.oat (offset 0x9cb000) (Android.os.MessageQueue.next+192)
- ....
- fd 556: anon_inode:[eventpoll]
- fd 557: anon_inode:[eventpoll]
- fd 558: anon_inode:[eventfd]
- fd 559: anon_inode:[eventpoll]
- fd 560: anon_inode:[eventfd]
- fd 561: anon_inode:[eventpoll]
- fd 562: anon_inode:[eventfd]
- fd 563: anon_inode:[eventfd]
- fd 564: anon_inode:[eventpoll]
- fd 565: anon_inode:[eventfd]
- fd 566: anon_inode:[eventpoll]
- fd 567: /dev/ashmem
- ..... // 省略千行
- fd 1022: anon_inode:dmabuf
- fd 1023: socket:[3549620]
通过 trace 分析, 还有一个关键的异常 log, 看到 systemui 进程有很多个 async_sensor 线程, 为什么这个线程这么多呢?
pid: 11019, tid: 2301, name: async_sensor>>> com.Android.systemui <<<pid: 11019, tid: 2431, name: async_sensor>>> com.Android.systemui <<<pid: 11019, tid: 2522, name: async_sensor>>> com.Android.systemui <<<pid: 11019, tid: 2542, name: async_sensor>>> com.Android.systemui <<<pid: 11019, tid: 2600, name: async_sensor>>> com.Android.systemui <<<.....// 省略若干
pid: 11019, tid: 5693, name: async_sensor>>> com.Android.systemui <<<搜查代码 async_sensor 是什么? 发现 async_sensor 是个 HanderThread
image.PNG
看在 DozeFactory 中, 有 new AsyncSensorManager 的地方:
image.PNG
继续查看 assembleMachine 方法在哪里调用的, DozeService 中有调用 assembleMachine 的地方
image.PNG
回头在结合 log 发现, DozeService 被频繁的启动, 看来一步步的接近真相了. 这个问题看来是锁屏同事造成的问题.
- isTest=false, canDoze=true, userId=0
- 06-15 21:48:11.888 1000 1313 1384 I DreamController: Stopping dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:48:13.337 1000 1313 1384 I DreamController: Starting dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:48:24.519 1000 1313 1384 I DreamController: Stopping dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:48:33.577 1000 1313 1384 I DreamController: Starting dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:48:50.640 1000 1313 1384 I DreamController: Stopping dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:48:56.540 1000 1313 1384 I DreamController: Starting dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:49:28.240 1000 1313 1384 I DreamController: Stopping dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:49:29.207 1000 1313 1384 I DreamController: Starting dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:49:30.206 1000 1313 1384 I DreamController: Stopping dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:49:33.332 1000 1313 1384 I DreamController: Starting dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:49:37.435 1000 1313 1384 I DreamController: Stopping dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:49:50.529 1000 1313 1384 I DreamController: Starting dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:50:20.010 1000 1313 1384 I DreamController: Stopping dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- 06-15 21:50:30.148 1000 1313 1384 I DreamController: Starting dream: name=ComponentInfo{
- com.Android.systemui/com.Android.keyguard.doze.DozeService
- }, isTest=false, canDoze=true, userId=0
- ......
2.4, 修复方案
最终转给锁屏同事, 将此修复
image.PNG
所以本问题的 RootCase 就是频繁的启动了 DozeService, 创建了大量的 HandlerThread 导致 fd 泄露, 那么为什么 HandlerThread 和 fd 泄露有关系呢? 跟踪源码发现 HandlerThread 创建会引起 Looper 的创建, 每一个 Looper 在创建的时候会打开两个 fd, 一个是 eventfd, 另外一个是 mEpolled, 这个和 tombstone 文件中打印的 fd 也对上了.
image.PNG
总结, 这种泄露问题如果分析的时候, 如果不知道 HandlerThread 会创建两个 fd 的基本知识, 那么这个问题比较难以分析.
2,Thread.start 相关
线程启动的时候, 可能也会有 fd 泄露的风险, 不过这种错误不太容易犯下, 如果你真是在一个循环中创建 1024 线程, 那么立刻见效, 程序死掉.
image.PNG
trace1
- java.lang.OutOfMemoryError: Could not allocate JNI Env
- at java.lang.Thread.nativeCreate(Native Method)
- at java.lang.Thread.start(Thread.java:729)
- at com.Android.server.Wi-Fi.WifiNative.startHal(WifiNative.java:1639)
- at com.Android.server.Wi-Fi.WifiStateMachine.setupDriverForSoftAp(WifiStateMachine.java:3970)
- at com.Android.server.Wi-Fi.WifiStateMachine.-wrap9(WifiStateMachine.java)
- at com.Android.server.Wi-Fi.WifiStateMachine$InitialState.processMessage(WifiStateMachine.java:4480)
- at com.Android.internal.util.StateMachine$SmHandler.processMsg(StateMachine.java:980)
- at com.Android.internal.util.StateMachine$SmHandler.handleMessage(StateMachine.java:799)
- at Android.os.Handler.dispatchMessage(Handler.java:102)
- at Android.os.Looper.loop(Looper.java:163)
- at Android.os.HandlerThread.run(HandlerThread.java:61)
trace2
- java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Try again
- at java.lang.Thread.nativeCreate(Native Method)
- at java.lang.Thread.start(Thread.java:733)
- at com.tencent.mm.sdk.f.b$a.start(SourceFile:61)
- at com.tencent.mm.am.a.bU(SourceFile:60)
- at com.tencent.mm.ui.MMAppMgr$8.tC(SourceFile:315)
- at com.tencent.mm.sdk.platformtools.am.handleMessage(SourceFile:69)
- at com.tencent.mm.sdk.platformtools.aj.handleMessage(SourceFile:173)
- at com.tencent.mm.sdk.platformtools.aj.dispatchMessage(SourceFile:128)
- at Android.os.Looper.loop(Looper.java:176)
- at Android.App.ActivityThread.main(ActivityThread.java:6701)
- at java.lang.reflect.Method.invoke(Native Method)
- at com.Android.internal.os.Zygote$MethodAndArgsCaller.run(Zygote.java:246)
- at com.Android.internal.os.ZygoteInit.main(ZygoteInit.java:783)
3,Inputchannel 相关
Inputchannel 也会可能出现 fd 泄露问题, 如下:
image.PNG
3.1 在 Activity 中不断弹 Dialog
- public class Main2Activity extends Activity {
- @Override
- protected void onCreate(Bundle savedInstanceState) {
- super.onCreate(savedInstanceState);
- setContentView(R.layout.activity_main2);
- }
- public void onClick(View view) {
- for (int i = 0; i < 1024; i++) {
- AlertDialog.Builder builder = new AlertDialog.Builder(this);
- builder.setTitle("fd").setIcon(R.drawable.ic_launcher_background).create();
- builder.show();
- }
- }
- }
不过一会, 这个 App 就死了, 报出了下面的问题
- 11-02 17:38:22.263 9351-9351/com.example.wangjing.rebootdemo E/AndroidRuntime: FATAL EXCEPTION: main
- Process: com.example.wangjing.rebootdemo, PID: 9351
- java.lang.IllegalStateException: Could not execute method for Android:onClick
- at Android.view.View$DeclaredOnClickListener.onClick(View.java:5391)
- at Android.view.View.performClick(View.java:6311)
- at Android.view.View$PerformClick.run(View.java:24833)
- at Android.os.Handler.handleCallback(Handler.java:794)
- at Android.os.Handler.dispatchMessage(Handler.java:99)
- at Android.os.Looper.loop(Looper.java:173)
- at Android.App.ActivityThread.main(ActivityThread.java:6653)
- at java.lang.reflect.Method.invoke(Native Method)
- at com.Android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:547)
- at com.Android.internal.os.ZygoteInit.main(ZygoteInit.java:821)
- Caused by: java.lang.reflect.InvocationTargetException
- at java.lang.reflect.Method.invoke(Native Method)
- at Android.view.View$DeclaredOnClickListener.onClick(View.java:5386)
- at Android.view.View.performClick(View.java:6311)
- at Android.view.View$PerformClick.run(View.java:24833)
- at Android.os.Handler.handleCallback(Handler.java:794)
- at Android.os.Handler.dispatchMessage(Handler.java:99)
- at Android.os.Looper.loop(Looper.java:173)
- at Android.App.ActivityThread.main(ActivityThread.java:6653)
- at java.lang.reflect.Method.invoke(Native Method)
- at com.Android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:547)
- at com.Android.internal.os.ZygoteInit.main(ZygoteInit.java:821)
- Caused by: java.lang.RuntimeException: Could not read input channel file descriptors from parcel.
- at Android.view.InputChannel.nativeReadFromParcel(Native Method)
- at Android.view.InputChannel.readFromParcel(InputChannel.java:148)
- at Android.view.IWindowSession$Stub$Proxy.addToDisplay(IWindowSession.java:804)
- at Android.view.ViewRootImpl.setView(ViewRootImpl.java:770)
- at Android.view.WindowManagerGlobal.addView(WindowManagerGlobal.java:356)
- at Android.view.WindowManagerImpl.addView(WindowManagerImpl.java:94)
- at Android.App.Dialog.show(Dialog.java:330)
- at Android.App.AlertDialog$Builder.show(AlertDialog.java:1114)
- at com.example.wangjing.rebootdemo.Main2Activity.onClick(Main2Activity.java:28)
- at java.lang.reflect.Method.invoke(Native Method)
- at Android.view.View$DeclaredOnClickListener.onClick(View.java:5386)
- at Android.view.View.performClick(View.java:6311)
- at Android.view.View$PerformClick.run(View.java:24833)
- at Android.os.Handler.handleCallback(Handler.java:794)
- at Android.os.Handler.dispatchMessage(Handler.java:99)
- at Android.os.Looper.loop(Looper.java:173)
- at Android.App.ActivityThread.main(ActivityThread.java:6653)
- at java.lang.reflect.Method.invoke(Native Method)
- at com.Android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:547)
- at com.Android.internal.os.ZygoteInit.main(ZygoteInit.java:821)
查看一下这个进程的 fd 信息, 很多的 socket, 正常情况下, 这些东西是没有的
- jason:/ # ps -ef |grep "wangjing"
- u0_a161 13134 9462 3 17:43:12 ? 00:00:04 com.example.wangjing.rebootdemo
- root 13385 13380 3 17:45:26 pts/1 00:00:00 grep wangjing
- jason:/ # ls -la proc/13134/fd/
- total 0
- dr-x------ 2 u0_a161 u0_a161 0 2018-11-02 17:43 .
- dr-xr-xr-x 9 u0_a161 u0_a161 0 2018-11-02 17:43 ..
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 0 -> /dev/null
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 1 -> /dev/null
- lr-x------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 10 -> /system/framework/com.nxp.nfc.nq.jar
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 100 -> socket:[17760179]
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 101 -> socket:[17753761]
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 102 -> socket:[17773237]
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 103 -> socket:[17760182]
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 104 -> socket:[17760184]
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 105 -> socket:[17773239]
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 106 -> socket:[17776657]
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 107 -> socket:[17774959]
- lrwx------ 1 u0_a161 u0_a161 64 2018-11-02 17:45 108 -> socket:[17776659]
- ......
5,Bitmap 相关
bitmap 也是需要 fd 的, 如下图, 没有关闭, 可能引发 fd 泄露的可能.
image.PNG
- trace1
- java.lang.RuntimeException: Could not allocate dup blob fd.
- at Android.graphics.Bitmap.nativeCreateFromParcel(Native Method)
- at Android.graphics.Bitmap.access$100(Bitmap.java:36)
- at Android.graphics.Bitmap$1.createFromParcel(Bitmap.java:1528)
- at Android.graphics.Bitmap$1.createFromParcel(Bitmap.java:1520)
- at Android.widget.RemoteViews$BitmapCache.<init>(RemoteViews.java:954)
- at Android.widget.RemoteViews.<init>(RemoteViews.java:1820)
- at Android.widget.RemoteViews.<init>(RemoteViews.java:1812)
- at Android.widget.RemoteViews.clone(RemoteViews.java:1905)
- at Android.App.Notification.cloneInto(Notification.java:1534)
- at Android.App.Notification.clone(Notification.java:1508)
- at Android.service.notification.StatusBarNotification.clone(StatusBarNotification.java:161)
- at com.Android.server.notification.NotificationManagerService$NotificationListeners.notifyPostedLocked(NotificationManagerService.java:3557)
- at com.Android.server.notification.NotificationManagerService$8.run(NotificationManagerService.java:2337)
- at Android.os.Handler.handleCallback(Handler.java:815)
- at Android.os.Handler.dispatchMessage(Handler.java:104)
- at Android.os.Looper.loop(Looper.java:207)
- at com.Android.server.SystemServer.run(SystemServer.java:410)
- at com.Android.server.SystemServer.main(SystemServer.java:255)
- at java.lang.reflect.Method.invoke(Native Method)
- at com.Android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:933)
- at com.Android.internal.os.ZygoteInit.main(ZygoteInit.java:782)
三, 总结
通过上面的五个案例, 总结常见的 fd 泄露的情景, 一般出现下面的 log, 就需要怀疑是否有 fd 泄露的情况.
- "Too many open files"
- "Could not allocate JNI Env"
- "Could not allocate dup blob fd"
- "Could not read input channel file descriptors from parcel"
- "pthread_create"
- "InputChannel is not initialized"
- "Could not open input channel pair"
大批量的打开 "anon_inode:[eventpoll]" 和 "pipe" 或者 "anon_inode:[eventfd]", 超过 100 个 eventpoll, 通常情况下是开启了太多的 HandlerThread/Looper/MessageQueue, 线程忘记关闭, 或者 looper 没有释放. 可以抓取 hprof 进行快速分析
对于 system server, 如果有大批量的 socket 打开, 可能是因为 Input Channel 没有关闭, 此类同样抓取 hprof, 查看 system server 中 WindowState 的情况.
大量的打开 "/dev/ashmem", 如果是 Context provider, 或者其他 App, 很可能是打开数据库没有关闭, 或者数据库链接频繁打开忘记关闭. 这个时候查看这个进程的 maps, cat proc/pid/maps, 即可看到这个 ashmem 的 name, 然后进一步可知道在哪里泄露.
3.1, 容易复现
1. 查看 fd 信息 adb shell ls -a -l /proc/<pid>/fd ,lsof
2. 查看进程线程信息: ps -t <pid>, 或者抓进程 trace, kill -3 <pid>
3. 抓取 hprof 定位资源使用情况
3.2, 难复现
1. 对于应用自身 fd 泄漏发生 JE 时可以在复写 UncatchHandlerException 在应用 crash 的时候通过 readlink 的方式读取 / proc/self/fd 的信息, 在后面发生的时候可以以获取 fd 信息
2.O 之后 NE 的 Tombstone 文件中有 open files, 可以查看打开的 fd 信息
3. 抓取进程的 ps 信息或者 trace 信息
4. 如果是 inputchannel 类型的, 有可能是窗口类型的, 因此可以查看 Windows 情况, dumpsys Windows
来源: http://www.jianshu.com/p/1f9cff12b84f