分类

调试定位云音乐搜索框冻结的问题

2018-03-22 10:03 debug

去年年底发布云音乐之后, 就收到社区反馈说, 使用搜狗输入法在网易云音乐搜索框中输入时, 很容易导致播放器窗口没有响应. 这个问题在开发过程中, 在我的机器上并没有遇到过.

跟踪

找一个可以重现问题的机器.

先启动播放器, 在输入框中多次输入文字之后, 就可以重现上面说到的窗口冻结的问题了. 现在使用 gdb 来跟踪一下调用栈.

获取 netease-cloud-music 的进程信息, 如下图, 其中 PID 为 24374 的是主进程 (browser process), 而 2437624415 这两个是 renderer 进程, 不需要关注它们.

$ ps aux | grep netease-cloud
debian  24374  7.8  1.3 3632692 162220 pts/0  Sl+  09:16   0:02 netease-cloud-music
debian  24376  0.0  0.4 510564 52852 pts/0    S+   09:16   0:00 /usr/bin/netease-cloud-music --type=zygote --no-sandbox --lang=en-US --log-file=/home/debian/.cache/netease-cloud-music/Cef/console.log --log-severity=error
debian  24415 13.1  1.6 2192980 197784 pts/0  Sl+  09:16   0:03 /usr/bin/netease-cloud-music --type=zygote --no-sandbox --lang=en-US --log-file=/home/debian/.cache/netease-cloud-music/Cef/console.log --log-severity=error
debian  24510  0.0  0.0  12792   992 pts/2    S+   09:16   0:00 grep --color=auto netease-cloud

现在启动 gdb, 然后把它绑到上面定位的那个 browser process, 可以看到, 在这个进程里存在数十个线程.

$ gdb /usr/bin/netease-cloud-music
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/netease-cloud-music...(no debugging symbols found)...done.
(gdb) attach 245374
Attaching to program: /usr/bin/netease-cloud-music, process 245374
ptrace: No such process.
(gdb) attach 24374
Attaching to program: /usr/bin/netease-cloud-music, process 24374
[New LWP 24375]
[New LWP 24377]
[New LWP 24378]
[New LWP 24379]
[New LWP 24380]
[New LWP 24381]
[New LWP 24382]
[New LWP 24383]
[New LWP 24384]
[New LWP 24385]
[New LWP 24386]
[New LWP 24387]
[New LWP 24388]
[New LWP 24389]
[New LWP 24390]
[New LWP 24391]
[New LWP 24392]
[New LWP 24393]
[New LWP 24394]
[New LWP 24395]
[New LWP 24396]
[New LWP 24397]
[New LWP 24398]
[New LWP 24399]
[New LWP 24400]
[New LWP 24401]
[New LWP 24403]
[New LWP 24404]
[New LWP 24405]
[New LWP 24406]
[New LWP 24407]
[New LWP 24408]
[New LWP 24409]
[New LWP 24410]
[New LWP 24411]
[New LWP 24412]
[New LWP 24413]
[New LWP 24414]
[New LWP 24427]
[New LWP 24432]
[New LWP 24433]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fbb983b46ad in read () at ../sysdeps/unix/syscall-template.S:84
84	../sysdeps/unix/syscall-template.S: No such file or directory.

现在看一下调用栈:

(gdb) bt
#0  0x00007fbb983b46ad in read () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fbb0c141a8b in Dtk::Core::DFileSystemWatcherPrivate::_q_readFromInotify() () at /usr/lib/x86_64-linux-gnu/libdtkcore.so.2
#2  0x00007fbb995cc7b9 in QMetaObject::activate(QObject*, int, int, void**) (sender=sender@entry=0x55ef77514728, signalOffset=<optimized out>, local_signal_index=local_signal_index@entry=0, argv=argv@entry=0x7fff381a5bd0) at kernel/qobject.cpp:3730
#3  0x00007fbb995ccd77 in QMetaObject::activate(QObject*, QMetaObject const*, int, void**) (sender=sender@entry=0x55ef77514728, m=m@entry=0x7fbb997ec6c0 <QSocketNotifier::staticMetaObject>, local_signal_index=local_signal_index@entry=0, argv=argv@entry=0x7fff381a5bd0) at kernel/qobject.cpp:3595
#4  0x00007fbb99643e0e in QSocketNotifier::activated(int, QSocketNotifier::QPrivateSignal) (this=this@entry=0x55ef77514728, _t1=<optimized out>) at .moc/moc_qsocketnotifier.cpp:135
#5  0x00007fbb995d8a42 in QSocketNotifier::event(QEvent*) (this=0x55ef77514728, e=0x7fff381a5e30) at kernel/qsocketnotifier.cpp:260
#6  0x00007fbb9dad328c in QApplicationPrivate::notify_helper(QObject*, QEvent*) (this=<optimized out>, receiver=0x55ef77514728, e=0x7fff381a5e30) at kernel/qapplication.cpp:3804
#7  0x00007fbb9dad837f in QApplication::notify(QObject*, QEvent*) (this=0x55ef774a4030, receiver=0x55ef77514728, e=0x7fff381a5e30) at kernel/qapplication.cpp:3561
#8  0x00007fbb995a12a8 in QCoreApplication::notifyInternal2(QObject*, QEvent*) (receiver=0x55ef77514728, event=event@entry=0x7fff381a5e30) at kernel/qcoreapplication.cpp:1015
#9  0x00007fbb995f410d in QCoreApplication::sendEvent(QObject*, QEvent*) (event=0x7fff381a5e30, receiver=<optimized out>) at ../../include/QtCore/../../src/corelib/kernel/qcoreapplication.h:225
#10 0x00007fbb995f410d in socketNotifierSourceDispatch(GSource*, GSourceFunc, gpointer) (source=0x55ef774d9370) at kernel/qeventdispatcher_glib.cpp:100
#11 0x00007fbb9b519887 in g_main_dispatch (context=0x55ef77386130) at ../../../../glib/gmain.c:3148
#12 0x00007fbb9b519887 in g_main_context_dispatch (context=context@entry=0x55ef77386130) at ../../../../glib/gmain.c:3813
#13 0x00007fbb9b519ab8 in g_main_context_iterate (context=context@entry=0x55ef77386130, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../../../../glib/gmain.c:3886
#14 0x00007fbb9b519b5c in g_main_context_iteration (context=0x55ef77386130, may_block=may_block@entry=1) at ../../../../glib/gmain.c:3947
#15 0x00007fbb995f3c7f in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) (this=0x55ef774ddec0, flags=...) at kernel/qeventdispatcher_glib.cpp:417
#16 0x00007fbb9959f57a in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) (this=this@entry=0x7fff381a6030, flags=..., flags@entry=...) at kernel/qeventloop.cpp:204
#17 0x00007fbb995a762d in QCoreApplication::exec() () at kernel/qcoreapplication.cpp:1285
#18 0x000055ef7575330c in main ()

sysdeps/unix/syscall-template.S:84 这部分汇编文件可以在 glibc 的源代码中找到:

/* This is a "normal" system call stub: if there is an error,
   it returns -1 and sets errno.  */

T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
	ret
T_PSEUDO_END (SYSCALL_SYMBOL)

可以看到, 这是一个正常的系统调用. 然后看一下上一个调用, 是: Dtk::Core::DFileSystemWatcherPrivate::_q_readFromInotify() () at /usr/lib/x86_64-linux-gnu/libdtkcore.so.2 这里并没有打印出具体调用了 dtkcore 里面的那个函数, 为此, 可以安装 dtkcore 的 debug 包:

$ sudo apt install libdtkcore2-dbgsym

安装完之后, 在 gdb 里面先 detach 刚刚调试的进程, 然后重新用 attach 24374 命令绑定到它.

(gdb) bt
#0  0x00007fbb983b46ad in read () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fbb0c141a8b in Dtk::Core::DFileSystemWatcherPrivate::_q_readFromInotify() (this=0x55ef775146f0) at filesystem/dfilesystemwatcher_linux.cpp:144
#2  0x00007fbb995cc7b9 in QMetaObject::activate(QObject*, int, int, void**) (sender=sender@entry=0x55ef77514728, signalOffset=<optimized out>, local_signal_index=local_signal_index@entry=0, argv=argv@entry=0x7fff381a5590) at kernel/qobject.cpp:3730
#3  0x00007fbb995ccd77 in QMetaObject::activate(QObject*, QMetaObject const*, int, void**) (sender=sender@entry=0x55ef77514728, m=m@entry=0x7fbb997ec6c0 <QSocketNotifier::staticMetaObject>, local_signal_index=local_signal_index@entry=0, argv=argv@entry=0x7fff381a5590) at kernel/qobject.cpp:3595
#4  0x00007fbb99643e0e in QSocketNotifier::activated(int, QSocketNotifier::QPrivateSignal) (this=this@entry=0x55ef77514728, _t1=<optimized out>) at .moc/moc_qsocketnotifier.cpp:135
#5  0x00007fbb995d8a42 in QSocketNotifier::event(QEvent*) (this=0x55ef77514728, e=0x7fff381a57f0) at kernel/qsocketnotifier.cpp:260
#6  0x00007fbb9dad328c in QApplicationPrivate::notify_helper(QObject*, QEvent*) (this=<optimized out>, receiver=0x55ef77514728, e=0x7fff381a57f0) at kernel/qapplication.cpp:3804
#7  0x00007fbb9dad837f in QApplication::notify(QObject*, QEvent*) (this=0x55ef774a4030, receiver=0x55ef77514728, e=0x7fff381a57f0) at kernel/qapplication.cpp:3561
#8  0x00007fbb995a12a8 in QCoreApplication::notifyInternal2(QObject*, QEvent*) (receiver=0x55ef77514728, event=event@entry=0x7fff381a57f0) at kernel/qcoreapplication.cpp:1015
#9  0x00007fbb995f410d in QCoreApplication::sendEvent(QObject*, QEvent*) (event=0x7fff381a57f0, receiver=<optimized out>) at ../../include/QtCore/../../src/corelib/kernel/qcoreapplication.h:225
#10 0x00007fbb995f410d in socketNotifierSourceDispatch(GSource*, GSourceFunc, gpointer) (source=0x55ef774d9370) at kernel/qeventdispatcher_glib.cpp:100
#11 0x00007fbb9b519887 in g_main_dispatch (context=0x55ef77386130) at ../../../../glib/gmain.c:3148
#12 0x00007fbb9b519887 in g_main_context_dispatch (context=context@entry=0x55ef77386130) at ../../../../glib/gmain.c:3813
#13 0x00007fbb9b519ab8 in g_main_context_iterate (context=context@entry=0x55ef77386130, block=block@entry=0, dispatch=dispatch@entry=1, self=<optimized out>) at ../../../../glib/gmain.c:3886
#14 0x00007fbb9b519b5c in g_main_context_iteration (context=0x55ef77386130, may_block=0) at ../../../../glib/gmain.c:3947
#15 0x00007fbb8f5a641a in  () at /usr/lib/x86_64-linux-gnu/qcef/libcef.so
#16 0x00007fbb8f5cd59a in  () at /usr/lib/x86_64-linux-gnu/qcef/libcef.so
#17 0x00007fbb8f4dc41a in  () at /usr/lib/x86_64-linux-gnu/qcef/libcef.so
#18 0x00007fbb995cc85c in QtPrivate::QSlotObjectBase::call(QObject*, void**) (a=0x7fff381a5ab0, r=0x55ef7764e700, this=<optimized out>) at ../../include/QtCore/../../src/corelib/kernel/qobject_impl.h:124
#19 0x00007fbb995cc85c in QMetaObject::activate(QObject*, int, int, void**) (sender=sender@entry=0x55ef7764e700, signalOffset=<optimized out>, local_signal_index=local_signal_index@entry=0, argv=argv@entry=0x0)
    at kernel/qobject.cpp:3715
#20 0x00007fbb995ccd77 in QMetaObject::activate(QObject*, QMetaObject const*, int, void**) (sender=sender@entry=0x55ef7764e700, m=m@entry=0x7fbb997ec780 <QTimer::staticMetaObject>, local_signal_index=local_signal_index@entry=0, argv=argv@entry=0x0) at kernel/qobject.cpp:3595
#21 0x00007fbb99643fb0 in QTimer::timeout(QTimer::QPrivateSignal) (this=this@entry=0x55ef7764e700) at .moc/moc_qtimer.cpp:198
#22 0x00007fbb995d8e18 in QTimer::timerEvent(QTimerEvent*) (this=0x55ef7764e700, e=<optimized out>) at kernel/qtimer.cpp:247
#23 0x00007fbb995cd2ab in QObject::event(QEvent*) (this=0x55ef7764e700, e=<optimized out>) at kernel/qobject.cpp:1278
#24 0x00007fbb9dad328c in QApplicationPrivate::notify_helper(QObject*, QEvent*) (this=<optimized out>, receiver=0x55ef7764e700, e=0x7fff381a5e00) at kernel/qapplication.cpp:3804
#25 0x00007fbb9dad837f in QApplication::notify(QObject*, QEvent*) (this=0x55ef774a4030, receiver=0x55ef7764e700, e=0x7fff381a5e00) at kernel/qapplication.cpp:3561
#26 0x00007fbb995a12a8 in QCoreApplication::notifyInternal2(QObject*, QEvent*) (receiver=0x55ef7764e700, event=event@entry=0x7fff381a5e00) at kernel/qcoreapplication.cpp:1015
#27 0x00007fbb995f2c1e in QCoreApplication::sendEvent(QObject*, QEvent*) (event=0x7fff381a5e00, receiver=<optimized out>) at ../../include/QtCore/../../src/corelib/kernel/qcoreapplication.h:225
#28 0x00007fbb995f2c1e in QTimerInfoList::activateTimers() (this=0x55ef774fdf90) at kernel/qtimerinfo_unix.cpp:637
#29 0x00007fbb995f3141 in timerSourceDispatch(GSource*, GSourceFunc, gpointer) (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:176
#30 0x00007fbb9b519887 in g_main_dispatch (context=0x55ef77386130) at ../../../../glib/gmain.c:3148
#31 0x00007fbb9b519887 in g_main_context_dispatch (context=context@entry=0x55ef77386130) at ../../../../glib/gmain.c:3813
#32 0x00007fbb9b519ab8 in g_main_context_iterate (context=context@entry=0x55ef77386130, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../../../../glib/gmain.c:3886
#33 0x00007fbb9b519b5c in g_main_context_iteration (context=0x55ef77386130, may_block=may_block@entry=1) at ../../../../glib/gmain.c:3947
#34 0x00007fbb995f3c7f in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) (this=0x55ef774ddec0, flags=...) at kernel/qeventdispatcher_glib.cpp:417
#35 0x00007fbb9959f57a in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) (this=this@entry=0x7fff381a6030, flags=..., flags@entry=...) at kernel/qeventloop.cpp:204
#36 0x00007fbb995a762d in QCoreApplication::exec() () at kernel/qcoreapplication.cpp:1285
#37 0x000055ef7575330c in main ()

现在来看, 已经可以打印出源文件的行号了: Dtk::Core::DFileSystemWatcherPrivate::_q_readFromInotify() (this=0x55ef775146f0) at filesystem/dfilesystemwatcher_linux.cpp:144

在 dtkcore 项目里, 我们来看一下第144号是什么:

void DFileSystemWatcherPrivate::_q_readFromInotify()
{
    Q_Q(DFileSystemWatcher);
//    qDebug() << "QInotifyFileSystemWatcherEngine::readFromInotify";

    int buffSize = 0;
    ioctl(inotifyFd, FIONREAD, (char *) &buffSize);
    QVarLengthArray<char, 4096> buffer(buffSize);
    buffSize = read(inotifyFd, buffer.data(), buffSize);
    char *at = buffer.data();
    char * const end = at + buffSize;

    QList<inotify_event *> eventList;
    QHash<int, QString> pathForId;
    /// only save event: IN_MOVE_TO
    QMap<int, QString> cookieToFilePath;
    QMap<int, QString> cookieToFileName;
    QSet<int> hasMoveFromByCookie;
    while (at < end) {
        inotify_event *event = reinterpret_cast<inotify_event *>(at);
        QString path;

是这里: buffSize = read(inotifyFd, buffer.data(), buffSize); 读取 inotify 文件句柄时, 被阻塞了.

后来了解了一下, 同事当时为何要把 QFilesystemWatcher 复制一份出来, 原来是 QFileSystemWatcher 提供的接口太少. 这个问题在几年前开发云音乐客户端时 我也有遇到过, 当时的做法是写了一个类似的文件系统监控接口, 只是功能更全面了些:

class FilesystemWatcher : public QObject {
  Q_OBJECT

 public:
  explicit FilesystemWatcher(QObject* parent = nullptr);
  ~FilesystemWatcher() override;

  struct WatchItem {
    int watch_id = 0;
    QString root_dir;
    QString dir;
    bool recursive = false;
  };

 signals:
  void fileAdded(const QString& root_dir, const QString& filepath);
  void fileDeleted(const QString& root_dir, const QString& filepath);

  void folderAdded(const QString& root_dir, const QString& dir);
  void folderDeleted(const QString& root_dir, const QString& dir);

  // Add all of files in |root_dir| to watch list.
  // If |recursive| is true, also sub directories are added to list.
  // Emit onWatchFolder() signal.
  void watchFolder(const QString& root_dir, bool recursive);
  void onWatchFolder(const QString& root_dir, bool ok);

  // Remove |root_dir| from watch list.
  void rmWatchFolder(const QString& root_dir);
};

看到这里, 就想到了解决方式, 那就是使用非阻塞的方法来读取 inotify 文件句柄. 在 DFileSystemWatcher 类里面找一下 inotify_init, 定位到了以下代码:

/*!
    Constructs a new file system watcher object with the given \a parent.
*/
DFileSystemWatcher::DFileSystemWatcher(QObject *parent)
    : QObject(parent)
    , DObject()
{
    int fd = -1;
#ifdef IN_CLOEXEC
    fd = inotify_init1(IN_CLOEXEC);
#endif
    if (fd == -1) {
        fd = inotify_init();
    }

    if (fd != -1)
        d_d_ptr.reset(new DFileSystemWatcherPrivate(fd, this));
}

修改一下 inotify 句柄被初始化时的方式, 改为非阻塞式的.

/*!
    Constructs a new file system watcher object with the given \a parent.
*/
DFileSystemWatcher::DFileSystemWatcher(QObject *parent)
    : QObject(parent)
    , DObject()
{
    int fd = -1;
#ifdef IN_CLOEXEC
    fd = inotify_init1(IN_CLOEXEC | O_NONBLOCK);
#endif
    if (fd == -1) {
        fd = inotify_init1(O_NONBLOCK);
    }

    if (fd != -1)
        d_d_ptr.reset(new DFileSystemWatcherPrivate(fd, this));
}

有关 inotify_init1() 这个接口的详细信息可以读取一下手册:

$ man inotify
$ man inotify_init

修改完之后, 为 dtkcore 生成新的 deb 包, 然后安装, 重新验证. 验证数十次之后, 搜狗输入法输入时导致主窗口冻结的问题再没有出现过.

反思

但是问题来了, 云音乐项目并没有使用 dtkcore 的任何代码, 怎么会调用到 DFilesystemWatcher 呢? 另外, 这个输入法的问题, 在我经常使用的 openbox 环境里面, 并不能重现.

经询问得知, 他们为 linux deepin 的桌面环境写了一个自动切换主题的插件, 该插件会通过 DFilesystemWatcher监控一个配置文件的变化. 在这个桌面环境里, 所有的 Qt 进程都会 自动载入这个插件. 但问题在于, 这个监控的操作, 是在 UI 线程里做的.

开发过界面程序的工程师大都知道, UI 线程被卡住冻结, 是个经常出现的问题, 原因就在于, 在这个线程里做了耗时的操作, 导致鼠标/键盘事件不能及时在 MessageLoop 里面被处理掉. 常见的解决方式就是把 I/O 相关的操作, 都放到后台的工作线程里. chrome 浏览器就是 典型的例子, 上面 gdb 有显示, 在主进程里, 有三四十个线程, 其中的很大一部分 就是 I/O 工作线程, 用于完成 http 请求, DNS 请求, SQLite 读写操作, 以及本地的缓存文件等.