Thu 9 April 2015
I had a problem this morning of a process that was stuck waiting for an HTTP fetch to complete, and had been stuck for 8 hours. Obviously the fetch had not been successful, and additionally some sort of timeout had broken, but I wanted the process to continue executing for the time being. What to do?
(This particular program was a Perl program, but I don't see why the same technique doesn't apply to almost anything).
Obviously the fetch is not going to complete successfully, but that's OK, the program in question is designed to tolerate failed fetches. If we could only close the socket it's trying to read, the read would return and the process would continue. It turns out that's quite easy.
First, find the file descriptor of the socket. In my case, the remote host is known as it is shown in the last line of our process's log output (it's gingerspice.example.com), and the PID (10029) is derived in any number of ways (the pidfile, ps, pgrep, whatever).
[jes@scaryspice ~]$ sudo lsof -p 10029 | grep gingerspice perl 10029 nobody 4u IPv4 51480959 0t0 TCP scaryspice.example.com:59521->gingerspice.example.com:11014 (ESTABLISHED)
So there we see the file descriptor is 4 (that's the "4u" column). Now we want to attach to the process with gdb:
[jes@scaryspice ~]$ sudo gdb /usr/bin/perl 10029 GNU gdb (GDB) CentOS
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff28c52000 0x000000339d80da70 in __read_nocancel () from /lib64/libpthread.so.0 (gdb)
(I redacted a whole load of boring output there). We can see the process was in the __read_nocancel function from libpthread, this is consistent with our understanding that it was blocked on a read.
Now, using the gdb shell, we simply close the file descriptor and detach our debugger, and then the read will return and execution will continue:
(gdb) call close(4) $1 = 0 (gdb) quit A debugging session is active.
Inferior 1 [process 10029] will be detached.
Quit anyway? (y or n) y Detaching from program: /usr/bin/perl, process 10029 [jes@scaryspice ~]$
And now our HTTP fetching program continues on its merry way! Hope this is useful to someone. Give us a shout if there's a better way to do it.
Update 2015-04-11: Marko points out you can use
to close the socket in one go without invoking the gdb shell.
gdb -p 10029 --batch -ex 'call close(4)'
Update 2015-04-13: Jilles Tjoelker (http://www.stack.nl/~jilles/) adds:
I noticed your technique can be dangerous when used on multi-threaded applications. If another thread opens a new file descriptor and gets the same descriptor number as the forcibly closed hung connection, the originally hung thread might perform operations on it, which then affect the new object. When the forcibly closed connection is closed normally, the new file descriptor is actually closed, and the other thread will not work properly. If new file descriptors are opened frequently, the "poisoning" can spread, receiving and sending data from and to the wrong place.If you like my blog, please consider subscribing to the RSS feed or the mailing list:
Fortunately, there is an easy solution for sockets: use shutdown() instead of close(), for example 'call shutdown(4, 2)'. The shutdown() function closes the connection (or one direction of the connection), while keeping the socket valid. Also, it actually wakes up threads in a blocking operation in most implementations, where the close() probably relies on the debugger forcing all threads to the user boundary (restarting all system calls and refetching the underlying objects from all file descriptors).
Some operating systems such as FreeBSD and OpenBSD have a tcpdrop(8) utility to close a TCP connection without needing to attach a debugger (which is always a somewhat risky operation).