Further investigation showed it wasn't Emacs' fault. The child process
correctly wrote ordinary text output. Each time Emacs called `read' at
the pty end, there was a zero character at the start of the data.
This problem didn't go away when the pty/tty pair was closed. But it
was associated with a specific pty/tty pair. I did some testing using
the shell:
felid:~$ (sleep 10; od -av > /dev/tty1) < /dev/ptyp1 &
[2] 24124
felid:~$ (while :; do sleep 1; echo -n 'foo'; done) > /dev/ttyp1
0000000 nul f o o f o o f o o f o o f o o
0000020 f o o f o o nul f o o nul f o o nul f
0000040 o o nul f o o nul f o o nul f o o nul f
0000060 o o nul f o o nul f o o nul f o o nul f
[etc.]
The second command was entered about two seconds after the first. As
you can see, the first seven writes were merged together, so in some
sense it was the reading that was broken, not the writing.
Another surprise is that when simply opening /dev/ptyp1 and reading, the
`read' call returned EIO until /dev/ttyp1 had been opened for the first
time. (Both `cat < /dev/ptyp1' and tracing Emacs confirmed this). When
/dev/ttyp1 was subsequently closed, the pty read blocked as it should.
In Emacs' case, this means that Emacs did busy reading until the child
process opens /dev/ttyp1.
`fuser' didn't show any process as using /dev/ptyp1 or /dev/ttyp1, so I
started using GDB on the kernel. Everything looked fine. Then I tried
`ls -lL /proc/[0-9]*/fd' and found that there was one process using
/dev/ttyp1 all this time. (BTW, should `fuser' be changed to show
processes using ptys?)
I killed that process, and everything went back to normal again. The
process was an old Emacs `bash' subprocess that had been reparented to
`init' and failed to die.
I can't reproduce this, but I have seen the same problem with older
kernels. I don't understand enough of the tty code to find whatever is
causing it.
Normally, if a process keeps the tty end open and the pty end is closed,
the tty is dissociated from the pty, so that the tty doesn't have access
to the next use of that pty/tty channel. I've checked and that works
fine, so I can't reproduce the bug described above.
At least I know how to work around the problem now: search for a rogue
process that still has /dev/ttyp1 (or whatever) open, and kill it.
Enjoy,
-- Jamie