The Go os/exec package allows for passing a Context when executing a command, therefore allowing for early termination if the context is canceled. Unfortunately, this comes with a caveat: While the spawned process will be killed, its children will simply be reparented and keep on going.

We can work around this by, instead of sending SIGKILL to the process, sending SIGKILL to the entire process group. Note that this is OS-specific and will not work on non-POSIX operating systems.

Caveats

  • If you are running this in a Docker container, make sure that you have something (e.g. dumb-init) running on PID 1 that will wait(2) on the zombie processes. Otherwise, this will still leak entries in the process table.
  • This mirrors the behavior of the Go stdlib which sends SIGKILL on context cancelation. Alternate strategies (e.g. SIGTERM with SIGKILL fallback) may be worth considering, depending on the use case.
import "syscall"
import "context"
import "os/exec"

// Runs the specified exec.Cmd and ensures that all children are killed when the command terminates
// The exit can occur due to the process itself exiting, or the context being canceled
// This only works on POSIX systems and has only been tested on Linux (Ubuntu 22.04) with Go version 1.18.
//
// This expects cmd to not have a context set, although it should handle it being set gracefully.
// With the context set on cmd, we could save ourselves the ceremony and just signal the process group after cmd.Wait terminates.
// Unfortunately, this won't work reliably, as there are some unfortunate edge cases where having a [WaitDelay](https://cs.opensource.google/go/go/+/refs/tags/go1.21.0:src/os/exec/exec.go;l=286) of zero results in cmd.Wait blocking forever on the orphaned subprocesses.
//
// Outline:
// 1. Set the process group id on the child process and its children
// 2. Wait for the process to finish in a separate goroutine
// 3. Select on the context being canceled and the command terminating
// 4. After either, send a SIGKILL to the process group
func runCommandAndKillChildrenOnTermination(ctx context.Context, cmd *exec.Cmd) error {
	// Request that the process group id be set (Setpgid: true) to the PID of the newly spawned process (Pgid: 0)
	cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true, Pgid: 0}

	if err := cmd.Start(); err != nil {
		return err
	}

	pgid := cmd.Process.Pid

	// Buffered channel to make sure the writer goroutine doesn't block, in case the reader completes early because of the cancellation
	cmdDone := make(chan error, 1)
	// Spin off a separate goroutine to wait for the command and report the status back
	go func() {
		defer close(cmdDone)

		err := cmd.Wait()
		cmdDone <- err
	}()

	// Wait for either the command to terminate, or the context to be canceled
	var isCancellation bool
	var cmdErr error
	select {
	case cmdErr = <-cmdDone:
		isCancellation = false
	case <-ctx.Done():
		isCancellation = true
	}

	// Once we are finished with the command execution, we want to kill all the children
	// This also applies if the command exited by itself, as it may have been SIGKILLed for other reasons
	// In case of a timeout, this will also kill the command process
	// Kill all processes in the group via `kill -9 -$PGID` (note the "-" to signal the group)
	if err := syscall.Kill(-pgid, syscall.SIGKILL); err != nil {
		// No such process, this is possible if the process exited and left no children behind
		if errno, ok := err.(syscall.Errno); ok && errno == syscall.ESRCH {
			// Do nothing, treat it like a regular exit
		} else {
			return err
		}
	}

	if isCancellation {
        // An alternative here is to wait on cmdDone, as the process must terminate either way
        // This would allow bubbling up the "actual" process error (i.e. SIGKILL received, in case of cancelation)
        // However, there appears to be little value in it and it merely introduces more complexity and another point of error
		return ctx.Err()
	} else {
		return cmdErr
	}
}