On different servers, the simp-core acceptance tests fail at different stages in the test (yum installs, puppet agent runs, etc.). All of these cases, the log lines look something like:
Some of the tests have 'workarounds' for this at specific checkpoints (begin/rescue blocks with retry). However, since the failure can occur on *any *host action (host.install, on(), retry_on(), etc.), those workarounds are insufficient.
Some failures occur within 10 minutes of the test running, so the session TMOUT parameter of 900 seconds doesn't seem to be the source of the problem.
I have tried increasing the ClientAliveInterval (to 2400 seconds) in the sshd configuration, but that did not solve the problem.
I have set the ssh keepalive in the nodeset, but that did not solve the problem. (Since the default is true, I wasn't really expecting this one to make a difference).