I suppose if you want to be really thorough, you could run the tests with ptrace and single-step the threads to make different interleavings at instruction level. Has anyone seen that sort of thing done? Is there any alternatives for black-box testing, if you are not able to instrument the code like what is done here?
I've done that for testing async signal handlers, but the combinatorics are much more favorable there. If the base thread runs n instructions, you just need n runs through that run 0..n instructions before inserting the signal, and then the signal handler runs to completion and then so does the base thread. O(n^2) total time. But with t threads each with n instructions to run, and they can interrupt each other at any boundary...it's not approachable for reasonable values of n. You need to reduce by singling out the operations with interesting behavior and simulating, I think.