That's right. There's a problem which I referred to as the toothpaste problem - you can squeeze the needed branchiness around but you can't make it go away entirely (at least, I couldn't).
There used to be 4 stages (stage 1 was the marks, stage 2 was bits-to-indexes, stage 3 was the tape construction and stage 4 was the 'clean everything up and do all the hard stuff'). It's possible - though awkward - to do tape construction branchlessly, but the gyrations required were expensive and weird and it just delayed the reckoning.
I built a prototype of the 'gather all the X at once and handle it in one go' and the logic to gather that stuff was more expensive than just handling everything.
In my wacky world of branch free coding (which I've been doing a while) there are worse things than missing a branch. The idea that you can accumulate an array branchlessly (i.e. always put the thing you have in a location somewhere and bump a pointer when you need to) seems pretty cool, but branch miss is not the only hazard. This technique of branchlessly writing a log is a anti-pattern I've tried over and over again and a stream of unpredictable writes is just as big a pain as a bunch of unpredictable branches - it causes the pipeline to come to a screaming halt. If you can get it going somehow (new SIMD tricks? better algorithm?) I'd be impressed.
Get my email from Daniel or hit me up in Twitter DMs if you want to discuss further.
... and yes, agreed that Daniel has done a great job making this into a real thing. This would have stayed a bunch of code sketches if I had been the only one working on it. In terms of quality, the project now is well on the way to being commercial quality code rather than the research prototype that I did. I understand I have you to thank for that as well!
There used to be 4 stages (stage 1 was the marks, stage 2 was bits-to-indexes, stage 3 was the tape construction and stage 4 was the 'clean everything up and do all the hard stuff'). It's possible - though awkward - to do tape construction branchlessly, but the gyrations required were expensive and weird and it just delayed the reckoning.
I built a prototype of the 'gather all the X at once and handle it in one go' and the logic to gather that stuff was more expensive than just handling everything.
In my wacky world of branch free coding (which I've been doing a while) there are worse things than missing a branch. The idea that you can accumulate an array branchlessly (i.e. always put the thing you have in a location somewhere and bump a pointer when you need to) seems pretty cool, but branch miss is not the only hazard. This technique of branchlessly writing a log is a anti-pattern I've tried over and over again and a stream of unpredictable writes is just as big a pain as a bunch of unpredictable branches - it causes the pipeline to come to a screaming halt. If you can get it going somehow (new SIMD tricks? better algorithm?) I'd be impressed.
Get my email from Daniel or hit me up in Twitter DMs if you want to discuss further.