I'm confused: surely the right way to compare would be to say the A/B tests pick...

I'm confused: surely the right way to compare would be to say the A/B tests pick the best option 100% of the time after 'reaching statistical significance'[1] - this is at least apples to apples.

[1] Like others here, I believe testing until you see significance is Doing It Wrong.