We stored hundreds of petabytes on cheap SATA drives with random fragment placem...

ianopolous · on March 7, 2020

No need to be rude. EDIT: The offensive part was removed

What was the probability of failure of your drives? My guess is you just didn't hit the threshold for your failure rate. The maths checks out (PhD here). Seriously, do the calculation.

paulsutter · on March 7, 2020

We lost drives all the time. In fact we moved so much data we needed checksums to avoid the 1e-13 undetected data errors.

We seriously did do the calculations (done by serious PhDs) and we seriously did not lose data.

I’m sure you are imagining a system that doesn’t work. But that doesn’t mean only a raidlike setup can work.

And by the way, could you explain how to calculate chance of data loss without taking recovery time into account?

ianopolous · on March 7, 2020

To clarify, the assumptions I'm making for the calculation are:

1) a Fixed probability of a server failing

2) a fixed erasure coding scheme used for all files

3) uncorrelated server failures

4) an erasure fragment is stored on a random server

ianopolous · on March 7, 2020

It boils down to the following:

You can calculate a probability L of losing a given file.

Because we've assumed totally uncorrelated failures that means this is the same for all files, and that the probability of losing NO files if you have T files is (1 - L)^T

As you can see, this approaches 0, meaning Pr(losing a file) approaches 1 as T increases.

Using the probability of file loss in Sia, which I would say is is too low, but lets ignore that. They get L = 10^-19.

This leads to T = ~10^19 before you expect to lose data. If you're erasure coding on the byte level, then that's 10 exa bytes.

I expect your probability of failure is much less than random nodes on a distributed global network of volunteers. so yes, ~petabyte is below the threshold, but there is a threshold.