-
Notifications
You must be signed in to change notification settings - Fork 156
Description
This is more a documentation nit than an outright failure, but I originally thought it was a bug. The problem lies in how Reed-Solomon is 'sold'. Not, specifically, by either the README or the blogpost it links to, but, e.g. on wikipedia and the insinuation that lies in the logical leap expanded on in the associated blog post: That IAAS providers use Reed-Solomon to 'error correct'. Combine that with what 'error correct' is generally assumed to be, and you end up with a tool that misleads.
Because Reed-Solomon doesn't do any error correction at all. At least, this library doesn't. The only thing it does is "I can recreate the original input if you give me X out of the Y chunks I made; those chunks must have no errors or I produce erroneous output".
That last one is the big surprise. I feel the docs should be spelling this out quite clearly.
Some shell exploration to further explain what I mean and 'prove' the statement "this tool does not error correct at all" (requires: some posixy OS):
dd if=/dev/urandom of=sample.data bs=16M count=8 iflag=fullblock
java -cp . com.backblaze.erasure.SampleEncoder sample.data
# this makes 6 files
java -cp . com.backblaze.erasure.SampleDecoder sample.data
# test the base case
diff sample.data.decoded sample.data
# prints nothing, indicating files identical
# now replace one arbitrary byte in one chunk file.
mv sample.data.1 sample.data.1.orig
head -c 1234 sample.data.1.orig >sample.data.1
head -c 1 sample.data.1.orig >>sample.data.1 # introduce a bit flip
tail -c +1236 sample.data.1.orig >>sample.data.1
#Confirm there's now a difference
diff sample.data.1 sample.data.1.orig
# should print: Binary files sample.data.1 and sample.data.1.orig differ
# Attempt to restore the file
java -cp . com.backblaze.erasure.SampleDecoder sample.data
# tool completes with no error.
diff sample.data.decoded sample.data
# prints: Binary files sample.data.decoded and sample.data differIt's not necessarily the project's job to ensure users use it properly, but, surely it'd be a good idea if the docs explain: Look, here's a foot gun, please don't blow your head off with it.
I'm just guessing as to how to actually use this system to be useful for actual error recovery. My best guess is this:
After making the chunks, perform the following process:
- Calculate a hash for each chunk and store it someplace; perhaps, like the file size, as a header in the chunk files.
- Double check that right now, each chunk is valid. I don't know how to do this; just running the tool presumably just picks 4 out of the 6 chunks (if running in the default 4+2 the sample code uses), but the code could be updated to have a verify mode where all chunks are cross checked.
Then to decode:
- Confirm the hash of each chunk. If a chunk does not match its hash, delete the entire chunk completely.
- now run the decode operation.
Applying that principle to the shell sample, if you inject rm sample.data.1 right before running the decoder, the tool still produces a decoded file (it can deal with a missing chunk), and now it is equal to the original input.
But, is this the intended way to use the tool, or is such a process beyond the scope of what the library is meant for?
I could contribute an updated sample and a few extra lines to the readme that spells out that this tool does what you think it does in the face of chunks that are completely gone, but otherwise doesn't do any error correcting, with an expanded sample that uses a hashing scheme to turn bit flips into missing chunks. Presumably simply SHA-256 baked into JDKs is sufficient for such a sample, e.g. no new dependencies would be introduced just to cater to the sample.