Sample is somewhat misleading: This library does not protect against bit flips at all. Better sample needed?

This is more a documentation nit than an outright failure, but I originally thought it was a bug. The problem lies in how Reed-Solomon is 'sold'. Not, specifically, by either the README or the blogpost it links to, but, e.g. on wikipedia and the insinuation that lies in the logical leap expanded on in the associated blog post: That IAAS providers use Reed-Solomon to 'error correct'. Combine that with what 'error correct' is generally assumed to be, and you end up with a tool that misleads.

Because Reed-Solomon doesn't do any error correction at all. At least, this library doesn't. The only thing it does is "I can recreate the original input if you give me X out of the Y chunks I made; __those chunks must have no errors or I produce erroneous output__".

That last one is the big surprise. I feel the docs should be spelling this out quite clearly.

Some shell exploration to further explain what I mean and 'prove' the statement "this tool does not error correct at all" (requires: some posixy OS):

```shell
dd if=/dev/urandom of=sample.data bs=16M count=8 iflag=fullblock
java -cp . com.backblaze.erasure.SampleEncoder sample.data
# this makes 6 files

java -cp . com.backblaze.erasure.SampleDecoder sample.data
# test the base case
diff sample.data.decoded sample.data
# prints nothing, indicating files identical

# now replace one arbitrary byte in one chunk file.
mv sample.data.1 sample.data.1.orig
head -c 1234 sample.data.1.orig >sample.data.1
head -c 1 sample.data.1.orig >>sample.data.1 # introduce a bit flip
tail -c +1236 sample.data.1.orig >>sample.data.1

#Confirm there's now a difference
diff sample.data.1 sample.data.1.orig
# should print: Binary files sample.data.1 and sample.data.1.orig differ

# Attempt to restore the file
java -cp . com.backblaze.erasure.SampleDecoder sample.data
# tool completes with no error.
diff sample.data.decoded sample.data
# prints: Binary files sample.data.decoded and sample.data differ
```

It's not necessarily the project's job to ensure users use it properly, but, surely it'd be a good idea if the docs explain: Look, here's a foot gun, please don't blow your head off with it.

I'm just guessing as to how to actually use this system to be useful for actual error recovery. My best guess is this:

After making the chunks, perform the following process:

1. Calculate a hash for each chunk and store it someplace; perhaps, like the file size, as a header in the chunk files.
2. Double check that _right now_, each chunk is valid. I don't know how to do this; just running the tool presumably just picks 4 out of the 6 chunks (if running in the default 4+2 the sample code uses), but the code could be updated to have a verify mode where all chunks are cross checked.

Then to decode:

1. Confirm the hash of each chunk. If a chunk does not match its hash, _delete the entire chunk completely_.
2. _now_ run the decode operation.

Applying that principle to the shell sample, if you inject `rm sample.data.1` right before running the decoder, the tool still produces a decoded file (it can deal with a missing chunk), and now it _is_ equal to the original input.

But, is this the intended way to use the tool, or is such a process beyond the scope of what the library is meant for?

I could contribute an updated sample and a few extra lines to the readme that spells out that this tool does what you think it does _in the face of chunks that are completely gone_, but otherwise doesn't do any error correcting, with an expanded sample that uses a hashing scheme to turn bit flips into missing chunks. Presumably simply SHA-256 baked into JDKs is sufficient for such a sample, e.g. no new dependencies would be introduced just to cater to the sample.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sample is somewhat misleading: This library does not protect against bit flips at all. Better sample needed? #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sample is somewhat misleading: This library does not protect against bit flips at all. Better sample needed? #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions