It is far easier to understand error-detection codes when we establish a parallel with check digits, which is an everyday device. Even more for software developers, since almost every one has implemented a modulo-11 checkdigit, or at least some checkdigit algorithm, at some point of the carreer.
This is not true for error-correcting codes. Simple codes like Hamming can be understood directly in their binary incarnations, but no material that bridges a complex code like Reed-Solomon with the decimal number realm. This text tries to fill this gap.
If you are just looking for some implementation of error-correcting checkdigit, you can go straight to this GitHub repository where I offer a Python implementation.
Generating a Reed-Solomon code is similar to generating a CRC code, and also similar to the modulo-11 checkdigit generation. In either case, the whole arithmetic is modular, the "clock handles arithmetic" where absurd equalities like 11+2=1 can be true.
In a wall clock, the "field" of possible values is twelve. An arithmetic with 12 elements, or 10 elements, can make a ring, but not a field. In order to have a field, the element count must be a prime or prime power. The nearest prime above 10 is 11, so we adopt the GF(11) field.
But why there can't be a GF(12) or GF(10) field? One reason is, certain tricks can only work in a Galois group. In GF(11), multiply any positive element by 1, 2, 3... yields all field elements. Using 3 as a generator:
3 x 1 = 3 3 x 2 = 6 3 x 3 = 9 3 x 4 = 12 % 11 = 1 3 x 5 = 15 % 11 = 4 3 x 6 = 18 % 11 = 7 3 x 7 = 21 % 11 = 10 3 x 8 = 24 % 11 = 2 3 x 9 = 27 % 11 = 5 3 x 10 = 30 % 11 = 8 3 x 11 = 33 % 11 = 0 3 x 12 = 36 % 11 = 3 ...
In a 12-element "clock arithmetic", the trick above only works for generators that don't divide 12. Another trick is to generate all positive elements using powers. This time, our generator will be 2:
2 ^ 0 = 1 2 ^ 1 = 2 2 ^ 2 = 4 2 ^ 3 = 8 2 ^ 4 = 16 % 11 = 5 2 ^ 5 = 32 % 11 = 10 2 ^ 6 = 64 % 11 = 9 2 ^ 7 = 128 % 11 = 7 2 ^ 8 = 256 % 11 = 3 2 ^ 9 = 512 % 11 = 6 2 ^ 10 = 1024 % 11 = 1 2 ^ 11 = 2048 % 11 = 2 ...
Even in a Galois field, not every positive element is a power generator. In the case of GF(11), the generators are: 2, 6, 7 and 8. Power generators are important since they allow for near-miraculous simplifications of polynomial equations.
In discrete math, the inverse of power is the discrete logarithm. For example, in a field GF(11) and generator 2, the discrete log of 5 is 4, since (as can be seen above) 2^4=5. It is difficult to calculate the discrete log, but since Reed-Solomon uses rather small fields, we can use a lookup table.
There is more than one way to calulate RS digits. Most implementations are based on BCH code, since they are for binary data. Here we will adopt the "classical" way, suggested by the authors themselves, which also happens to be very similar to modulo-11 checkdigit calculation.
Before the long and boring explanation here is a Google Sheet link, where you can play with values and get your hands dirty: Click here to open.
In the example sheet, the original message is 3141592, with 7 symbols. We want 3 check digits, therefore obtaining a code which detects all errors up to 2 digits and corrects all 1-digit errors.
The formular for every RS digits is a polynomial whose coefficients are the message symbols. For the message 3141592, the polynominal is:
3*x^7 + 1*x^6 + 4*x^5 + 1*x^4 + 5*x^3 + 9*x^2 + 2*x^1
In the sheet, we calculate every polynomial term in one cell (see block C5:I7), and each RS digit (column J) the sum of one row (5, 6, and 7). All formulas are laced with MOD(x,11) since it is all modular arithmetic.
The "x" value in the polynomial is different for every RS digit, and it must follow some orderly sequence. The original RS paper suggests a handful of such sequences. We have adopted the powers of the generator 2: 2^0, 2^1, 2^2, which are simply 1, 2, and 4. The "exes" can be found in the block B5:B7. Using this particular sequence will help us later on.
(Looking in retrospect, it would be best to begin the sequence with 2^1, since this guarantees the first RS digit can also detect digit-swapping errors, which are common when numbers are typed by a human. This is the sort of thing a code for binary message does not have to think about, but it is relevant for decimal-based messages since there is a human factor involved.)
Each "x" value is again elevated to some power due to the polynomial evaluation, so it is still working as a generator. This guarantees the "weights" multiplied with each message digit are well-spread, and a single change in message affects all RS check digits.
The polynomial thing may sound scary, but it ends up being similar to the modulo-11 checkdigit. Up to now, the only big difference is the creation of 3 checkdigits instead of just one.
Either in RS or modulo-11 checkdigits, a check digit can assume 11 different values, including 10. This is expected given the usage of a GF(11) field. In the case of modulo-11, some implementations use "X" to represent the eleventh element, but most simply cast it to 0 or 1, effectively weakening it.
Accordingly to the sheet example, the RS code for 3141592 is 313, so the final message bundle to be transmitted is 3141592-313.
When an RS-protected message is received, the first thing is to check for errors. The method is the same as CRC and check digits: redo the same math and check wether it generates different RS digits.
In the sheet, you can 'corrupt' any digits of the message in C10:I10 range to simulate what happens in case of errors. For example, if the received message is 3141692-313, the RS digits recalculated at RX side are 491, different from 313. Being different, there are errors. Note that all digits changed. This sweeping change is not 100% guaranteed, but that's what happens in most cases.
What if the RS code itself was corrupted? For example, what if the received message was 3141592-333? In this case, the RX math would determine the RS digits should be 313 instead of 333. Since the RS code diverges in a single digit, chances are the error is in the code itself, not in the main message. If the error-correcting algorithm fails, this possibility is then considered.
Back to the corrupted message 3141692-313, whose RX-side of RS code is 491. In order to discover the position and the magnitude of the error, we start by finding the syndrome. Syndrome is the difference between received and calculated RS code. The math is carried out digit by digit, modulo 11, no borrow and no transport between digits:
4 - 3 = 1 9 - 1 = 8 1 - 3 = -2 = 9 (mod 11)
The syndrome 189 is the RS code of the hypothetical message 0000100 (and also many other messages). Note that 0000100 is exactly the modulo-11 difference between the corrupted message 3141692 and the original 3141592. 0000100 contains both the magnitude and the position of the error we are looking for.
This means: there is a "kinship" between the syndrome and the error content of the RX message. All we need to do is to discover the error value given the syndrome. This is difficult because there are many, many values whose RS code is also 189.
One more thing: the error value has a single non-zero digit. Our RS code has 3 digits, so it can fix a single error. In this particular case, we just need to test different digits in different places until we zero the syndrome (there are just 63 possibilities). If none of these 63 variations have a RS code of 189, it means there are two or more errors and the message is uncorrectable.
This exhaustive search is not practical when the RS code is long enough to correct multiple errors. In that case, we need to do it the 'right' way, using sheer math.
The core technique is to create a equation system using the polynomial representation of the error. If we can correct 5 errors, the polynomial would have 5 terms. In our humble example, we can correct at most 1 error, so the polynomial is single-term:
a * x ^ b a = error magnitude b = error position
When we encoded the message, we calculated each RS digit using a different value for "x": 1, 2 and 4. We do the same now, but now we have the result (the syndrome digit) and the incognites are the polynomial coefficients:
a * (1) ^ b = 1 a * (2) ^ b = 8 a * (4) ^ b = 9
For "n" unknowns, we need a system of "n" equations. In the system above, there are 2 unknowns and 3 equations. We know we can resolve "a" and "b", but how?
If the RS code had 4 digits and we wanted to find 2 errors, the polynomial would have two terms and 4 unknowns. The syndrome would have 4 digits and we would create a system of 4 equations.
Back to the example. The value of "a" is a low-hanging fruit:
a * (1) ^ b = 1 a = 1
Now we replace the value of "a" in another equation, in order to find "b":
a * (2) ^ b = 8 1 * 2 ^ b = 8 2 ^ b = 8 (the discrete log of 8 is 3) b = 3
Finding "b" using the third equation would not be as easy. It is interesting to resolve it because it shows off some tricks of Galois fields:
a * (4) ^ b = 9 1 * 4 ^ b = 9 Since 4 = 2^2, (2 ^ 2) ^ b = 9 2 ^ (2b) = 9 discrete log(9) = 6, so 2 ^ (2b) = 2 ^ 6 2b = 6 (mod 10 here, no longer mod 11. See note.) b = 3
Having used powers of the generator 2 to encode the RS digits yielded a big payoff here. We were able to simplify the equation and remove the powers.
NOTE: the modulo changed from 11 to 10 in the last step. When we manipulate powers under modular arithmetic, their module is the Euler Totient. The totient of 11 is 10. When we transition from the exponential equation to the linear form 2b=6, we have to downgrade to modulo-10. (The example was not the best since the change of module didn't affect the result.)
Unfortunately, since there is no discrete logarithm function in the spreadsheet, the mathematical way of correcting errors could not be implemented there.
As we've seen, correcting a single error is easy, either by doing exhaustive search or by math. When there are many correctable errors, the equation system will show multiple-term polynomials like these:
a * x ^ (4b) + c * x ^ (4c) = 3 a * x ^ (8b) + c * x ^ (8c) = 9
They bring two challgenges. First, it is difficult to simplify non-linear equations with sums. Second, the difficulty of solving a big equation system efficiently. The original Reed-Solomon paper left these problems open. In time they were solved, but it took many researchers and many decades.
The "linearization" of polynomial equations uses tricks involving generators and discrete logarithm. As we've seen before, every element of the field except 0 can be represented as the n-th power of a generator. Let's work an equation in this direction:
a * x ^ (4b) + c * x ^ (4d) = 3 Supposing we will evaluate this using x=2^3, and making a=2^A, c=2^C, and knowing that 2^8=3, 2^2=4 and 2^3=8, 2^A.(2^3)^(4.b) + 2^C.(2^3)^(4.d) = 2^8 2^A.2^(12.b) + 2^C.(2^12.d) = 2^8 2^(A + 12.b) + 2^(C + 12.d) = 2^8 (mod 11) or 2^(A + 12.b) + 2^(C + 12.d) - 2^8 = 0 2^(A + 12.b) + 2^(C + 12.d) + 2^3 = 0 (-3 = +8 mod 11)
At this point, we still have a non-linear equation, but the base of every power is the generator. But, in discrete math, we can use discrete logs in place of the original values:
Given x such as x = 2^y, we define the mapping function N(x) such as N(x) = log(x) + 1 = y + 1 and N(0) = 0
Wouldn't be awesome if, using the mapping function above, we could convert to a linear equation in a single step?
2^(A + 12.b) + 2^(C + 12.d) + 2^3 = 0 (mod 11) N(2^(A + 12.b)) + N(2^(C + 12.d)) + N(2^3) = 0 (mod 10) A + 12.b + C + 12.d + 3 = 0 (mod 10, wrong)
But there is a way to mix sums and powers in the domain of discrete logs. It takes an additional tool called Zech logarithm, whose definition is:
Z(n) = discrete log(1 + g^n) where g = some field generator n = n-the power of the generator
As happens with discrete logs, Zech logs are difficult to calculate, but we can just use a lookup table when the field is small. This is how Z(x) can be used to linearize a discrete non-linear equation:
a^i + a^j = a^(i + Z(j - i)) Z(x) = Zech logarithm Converting a sum of powers into a power of sum: a^i + a^j => (N(a^j) + Z(N(a^i) - N(a^j)) - 1) mod 10) + 1 => (j + 1 + Z(i + 1 - j - 1)) - 1) mod 10) + 1 => (j + Z(i - j)) mod 10) + 1
With these tools, we can convert the whole equation system into a linear system, and solve it using standard algorithms.