Floating Point Programming 101

Floating point numbers are stored in IEEE floating point format.IEEE floating point format is designed for maximum speed. Therefore, the first thingto remember about floating point is that floating point computations are approximationsof the "real" values. Consider the following two equations:

c := a / 10.0;

c := a * 0.1;

These two equations are mathematically identical, however theycan generate slightly different results in IEEE floating point math. Only the lastdecimal place, in base 2, might be different. The two computations in this examplediffer because 0.1 cannot be perfectly represented in floating point form. It canonly be approximated, whereas 10.0 can be perfectly represented. IEEE numbers arestored in base 2 form.

A 64-bit IEEE number has 15 significant digits of precision,6 digits for a 32-bit number. You cannot expect this level of accuracy from you algorithms.This is because floating point add/subtract and multiply/divide induce round offerrors at each operation. Add/subtract can have greater round-off errors than multiply/divide.This is because of the operands must be scaled to the same power before the operation.Scaling is not necessary for multiply/divide operations.

So what accuracy can you expect. This is a non trivial analysis,but a very safe bet is about 9-10 digits for a 64-bit IEEE number. Scientific numericalsimulation code can get this accuracy. Simpler code can expect more. You should certainlynever try to expect more than 14 digits.

What does all this mean to you. It means you should not, or cannot,use all available digits of precision given by a specific floating point format foryour results. You should allow for round off errors in computations. The source examplegiven is a good one because our compilers perform such an optimization because multiplicationis faster than division on all known processors.

Be very careful when converting floating point numbers to integers.This is where most people get into trouble. 1.999999999999 is realistically numericallythe same as 2.0 given the very small differences, however depending on how you convertto an integer you could get a result of 1 or 2 if the real value is 1.999999999999.Slight differences in computations such as the first example can result in an integerconversion returning 1 instead of 2, given these example numbers. For example ifyou convert 1.999999999999 to an integer without any rounding you will get a valueof 1. If the number is rounded you will get a value of 2. In Modula-2 and Ada95 thelanguage provided conversion procedures from floating point to integer do not round.

You should never use the equal and not equal operators with floatingpoint numbers due to approximations. If you need to test for this you should choosea delta value such that when two numbers that differ by no more than the delta valueare considered identical.