Computer Science for Floating-point Arithmetic

For 0.1 + 0.2, a calculator outputs 0.3 as expected. But it is quietly rounded to match your expectations. In software, decimal operations are imprecise. Unless special libraries (methods) are used, a program will output "odd" results.

These are not computer bugs. This is the inevitable consequence of:

The fact that there are infinitely many real numbers. To represent them, we'd need an infinite amount of memory, and that's impossible.
Squeezing infinitely many numbers thus requires a limited set or approximate representations.

Logically, approximations lead to imprecise, "odd" results such as the above, which ultimately lead to bugs:

cents (money) can get lost,
transactions rejected because of imprecise comparisons or
mismatched totals
scientific calculations producing subtly incorrect results

This page is here to:

Explain relatively concisely why this happens
What is scientific notation (...123e-16) and why it shows up in computation results
What can be done to mitigate the problem
How to test for related potential bugs

Quick Summary

Just in case, here's the shortest possible explanation:

Even "human" formats don't always allow for exact representation
- Think ⅓ is represented as an infinite 0.33333.... in decimal
- Since we must cut off at some point, we lose precision.
- Multiplying 0.33333 * 3 will yield 0.99999, not 1. Precision lost.
Computers process information as 1s and 0s
In binary, the same happens all the time
- E.g., 0.1 in binary is a sequence of 1s and 0s, followed by an infinite sequence of 0011
- The 0011 must be cut off at some point - and that's where precision is lost
- Just like in decimal, doing math with imprecise representations leads to imprecise results
- More operations may lead to imprecision becoming bigger and bigger

That's it. The rest of the page is just a more thorough explanation of the same.

1. Humans vs. Computers

Humans have 10 fingers. Hence, we have 10 different digits - from 0 to 9.

Computers, at the lowest level, process information as electric charges in a chip - either there's a charge, or none. 0 or 1.

Summary:

Base 10, aka the decimal system. 0 to 9. For humans.
Base 2, aka binary. 0 or 1. For computers.

2. Bits and Bytes

We want to "translate" human numbers (base 10) to computer numbers (base 2). It is done like so:

Decimal

Binary

Bits used

1 bit

1 bit. Thus 1 bit can store 2 patterns - 0 or 1.

2 bits

2 bits. Thus 2 bits can store 4 patterns - 00, 01, 10, 11.

100

3 bits

101

3 bits

…

1000

4 bits

…

10000

5 bits

…

128

10000000

8 bits. aka 1 byte.

…

255

11111111

8 bits. aka 1 byte. Can store a number from 0 to 255 (a total of 256 numbers)

Summary:

1 bit is the smallest unit of storage
1 byte is the smallest practical unit - enough to hold at least 1 typed character ('a', '2', '$')
Add 1 bit to double the number of patterns
1 bit - 2 patterns
2 bits - 4 patterns (e.g., 01)
8 bits - 256 patterns (one byte) (e.g., 01011100)
Mathematically: n bits yield 2ⁿ patterns (2 to the nth power)
This doubling with every bit starts yielding huge numbers rather quickly

Also:

4 bytes (32 bits) are typically used to store integers
- Range −2,147,483,648 to 2,147,483,647 (billions)
8 bytes (64 bits) are typically used to store bigger integers (longs)
- Range −9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (quintillions)

34 bits and 64 bits are important milestones. Keep them in mind as you'll see them later.

Dealing with huge numbers such as above can be cumbersome. Is there a faster, more convenient way to represent them? See the next section.

3. Handling long numbers with scientific notation

It's useful, and almost necessary, to have a unified format to represent numbers. Including very small numbers and very big numbers.

To an engineer building a highway, it does not matter whether it’s 10 meters or 10.0001 meters wide.
On the contrary, 0.0001 meters (a tenth of a millimeter) is a huge difference to a microchip designer.

To satisfy the engineer and the chip designer, a number format has to provide accuracy for numbers at very different magnitudes. Also, when numbers get very big or very small, it becomes hard (and error-prone) to do math with them, even "on paper".

How much is 300000000 * 0.00000015 ?

A bit difficult, isn't it? This can be rewritten in scientific notation.

The format is a * 10^n. Here, 10 is a constant (decimal system, 10 fingers). The solution becomes much easier.

// 1. Multiply 3 * 1.5 
// 2. Subtract exponents (8-7)
// 3. Result: 4.5 * 10^1 or just 45
3   * 10^8 
* 
1.5 * 10^-7
=
4.5 * 10^1    // (or just 45)

The scientific notation is a fast, compact format not just for humans, but for machines too!

But it must be written in base 2 format - a * 2^n.

Same as decimal, but we replace the constant 10 with 2.

// Base 10, decimal
300,000,000 = 3 * 10^8        
0.00000015  = 1.5 × 10^-7

// Base 2, binary
300,000,000 ≈ 1.117 × 2^28        
0.00000015  ≈ 1.0111 × 2^-23

Notice that binary representations use ≈, i.e., the binary representation is approximate. This is important and will become evident later.

The general formula is expressed as:

Number = coefficient * base_constant^exponent
15000  =   1.5       *   10         ^ 4

"Floating Point"

The "floating" in floating-point refers to the fact that the decimal (or binary) point can "float" — it’s not fixed in one place.

In regular numbers, the point is fixed: 123.45 → the decimal is after the 3.
In floating-point, you store coefficient × base^exponent, so the "point" moves depending on the exponent:

Example:

Number

Scientific / Floating-point

123.45

1.2345 × 10² → decimal point "floats" 2 places

0.00123

1.23 × 10^-3 → decimal point "floats" -3 places

Summary:

All numbers can be represented in this format with any base. We generally only care about base 10 or base 2 (binary). But in binary, some fractions are only approximate. Why? Find out in the next section.

4. Approximate representations

There is more than one way to represent numbers. For example, fractions:

\frac{1}{3} + \frac{1}{3} + \frac{1}{3} = 1

However, as you may remember from school, 1/3 cannot be represented exactly in decimal, it is 0.33333333..... infinitely repeating.

We cannot go on forever, so we must truncate at some point. We lose precision when we do this. How much precision we decide to lose depends on context.

Suppose we decide to truncate after the 4th digit after the decimal.

As such, adding 0.3333 + 0.3333 + 0.3333 will yield 0.9999. Never a full 1. This may be good enough for many general purposes, but not enough for manufacturing microchips.

As you can see, even the decimal system fails to represent some numbers exactly.

5. Floating-point representation

Binary fractions face the same problem - some numbers can’t be represented exactly, just like 1/3 in decimal.

In binary floating-point, all integers are represented exactly only up to a fixed limit (about 9 quadrillion for double-precision). Beyond that, integers are rounded.

However, only fractions whose denominators are powers of 2 (1/2, 1/4, 3/8, etc.) can be exactly represented in floats or doubles. All others (0.1, 0.22, 0.357, etc.) are approximations.

For example, here is the full binary representation of 0.1

// the length is 64 bits
// notice the (infinitely) repeating pattern of 0011, until it is cut off, 
// just like we must cut off 0.333333, or we run out of memory (bits length)
0011111110111001100110011001100110011001100110011001100110011010

The truncated sequence is no longer 0.1, but something very close to it.

These bits represent the binary scientific notation of the number. (See section 3)

0         01111111011     1001100110011001100110011001100110011001100110011010
|                |                 |
Sign (+ or -)    Exponent       Mantissa / Significand
(1 bit)          (11 bits)       (52 bits)                = total 64 bits

When normalized for binary scientific notation (1.x × 2^n), the significand becomes:
1.1001100110011001100110011001100110011001100110011010 × 2^-4

The main takeaway is that since computers must cut off numbers at some point, precision is lost.

6. Doing math with approximations

// How 0.1, 0.2, and 0.3 are approximated. Slightly higher or lower.

// 0.1 with truncated bits and lost precision results in:
0.1000000000000000055511151231257827021181583404541015625

// 0.2 with truncated bits and lost precision results in:
0.200000000000000011102230246251565404236316680908203125
                  ^              
// in 64 bits, imprecision often starts around 16-17th digit after the decimal point
                   
// 0.3 with truncated bits and lost precision results in:
0.299999999999999988897769753748434595763683319091796875

As such, 0.1 + 0.2 actually results in:

0.3000000000000000444089209850062616169452667236328125

We can now see that the tiny difference is big enough for the equality check to yield "false".

// what's written in the program 
0.1 + 0.2 == 0.3

// the approximations the computer works with
0.3000000000000000444089209850062616169452667236328125 == 0.299999999999999988897769753748434595763683319091796875

You should also now understand the example at the start:

> 2.6 - 0.7 - 1.9
2.220446049250313e-16

// Explanation:

// The output is a number very close to zero, but not quite
// also can be written as 2.22044604925031 * 10⁻¹⁶
// or as 0.0000000000000002220446049250313

7. Single vs. Double precision

More bits mean more precision and a larger representable range.

32-bit (single precision / float)
- Faster, uses less memory
- ~7 decimal digits of precision
- Rounding errors appear sooner
64-bit (double precision / double)
- More memory, slightly slower
- ~16 decimal digits of precision
- Much smaller rounding errors

Most modern programs and tests use double precision by default. Notice that with 64 bits, imprecision starts much further away from the decimal point.

Precision

Example value

Stored value (approx.)

32-bit (float)

0.1

0.10000000149011612

64-bit (double)

0.1

0.10000000000000000555

Why this matters in testing:

The same calculation may produce different results in float vs double
Equality checks (==) are more fragile with lower precision
Tests that pass locally may fail on:
- different hardware
- GPUs vs CPUs
- systems using single precision internally

Expect double precision with most general-purpose programming and backend software.

A 128-bit floating-point format exists, but it is niche for very high-precision scientific computing, not typical commercial software. If you were doing rocket science, you'd probably not be on this page :).

8. What to do about imprecisions

Most popular languages do not try to "fix" floating-point math. Instead, they provide tools and conventions to work around imprecision when it matters.

The common strategies are:

Use tolerances instead of equality
Use specialized precision types for money and exact math

Custom tolerance example

a = 0.1 + 0.2
b = 0.3

epsilon = 1e-9  # Treat numbers as equal if they differ by less than 0.000000001

if abs(a - b) < epsilon:
    print("Values are equal within tolerance")
else:
    print("Values are NOT equal")

# or simply
import math
math.isclose(0.1 + 0.2, 0.3)

double a = 0.1 + 0.2;
double b = 0.3;
double epsilon = 1e-9;
if (Math.abs(a - b) < epsilon) {
    System.out.println("Values are equal within tolerance");
} else {
    System.out.println("Values are NOT equal");
}

Special types for decimal math

In real, robust applications, special libraries are used to avoid rounding errors. In other words, and simply put, produce results the way a calculator would.

Language

Default floating type

How imprecision is handled

Java

double (64-bit)

Use tolerances for comparisons; use BigDecimal for exact decimal math

double (64-bit)

Use tolerances; use decimal for financial / exact calculations

Python

float (64-bit)

Use tolerances; use decimal.Decimal or fractions.Fraction

JavaScript

Number (64-bit double)

Use tolerances; use libraries (BigInt, decimal libs) for exact math

9. How to test for floating-point errors

See the dedicated page.

References

Floating-point guide (simple): https://floating-point-gui.de/formats/fp/
What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg. The "classic" paper on the topic. Complex. Available on docs.oracle.com or in the PDF below.
RapidTables: Binary to Decimal converter

4MB

What every computer scientist should know about floating point arithmetic.pdf

PDF

Open

PreviousCRUD NextTesting for Floating-point Errors

Last updated 1 month ago

hashtagQuick Summary

hashtag1. Humans vs. Computers

hashtag2. Bits and Bytes

hashtag3. Handling long numbers with scientific notation

hashtag"Floating Point"

hashtag4. Approximate representations

hashtag5. Floating-point representation

hashtag6. Doing math with approximations

hashtag7. Single vs. Double precision

hashtag8. What to do about imprecisions

hashtagCustom tolerance example

hashtagSpecial types for decimal math

hashtag9. How to test for floating-point errors

hashtagReferences

Quick Summary

1. Humans vs. Computers

2. Bits and Bytes

3. Handling long numbers with scientific notation

"Floating Point"

4. Approximate representations

5. Floating-point representation

6. Doing math with approximations

7. Single vs. Double precision

8. What to do about imprecisions

Custom tolerance example

Special types for decimal math

9. How to test for floating-point errors

References