brain-circuitComputer Science for Floating-point Arithmetic

For 0.1 + 0.2, a calculator outputs 0.3 as expected. But it is quietly rounded to match your expectations. In software, decimal operations are imprecise. Unless special libraries (methods) are used, a program will output "odd" results.

These are not computer bugs. This is the inevitable consequence of:

  • The fact that there are infinitely many real numbers. To represent them, we'd need an infinite amount of memory, and that's impossible.

  • Squeezing infinitely many numbers thus requires a limited set or approximate representations.

Logically, approximations lead to imprecise, "odd" results such as the above, which ultimately lead to bugs:

  • cents (money) can get lost,

  • transactions rejected because of imprecise comparisons or

  • mismatched totals

  • scientific calculations producing subtly incorrect results

This page is here to:

  1. Explain relatively concisely why this happens

  2. What is scientific notation (...123e-16) and why it shows up in computation results

  3. What can be done to mitigate the problem

  4. How to test for related potential bugs

Quick Summary

Just in case, here's the shortest possible explanation:

  • Even "human" formats don't always allow for exact representation

    • Think ⅓ is represented as an infinite 0.33333.... in decimal

    • Since we must cut off at some point, we lose precision.

    • Multiplying 0.33333 * 3 will yield 0.99999, not 1. Precision lost.

  • Computers process information as 1s and 0s

  • In binary, the same happens all the time

    • E.g., 0.1 in binary is a sequence of 1s and 0s, followed by an infinite sequence of 0011

    • The 0011 must be cut off at some point - and that's where precision is lost

    • Just like in decimal, doing math with imprecise representations leads to imprecise results

    • More operations may lead to imprecision becoming bigger and bigger

That's it. The rest of the page is just a more thorough explanation of the same.

1. Humans vs. Computers

Humans have 10 fingers. Hence, we have 10 different digits - from 0 to 9.

Computers, at the lowest level, process information as electric charges in a chip - either there's a charge, or none. 0 or 1.

Summary:

  • Base 10, aka the decimal system. 0 to 9. For humans.

  • Base 2, aka binary. 0 or 1. For computers.

2. Bits and Bytes

We want to "translate" human numbers (base 10) to computer numbers (base 2). It is done like so:

Decimal
Binary
Bits used

0

0

1 bit

1

1

1 bit. Thus 1 bit can store 2 patterns - 0 or 1.

2

10

2 bits

3

11

2 bits. Thus 2 bits can store 4 patterns - 00, 01, 10, 11.

4

100

3 bits

5

101

3 bits

8

1000

4 bits

16

10000

5 bits

128

10000000

8 bits. aka 1 byte.

255

11111111

8 bits. aka 1 byte. Can store a number from 0 to 255 (a total of 256 numbers)

Summary:

  • 1 bit is the smallest unit of storage

  • 1 byte is the smallest practical unit - enough to hold at least 1 typed character ('a', '2', '$')

  • Add 1 bit to double the number of patterns

  • 1 bit - 2 patterns

  • 2 bits - 4 patterns (e.g., 01)

  • 8 bits - 256 patterns (one byte) (e.g., 01011100)

  • Mathematically: n bits yield 2n patterns (2 to the nth power)

  • This doubling with every bit starts yielding huge numbers rather quickly

Also:

  • 4 bytes (32 bits) are typically used to store integers

    • Range −2,147,483,648 to 2,147,483,647 (billions)

  • 8 bytes (64 bits) are typically used to store bigger integers (longs)

    • Range −9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (quintillions)

circle-info

34 bits and 64 bits are important milestones. Keep them in mind as you'll see them later.

Dealing with huge numbers such as above can be cumbersome. Is there a faster, more convenient way to represent them? See the next section.

3. Handling long numbers with scientific notation

It's useful, and almost necessary, to have a unified format to represent numbers. Including very small numbers and very big numbers.

  • To an engineer building a highway, it does not matter whether it’s 10 meters or 10.0001 meters wide.

  • On the contrary, 0.0001 meters (a tenth of a millimeter) is a huge difference to a microchip designer.

To satisfy the engineer and the chip designer, a number format has to provide accuracy for numbers at very different magnitudes. Also, when numbers get very big or very small, it becomes hard (and error-prone) to do math with them, even "on paper".

How much is 300000000 * 0.00000015 ?

A bit difficult, isn't it? This can be rewritten in scientific notation.

The format is a * 10^n. Here, 10 is a constant (decimal system, 10 fingers). The solution becomes much easier.

The scientific notation is a fast, compact format not just for humans, but for machines too!

But it must be written in base 2 format - a * 2^n.

Same as decimal, but we replace the constant 10 with 2.

circle-exclamation

The general formula is expressed as:

"Floating Point"

The "floating" in floating-point refers to the fact that the decimal (or binary) point can "float" — it’s not fixed in one place.

  • In regular numbers, the point is fixed: 123.45 → the decimal is after the 3.

  • In floating-point, you store coefficient × base^exponent, so the "point" moves depending on the exponent:

Example:

Number
Scientific / Floating-point

123.45

1.2345 × 10² → decimal point "floats" 2 places

0.00123

1.23 × 10^-3 → decimal point "floats" -3 places

Summary:

Drawing

All numbers can be represented in this format with any base. We generally only care about base 10 or base 2 (binary). But in binary, some fractions are only approximate. Why? Find out in the next section.

4. Approximate representations

There is more than one way to represent numbers. For example, fractions:

13+13+13=1\frac{1}{3} + \frac{1}{3} + \frac{1}{3} = 1

However, as you may remember from school, 1/3 cannot be represented exactly in decimal, it is 0.33333333..... infinitely repeating.

We cannot go on forever, so we must truncate at some point. We lose precision when we do this. How much precision we decide to lose depends on context.

Suppose we decide to truncate after the 4th digit after the decimal.

As such, adding 0.3333 + 0.3333 + 0.3333 will yield 0.9999. Never a full 1. This may be good enough for many general purposes, but not enough for manufacturing microchips.

circle-info

As you can see, even the decimal system fails to represent some numbers exactly.

5. Floating-point representation

Binary fractions face the same problem - some numbers can’t be represented exactly, just like 1/3 in decimal.

In binary floating-point, all integers are represented exactly only up to a fixed limit (about 9 quadrillion for double-precision). Beyond that, integers are rounded.

However, only fractions whose denominators are powers of 2 (1/2, 1/4, 3/8, etc.) can be exactly represented in floats or doubles. All others (0.1, 0.22, 0.357, etc.) are approximations.

For example, here is the full binary representation of 0.1

The truncated sequence is no longer 0.1, but something very close to it.

These bits represent the binary scientific notation of the number. (See section 3)

circle-info

The main takeaway is that since computers must cut off numbers at some point, precision is lost.

6. Doing math with approximations

As such, 0.1 + 0.2 actually results in:

We can now see that the tiny difference is big enough for the equality check to yield "false".

You should also now understand the example at the start:

7. Single vs. Double precision

More bits mean more precision and a larger representable range.

  • 32-bit (single precision / float)

    • Faster, uses less memory

    • ~7 decimal digits of precision

    • Rounding errors appear sooner

  • 64-bit (double precision / double)

    • More memory, slightly slower

    • ~16 decimal digits of precision

    • Much smaller rounding errors

Most modern programs and tests use double precision by default. Notice that with 64 bits, imprecision starts much further away from the decimal point.

Precision
Example value
Stored value (approx.)

32-bit (float)

0.1

0.10000000149011612

64-bit (double)

0.1

0.10000000000000000555

Why this matters in testing:

  • The same calculation may produce different results in float vs double

  • Equality checks (==) are more fragile with lower precision

  • Tests that pass locally may fail on:

    • different hardware

    • GPUs vs CPUs

    • systems using single precision internally

Expect double precision with most general-purpose programming and backend software.

circle-info

A 128-bit floating-point format exists, but it is niche for very high-precision scientific computing, not typical commercial software. If you were doing rocket science, you'd probably not be on this page :).

8. What to do about imprecisions

Most popular languages do not try to "fix" floating-point math. Instead, they provide tools and conventions to work around imprecision when it matters.

The common strategies are:

  • Use tolerances instead of equality

  • Use specialized precision types for money and exact math

Custom tolerance example

Special types for decimal math

In real, robust applications, special libraries are used to avoid rounding errors. In other words, and simply put, produce results the way a calculator would.

Language
Default floating type
How imprecision is handled

Java

double (64-bit)

Use tolerances for comparisons; use BigDecimal for exact decimal math

C#

double (64-bit)

Use tolerances; use decimal for financial / exact calculations

Python

float (64-bit)

Use tolerances; use decimal.Decimal or fractions.Fraction

JavaScript

Number (64-bit double)

Use tolerances; use libraries (BigInt, decimal libs) for exact math

9. How to test for floating-point errors

See the dedicated page.

References

  1. What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg. The "classic" paper on the topic. Complex. Available on docs.oracle.comarrow-up-right or in the PDF below.

Last updated