Computer Science for Floating-point Arithmetic
For 0.1 + 0.2, a calculator outputs 0.3 as expected. But it is quietly rounded to match your expectations. In software, decimal operations are imprecise. Unless special libraries (methods) are used, a program will output "odd" results.

These are not computer bugs. This is the inevitable consequence of:
The fact that there are infinitely many real numbers. To represent them, we'd need an infinite amount of memory, and that's impossible.
Squeezing infinitely many numbers thus requires a limited set or approximate representations.
Logically, approximations lead to imprecise, "odd" results such as the above, which ultimately lead to bugs:
cents (money) can get lost,
transactions rejected because of imprecise comparisons or
mismatched totals
scientific calculations producing subtly incorrect results
This page is here to:
Explain relatively concisely why this happens
What is scientific notation (
...123e-16) and why it shows up in computation resultsWhat can be done to mitigate the problem
How to test for related potential bugs
Quick Summary
Just in case, here's the shortest possible explanation:
Even "human" formats don't always allow for exact representation
Think ⅓ is represented as an infinite
0.33333....in decimalSince we must cut off at some point, we lose precision.
Multiplying
0.33333 * 3will yield0.99999, not1. Precision lost.
Computers process information as
1sand0sIn binary, the same happens all the time
E.g.,
0.1in binary is a sequence of1sand0s, followed by an infinite sequence of0011The
0011must be cut off at some point - and that's where precision is lostJust like in decimal, doing math with imprecise representations leads to imprecise results
More operations may lead to imprecision becoming bigger and bigger
That's it. The rest of the page is just a more thorough explanation of the same.
1. Humans vs. Computers
Humans have 10 fingers. Hence, we have 10 different digits - from 0 to 9.
Computers, at the lowest level, process information as electric charges in a chip - either there's a charge, or none. 0 or 1.
Summary:
Base 10, aka the decimal system. 0 to 9. For humans.
Base 2, aka binary. 0 or 1. For computers.
2. Bits and Bytes
We want to "translate" human numbers (base 10) to computer numbers (base 2). It is done like so:
0
0
1 bit
1
1
1 bit. Thus 1 bit can store 2 patterns - 0 or 1.
2
10
2 bits
3
11
2 bits.
Thus 2 bits can store 4 patterns - 00, 01, 10, 11.
4
100
3 bits
5
101
3 bits
…
…
…
8
1000
4 bits
…
…
…
16
10000
5 bits
…
…
…
128
10000000
8 bits. aka 1 byte.
…
…
…
255
11111111
8 bits. aka 1 byte. Can store a number from 0 to 255 (a total of 256 numbers)
Summary:
1 bit is the smallest unit of storage
1 byte is the smallest practical unit - enough to hold at least 1 typed character ('a', '2', '$')
Add 1 bit to double the number of patterns
1 bit - 2 patterns
2 bits - 4 patterns (e.g.,
01)8 bits - 256 patterns (one byte) (e.g.,
01011100)Mathematically: n bits yield 2n patterns (2 to the nth power)
This doubling with every bit starts yielding huge numbers rather quickly
Also:
4 bytes (32 bits) are typically used to store integers
Range −2,147,483,648 to 2,147,483,647 (billions)
8 bytes (64 bits) are typically used to store bigger integers (longs)
Range −9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (quintillions)
34 bits and 64 bits are important milestones. Keep them in mind as you'll see them later.
Dealing with huge numbers such as above can be cumbersome. Is there a faster, more convenient way to represent them? See the next section.
3. Handling long numbers with scientific notation
It's useful, and almost necessary, to have a unified format to represent numbers. Including very small numbers and very big numbers.
To an engineer building a highway, it does not matter whether it’s 10 meters or 10.0001 meters wide.
On the contrary, 0.0001 meters (a tenth of a millimeter) is a huge difference to a microchip designer.
To satisfy the engineer and the chip designer, a number format has to provide accuracy for numbers at very different magnitudes. Also, when numbers get very big or very small, it becomes hard (and error-prone) to do math with them, even "on paper".
How much is 300000000 * 0.00000015 ?
A bit difficult, isn't it? This can be rewritten in scientific notation.
The format is a * 10^n. Here, 10 is a constant (decimal system, 10 fingers). The solution becomes much easier.
The scientific notation is a fast, compact format not just for humans, but for machines too!
But it must be written in base 2 format - a * 2^n.
Same as decimal, but we replace the constant 10 with 2.
Notice that binary representations use ≈, i.e., the binary representation is approximate. This is important and will become evident later.
The general formula is expressed as:
"Floating Point"
The "floating" in floating-point refers to the fact that the decimal (or binary) point can "float" — it’s not fixed in one place.
In regular numbers, the point is fixed:
123.45→ the decimal is after the 3.In floating-point, you store coefficient × base^exponent, so the "point" moves depending on the exponent:
Example:
123.45
1.2345 × 10² → decimal point "floats" 2 places
0.00123
1.23 × 10^-3 → decimal point "floats" -3 places
Summary:
All numbers can be represented in this format with any base. We generally only care about base 10 or base 2 (binary). But in binary, some fractions are only approximate. Why? Find out in the next section.
4. Approximate representations
There is more than one way to represent numbers. For example, fractions:
However, as you may remember from school, 1/3 cannot be represented exactly in decimal, it is 0.33333333..... infinitely repeating.
We cannot go on forever, so we must truncate at some point. We lose precision when we do this. How much precision we decide to lose depends on context.
Suppose we decide to truncate after the 4th digit after the decimal.
As such, adding 0.3333 + 0.3333 + 0.3333 will yield 0.9999. Never a full 1. This may be good enough for many general purposes, but not enough for manufacturing microchips.
As you can see, even the decimal system fails to represent some numbers exactly.
5. Floating-point representation
Binary fractions face the same problem - some numbers can’t be represented exactly, just like 1/3 in decimal.
In binary floating-point, all integers are represented exactly only up to a fixed limit (about 9 quadrillion for double-precision). Beyond that, integers are rounded.
However, only fractions whose denominators are powers of 2 (1/2, 1/4, 3/8, etc.) can be exactly represented in floats or doubles. All others (0.1, 0.22, 0.357, etc.) are approximations.
For example, here is the full binary representation of 0.1
The truncated sequence is no longer 0.1, but something very close to it.
These bits represent the binary scientific notation of the number. (See section 3)
The main takeaway is that since computers must cut off numbers at some point, precision is lost.
6. Doing math with approximations
As such, 0.1 + 0.2 actually results in:
We can now see that the tiny difference is big enough for the equality check to yield "false".
You should also now understand the example at the start:
7. Single vs. Double precision
More bits mean more precision and a larger representable range.
32-bit (single precision /
float)Faster, uses less memory
~7 decimal digits of precision
Rounding errors appear sooner
64-bit (double precision /
double)More memory, slightly slower
~16 decimal digits of precision
Much smaller rounding errors
Most modern programs and tests use double precision by default. Notice that with 64 bits, imprecision starts much further away from the decimal point.
32-bit (float)
0.1
0.10000000149011612
64-bit (double)
0.1
0.10000000000000000555
Why this matters in testing:
The same calculation may produce different results in
floatvsdoubleEquality checks (
==) are more fragile with lower precisionTests that pass locally may fail on:
different hardware
GPUs vs CPUs
systems using single precision internally
Expect double precision with most general-purpose programming and backend software.
A 128-bit floating-point format exists, but it is niche for very high-precision scientific computing, not typical commercial software. If you were doing rocket science, you'd probably not be on this page :).
8. What to do about imprecisions
Most popular languages do not try to "fix" floating-point math. Instead, they provide tools and conventions to work around imprecision when it matters.
The common strategies are:
Use tolerances instead of equality
Use specialized precision types for money and exact math
Custom tolerance example
Special types for decimal math
In real, robust applications, special libraries are used to avoid rounding errors. In other words, and simply put, produce results the way a calculator would.
Java
double (64-bit)
Use tolerances for comparisons; use BigDecimal for exact decimal math
C#
double (64-bit)
Use tolerances; use decimal for financial / exact calculations
Python
float (64-bit)
Use tolerances; use decimal.Decimal or fractions.Fraction
JavaScript
Number (64-bit double)
Use tolerances; use libraries (BigInt, decimal libs) for exact math
9. How to test for floating-point errors
See the dedicated page.
References
Floating-point guide (simple): https://floating-point-gui.de/formats/fp/
What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg. The "classic" paper on the topic. Complex. Available on docs.oracle.com or in the PDF below.
RapidTables: Binary to Decimal converter
Last updated