11.8 s

Number representation

8.7 μs

Besides of the concrete names of Julia library functions everything in this chapter is valid for all modern programming languagues and computer systems.

10.6 μs

All data in computers are stored as sequences of bits. For concrete number types, the bitstring function returns this information as a sequence of 0 and 1. The sizeof function returns the number of bytes in the binary representation.

7.4 μs

Integer numbers

9.6 μs
T_int
Int16
3.9 μs
i
1
3.8 ms
2
21.5 μs
"0000000000000001"
6.0 ms

Positive integer numbers are represented by their representation in the binary system. For negative numbers n, the binary representation of their "two's complement" 2N|n| (where N is the number of available bits) is stored.

typemin and typemax return the smallest and largest numbers which can be represented in number type.

8.8 μs
3.6 ms

Unless the possible range of the representation (2N1,2N1) is exceeded, addition, multiplication and subtraction of integers are exact. If it is exceeded, operation results wrap around into the opposite sign region.

8.6 μs
10
6.9 μs
-32759
4.4 ms

Floating point numbers

7.6 μs

How does this work for floating point numbers ?

14.2 μs
0.30000000000000004
3.2 μs

But this should be 0.3. What is happening ???

10.0 μs

Real number representation

  • Let us think about representation real numbers. Usually we write them as decimal fractions and cut the representation off if the number of digits is infinite.

  • Any real number xR can be expressed via the representation formula: x=±i=0diβiβe with base βN,β2, significand (or mantissa) digits diN,0di<β and exponent eZ

  • The representation is infinite for periodic decimal numbers and irrational numbers.

21.0 μs

Scientific notation

The scientific notation of real numbers is derived from this representation in the case of β=10. Let e.g. x=6.0221023=6.022e23. Then

  • β=10

  • d=(6,0,2,2,0)

  • e=23

This representation is not unique, e.g. x1=0.60221024=0.6022e24=x with

  • β=10

  • d=(0,6,0,2,2,0)

  • e=24

13.9 μs

IEEE754 standard

This is the actual standard format for storing floating point numbers. It was developed in the 1980ies.

  • β=2, therefore di{0,1}

  • Truncation to fixed finite size: x=±i=0t1diβiβe

  • t : significand (mantissa) length

  • Normalization: assume d0=1 save one bit for the storage of the significand. This requires a normalization step after operations which adjusts significand and exponent of the result.

  • k: exponent size. Define L,K: βk+1=LeU=βk1

  • Extra bit for sign

  • storage size: (t1)+k+1

  • Standardized for most modern languages

  • Hardware support usually for 64bit and 32bit

precisionJuliaC/C++ktbits
quadruplen/along double16113128
doubleFloat64double115364
singleFloat32float82432
halfFloat16n/a51116

The storage sequence is: Sign bit, exponent, mantissa.

28.5 μs

Storage layout for a normalized Float32 number (d0=1):

  • bit 1: sign, 0+,1

  • bit 29: k=8 exponent bits

    • the value e+2k11=e+127 is stored no need for sign bit in exponent

  • bit 1032: 23=t1 mantissa bits d1d23

  • d0=1 not stored "hidden bit"

21.4 μs

Julia allows to obtain the signifcand and the exponent of a floating point number

5.5 μs
x0
2.0
3.1 μs
15.6 ms
  • We can calculate the length of the exponent k from the maximum representable floating point number by taking the base-2 logarithm of its exponent:

13.6 μs
45.4 μs
  • The size of the significand t is calculated from the overall size of the representation minus the size of the exponent and the size of the sign bit + 1 for the "hidden bit".

9.0 μs
35.1 μs

This allows to define a more readable variant of the bitstring repredentatio for floats.

  • The sign bit is the first bit in the representation:

7.2 μs
30.0 μs
  • Next comes the exponent:

16.6 μs
exponent_bits (generic function with 1 method)
44.0 μs
  • And finally, the significand:

10.9 μs
33.8 μs
  • Put them together:

10.8 μs
26.3 μs

Julia floating point types

6.8 μs
T
Float16
6.4 μs

Type Float16:

  • size of exponent: 5

  • size of significand: 11

17.2 ms
x
Float16(0.1)
9.7 ms
  • Binary representation: 0_01011_1001100110

  • Exponent e=-4

  • Stored: e+15= 11

  • d0=1 assumed implicitely.

87.8 ms
  • Numbers which are exactly represented in decimal system may not be exactly represented in binary system!

  • Such numbers are always rounded to a finite approximate

11.5 μs
x_per
Float16(0.2998)
25.2 μs
"0_01101_0011001100"
5.5 μs
Floating point limits
  • Finite size of representation there are minimal and maximal possible numbers which can be represented

  • symmetry wrt. 0 because of sign bit

  • smallest positive denormalized number: di=0,i=0t2,dt1=1 xmin=21t2L

15.0 μs
25.6 ms
  • smallest positive normalized number: d0=1,di=0,i=1t1 xmin=2L

21.0 μs
134 μs
  • largest positive normalized number: di=1,0t1 xmax=2(121t)2U

11.2 μs
26.2 μs
  • Largest representable number:

14.1 μs
14.9 ms
Machine precision
  • There cannot be more than 2t+k floating point numbers almost all real numbers have to be approximated

  • Let x be an exact value and x~ be its approximation. Then |x~xx|<ϵ is the best accuracy estimate we can get, where

    • ϵ=21t (truncation)

    • ϵ=1221t (rounding)

  • Also: ϵ is the smallest representable number such that 1+ϵ>1.

  • Relative errors show up in particular when

    • subtracting two close numbers

    • adding smaller numbers to larger ones

How do operations work?

E.g. Addition

  • Adjust exponent of number to be added:

    • Until both exponents are equal, add 1 to exponent, shift mantissa to right bit by bit

  • Add both numbers

  • Normalize result

The smallest number one can add to 1 can have at most t bit shifts of normalized mantissa until mantissa becomes 0, so its value must be 2t.

Machine epsilon
  • Smallest floating point number ϵ such that 1+ϵ>1 in floating point arithmetic

  • In exact math it is true that from 1+ε=1 it follows that 0+ε=0 and vice versa. In floating point computations this is not true

45.1 μs
ϵ
Float16(0.000977)
154 μs
"0_00101_0000000000"
8.2 μs
10.1 ms
10.7 μs
38.6 μs

Density of floating point numbers

How dense are floating point numbers on the real axis?

9.2 μs
72.2 μs
X
154 ms
4.8 s