4.1  Floating-Point Decomposition

Floating-point arithmetic is based on the observation that every nonzero real number $ x$ admits a representation of the form

$\displaystyle x = \pm m \cdot 2^e,$

where $ e$ is an integer, called the exponent of $ x$, and $ m$ is a number in the range $ 1 \leq m < 2$, called the significand of $ x$. These components are defined as follows.

Definition 4.1.1   (sgn, expo, sig) Let $ x \in \mathbb{R}$. If $ x \neq 0$, then

$\displaystyle sgn(x) = \left\{\begin{array}{ll}
1 & \mbox{if $x > 0$}\\
-1 & \mbox{if $x < 0$,}
\end{array}\right.$

$ expo(x)$ is the unique integer that satisfies

$\displaystyle 2^{expo(x)} \leq \vert x\vert < 2^{expo(x)+1},$

and

$\displaystyle sig(x) = \vert x\vert 2^{-expo(x)}.$

If $ x = 0$, then $ sgn(x) = expo(x) = sig(x) = 0$.

The decomposition property is immediate.

Lemma 4.1.1   (fp-rep, fp-abs) For all $ x \in \mathbb{R}$, $ x = sgn(x)sig(x)2^{expo(x)}$.

Lemma 4.1.2   (expo-minus, sig-minus) For all $ x \in \mathbb{R}$,

(a) $ \mathit{expo}(-x) = \mathit{expo}(x)$;

(b) $ \mathit{sig}(-x) = \mathit{sig}(x)$.

The definition of $ expo$ may be restated as follows:

  (expo<=, expo>=) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{Z}$,

(a) If $ 2^n \leq x$, then $ n \leq expo(x)$;

(b) If $ 0 < \vert x\vert < 2^{n+1}$, then $ expo(x) \leq n$.

PROOF:

(a) If $ n > expo(x)$, then $ n \geq expo(x)+1$, which implies

$\displaystyle \vert x\vert < 2^{expo(x)+1} \leq 2^n.$

(b) If $ n < expo(x)$, then $ n+1 \leq expo(x)$, which implies

$\displaystyle 2^{n+1} \leq 2^{expo(x)} \leq \vert x\vert.$

Corollary 4.1.4   (expo-2**n) For all $ n \in \mathbb{Z}$, $ expo(2^n) = n$.

  (expo-monotone) Let $ x \in \mathbb{R}$ and $ y \in \mathbb{R}$. If $ 0 < \vert x\vert \leq \vert y\vert$, then $ expo(x) \leq expo(y)$.

PROOF: since $ \vert x\vert \leq \vert y\vert < 2^{expo(y)+1}$, the lemma follows from Lemma 4.1.3(b). 


The width of a bit vector is determined by its exponent.

  (bvecp-expo) For all $ x \in \mathbb{N}$, $ x$ is a bit vector of width $ expo(x)+1$.

PROOF: This is just a restatement of the second inequality of Definition 4.1.1

  (mod-expo) For all nonzero $ x \in \mathbb{N}$,

$\displaystyle mod(x,2^{expo(x)}) = x - 2^{expo(x)}.$

PROOF: This is an instance of Lemma 1.2.16


:

We have the following bounds on $ sig$.

  (sig-lower-bound, sig-upper-bound) Let $ x \in \mathbb{R}$. If $ x \neq 0$, then $ 1 \leq sig(x) < 2$.

PROOF: Definition 4.1.1 yields

$\displaystyle 1 = 2^{expo(x)}/2^{expo(x)} \leq \vert x\vert/2^{expo(x)} < 2^{expo(x)+1}/2^{expo(x)} = 2.$

Corollary 4.1.9   (expo-sig) For all $ x \in \mathbb{R}$, $ expo(sig(x)) = 1$.

  (sig-self) Let $ x \in \mathbb{R}$. If $ 1 \leq x < 2$, then $ sig(x) = x$.

PROOF: Since $ 2^0 \leq \vert x\vert = x < 2^1$, $ expo(x) = 0$, and hence $ sig(x) = x/2^0 = x$

Corollary 4.1.11   (sig-sig) For all $ x \in \mathbb{R}$, $ sig(sig(x)) = sig(x)$.

  (fp-rep-unique) Let $ x \in \mathbb{R}$. If $ \vert x\vert = 2^{e}m$, where $ e \in \mathbb{Z}$, $ m \in
\mathbb{R}$, and $ 1 \leq m < 2$, then $ m = sig(x)$ and $ e = expo(x)$.

PROOF: Since $ 1 \leq m < 2$, $ 2^e \leq 2^em < 2^e2 = 2^{e+1}$, where $ 2^em = \vert x\vert$. It follows from Definition 4.1.1 that $ e = expo(x)$, and therefore $ sig(x) = \vert x\vert/2^e = m$


Changing the sign of a number does not affect its exponent or significand.

Lemma 4.1.13   (sgn-minus, expo-minus, sig-minus) For all $ x \in \mathbb{R}$,


(a) $ sgn(-x) = -sgn(x)$
(b) $ expo(-x) = expo(x)$
(c) $ sig(-x) = sig(x)$.

A shift does not affect the sign or significand.

  (sgn-shift, expo-shift, sig-shift) If $ x \in \mathbb{R}$, $ x \neq 0$, and $ n \in \mathbb{Z}$, then


(a) $ sgn(2^{n}x) = sgn(x)$
(b) $ expo(2^{n}x) = expo(x) + n$
(c) $ sig(2^{n}x) = sig(x)$.

PROOF:

(a) $ sgn(2^{n}x) = 2^{n}x/\vert 2^{n}x\vert = x/\vert x\vert = sgn(x)$.

(b) $ 2^{expo(x)} \leq \vert x\vert < 2^{expo(x)+1} \Rightarrow 2^{expo(x)+x} \leq \vert 2^nx\vert < 2^{expo(x)+n+1}$.

(c) $ sig(2^{n}x) = \vert 2^{n}x\vert 2^{-(expo(x)+n)} = \vert x\vert 2^{-expo(x)} = sig(x)$


We have the following formulas for the components of a product.

  (sgn-prod, expo-prod, sig-prod) Let $ x \in \mathbb{R}$ and $ y \in \mathbb{R}$. If $ xy \neq 0$, then


(a) $ sgn(xy) = sgn(x)sgn(y)$

(b) $ expo(xy) = \left\{\begin{array}{ll}
expo(x)+expo(y) & \mbox{if $sig(x)sig(y)<2$}\\
expo(x)+expo(y)+1 & \mbox{if $sig(x)sig(y) \geq 2$}
\end{array}\right.$

(c) $ sig(xy) = \left\{\begin{array}{ll}
sig(x)sig(y) & \mbox{if $sig(x)sig(y)<2$}\\
sig(x)sig(y)/2 & \mbox{if $sig(x)sig(y) \geq 2$.}
\end{array}\right.$

PROOF:

(a) $ sgn(xy) = xy/\vert xy\vert = (x/\vert x\vert)(y/\vert y\vert) = sgn(x)sgn(y)$.

(b) Since $ 2^{expo(x)} \leq \vert x\vert < 2^{expo(x)+1}$ and $ 2^{expo(y)} \leq \vert y\vert < 2^{expo(y)+1}$, we have

$\displaystyle 2^{expo(x)+expo(y)}$ $\displaystyle =$ $\displaystyle 2^{expo(x)}2^{expo(y)}$  
  $\displaystyle \leq$ $\displaystyle \vert xy\vert$  
  $\displaystyle <$ $\displaystyle 2^{expo(x)+1}2^{expo(y)+1}$  
  $\displaystyle =$ $\displaystyle 2^{expo(x)+expo(y)+2}.$  

If $ sig(x)sig(y) = \vert x\vert 2^{-expo(x)}\vert y\vert 2^{-expo(y)} < 2$, then

$\displaystyle \vert xy\vert < 2\cdot 2^{expo(x)}2^{expo(y)} = 2^{expo(x) + expo(y) + 1},$

and by Definition 4.1.1, $ expo(xy) = expo(x) + expo(y)$.

On the other hand, if $ sig(x)sig(y) \geq 2$, then similarly,

$\displaystyle \vert xy\vert \geq 2^{expo(x) + expo(y) + 1},$

and Definition 4.1.1 yields $ expo(xy) = expo(x) + expo(y) + 1$.

(c) If $ sig(x)sig(y) < 2$, then

$\displaystyle sig(xy)$ $\displaystyle =$ $\displaystyle \vert xy\vert 2^{-expo(xy)}$  
  $\displaystyle =$ $\displaystyle \vert xy\vert 2^{-(expo(x)+expo(y))}$  
  $\displaystyle =$ $\displaystyle \vert x\vert 2^{-expo(x)}\vert y\vert 2^{-expo(y)}$  
  $\displaystyle =$ $\displaystyle sig(x)sig(y).$  

Otherwise,
$\displaystyle sig(xy)$ $\displaystyle =$ $\displaystyle \vert xy\vert 2^{-expo(xy)}$  
  $\displaystyle =$ $\displaystyle \vert xy\vert 2^{-(expo(x)+expo(y)+1)}$  
  $\displaystyle =$ $\displaystyle \vert x\vert 2^{-expo(x)}\vert y\vert 2^{-expo(y)}/2$  
  $\displaystyle =$ $\displaystyle sig(x)sig(y)/2.$  

David Russinoff 2017-08-01