next up previous contents
Next: Exactness Up: Floating-Point Representation Previous: Floating-Point Representation   Contents

Sign, Exponent, and Significand

The three components of a rational number are defined as follows.

Definition 4.1.1   (sgn, expo, sig) Let $ x \in \mathbb{Q}$ . If $ x \neq 0$ , then

$\displaystyle sgn(x) = \left\{\begin{array}{ll}
1 & \mbox{if $x > 0$}\\
-1 & \mbox{if $x < 0$,}
\end{array}\right.$

$ expo(x)$ is the unique integer that satisfies

$\displaystyle 2^{expo(x)} \leq \vert x\vert < 2^{expo(x)+1},$

and

$\displaystyle sig(x) = \vert x\vert 2^{-expo(x)}.$

If $ x=0$ , then $ sgn(x) = expo(x) = sig(x) = 0$ .

The decomposition property is immediate.

Lemma 4.1.1   (fp-rep, fp-abs) For all $ x \in \mathbb{Q}$ , $ x = sgn(x)sig(x)2^{expo(x)}$ .

The definition of $ expo$ may be restated as follows:

  (expo<=, expo>=) For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{Z}$ ,

(a) If $ 2^n \leq x$ , then $ n \leq expo(x)$ .
(b) If $ 0 < \vert x\vert < 2^{n+1}$ , then $ expo(x) \leq n$ .

PROOF:

(a) If $ n > expo(x)$ , then $ n \geq expo(x)+1$ , which implies

$\displaystyle \vert x\vert < 2^{expo(x)+1} \leq 2^n.$

(b) If $ n < expo(x)$ , then $ n+1 \leq expo(x)$ , which implies

$\displaystyle 2^{n+1} \leq 2^{expo(x)} \leq \vert x\vert.$

Corollary 4.1.3   (expo-2**n) For all $ n \in \mathbb{Z}$ , $ expo(2^n) = n$ .

  (expo-monotone) Let $ x \in \mathbb{Q}$ and $ y \in \mathbb{Q}$ . If $ 0 < \vert x\vert \leq \vert y\vert$ , then $ expo(x) \leq expo(y)$ .

PROOF: since $ \vert x\vert \leq \vert y\vert < 2^{expo(y)+1}$ , the lemma follows from Lemma 4.1.2(b). 


The width of a bit vector is determined by its exponent.

  (bvecp-expo) For all $ x \in \mathbb{N}$ , $ x$ is a bit vector of width $ expo(x)+1$ .

PROOF: This is just a restatement of the second inequality of Definition 4.1.1

  (mod-expo) For all nonzero $ x \in \mathbb{N}$ ,

$\displaystyle mod(x,2^{expo(x)}) = x - 2^{expo(x)}.$

PROOF: This is an instance of Lemma 1.2.17


:

We have the following bounds on $ sig$ .

  (sig-lower-bound, sig-upper-bound) Let $ x \in \mathbb{Q}$ . If $ x \neq 0$ , then $ 1 \leq sig(x) < 2$ .

PROOF: Definition 4.1.1 yields

$\displaystyle 1 = 2^{expo(x)}/2^{expo(x)} \leq \vert x\vert/2^{expo(x)} < 2^{expo(x)+1}/2^{expo(x)} = 2.$

  (already-sig) Let $ x \in \mathbb{Q}$ . If $ 1 \leq x < 2$ , then $ sig(x) = x$ .

PROOF: Since $ 2^0 \leq \vert x\vert = x < 2^1$ , $ expo(x) = 0$ , and hence $ sig(x) = x/2^0 = x$

Corollary 4.1.9   (sig-sig) For all $ x \in \mathbb{Q}$ , $ sig(sig(x)) = sig(x)$ .

  (fp-rep-unique) Let $ x \in \mathbb{Q}$ . If $ \vert x\vert = 2^{e}m$ , where $ e \in \mathbb{Z}$ , $ m \in
\mathbb{Q}$ , and $ 1 \leq m < 2$ , then $ m = sig(x)$ and $ e = expo(x)$ .

PROOF: Since $ 1 \leq m < 2$ , $ 2^e \leq 2^em < 2^e2 = 2^{e+1}$ , where $ 2^em = \vert x\vert$ . It follows from Definition 4.1.1 that $ e = expo(x)$ , and therefore $ sig(x) = \vert x\vert/2^e = m$


Changing the sign of a number does not affect its exponent or significand.

Lemma 4.1.11   (sgn-minus, expo-minus, sig-minus) For all $ x \in \mathbb{Q}$ ,


(a) $ sgn(-x) = -sgn(x)$
(b) $ expo(-x) = expo(x)$
(c) $ sig(-x) = sig(x)$ .

A shift does not affect the sign or significand.

  (sgn-shift, expo-shift, sig-shift) If $ x \in \mathbb{Q}$ , $ x \neq 0$ , and $ n \in \mathbb{Z}$ , then


(a) $ sgn(2^{n}x) = sgn(x)$
(b) $ expo(2^{n}x) = expo(x) + n$
(c) $ sig(2^{n}x) = sig(x)$ .

PROOF:

(a) $ sgn(2^{n}x) = 2^{n}x/\vert 2^{n}x\vert = x/\vert x\vert = sgn(x)$ .

(b) $ 2^{expo(x)} \leq \vert x\vert < 2^{expo(x)+1} \Rightarrow 2^{expo(x)+x} \leq \vert 2^nx\vert < 2^{expo(x)+n+1}$ .

(c) $ sig(2^{n}x) = \vert 2^{n}x\vert 2^{-(expo(x)+n)} = \vert x\vert 2^{-expo(x)} = sig(x)$


We have the following formulas for the components of a product.

  (sgn-prod, expo-prod, sig-prod) Let $ x \in \mathbb{Q}$ and $ y \in \mathbb{Q}$ . If $ xy \neq 0$ , then


(a) $ sgn(xy) = sgn(x)sgn(y)$

(b) $ expo(xy) = \left\{\begin{array}{ll}
expo(x)+expo(y) & \mbox{if $sig(x)sig(y)<2$}\\
expo(x)+expo(y)+1 & \mbox{if $sig(x)sig(y) \geq 2$}
\end{array}\right.$

(c) $ sig(xy) = \left\{\begin{array}{ll}
sig(x)sig(y) & \mbox{if $sig(x)sig(y)<2$}\\
sig(x)sig(y)/2 & \mbox{if $sig(x)sig(y) \geq 2$.}
\end{array}\right.$

PROOF:

(a) $ sgn(xy) = xy/\vert xy\vert = (x/\vert x\vert)(y/\vert y\vert) = sgn(x)sgn(y)$ .

(b) Since $ 2^{expo(x)} \leq \vert x\vert < 2^{expo(x)+1}$ and $ 2^{expo(y)} \leq \vert y\vert < 2^{expo(y)+1}$ , we have

$\displaystyle 2^{expo(x)+expo(y)}$ $\displaystyle =$ $\displaystyle 2^{expo(x)}2^{expo(y)}$  
  $\displaystyle \leq$ $\displaystyle \vert xy\vert$  
  $\displaystyle <$ $\displaystyle 2^{expo(x)+1}2^{expo(y)+1}$  
  $\displaystyle =$ $\displaystyle 2^{expo(x)+expo(y)+2}.$  

If $ sig(x)sig(y) = \vert x\vert 2^{-expo(x)}\vert y\vert 2^{-expo(y)} < 2$ , then

$\displaystyle \vert xy\vert < 2\cdot 2^{expo(x)}2^{expo(y)} = 2^{expo(x) + expo(y) + 1},$

and by Definition 4.1.1, $ expo(xy) = expo(x) + expo(y)$ .

On the other hand, if $ sig(x)sig(y) \geq 2$ , then similarly,

$\displaystyle \vert xy\vert \geq 2^{expo(x) + expo(y) + 1},$

and Definition 4.1.1 yields $ expo(xy) = expo(x) + expo(y) + 1$ .

(c) If $ sig(x)sig(y) < 2$ , then

$\displaystyle sig(xy)$ $\displaystyle =$ $\displaystyle \vert xy\vert 2^{-expo(xy)}$  
  $\displaystyle =$ $\displaystyle \vert xy\vert 2^{-(expo(x)+expo(y))}$  
  $\displaystyle =$ $\displaystyle \vert x\vert 2^{-expo(x)}\vert y\vert 2^{-expo(y)}$  
  $\displaystyle =$ $\displaystyle sig(x)sig(y).$  

Otherwise,
$\displaystyle sig(xy)$ $\displaystyle =$ $\displaystyle \vert xy\vert 2^{-expo(xy)}$  
  $\displaystyle =$ $\displaystyle \vert xy\vert 2^{-(expo(x)+expo(y)+1)}$  
  $\displaystyle =$ $\displaystyle \vert x\vert 2^{-expo(x)}\vert y\vert 2^{-expo(y)}/2$  
  $\displaystyle =$ $\displaystyle sig(x)sig(y)/2.$  


next up previous contents
Next: Exactness Up: Floating-Point Representation Previous: Floating-Point Representation   Contents
David Russinoff 2007-01-02