next up previous contents
Next: Rounding Up: Floating-Point Representation Previous: Exactness   Contents

Floating-Point Formats

A floating-point format is a scheme for encoding rationals as bit vectors, derived from the decomposition expressed by Lemma 4.1.1. A format may be characterized by two positive integers $ p$ and $ q$ , where

(1) $ p$ is the number of bits allocated to the significand, specifying the precision with which representable numbers are differentiated, and

(2) $ q$ is the number of bits allocated to the exponent, respectively, determining the range of representable numbers.

Under this plan, a floating-point encoding takes the form of a bit vector of width $ p+q+1$ consisting of a $ p$ -bit significand field, a $ q$ -bit exponent field, and a 1-bit sign field. A common optimization of this scheme is the omission of the most significant bit (MSB) of the significand field, which must then be derived contextually; the width of an encoding is thereby reduced to $ p+q$ .

In fact, there are two classes of floating-point formats in common use: those in which the leading significand bit appears explicitly, and those in which it is implicit. Representations of the first type include most implementations of the single extended ($ p = 32$ , $ q = 11$ ) and double extended ($ p = 64$ , $ q = 15$ ) formats provided by IEEE Standard 754 [IEEE85], as well as the higher-precision formats that are typically used for internal computations in floating-point units. Under this scheme, as diagrammed in Figure 4.3, an encoding $ x$ is a concatenation of the form

$\displaystyle \{$esgnf$\displaystyle (x,p,q),$   eexpof$\displaystyle (x,p,q)[q-1:0],$   esigf$\displaystyle (x,p)[p-1:0]\},$

the components of which are formally defined as follows.

Definition 4.3.1   (esgnf, eexpf, esigf) Let $ p \in
\mathbb{N}$ and $ q \in \mathbb{N}$ with $ p>0$ and $ q>0$ . If $ x$ is a bit vector of width $ p+q+1$ , then

(a) esgnf$ (x,p,q) = x[p+q]$
(a) eexpof$ (x,p,q) = x[p+q-1:p]$
(a) esigf$ (x,p) = x[p-1:0]$ .

Figure 4.1: A Floating-Point Format with Explicit MSB
\begin{figure}\par\setlength{\unitlength}{2mm}
\begin{picture}(64,10)(-1,-1)
...
...(4,3)[l]{$p$-$1$}}
\put(55,6){\makebox(4,3)[r]{$0$}}
\end{picture}
\end{figure}

The significand field of such an encoding is interpreted with an implied radix point following the most significant bit. That is, if

esigf$\displaystyle (x,p) = \verb!b!\beta_0\beta_1\cdots\beta_{p-1},$

then the value represented by this component is

esigf$\displaystyle (x,p)/2^{p-1} = \verb!b!\beta_0.\beta_1\cdots\beta_{p-1}.$

The rational number encoded by $ x$ is the signed product of this value and a power of 2 determined by the exponent field. Since it is desirable for the range of exponents to be centered at 0, this field is interpreted with a bias of $ 2^{q-1}-1$ , i.e., the value represented is

$\displaystyle 2^{\mbox{\scriptsize {\it eexpof}}(x,p,q) - (2^{q-1}-1)},$

where the the exponent lies in the range

$\displaystyle 2^{q-1}-1 \leq$   eexpof$\displaystyle (x,p,q) - (2^{q-1}-1) \leq 2^{q-1}.$

The decoding function for this format, therefore, is defined as follows.

Definition 4.3.2   (edecode) Let $ p \in
\mathbb{N}$ and $ q \in \mathbb{N}$ with $ p>0$ and $ q>0$ . If $ x$ is a bit vector of width $ p+q+1$ , then

$\displaystyle edecode(x,p,q) = (-1)^{\mbox{\scriptsize {\it esgnf}(x,p,q)}}\mbox{\it esigf}(x,p)
2^{\mbox{\scriptsize {\it eexpof}}(x,p,q)+1-p-bias(q)},$

where

$\displaystyle bias(q) = 2^{q-1}-1.$

The most commonly used formats, however, are based on an implicit MSB. These include the IEEE basic single ($ p=24$ , $ q=8$ ) and double ($ p=53$ , $ q = 11$ ) formats, at least one of which must be implemented by any IEEE-compliant floating-point unit. An encoding with respect to this scheme, as diagrammed in Figure 4.3, consists of the fields defined below.

Definition 4.3.3   (isgnf, iexpof, isigf) Let $ p \in
\mathbb{N}$ and $ q \in \mathbb{N}$ with $ p>0$ and $ q>0$ , and let $ x$ be a bit vector of width $ p+q$ .

(a) isgnf$ (x,p,q) = x[p+q-1]$
(b) iexpof$ (x,p,q) = x[p+q-2:p-1]$
(c) isigf$ (x,p) = x[p-2:0]$ .

Figure 4.2: A Floating-Point Format with Implicit MSB
\begin{figure}\par\setlength{\unitlength}{2mm}
\begin{picture}(64,10)(-1,-1)
...
...(4,3)[l]{$p$-$2$}}
\put(55,6){\makebox(4,3)[r]{$0$}}
\end{picture}
\end{figure}

The value of the implicit MSB for an encoding $ x$ may be either 0 or 1, and is determined by iexpof$ (x,p,q)$ . If this field has any value other than the extremes 0 and $ 2^q-1$ , then $ x$ is said to be a normal encoding, as characterized by the following predicate.

Definition 4.3.4   (nencodingp) Let $ p \in
\mathbb{N}$ and $ q \in \mathbb{N}$ with $ p>0$ and $ q>0$ , and let $ x$ be a $ (p+q)$ -bit vector. Then

$\displaystyle nencodingp(x,p,q) \Leftrightarrow 0 <$   iexpof$\displaystyle (x,p,q) < 2^q-1.$

The implicit bit of a normal encoding is always 1. Consequently, the decoding function is the same as that for an encoding with explicit MSB, except that esigf$ (x,p)$ is replaced by

$\displaystyle \{1\verb!'b!1,$   isigf$\displaystyle (x,p)[p-2:0]\} = 2^{p\mbox{-}1}+\mbox{\it isigf}(x,p).$

Definition 4.3.5   (ndecode) Let $ x \in \mathbb{N}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>0$ and $ q>0$ . If $ nencodingp(x,p,q)$ , then
$\displaystyle ndecode(x,p,q)$ $\displaystyle =$ $\displaystyle (-1)^{\mbox{\scriptsize {\it isgnf}}(x,p,q)}(2^{p\mbox{-}1}+\mbo...
...{\mbox{\scriptsize {\it iexpof}}(x,p,q)\mbox{$+$}1\mbox{-}p\mbox{-}bias(q)}).$  

  (sgn-ndecode. expo-ndecode, sig-ndecode) Let $ x \in \mathbb{N}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>1$ and $ q>0$ . Assume that $ nencodingp(x,p,q)$ and let $ \hat{x} = ndecode(x,p,q)$ . then

(a) $ sgn(\hat{x}) = (-1)^{\mbox{\scriptsize {\it isgnf}}(x,p,q)}$
(b) $ expo(\hat{x}) =$   iexpof$ (x,p,q)-bias(q)$
(c) $ sig(\hat{x}) = 1 +$   isigf$ (x,p)/2^{p-1}$ .

PROOF: Part (a) is a trivial consequence of Definition 4.3.5.

By Lemma 2.2.1,

$\displaystyle 2^{p-1} \leq 2^{p-1} +$   isigf$\displaystyle (x,p) < 2^{p-1} + 2^{p-1} = 2^p,$

and thus, by Definition 4.1.1,

$\displaystyle expo(2^{p-1} +$   isigf$\displaystyle (x,p)) = p-1.$

Now by Lemma 4.1.12,
$\displaystyle expo(\hat{x})$ $\displaystyle =$ $\displaystyle expo(2^{p-1} +$   isigf$\displaystyle (x,p))+$iexpof$\displaystyle (x,p,q)+1-p-bias(q)$  
  $\displaystyle =$ iexpof$\displaystyle (x,p,q)-bias(q).$  

Finally, by Lemma 4.1.12 and Definition 4.1.1,
$\displaystyle sig(\hat{x})$ $\displaystyle =$ $\displaystyle sig(2^{p-1} +$   isigf$\displaystyle (x,p))$  
  $\displaystyle =$ $\displaystyle (2^{p-1} +$   isigf$\displaystyle (x,p))/2^{p-1}$  
  $\displaystyle =$ $\displaystyle 1 +$   isigf$\displaystyle (x,p)/2^{p-1}.$  

The following predicate characterizes the rational numbers that are representable by normal encodings.

Definition 4.3.6   (nrepp) Let $ r \in \mathbb{Q}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>1$ and $ q>0$ . Then $ nrepp(r,p,q)$ if and only if all of the following are true:

(a) $ r \neq 0$ ,
(b) $ 0 < expo(r)+bias(q) < 2^q-1$ , and
(c) $ r$ is $ p$ -exact.

The normal encoding of a representable number is derived as follows.

Definition 4.3.7   (nencode) Let $ r \in \mathbb{Q}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>0$ and $ q>0$ . If $ nrepp(r,p,q)$ , then

$\displaystyle nencode(r,p,q) = \{neg(r), (expo(r)+bias(q))[q$-$\displaystyle 1:0], (2^{p\mbox{-}1}(sig(r)\mbox{-}1))[p\mbox{-}2:0]\},$

where

$\displaystyle neg(r) = \left\{\begin{array}{ll}
0 & \mbox{if $r > 0$}\\
1 & \mbox{if $r < 0$.}
\end{array}\right.$

The next two lemmas establish an inverse relation between the encoding and decoding functions, from which it follows that the numbers that admit normal encodings are precisely those that satify nrepp.

  (nrepp-ndecode, nencode-ndecode) Let $ x \in \mathbb{N}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>1$ and $ q>0$ . If $ nencodingp(x,p,q)$ , then

$\displaystyle nrepp(ndecode(x,p,q),p,q)$

and

$\displaystyle nencode(ndecode(x,p,q),p,q) = x.$

PROOF: Let $ \hat{x} = ndecode(x,p,q)$ . It is clear from Definition 4.3.5 that $ \hat{x} \neq 0$ . By Lemma 4.3.1,

$\displaystyle expo(\hat{x})+bias(q) =$   iexpof$\displaystyle (x,p,q)$

is a $ q$ -bit vector, and

$\displaystyle 2^{p-1}sig(\hat{x}) = 2^{p-1}(1+$isigf$\displaystyle (x,p)/2^{p-1}) = 2^{p-1}+$isigf$\displaystyle (x,p) \in \mathbb{Z},$

i.e., $ \hat{x}$ is $ p$ -exact. This establishes $ nrepp(\hat{x},p,q)$ .

It also clear from Definition 4.3.5 that

isgnf$\displaystyle (x,p,q) = \left\{\begin{array}{ll}
0 & \mbox{if $\hat{x} > 0$}\\
1 & \mbox{if $\hat{x} < 0$.}
\end{array}\right.$

Thus, by Definitions 4.3.7 and 4.3.3 and Lemmas 2.4.9 and 2.2.11,
$\displaystyle {nencode(\hat{x},p,q)}$
  $\displaystyle =$ $\displaystyle \{$isgnf$\displaystyle (x,p,q), (expo(\hat{x})+bias(q))[q$-$\displaystyle 1:0], (2^{p\mbox{-}1}(sig(\hat{x})\mbox{-}1))[p\mbox{-}2:0]\}$  
  $\displaystyle =$ $\displaystyle \{$isgnf$\displaystyle (x,p,q),$   iexpof$\displaystyle (x,p,q)[q-1:0],$   isigf$\displaystyle (x,p)[p-2:0]\}$  
  $\displaystyle =$ $\displaystyle \{x[p+q-1], x[p+q-2:p-1], x[p-2:0]\}$  
  $\displaystyle =$ $\displaystyle x[p+q-1:0]$  
  $\displaystyle =$ $\displaystyle x.$  

  (nencodingp-nencode, ndecode-nencode) Let $ r \in \mathbb{Q}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>1$ and $ q>0$ . If $ nrepp(r,p,q)$ , then

$\displaystyle nencodingp(nencode(r,p,q),p,q)$

and

$\displaystyle ndecode(nencode(r,p,q),p,q) = r.$

PROOF: Let $ x = eencode(r,p,q)$ . By Lemma 2.4.1, $ x$ is a $ (p+q)$ -bit vector and by Lemma 2.4.7,

isgnf$\displaystyle (x,p,q) = x[p+q-1] = \left\{\begin{array}{ll}
0 & \mbox{if $r > 0$}\\
1 & \mbox{if $r < 0$,}
\end{array}\right.$

iexpof$\displaystyle (x,p,q) = x[p+q-2:p-1] = (expo(r)+bias(q))[q-1:0],$

and

isigf$\displaystyle (x,p) = x[p-2:0] = (2^{p-1}(sig(r)-1))[p-2:0].$

Since $ expo(r)+bias(q)$ is a $ q$ -bit vector,

$\displaystyle (expo(r)+bias(q))[q-1:0] = expo(r)+bias(q)$

by Lemma 2.2.11.

Since $ r$ is $ p$ -exact,

$\displaystyle 2^{p-1}(sig(r)-1) = 2^{p-1}sig(r)-2^{p-1} = \in \mathbb{Z}$

and by Lemma 4.1.7, $ 2^{p-1}(sig(r)-1) < 2^{p-1}$ , which implies

$\displaystyle (2^{p-1}(sig(r)-1))[p-2:0] = 2^{p-1}(sig(r)-1).$

Finally, according to Definition 4.3.5,
$\displaystyle ndecode(x,p,q)$ $\displaystyle =$ $\displaystyle (-1)^{\mbox{\scriptsize {\it isgnf}}(x,p,q)}(2^{p-1}+\mbox{\it isigf}(x,p))2^{\mbox{\scriptsize {\it iexpof}}(x,p,q)+1-p-bias(q)}$  
  $\displaystyle =$ $\displaystyle sgn(r)2^{p-1}sig(r)2^{expo(r)+bias(q)+1-p-bias(q)}$  
  $\displaystyle =$ $\displaystyle sgn(r)sig(r)2^{expo(r)}$  
  $\displaystyle =$ $\displaystyle r.$  

We shall have occasion to refer to the smallest positive number that admits a normal representation.

Definition 4.3.8   (spn) For all $ q \in \mathbb{N}$ , $ spn(q) = 2^{1-bias(q)}$ .

  (positive-spn, nrepp-spn, spn-smallest)
Let $ p \in
\mathbb{N}$ and $ q \in \mathbb{N}$ with $ p>1$ and $ q>0$ .

(a) $ spn(q) > 0$
(b) $ nrepp(spn(q),p,q)$
(c) If $ r \in \mathbb{Q}$ , $ r>0$ , and $ nrepp(r,p,q)$ , then $ r \ge spn(q)$ .

PROOF: It is clear that $ spn(q)$ is positive and satisfies Definition 4.3.6. Moreover, if $ r>0$ and $ nrepp(r,p,q)$ , then since $ expo(r) > -bias(q)$ , $ r \geq 2^{1-bias(q)}$ by Lemma 4.1.2


Of the two values of the exponent field that lie outside of the range of normal encodings, the upper extreme $ 2^q-1$ is reserved for the encoding of infinities and other non-numerical entities, which will not be discussed here, while an exponent field of 0 is used to encode numerical values that lie below the normal range. If both the exponent and significand fields are 0, then the encoded value itself is 0. In the remaining case, the encoding is said to be denormal.

Definition 4.3.9   (dencodingp) Let $ p \in
\mathbb{N}$ and $ q \in \mathbb{N}$ with $ p>0$ and $ q>0$ , and let $ x$ be a $ (p+q)$ -bit vector. Then

$\displaystyle dencodingp(x,p,q) \Leftrightarrow$   iexpof$\displaystyle (x,p,q) = 0$    and    isigf$\displaystyle (x,p,q) \neq 0.$

Theere are two differences between the decoding formulas for denormal and normal representations:

(1) For a denormal encoding, the implicit MSB is taken to be 0 rather than 1, so that the value represented by the significand field is isigf$ (x,p)/2^{p-1}$ .

(2) The power of 2 represented by the zero exponent field of a denormal encoding, which might be expected to be $ 2^{-bias(q)}$ , is instead the same as the minimum value of the normal range, $ 2^{1-bias(q)}$ .

Definition 4.3.10   (ddecode) Let $ x \in \mathbb{N}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>0$ and $ q>0$ . If $ dencodingp(x,p,q)$ , then

$\displaystyle ddecode(x,p,q) = (-1)^{\mbox{\scriptsize {\it isgnf}}(x,p,q)}\mbox{\it isigf}(x,p)2^{2\mbox{-}p\mbox{-}bias(q)}.$

  (sgn-ddecode, expo-ddecode, sig-ddecode) Let $ x \in \mathbb{N}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>1$ and $ q>0$ . Assume that $ dencodingp(x,p,q)$ and let $ \hat{x} = ddecode(x,p,q)$ .

(a) $ sgn(\hat{x}) = (-1)^{\mbox{\scriptsize {\it isgnf}}(x,p,q)}$
(b) $ expo(\hat{x}) = expo($isigf$ (x,p))-bias(q) + 2 - p$
(c) $ sig(\hat{x}) = sig($isigf$ (x,p))$ .

PROOF: (a) is trivial; (b) and (c) follow from Lemmas 4.1.11 and 4.1.12


The class of rationals that are representable as denormal encodings is recognized by the following predicate.

Definition 4.3.11   (drepp) Let $ r \in \mathbb{Q}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>1$ and $ q>0$ . Then $ drepp(r,p,q)$ if and only if all of the following are true:

(a) $ r \neq 0$ ,
(b) $ 2-p \leq expo(r)+bias(q) \leq 0$ , and
(c) $ r$ is $ (p-2+2^{q-1}+expo(r))$ -exact.

If a number is so representable, then its encoding is constructed as follows.

Definition 4.3.12   (dencode) Let $ r \in \mathbb{Q}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>0$ and $ q>0$ . If $ drepp(r,p,q)$ , then

$\displaystyle dencode(r,p,q) = \{neg(r), q\verb!'b!0, (2^{p\mbox{-}2+expo(r)+bias(q)}sig(r))[p\mbox{-}2:0]\},$

where

$\displaystyle neg(r) = \left\{\begin{array}{ll}
0 & \mbox{if $r > 0$}\\
1 & \mbox{if $r < 0$.}
\end{array}\right.$

Next, we examine the relationship between the decoding and encoding functions.

  (drepp-ddecode, dencode-ddecode) Let $ x \in \mathbb{N}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>1$ and $ q>0$ . If $ dencodingp(x,p,q)$ , then

$\displaystyle drepp(ddecode(x,p,q),p,q)$

and

$\displaystyle dencode(ddecode(x,p,q),p,q) = x.$

PROOF: Let $ \hat{x} = ddecode(x,p,q)$ . Since $ 1 \leq$   isigf$ (x,p) < 2^{p-1}$ ,

$\displaystyle 2^{2-p-bias(q)} \leq \vert\hat{x}\vert =$   isigf$\displaystyle (x,p)2^{2\mbox{-}p\mbox{-}bias(q)} < 2^{1-bias(q)},$

and by Lemma 4.1.2,

$\displaystyle 2-p-bias(q) \leq expo(\hat{x}) < 1-bias(q),$

which is equivalent to Definition 4.3.11(b). In order to prove (c), we must show, according to Definition 4.2.1, that

$\displaystyle 2^{p-2+2^{q-1}+expo(\hat{x})-1}sig(\hat{x}) = 2^{p-2+bias(q)+expo(\hat{x})}sig(\hat{x}) \in \mathbb{Z}.$

But
$\displaystyle 2^{p-2+bias(q)+expo(\hat{x})}sig(\hat{x})$ $\displaystyle =$ $\displaystyle 2^{p-2+bias(q)+expo(\hat{x})}\vert\hat{x}\vert 2^{-expo(\hat{x})}$  
  $\displaystyle =$ $\displaystyle 2^{p-2+bias(q)}\vert\hat{x}\vert$  
  $\displaystyle =$ $\displaystyle 2^{p-2+bias(q)}$isigf$\displaystyle (x,p)2^{2\mbox{-}p\mbox{-}bias(q)}$  
  $\displaystyle =$ isigf$\displaystyle (x,p) \in \mathbb{Z}.$  

This establishes $ drepp(\hat(x),p,q)$ .

Now by Definition 4.3.10,

isgnf$\displaystyle (x,p,q) = \left\{\begin{array}{ll}
0 & \mbox{if $\hat{x} > 0$}\\
1 & \mbox{if $\hat{x} < 0$.}
\end{array}\right.$

Therefore, by Definitions 4.3.9, 4.3.12, and 4.3.3 and Lemmas 2.4.9 and 2.2.11,
$\displaystyle {dencode(\hat{x},p,q)}$
  $\displaystyle =$ $\displaystyle \{$isgnf$\displaystyle (x,p,q), q\verb!'b!0, (2^{p\mbox{-}2+expo(\hat{x})+bias(q)}sig(\hat{x}))[p\mbox{-}2:0]\}$  
  $\displaystyle =$ $\displaystyle \{$isgnf$\displaystyle (x,p,q),$   iexpof$\displaystyle (x,p,q)[q-1:0],$   isigf$\displaystyle (x,p)[p-2:0]\}$  
  $\displaystyle =$ $\displaystyle \{x[p+q-1], x[p+q-2:p-1], x[p-2:0]\}$  
  $\displaystyle =$ $\displaystyle x[p+q-1:0]$  
  $\displaystyle =$ $\displaystyle x.$  

  (dencodingp-dencode, ddecode-dencode) Let $ r \in \mathbb{Q}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ p>1$ and $ q>0$ . If $ drepp(r,p,q)$ , then

$\displaystyle dencodingp(dencode(r,p,q),p,q)$

and

$\displaystyle ddecode(dencode(r,p,q),p,q) = r.$

PROOF: Let $ x = dencode(r,p,q)$ . By Lemma 2.4.1, $ x$ is a $ (p+q)$ -bit vector and by Lemma 2.4.7,

isgnf$\displaystyle (x,p,q) = x[p+q-1] = \left\{\begin{array}{ll}
0 & \mbox{if $r > 0$}\\
1 & \mbox{if $r < 0$,}
\end{array}\right.$

iexpof$\displaystyle (x,p,q) = x[p+q-2:p-1] = 0,$

and

isigf$\displaystyle (x,p) = x[p-2:0] = (2^{p-2+expo(r)+bias(q)}sig(r))[p-2:0].$

Since $ r$ is $ (p-2+2^{q-1}+expo(r))$ -exact,

$\displaystyle 2^{p-2+expo(r)+bias(q)}sig(r) = 2^{(p-2+2^{q-1}+expo(r))-1}sig(r) = \in \mathbb{Z}$

and since $ expo(r) + bias(q) \leq 0$ ,

$\displaystyle 2^{p-2+expo(r)+bias(q)}sig(r) < 2^{p-2}\cdot 2 = 2^{p-1}$

by Lemma 4.1.7, which implies

$\displaystyle (2^{p-2+expo(r)+bias(q)}sig(r))[p-2:0] = 2^{p-2+expo(r)+bias(q)}sig(r).$

Finally, according to Definition 4.3.10,
$\displaystyle ddecode(x,p,q)$ $\displaystyle =$ $\displaystyle (-1)^{\mbox{\scriptsize {\it isgnf}}(x,p,q)}\mbox{\it isigf}(x,p)2^{2\mbox{-}p\mbox{-}bias(q)}$  
  $\displaystyle =$ $\displaystyle sgn(r)2^{p-2+expo(r)+bias(q)}sig(r)2^{2\mbox{-}p\mbox{-}bias(q)}$  
  $\displaystyle =$ $\displaystyle sgn(r)sig(r)2^{expo(r)}$  
  $\displaystyle =$ $\displaystyle r.$  

The smallest positive denormal is computed by the following function:

Definition 4.3.13   (spd) For all $ p \in
\mathbb{N}$ and $ q \in \mathbb{N}$ , $ spd(p,q) = 2^{2-bias(q)-p}$ .

  (positive-spd, drepp-spd, spd-smallest)
Let $ p \in
\mathbb{N}$ and $ q \in \mathbb{N}$ with $ p>1$ and $ q>0$ .

(a) $ spd(p,q) > 0$
(b) $ drepp(spd(p,q),p,q)$
(c) If $ r \in \mathbb{Q}$ , $ r>0$ , and $ drepp(r,p,q)$ , then $ r \geq spd(p,q)$ .

PROOF: It is clear that $ spd(p,q)$ is positive. To show that $ spd(p,q)$ is $ (p-2+2^{q-1}+expo(spd(p,q)))$ -exact, we need only observe that

$\displaystyle p-2+2^{q-1}+expo(spd(p,q)) = p-2+2^{q-1}+2-(2^{q-1}-1)-p = 1.$

Finally, since

$\displaystyle expo(spd(p,q)) + bias(q) = 2-p < 0,$

$ drepp(spd(p,q)$ holds and moreover, $ spd(p,q)$ is the smallest positive $ r$ that satisfies $ 2-p \leq expo(r)+bias(q)$


Every number with a denormal representation is a multiple of the smallest positive denormal.

  (spd-mult) Let $ r \in \mathbb{Q}$ , $ p \in
\mathbb{N}$ , and $ q \in \mathbb{N}$ with $ r>0$ , $ p>1$ , and $ q>0$ . Then $ drepp(r,p,q)$ if and only if $ r = m \cdot spd(p,q)$ for some $ m \in \mathbb{N}$ , $ 1 \leq m < 2^{p-1}$ .

PROOF: For $ 1 \leq m \leq p-1$ , let $ a_m = m \cdot spd(p,q)$ . Then $ a_1 = spd(p,q)$ and

$\displaystyle a_{2^{p-1}} = 2^{p-1}spd(p,q) = 2^{p-1}2^{2-bias(q)-p} = 2^{1-bias(q)} = spn(q).$

We shall show, by induction on $ m$ , that $ drepp(a_m,p,q)$ for $ 1 \leq m<2^{n-1}$ . First note that for all such $ m$ ,
$\displaystyle {fp^+(a_m,p+expo(a_m)-expo(spn(q)))}$
  $\displaystyle =$ $\displaystyle a_m + 2^{expo(a_m)+1-(p+expo(a_m)-expo(spn(q)))}$  
  $\displaystyle =$ $\displaystyle a_m + 2^{expo(spn(q))-(p-1)}$  
  $\displaystyle =$ $\displaystyle a_m + spd(p,q)$  
  $\displaystyle =$ $\displaystyle a_{m+1}.$  

Suppose that $ drepp(a_{m-1},p,q)$ for some $ m$ , $ 1 < m < 2^{p-1}$ . Then $ a_{m-1}$ is $ (p+expo(a_{m-1})-expo(spn(q)))$ -exact, and by Lemma 4.2.15, so is $ a_m$ . But since $ expo(a_m) \geq expo(a_{m-1}$ , it follows from Lemma 4.2.5 that $ a_m$ is also $ (p+expo(a_m)-expo(spn(q)))$ -exact. Since

$\displaystyle a_m < a_{2^{p-1}} = spn(q) = 2^{1-bias(q)},$

$ expo(a_m) < 1-bias(q)$ , i.e., $ expo(a_m) + bias(q) \leq 0$ , and hence, $ drepp(a_m,p,q)$ .

Now suppose that $ z \in {mathbb Q}$ and $ drepp(z,p,q)$ . Let $ m = \lfloor z/a_1 \rfloor$ . Clearly, $ 1 \leq m < 2^{p-1}$ , and $ a_m \leq z < a_{m+1}$ . It follows from Lemma 4.2.16 that $ expo(z) = expo(a_m)$ , and consequently, $ z$ is $ (p+expo(a_m)-expo(spn(q)))$ -exact. Thus, by Lemma 4.2.15, $ z = a_m$


A common task performed by floating-point units is the conversion of an encoding from one format to another, which generally requires the rebiasing of the exponent field. This operation may, of course, be viewed in terms of bit vector addition. If $ e$ is the value of an exponent field of width $ m$ , then the actual exponent of the encoded number is $ e-bias(m)$ . The result of rebiasing this value for a field of width $ n$ is given by the following definition.

Definition 4.3.14   (rebias-expo) For all $ e \in \mathbb{N}$ , $ m \in \mathbb{N}$ , and $ n \in \mathbb{N}$ ,

$\displaystyle rebias$-$\displaystyle expo(e,m,n) = e+bias(n)-bias(m).$

When the target exponent field is wider than that of the source, rebiasing is always possible.

  (rebias-up) Let $ m \in \mathbb{N}$ and $ n \in \mathbb{N}$ with $ 1 < m < n$ . If $ e$ is an $ m$ -bit vector, then

$\displaystyle rebias$-$\displaystyle expo(e,m,n) = \{e[m-1], \{(n-m)\{\verb! !e[m-1]\}\}, e[m-2:0]\}.$

PROOF: First suppose that $ e[m-1] = 1$ . Then by Lemmas 2.2.11 and 2.3.13,

$\displaystyle e = e[m-1:0] = 2^{m-1} + e[m-2:0]$

and

$\displaystyle rebias$-$\displaystyle expo(e,m,n) = 2^{n-1} - 2^{m-1} + 2^{m-1} + e[m-2:0] = 2^{n-1} + e[m-2:0].$

On the other hand, by Definition 2.4.1 and Lemma 2.4.18,

       $ \{e[m-1],$
 
$ \{(n-m)\{\verb! !e[m-1]\}\}, e[m-2:0]\}$


$ = 2^{n-1}\cdot 1 + 2^{m-1}\{(n-m)\{0\}\} + e[m-2:0]$
$ = 2^{n-1} + e[m-2:0].$
Now suppose $ e[m-1] = 0$ . Then

$\displaystyle rebias$-$\displaystyle expo(e,m,n) = 2^{n-1} - 2^{m-1} + e = 2^{n-1} - 2^{m-1} + e[m-2:0]$

and by Definition 2.4.1 and Lemma 2.4.19,

       $ \{e[m-1],$
 
$ \{(n-m)\{\verb! !e[m-1]\}\}, e[m-2:0]\}$


$ = 2^{n-1}\cdot 0 + 2^{m-1}\{(n-m)\{1\}\} + e[m-2:0]$
$ = 2^{m-1}(2^{n-m}-1) + e[m-2:0]$
$ = 2^{n-1} - 2^{m-1} + e[m-2:0].$  

Corollary 4.3.11   (bvecp-rebias-up) Let $ m \in \mathbb{N}$ and $ n \in \mathbb{N}$ with $ 0 < m \leq n$ . If $ e$ is an $ m$ -bit vector, then $ rebias$-$ expo(e,m,n)$ is an $ n$ -bit vector.

Now suppose that $ e$ is an $ n$ -bit biased exponent to be rebiased to fit into a smaller $ m$ -bit field. In order for this to be possible,

$\displaystyle rebias$-$\displaystyle expo(e,n,m) = e+bias(m)-bias(n)$

must be an $ m$ -bit vector, i.e.,

$\displaystyle 0 \leq e+bias(m)-bias(n) = e + 2^{m-1} - 2^{n-1} < 2^m,$

or equivalently,

$\displaystyle 2^{n-1}-2^{m-1} \leq e < 2^{n-1}+2^{m-1}.$

  (rebias-down) Let $ m \in \mathbb{N}$ and $ n \in \mathbb{N}$ with $ 1 < m < n$ . If $ e$ is an $ n$ -bit vector and

$\displaystyle 2^{n-1}-2^{m-1} \leq e < 2^{n-1}+2^{m-1},$

then

$\displaystyle rebias$-$\displaystyle expo(e,n,m) = \{e[n-1], e[m-2:0]\}.$

PROOF: The hypothesis implies that $ e$ is an $ n$ -bit vector, and hence, by Lemmas 2.2.11, 2.3.13, and 2.2.19,

$\displaystyle e = e[n-1:0] = 2^{n-1}e[n-1] + 2^{m-1}e[n-2:m-1] + e[m-2:0].$

Suppose first that $ e[n-1] = 1$ . Then $ e[n-2:m-1] = 0$ , for otherwise $ e[n-2:m-1] \geq 1$ and $ e \geq 2^{n-1} + 2^{m-1}$ . Thus, $ e = 2^{n-1} + e[m-2:0]$ and
$\displaystyle rebias$-$\displaystyle expo(e,n,m)$ $\displaystyle =$ $\displaystyle 2^{m-1} + e[m-2:0]$  
  $\displaystyle =$ $\displaystyle 2^{m-1}e[n-1] + e[m-2:0]$  
  $\displaystyle =$ $\displaystyle \{e[n-1], e[m-2:0]\}.$  

Now suppose $ e[n-1] = 0$ . Then $ e[n-2:m-1] = 2^{n-m}-1$ , for otherwise, by Lemma 2.2.1, $ e[n-2:m-1] \leq 2^{n-m}-2$ and
$\displaystyle e$ $\displaystyle \leq$ $\displaystyle 2^{m-1}(2^{n-m}-2) + e[m-2:0]$  
  $\displaystyle \leq$ $\displaystyle 2^{m-1}(2^{n-m}-2) + e[m-2:0]$  
  $\displaystyle <$ $\displaystyle 2^{n-1} - 2^m + 2^{m-1}$  
  $\displaystyle =$ $\displaystyle 2^{n-1} - 2^{m-1}.$  

Therefore,

$\displaystyle e = 2^{m-1}(2^{n-m}-1) + e[m-2:0] = 2^{n-1} - 2^{m-1} + e[m-2:0]$

and
$\displaystyle rebias$-$\displaystyle expo(e,n,m)$ $\displaystyle =$ $\displaystyle e[m-2:0]$  
  $\displaystyle =$ $\displaystyle 2^{m-1}e[n-1] + e[m-2:0]$  
  $\displaystyle =$ $\displaystyle \{e[n-1], e[m-2:0]\}.$  

Corollary 4.3.13   (bvecp-rebias-down) Let $ m \in \mathbb{N}$ and $ n \in \mathbb{N}$ with $ 0 < m \leq n$ . If $ e$ is an $ n$ -bit vector and

$\displaystyle 2^{n-1}-2^{m-1} \leq e < 2^{n-1}+2^{m-1},$

then $ rebias$-$ expo(e,n,m)$ is an $ m$ -bit vector.


next up previous contents
Next: Rounding Up: Floating-Point Representation Previous: Exactness   Contents
David Russinoff 2007-01-02