6.6  Denormal Rounding

The rounding of numbers that lie below the normal range of a floating-point format is conveniently defined by the following function. Note that its arguments include the format itself, both parameters of which are required to compute the precision of the result.

Definition 6.6.1   (drnd) Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$.

$\displaystyle \mathit{drnd}(x,{\cal R}, F) = {\cal R}(x, \mathit{prec}(F) + expo(x) - expo(\mathit{spn}(F)).$

It is not true in general that $ sgn(\mathit{drnd}(x,{\cal R}, F)) = sgn(x)$, since a non-zero denormal may be rounded to 0. However, we do have the following analog of Lemma 6.5.5.

  (drnd-minus) Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$. Let $ {\cal R}$ be a common rounding mode and let

$\displaystyle \hat{\cal R} = \left\{\begin{array}{ll}
\mathit{RDN} & \mbox{if ...
... ${\cal R} = \mathit{RDN}$}\\
{\cal R} & \mbox{otherwise.}\end{array} \right.$

Then

$\displaystyle \mathit{drnd}(-x,{\cal R}, F) = -\mathit{drnd}(x,\hat{\cal R}, F).$

PROOF: This follows from Lemmas 6.5.5 and 4.1.13


A denormal is always rounded to a representable number, which may be denormal, 0, or the smallest representable normal.

  (drnd-exactp-a) Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$. Let $ {\cal R}$ be a common rounding mode. Then one of the following is true:

(a) $ \mathit{drnd}(x, {\cal R}, F) = 0$.

(b) $ \mathit{drnd}(x, {\cal R}, F) = \mathit{sgn}(x)\mathit{spn}(F)$.

(c) $ \mathit{drnd}(x, {\cal R}, F)$ is representable as a denormal in $ F$.

PROOF: We may assume $ x>0$, as the general result follows easily from this case.

Let $ p = \mathit{prec}(F)$. Suppose first that $ p+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F)) \leq 0$. Then by Lemmas 6.5.3, 6.1.1, and 6.2.4, either

$\displaystyle \mathit{drnd}(x,{\cal R}, F) = \mathit{RTZ}(x,p+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F))) = 0$

or
$\displaystyle \mathit{drnd}(x,{\cal R},F)$ $\displaystyle =$ $\displaystyle RAZ(x,p+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F))$  
  $\displaystyle =$ $\displaystyle 2^{\mathit{expo}(x) + 1 - (p+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F)))}$  
  $\displaystyle =$ $\displaystyle 2^{\mathit{expo}(\mathit{spn}(F))-(p-1)}$  
  $\displaystyle =$ $\displaystyle \mathit{spd}(F).$  

We may assume, therefore, that $ p+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F)) \geq 1$, so that

$\displaystyle \mathit{expo}(x) \geq \mathit{expo}(\mathit{spn}(F))-p+1 = 1-\mathit{bias}(F)-p+1$

and hence,

$\displaystyle \mathit{expo}(x) + \mathit{bias}(F) \geq 2-p.$

Since $ x < \mathit{spn}(F) = 2^{1-\mathit{bias}(F)}$, $ \mathit{expo}(x) \leq -\mathit{bias}(F)$, i.e.,

$\displaystyle \mathit{expo}(x) + \mathit{bias}(F) \leq 0.$

By Lemma 6.5.6, $ \mathit{drnd}(x, {\cal R}, F)$ is $ (p+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F)))$-exact. If $ \mathit{expo}(\mathit{drnd}(x,{\cal R},F)) = \mathit{expo}(x)$, then we have $ \mathit{drnd}(x, {\cal R}, F)$ is representable as a denormal. If not, then Lemma 6.5.10 implies $ \mathit{drnd}(x,{\cal R},F) = 2^{\mathit{expo}(x)+1}$. In this case, either $ \mathit{expo}(x) = -\mathit{bias}(F)$ and

$\displaystyle \mathit{drnd}(x,{\cal R},F) = 2^{1-\mathit{bias}(F)} = \mathit{spn}(F),$

or $ \mathit{expo}(x)+\mathit{bias}(F) < 0$ and

$\displaystyle \mathit{expo}(\mathit{drnd}(x,{\cal R},F))+\mathit{bias}(F) = 1 + \mathit{expo}(x) + \mathit{bias}(F) \leq 0,$

which implies that $ \mathit{drnd}(x, {\cal R}, F)$ is representable as a denormal. 

  (drnd-exactp-b) If $ x$ is representable as a denormal in $ F$ and $ {\cal R}$ is a common rounding mode, then

$\displaystyle \mathit{drnd}(x,{\cal R}, F) = x.
$

PROOF: Let $ p = \mathit{prec}(F)$. By Definition 5.3.5, $ x$ is $ (p+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F)))$-exact. Therefore, by Lemma 6.5.7,

$\displaystyle \mathit{drnd}(x,{\cal R}, F) = {\cal R}(x, p + \mathit{expo}(x) - \mathit{expo}(\mathit{spn}(F))) = x.$

  (drnd-exactp-c) If $ F$ is a format, $ {\cal R}$ is a common rounding mode, and $ x \in \mathbb{R}$ with

$\displaystyle 2 - \mathit{bias}(F) - \mathit{prec}(F) \leq \mathit{expo}(x) \leq -\mathit{bias}(F),
$

then $ \mathit{drnd}(x,{\cal R}, F) = x$ iff $ x$ is $ (\mathit{expo}(x) + \mathit{bias}(F) + \mathit{prec}(F) - 1)$-exact.

PROOF: Suppose $ \mathit{drnd}(x,{\cal R}, F) = x$. Since $ 0 < \vert x\vert < \mathit{spn}(F)$, Lemma 6.6.2 guarantees that $ x$is representable as a denormal, which implies that $ x$ is $ (\mathit{expo}(x) + \mathit{bias}(F) + \mathit{prec}(F) - 1)$-exact.

On the other hand, if $ x$ is $ (\mathit{expo}(x) + \mathit{bias}(F) + \mathit{prec}(F) - 1)$-exact, then $ x$ is representable by Definition 5.3.5 and $ \mathit{drnd}(x,{\cal R}, F) = x$ by Lemma 6.6.3

  (drnd-exactp-d, drnd-exactp-e) Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$. Let $ a$ be representable as a denormal in $ F$ and let $ {\cal R}$ be a common rounding mode.

(a) If $ a \geq x$, then $ a \geq \mathit{drnd}(x, {\cal R}, F)$.

(b) If $ a \leq x$, then $ a \leq \mathit{drnd}(x, {\cal R}, F)$.

PROOF: By Definition 5.3.5, $ a$ is $ (\mathit{prec}(F)+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F)))$-exact. The result follows from Lemma 6.5.8


The defining characteristics of the directed rounding modes are inherited by denormal rounding.

  (drnd-rtz, drnd-raz, drnd-rup, drnd-rdn)
Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$.

(a) $ \vert\mathit{drnd}(x,\mathit{RTZ}, F)\vert \leq \vert x\vert$.

(b) $ \vert\mathit{drnd}(x,RAZ, F)\vert \geq \vert x\vert$.

(c) $ \mathit{drnd}(x,\mathit{RDN}, F) \leq x$.

(d) $ \mathit{drnd}(x,\mathit{RUP}, F) \geq x$.

PROOF: This is a consequence of Lemmas 6.1.5, 6.2.5, 6.5.2, and 6.5.1

Denormal rounding error is bounded by the distance between successive representable numbers (see Lemma 5.3.5).

  (drnd-diff) Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$. Let $ {\cal R}$ be a common rounding mode. Then

$\displaystyle \vert x-\mathit{drnd}(x,{\cal R},F)\vert < \mathit{spd}(F).$

PROOF: Let $ p = \mathit{prec}(F)$. By Lemma 6.5.9,

$\displaystyle \vert x-\mathit{drnd}(x,{\cal R},F)\vert$ $\displaystyle =$ $\displaystyle \vert x - {\cal R}(x, p + \mathit{expo}(x) - \mathit{expo}(\mathit{spn}(F)))\vert$  
  $\displaystyle <$ $\displaystyle 2^{\mathit{expo}(x) + 1 - (p+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F)))}$  
  $\displaystyle =$ $\displaystyle 2^{\mathit{expo}(\mathit{spn}(F))-(p-1)}$  
  $\displaystyle =$ $\displaystyle \mathit{spd}(F).$  

Naturally, unbiased denormal rounding returns the representable number that is closest to its argument.

  (drnd-rne-diff,drnd-rna-diff) Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$. Let $ a$ be representable as a denormal in $ F$. Then

$\displaystyle \vert x-\mathit{drnd}(x,RNE, F)\vert \leq \vert x-a\vert
$

and

$\displaystyle \vert x-\mathit{drnd}(x,RNA, F)\vert \leq \vert x-a\vert.$

PROOF: By Definition 5.3.5, $ a$ is $ (\mathit{prec}(F)+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F)))$-exact. The result follows from Lemmas 6.3.14 and 6.3.35

  (drnd-rto) Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$. Let $ {\cal R}$ be a common rounding mode. If

$\displaystyle n \geq p + \mathit{expo}(x) - \mathit{expo}(\mathit{spn}(F)) + 2,$

then

$\displaystyle \mathit{drnd}(RTO(x,n),{\cal R}, F) = \mathit{drnd}(x,{\cal R}, F).$

PROOF: See Lemma 6.5.17


The next lemma, which pertains to the detection of floating-point underflow, warrants some motivation. Let $ x$ be the precise numerical result of an arithmetic operation, to be rounded according to a mode $ {\cal R}$ and encoded in a format $ F$ with precision $ p$. Most implementations first compute the value $ r = \mathit{rnd}(x, {\cal R}, p)$, using an internal format with a sufficient wide exponent field to accommodate this result. According to the x86 architectural definition, underflow occurs when $ \vert r\vert < \mathit{spn}(F)$. If this occurs and the underflow mask is set, then the value $ d = \mathit{dnrd}(x, {\cal R}, F)$ is computed and returned, and the underflow flag is set iff $ d \neq x$.

There are, however, implementations that compute $ d$ directly in the event that $ \vert u\vert < \mathit{spn}(F)$, without computing $ r$. The requirement of correctly setting the underflow flag then presents a problem, since $ r$ may lie below the normal range when $ d$ does not. Thus, for such an implementation, if $ \vert x\vert < \vert d\vert = \mathit{spn}(F)$, then extra logic is required to determine whether $ \vert r\vert < \mathit{spn}(F)$. On the other hand, there is no such ambiguity requiring extra logic for the case $ \vert d\vert < \mathit{spn}(F)$, since the following lemma guarantees that $ \vert r\vert < \mathit{spn}(F)$ as well.

  (rnd-drnd-up) Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$. Let $ {\cal R}$ be a common rounding mode. If $ \vert\mathit{rnd}(x,{\cal R}, \mathit{prec}(F))\vert = \mathit{spn}(F)$, then $ \vert\mathit{drnd}(x,{\cal R}, F)\vert = \mathit{spn}(F)$.

PROOF: Let $ p = \mathit{prec}(F)$. By Lemma 6.5.10, $ \mathit{expo}(x) = \mathit{expo}(\mathit{spn}(F)) - 1$, and hence

$\displaystyle p + expo(x) - expo(\mathit{spn}(F) = p - 1.
$

Since $ \mathit{drnd}(x,{\cal R}, F) = \mathit{rnd}(x,{\cal R}, p + expo(x) - expo(\mathit{spn}(F)))$, the claim follows from Lemma 6.5.18 with $ m = p-1$ and $ n = p$

The next lemma pertains to the setting of the precision flag in the event of underflow and is relevant to the formal architectural specifications discussed in Part III: if $ d$ is exact, then so is $ r$:

Lemma 6.6.11   (rnd-drnd-exactp) Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$. Let $ {\cal R}$ be a common rounding mode. If $ \mathit{drnd}(x,{\cal R}, F) = x$, then $ \mathit{rnd}(x,{\cal R}, \mathit{prec}(F)) = x$.

PROOF Let $ p = \mathit{prec}(F)$. By lemma6.6.2, $ x$ is representable as a denormal in $ F$, which implies the $ x$ is $ (\mathit{expo}(x) + \mathit{bias}(F) + p - 1)$-exactp. Since $ \mathit{expo}(x) < \mathit{expo}(\mathit{spn}(F)) = 1 - \mathit{bias}(F)$, $ \mathit{expo}(x) + \mathit{bias}(F) + p - 1 < p$, and by Lemma 4.2.5, $ x$ is $ p$-exact, and the claim follows from Lemma 6.5.7

We have the following alternative formulation of drnd.

  (drnd-rewrite) Let $ F$ be a format and let $ x \in \mathbb{R}$, $ \vert x\vert < \mathit{spn}(F)$. Let $ {\cal R}$ be a common rounding mode. Then

$\displaystyle \mathit{drnd}(x,{\cal R}, F) = {\cal R}(x + \mathit{sgn}(x)\mathit{spn}(F), \mathit{prec}(F)) - sgn(x)\mathit{spn}(F).$

PROOF: Let $ p = \mathit{prec}(F)$. We first consider the case $ x \geq 0$ and apply Lemma 6.5.13, substituting $ \mathit{spn}(F)$ for $ x$, $ x$ for $ y$, and $ p + \mathit{expo}(x) - \mathit{expo}(\mathit{spn}(F))$ for $ k$. Thus,

$\displaystyle k' = k + \mathit{expo}(\mathit{spn}(F)) - \mathit{expo}(x) = p > 1,$

and $ \mathit{spn}(F)$ is $ k'$-exact by Lemma 4.2.6. Since

$\displaystyle 2^{\mathit{expo}(\mathit{spn}(F))} = \mathit{spn}(F) \leq \mathit{spn}(F) + x < 2 \cdot \mathit{spn}(F) = 2^{\mathit{expo}(\mathit{spn}(F))+1},$

$ \mathit{expo}(\mathit{spn}(F)+x) = \mathit{expo}(\mathit{spn}(F))$ and therefore

$\displaystyle k” = k + \mathit{expo}(\mathit{spn}(F)+x) - \mathit{expo}(x) = p$

as well. Thus, we have

$\displaystyle \mathit{spn}(F) + {\cal R}(x,k) = {\cal R}(\mathit{spn}(F) + x,k”)$

and
$\displaystyle \mathit{drnd}(x,{\cal R},F)$ $\displaystyle =$ $\displaystyle {\cal R}(x,p+\mathit{expo}(x)-\mathit{expo}(\mathit{spn}(F)))$  
  $\displaystyle =$ $\displaystyle {\cal R}(x,k)$  
  $\displaystyle =$ $\displaystyle {\cal R}(\mathit{spn}(F)+x,k”) - \mathit{spn}(F)$  
  $\displaystyle =$ $\displaystyle {\cal R}(\mathit{spn}(F)+x,n) - \mathit{spn}(F).$  

The result may be extended to $ x<0$ by invoking Lemmas 6.5.5 and 6.6.1: if $ \hat{\cal R}$ is defined as in these lemmas, then
$\displaystyle \mathit{drnd}(x,{\cal R}, p, q)$ $\displaystyle =$ $\displaystyle -\mathit{drnd}(-x, \hat{\cal R}, p, q)$  
  $\displaystyle =$ $\displaystyle -(\hat{\cal R}(-x + sgn(-x)\mathit{spn}(F), p) - sgn(-x)\mathit{spn}(F))$  
  $\displaystyle =$ $\displaystyle -\hat{\cal R}(-x + sgn(-x)\mathit{spn}(F), p) + sgn(-x)\mathit{spn}(F)$  
  $\displaystyle =$ $\displaystyle {\cal R}(x+sgn(x)\mathit{spn}(F), p) - sgn(x)\mathit{spn}(F).$  

Lemma 6.6.12 is used in the proof of the final lemma of this chapter, which pertains to the rounding of numbers smaller than the smallest representable denormal.

  (drnd-tiny-a, drnd-tiny-b, drnd-tiny-c)
Let $ F$ be a format, $ x \in \mathbb{R}$, and $ {\cal R}$ a common rounding mode.

(a) If $ 0 < x < \frac{1}{2}\mathit{spd}(F)$, then $ \mathit{drnd}(x,{\cal R}, F) = \left\{\begin{array}{ll}
\mathit{spd}(F) & \mb...
...cal R} = RAZ$ or ${\cal R} = RUP$}\\
0 & \mbox{otherwise;}\end{array}\right.$

(b) If $ x = \frac{1}{2}\mathit{spd}(F)$, then $ \mathit{drnd}(x,{\cal R}, F) = \left\{\begin{array}{ll}
\mathit{spd}(F) & \mb...
...cal R} = RUP$ or ${\cal R} = RNA$}\\
0 & \mbox{otherwise;}\end{array}\right.$

(c) If $ \frac{1}{2}\mathit{spd}(F) < x < \mathit{spd}(F)$, then $ \mathit{drnd}(x,{\cal R}, F) = \left\{\begin{array}{ll}
0 & \mbox{if ${\cal R...
... or ${\cal R} = RDN$}\\
\mathit{spd}(F) & \mbox{otherwise.}\end{array}\right.$

PROOF: Let $ p = \mathit{prec}(F)$, $ q = \mathit{expw}(F)$, $ a = \mathit{spn}(F) = 2^{2-2^{q-1}}$ and

$\displaystyle b = \mathit{fp}^+(a,p) = a + 2^{\mathit{expo}(\mathit{spn}(F))+1-p} = a + 2^{3-2^{q-1}-p} = a + \mathit{spd}(F).$

By Lemma 6.6.12,

$\displaystyle \mathit{drnd}(x,{\cal R},F) = {\cal R}(a+x,p) - a.
$

Case 1: $ {\cal R} = RAZ$ or $ {\cal R} = \mathit{RUP}$

By Lemma 6.2.5,

$\displaystyle {\cal R}(a+x,p) = RAZ(a+x,p) \geq a+x > a,$

and hence, by Lemmas 6.2.11 and 4.2.16,

$\displaystyle \mathit{RAZ}(a+x,p) \geq b.$

On the other hand, since

$\displaystyle b = a + \mathit{spd}(F) > a + x,
$

Lemma 6.2.14 implies $ b \geq RAZ(a+x,p)$, and therefore $ RAZ(a+x,p) = b$, and

$\displaystyle \mathit{drnd}(x,{\cal R},F) = {\cal R}(a+x,p) - a = b - a = \mathit{spd}(F).
$

Case 2: $ {\cal R} = \mathit{RTZ}$ or $ {\cal R} = \mathit{RDN}$

First note that by Lemma 4.2.20,

$\displaystyle fp^-(b,p) = fp^-(fp^+(a,p),p) = a.$

Now by Lemma 6.1.5,

$\displaystyle {\cal R}(a+x,p) = \mathit{RTZ}(a+x,p) \leq a+x < b,$

and hence, by Lemmas 6.1.9 and 4.2.21,

$\displaystyle RAZ(a+x,p) \leq a.$

On the other hand, Lemma 6.1.11 implies $ a \leq \mathit{RTZ}(a+x,p)$, and therefore $ \mathit{RTZ}(a+x,p) = a$ and

$\displaystyle \mathit{drnd}(x,{\cal R},F) = {\cal R}(a+x,p) - a = a - a = 0.
$

Case 3: $ {\cal R} = RNE$ or $ {\cal R} = RNA$

By Lemmas 6.3.1 and 6.3.22, $ {\cal R}(a+x,p)$ is either $ a$ or $ b$, and hence $ \mathit{drnd}(x, {\cal R}, F)$ is either 0 or $ \mathit{spn}(F)$, respectively.

Since

$\displaystyle \vert(a+x) - RAZ(a+x,p)\vert = b - (a + x) = \mathit{spd}(F) - x
$

and

$\displaystyle \vert(a+x) - \mathit{RTZ}(a+x,p)\vert = \vert(a+x)-a\vert = x,
$

the claims (a) and (c) follow from Lemmas 6.3.10 and 6.3.31. For the proof of (c), suppose $ x = \frac{1}{2}\mathit{spd}(F)$. Then

$\displaystyle a+x = 2^{2-2^{q-1}} + 2^{2-2^{q-1}-p} = 2^{2-2^{q-1}}(1 + 2^{-p})
$

and $ \mathit{sig}(a+x) = 1 + 2^{-p}$. Thus, $ a+x$ is $ (p+1)$-exact but not $ p$-exact. The case $ {\cal R} = \mathit{RNA}$ now follows from Lemma 6.3.21, and since $ b$ is not $ (p-1)$-exact, the case $ {\cal R} = \mathit{RNE}$ follows from Lemma 6.3.39


As a consequence of Lemma 6.6.13 (a), for any given rounding mode, two sufficiently small numbers produce the same rounded result.

Corollary 6.6.14   (drnd-tiny-equal) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$. Let $ F$ be a format and $ {\cal R}$ a common rounding mode. If

$\displaystyle 0 < x < \frac{1}{2}\mathit{spd}(F)$

and

$\displaystyle 0 < y < \frac{1}{2}\mathit{spd}(F),$

then

$\displaystyle \mathit{drnd}(x,{\cal R}, F) = \mathit{drnd}(y,{\cal R}, F).$

David Russinoff 2017-08-01