6.3  Unbiased Rounding

Next, we examine the mode that the IEEE standard designates as “round to nearest”, which may round in either direction, selecting the representable number that is closest to its argument. This mode is also sometimes called “round to nearest even” because of the manner in which it resolves the ambiguous case of a midpoint, i.e, a number that is equidistant from two successive representable numbers.

Definition 6.3.1   (rne) Given $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, let $ z = \lfloor 2^{n-1}sig(x)\rfloor$, and $ f = 2^{n-1}sig(x) - z$. Then

$\displaystyle RNE(x,n) = \left\{\begin{array}{ll}
\mathit{RTZ}(x,n) & \mbox{if...
...even}\\
RAZ(x,n) & \mbox{if $f = 1/2$ and $z$ is odd}. \end{array} \right. $

Example: Let $ x = (101.101)_2$ and $ n = 5$. Then

$\displaystyle z = \lfloor 2^{n-1}sig(x) \rfloor = \lfloor (10110.1)_2 \rfloor
= (10110)_2$

and

$\displaystyle f = 2^{n-1}sig(x) - z = (10110.1)_2 - (10110)_2 = (0.1 = 1/2)_2,$

indicating a “tie”, i.e., that $ x$ is equidistant from two successive 5-exact numbers. Since $ z$ is even, the tie is broken in favor of the lesser of the two:

$\displaystyle RNE(x,n) = \mathit{RTZ}(x,n) = (101.1)_2.$

Like all rounding modes, the value of RNE is always that of either $ \mathit{RTZ}$ or RAZ.

Lemma 6.3.1   (rne-choice) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, either

$\displaystyle RNE(x,n) = \mathit{RTZ}(x,n)$

or

$\displaystyle RNE(x,n) = RAZ(x,n).$

We list several properties of RNE that may be derived from Lemma 6.3.1 and the corresponding properties of $ \mathit{RTZ}$ and RAZ.

Lemma 6.3.2   (sgn-rne) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$, then

$\displaystyle sgn(RNE(x,n)) = sgn(x).$

Lemma 6.3.3   (rne-exactp-a) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, $ RNE(x,n)$ is $ n$-exact.

Lemma 6.3.4   (rne-exactp-b) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$ and x is $ n$-exact, then

$\displaystyle RNE(x,n) = x.$

Lemma 6.3.5   (rne-exactp-c,rne-exactp-d) Let $ x \in \mathbb{R}$, $ a \in \mathbb{R}$, and $ n \in \mathbb{N}$, $ n > 0$. Suppose $ a$ is $ n$-exact.

(a) If $ a \geq x$, then $ a \geq RNE(x,n)$.
(b) If $ a \leq x$, then $ a \leq RNE(x,n)$.

Lemma 6.3.6   (expo-rne) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, if $ \vert RNE(x,n)\vert \neq 2^{expo(x)+1}$, then

$\displaystyle expo(RNE(x,n)) = expo(x).$

  (rne<=raz, rne>=rtz) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ x \geq 0$, then

$\displaystyle \mathit{RTZ}(x,n) \leq RNE(x,n) \leq RAZ(x,n).$

PROOF: This follows from Lemmas 6.1.5 and 6.2.5, which together ensure that $ \mathit{RTZ}(x,n) \leq RAZ(x,n)$

  (rne-shift) For all $ x \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ k \in \mathbb{Z}$,

$\displaystyle RNE(2^kx,n) = 2^k RNE(x,n).$

PROOF: It is clear from Definition 6.3.1 that the choice between $ \mathit{RTZ}(x,n)$ and $ RAZ(x,n)$ depends only on $ sig(x)$. Thus, for example, if $ RNE(x,n) = \mathit{RTZ}(x,n)$, then since $ sig(2^kx) = sig(x)$, $ RNE(2^kx,n) = \mathit{RTZ}(2^kx,n)$ as well, and by Lemma 6.1.4,

$\displaystyle RNE(2^kx,n) = \mathit{RTZ}(2^kx,n) = 2^k\mathit{RTZ}(x,n) = 2^k RNE(x,n).$

  (rne-minus) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle RNE(-x,n) = -RNE(x,n).$

PROOF: This may be derived from Lemmas 6.1.3 and 6.2.2 by following the same reasoning as used in the proof of Lemma 6.3.8


In the computation of $ RNE(x,n)$, the choice between $ \mathit{RTZ}(x,n)$ and $ RAZ(x,n)$ is governed by their relative distances from $ x$.

  (rne-rtz, rne-raz) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$.

(a) If $ \vert x - \mathit{RTZ}(x,n)\vert < \vert x - RAZ(x,n)\vert$, then $ RNE(x,n) = \mathit{RTZ}(x,n)$.

(b) If $ \vert x - \mathit{RTZ}(x,n)\vert > \vert x - RAZ(x,n)\vert$, then $ RNE(x,n) = RAZ(x,n)$.

PROOF: We may assume that $ 2^{n-1}sig(x) \notin \mathbb{Z}$, for otherwise

$\displaystyle \mathit{RTZ}(x,n) = RAZ(x,n) = RNE(x,n) = x.$

Let $ f = 2^{n-1}sig(x)-\lfloor 2^{n-1}sig(x)\rfloor$. Then
$\displaystyle \vert x - \mathit{RTZ}(x,n)\vert$ $\displaystyle =$ $\displaystyle \vert x\vert - \vert\mathit{RTZ}(x,n)\vert$  
  $\displaystyle =$ $\displaystyle 2^{expo(x)+1-n}(2^{n-1}sig(x)-\lfloor 2^{n-1}sig(x)\rfloor)$  
  $\displaystyle =$ $\displaystyle 2^{expo(x)+1-n}f$  

and
$\displaystyle \vert x - RAZ(x,n)\vert$ $\displaystyle =$ $\displaystyle \vert RAZ(x,n)\vert - \vert x\vert$  
  $\displaystyle =$ $\displaystyle 2^{expo(x)+1-n}(\lceil 2^{n-1}sig(x) \rceil - 2^{n-1}sig(x))$  
  $\displaystyle =$ $\displaystyle 2^{expo(x)+1-n}(1-f).$  

Thus, (a) and (b) correspond to $ f < 1/2$ and $ f > 1/2$, respectively. 

  (rne-down, rne-up) Let $ x \in \mathbb{R}$, $ a \in \mathbb{R}$, and $ n \in \mathbb{N}$. Assume that $ n > 0$, $ a>0$, and $ a$ is $ n$-exact.

(a) If $ a \leq x < a + 2^{expo(a)-n}$, then $ RNE(x, n) = a$.

(b) If $ a + 2^{expo(a)-n} < x \leq fp^+(a, n)$, then $ RNE(x, n) = fp^+(a, n)$.

PROOF: In either case, by Lemmas 6.1.12 and 6.2.15, $ \mathit{RTZ}(x, n) = a$ and $ RAZ(x, n) = fp^+(a, n) = a + 2^{expo(a)+1-n}$. The claims follow from Lemma 6.3.10


  (rne-up-2) Let $ x \in \mathbb{R}$, $ k \in \mathbb{N}$, $ m \in \mathbb{N}$, and $ n \in \mathbb{N}$, with $ 0 < m < n$ and $ \vert x\vert < 2^k$. If $ \vert\mathit{RNE}(x, n)\vert = 2^k$, then $ \vert\mathit{RNE}(x, m)\vert = 2^k$.

PROOF: We may assume that $ x>0$. By Lemmas 4.2.19 and 6.3.5, $ x > fp^-(2^k, n)$. Let $ a = fp^-(2^k, m)$. Then $ \mathit{expo}(a) = k-1$. Since $ 2^{k-n} \leq 2^{k-m-1} = 2^{k-m} - 2^{k-m-1}$,

$\displaystyle fp^-(2^k,n) = 2^k - 2^{k-n} \geq 2^k - (2^{k-m} - 2^{k-m-1}) = a + 2^{\mathit{expo}(a)-m}
$

and the claim follows from Lemma 6.3.11(b). 

No $ n$-exact number can be closer to $ x$ than is $ RNE(x,n)$.

  (rne-nearest) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ n > 0$ and y is n-exact, then

$\displaystyle \vert x-y\vert \geq \vert x-RNE(x,n)\vert.$

PROOF: Assume $ \vert x-y\vert < \vert x-RNE(x,n)\vert$. We shall only consider the case $ x>0$, as the case $ x<0$ is handled similarly.

First suppose $ RNE(x,n) = \mathit{RTZ}(x,n)$. Since $ RNE(x,n) \leq x$, we must have $ y > RNE(x,n)$ and hence $ y > x$ by Lemma 6.1.11. But since $ RAZ(x,n) - x \geq x-RNE(x,n)$ by Lemma 6.3.10, we also have $ y < RAZ(x,n)$, and hence $ y < x$ by Lemma 6.2.14.

In the remaining case, $ RNE(x,n) = RAZ(x,n) > x$. Now $ y < RNE(x,n)$ and by Lemma 6.2.14, $ y < x$. But in this case, Lemma 6.3.10 implies $ y > \mathit{RTZ}(x,n)$, and hence $ y > x$ by Lemma 6.1.11


Consequently, the maximum RNE rounding error is half the distance between successive representable numbers.

  (rne-diff) If $ x \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ n > 0$, then

$\displaystyle \vert x-RNE(x,n)\vert \leq 2^{expo(x)-n}.$

PROOF: By Lemma 6.3.9, we may assume $ x>0$. Let

$\displaystyle a = \mathit{RTZ}(x,n)+2^{expo(x)+1-n} = \mathit{fp}^+(\mathit{RTZ}(x,n),n).$

If the statement fails, then since $ \mathit{RTZ}(x,n)$ and $ RAZ(x,n)$ are both $ n$-exact, Lemma 6.3.13 implies

$\displaystyle \mathit{RTZ}(x,n) < x-2^{expo(x)-n} < x+2^{expo(x)-n} < RAZ(x,n),$

and hence $ a < RAZ(x,n)$. Then by Lemmas 4.2.16 and 6.2.14, we have $ a<x$, contradicting Lemma 6.1.7


  (rne-diff-cor) If $ x \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ n > 0$, then

$\displaystyle \vert x-RNE(x,n)\vert \leq 2^{-n}\vert x\vert.$

PROOF: This follows from Lemma 6.3.14 and Definition 4.1.1

  (rne-force) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ x \neq 0$, $ n > 0$, y is n-exact, and $ \vert x - y\vert < 2^{\mathit{expo}(x)-n}$, then $ y = \mathit{RNE}(x,n)$.

PROOF: We shall consider the case $ x>0$; the case $ x<0$ then follows from Lemma 6.3.9.

Let $ e = \mathit{expo}(x)$. and $ z = \mathit{RNE}(x,n)$. By Lemma 6.3.6, $ \mathit{expo}(z) \geq e$. We also have $ \mathit{expo}(y) \geq e$, for otherwise $ y < 2^e$ and by Lemma 4.2.21,

$\displaystyle y \leq \mathit{fp}^-(2^e,n) = 2^e - 2^{e-n} \leq x - 2^{e-n},
$

contradicting $ \vert x-y\vert < 2^{e-n}$. By Lemma 6.3.14,

$\displaystyle \vert y-z\vert \leq \vert x-y\vert + \vert x-z\vert < 2^{e-n} + 2^{e-n} = 2^{e+1-n}.
$

If $ y<z$, then since

$\displaystyle z < y + 2^{e+1-n} \leq y + 2^{\mathit{expo}(y)+1-n} = \mathit{fp}^+(y, n),
$

Lemma 4.2.16 implies $ y=z$. But if $ z<y$, then we have a similar contradiction. 

  (rne-monotone) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ 0 \leq x \leq y$ and $ n > 0$, then

$\displaystyle RNE(x,n) \leq RNE(y,n).$

PROOF: Suppose $ x < y$ and $ RNE(x,n) > RNE(y,n)$. Then Lemma 6.3.13 implies $ x>RNE(y,n)$, otherwise $ x \leq RNE(y,n) < RNE(x,n)$ and $ \vert x-RNE(y,n)\vert
< \vert x-RNE(x,n)\vert$. Similarly, $ y < RNE(x,n)$, and thus $ RNE(y,n) < x \leq y < RNE(x,n)$. Applying Lemma 6.3.13 again, we have $ x-RNE(y,n) \geq RNE(x,n) - x$, and hence $ 2x \geq
RNE(x,n) + RNE(y,n)$. Similarly, $ RNE(x,n)-y \geq y-RNE(y,n)$, and hence $ RNE(x,n)+RNE(y,n)
\geq 2y$. Consequently, $ 2x \geq 2y$, contradicting $ x < y$


A midpoint with respect to a precision $ n$ may be characterized as a number that is $ (n+1)$-exact but not $ n$-exact. By virtue of the following result, the term rounding boundary is sometimes used as well.

  (rne-rne-lemma) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ 0<x<y$ and $ RNE(x,n) \neq RNE(y,n)$, then for some $ a \in \mathbb{R}$, $ x \leq a \leq y$ and a is $ (n+1)$-exact.

PROOF: By Lemma 6.3.17, $ RNE(x,n) < RNE(y,n)$. Let $ e = expo(RNE(x,n))$,

$\displaystyle a = \mathit{fp}^+(RNE(x,n),n+1) = RNE(x,n) + 2^{e-n}$

and

$\displaystyle b = \mathit{fp}^+(RNE(x,n),n) = RNE(x,n) + 2^{e+1-n} = a+2^{e-n}.$

Since $ RNE(x,n)$ and $ RNE(y,n)$ are both $ n$-exact by Lemma 6.3.3, Lemma 4.2.15 implies that $ a$ is $ (n+1)$-exact and $ b$ is $ n$-exact and consequently, by Lemma 4.2.16, $ a$ is not $ n$-exact and $ RNE(y,n) \geq b$. Thus,

$\displaystyle RNE(x,n) < a < b \leq RNE(y,n).$

Moreover, $ x \leq a \leq y$, for if $ x > a$, then $ \vert x-b\vert < \vert x-RNE(x,n\vert$, contradicting Lemma 6.3.13 and similarly if $ y < a$, then $ \vert y-RNE(x,n)\vert < \vert y-RNE(y,n\vert$

  (rne-rne) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, $ a \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ k \in \mathbb{N}$ with $ n \geq k > 0$. If $ a$ is $ (n+1)$-exact, $ 0<a<x$, and $ 0 < y <a+2^{expo(a)-n}$, then $ RNE(x,k) \geq RNE(y,k)$.

PROOF: By Lemma 6.3.17, we may assume $ x < y$, so that $ a<x<y<a+2^{expo(a)-n}$. By Lemmas 4.2.15 and 4.2.16, $ a$ and $ a+2^{expo(a)-n}$ are successive $ (n+1)$-exact numbers, and hence $ RNE(x,k) = RNE(y,k)$ by Lemma 6.3.18


We also have the following partial converse of Lemma 6.3.18.

  (rne-boundary) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, $ a \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ n > 0$, $ 0<x <a < y$ and $ a$ is $ (n+1)$-exact but not $ n$-exact, then

$\displaystyle RNE(x,n) < RNE(y,n).$

PROOF: Let $ e = expo(a)$. Since $ a$ is not $ n$-exact, $ a \neq 2^e$. Let

$\displaystyle b = \mathit{fp}^-(a,n+1) = a - 2^{e-n}/$

By Lemma 4.2.20,

$\displaystyle \mathit{fp}^+(b,n+1) = b + 2^{expo(b)-n} = a = b + 2^{e-n},$

and it follows that $ expo(b) = e$.

By hypothesis, $ 2^{n-e}a \in \mathbb{Z}$ but $ 2^{n-1-e}a \in \mathbb{Z}$, and therefore, $ 2^{n-e}a$ is odd, i.e., $ 2^{n-e}a = 2k+1$, where $ k \in \mathbb{Z}$. Now since

$\displaystyle 2^{n-1-e}b = 2^{n-1-e}(a-2^{n-e}) = \frac{1}{2}(2^{n-e}a-1) = k \in \mathbb{Z},$

$ b$ is $ n$-exact. Let

$\displaystyle c = \mathit{fp}^+(b,n) = b + 2^{n+1-e} = a + 2^{n-e}.$

By Lemma 4.2.15, $ c$ is $ n$-exact as well.

If $ RNE(x,n) > b$, then by Lemma 4.2.16, $ RNE(x,n) \geq c$, which implies $ \vert b-x\vert < \vert RNE(x,n) - x\vert$, contradicting Lemma 6.3.13. Similarly, if $ RNE(y,n) < c$, then $ RNE(y,n) \leq b$ and $ \vert c-y\vert < \vert RNE(y,n) - y\vert$. Thus,

$\displaystyle RNE(x,n) \leq b < a < c \leq RNE(y,n).$

The meaning of “round to nearest even” is that in the case of a midpoint $ x$, $ RNE(x,n)$ is defined to be the “rounder” of the two nearest $ n$-exact numbers, i.e., the one that is $ (n-1)$-exact.

  (rne-midpoint) Let $ n \in \mathbb{N}$, $ n>1$, and $ x \in \mathbb{R}$. If $ x$ is $ (n+1)$-exact but not $ n$-exact, then $ RNE(x,n)$ is $ (n-1)$-exact.

PROOF: Again we may assume $ x>0$. Let $ z = \lfloor 2^{n-1}sig(x)\rfloor$ and $ f = 2^{n-1}sig(x) - z$. Since $ 2^{n-1}sig(x) \notin \mathbb{Z}$, $ 0<f<1$. But $ 2^{n}sig(x) = 2z+2f \in \mathbb{Z}$, hence $ 2f \in \mathbb{Z}$ and $ f=\frac{1}{2}$.

If $ z$ is even, then

$\displaystyle RNE(x,n) = \mathit{RTZ}(x,n) = z2^{expo(x)+1-n}$

and by Lemma 6.1.6,

$\displaystyle 2^{n-2-expo(RNE(x,n))}RNE(x,n) = 2^{n-2-expo(x)}z2^{expo(x)+1-n} = z/2
\in \mathbb{Z}.$

If $ z$ is odd, then

$\displaystyle RNE(x,n) = RAZ(x,n) = (z+1)2^{expo(x)+1-n}.$

We may assume $ RAZ(x,n) \neq 2^{expo(x)+1}$, and hence by Corollary 6.2.10,
$\displaystyle 2^{n-2-expo(RNE(x,n))}RNE(x,n)$ $\displaystyle =$ $\displaystyle 2^{n-2-expo(x)}(z+1)2^{expo(x)+1-n}$  
  $\displaystyle =$ $\displaystyle (z+1)/2$  
  $\displaystyle \in$ $\displaystyle \mathbb{Z}.$  

One consequence of this result is that a midpoint is sometimes rounded up and sometimes down, and therefore, over the course of a long series of computations and approximations, rounding error is less likely to accumulate to a significant degree in one particular direction than it would be if the the choice were made more consistently. The cost of this feature is a more complicated definition, requiring a more expensive implementation.

When the goal of a computation is provable accuracy rather than IEEE-compliance, a simpler version of “round to nearest” may be appropriate. The critical feature of this mode then becomes the relative error bound guaranteed by Lemma 6.3.14, since this is likely to be the basis for any formal error analysis. The following definition presents an alternative to RNE that respects the same error bound (see Lemma 6.3.35) but admits a simpler implementation and is therefore commonly used for internal floating-point calculations.

Definition 6.3.2   (rna) Given $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, let $ z = \lfloor 2^{n-1}sig(x)\rfloor$, and $ f = 2^{n-1}sig(x) - z$. Then

$\displaystyle RNA(x,n) = \left\{\begin{array}{ll}
\mathit{RTZ}(x,n) & \mbox{if $f < 1/2$}\\
RAZ(x,n) & \mbox{if $f \geq 1/2$.} \end{array} \right. $

Example: Let $ x = (101.101)_2$ and $ n = 5$. Since

$\displaystyle f = 2^{n-1}sig(x) - \lfloor 2^{n-1}sig(x) \rfloor = (10110.1)_2 - (10110)_2 = (0.1 = 1/2)_2,$

$\displaystyle RNA(x,n) = RAZ(x,n) = (101.11)_2$

Naturally, many of the properties of RNE are held by $ RNA$ as well. We list some of them here, omitting the proofs, which are essentially the same as those given above for RNE.

Lemma 6.3.22   (rna-choice) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, either

$\displaystyle RNA(x,n) = \mathit{RTZ}(x,n)$

or

$\displaystyle RNA(x,n) = RAZ(x,n).$

Lemma 6.3.23   (sgn-rna) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$, then

$\displaystyle sgn(rna(x,n)) = sgn(x).$

Lemma 6.3.24   (rna-exactp-a) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, $ RNA(x,n)$ is $ n$-exact.

Lemma 6.3.25   (rna-exactp-b) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$ and x is $ n$-exact, then

$\displaystyle RNA(x,n) = x.$

Lemma 6.3.26   (rna-exactp-c,rna-exactp-d) Let $ x \in \mathbb{R}$, $ a \in \mathbb{R}$, and $ n \in \mathbb{N}$, $ n > 0$. Suppose $ a$ is $ n$-exact.

(a) If $ a \geq x$, then $ a \geq RNA(x,n)$.
(b) If $ a \leq x$, then $ a \leq RNA(x,n)$.

Lemma 6.3.27   (expo-rna) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, if $ \vert RNA(x,n)\vert \neq 2^{expo(x)+1}$, then

$\displaystyle expo(RNA(x,n)) = expo(x).$

Lemma 6.3.28   (rna<=raz, rna>=rtz) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ x \geq 0$, then

$\displaystyle \mathit{RTZ}(x,n) \leq RNA(x,n) \leq RAZ(x,n).$

Lemma 6.3.29   (rna-shift) For all $ x \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ k \in \mathbb{Z}$,

$\displaystyle RNA(2^kx,n) = 2^k RNA(x,n).$

Lemma 6.3.30   (rna-minus) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle RNA(-x,n) = -RNA(x,n).$

Lemma 6.3.31   (rna-rtz, rna-raz) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$.

(a) If $ \vert x - \mathit{RTZ}(x,n)\vert < \vert x - RAZ(x,n)\vert$, then $ RNA(x,n) = \mathit{RTZ}(x,n)$.

(b) If $ \vert x - \mathit{RTZ}(x,n)\vert > \vert x - RAZ(x,n)\vert$, then $ RNA(x,n) = RAZ(x,n)$.

Corollary 6.3.32   (rna-down, rna-up) Let $ x \in \mathbb{R}$, $ a \in \mathbb{R}$, and $ n \in \mathbb{N}$. Assume that $ n > 0$, $ a>0$, and $ a$ is $ n$-exact.

(a) If $ a \leq x < a + 2^{expo(a)-n}$, then $ RNA(x, n) = a$.

(b) If $ 2^{expo(a)-n} < x \leq fp^+(a, n)$, then $ RNA(x, n) = fp^+(a, n)$.

Corollary 6.3.33   (rna-up-2) Let $ x \in \mathbb{R}$, $ k \in \mathbb{N}$, $ m \in \mathbb{N}$, and $ n \in \mathbb{N}$. $ 0 < m < n$, $ \vert x\vert < 2^k$, and $ \vert\mathit{RNA}(x, n)\vert = 2^k$, then $ \vert\mathit{RNE}(x, m)\vert = 2^k$.

Lemma 6.3.34   (rna-nearest) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ n > 0$ and y is n-exact, then

$\displaystyle \vert x-y\vert \geq \vert x-RNA(x,n)\vert.$

Lemma 6.3.35   (rna-diff) If $ x \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ n > 0$, then

$\displaystyle \vert x-RNA(x,n)\vert \leq 2^{expo(x)-n}.$

Lemma 6.3.36   (rna-monotone) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ 0 \leq x \leq y$ and $ n > 0$, then

$\displaystyle RNA(x,n) \leq RNA(y,n).$

Lemma 6.3.37   (rna-rna-lemma) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ 0<x<y$ and $ RNA(x,n) \neq RNA(y,n)$, then for some $ a \in \mathbb{R}$, $ x \leq a \leq y$ and a is $ (n+1)$-exact.

Corollary 6.3.38   (rna-rna) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, $ a \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ k \in \mathbb{N}$ with $ n \geq k > 0$. If $ a$ is $ (n+1)$-exact, $ 0<a<x$, and $ 0 < y <a+2^{expo(a)-n}$, then $ RNA(x,k) \geq RNA(y,k)$.

The difference between RNE and $ RNA$ is that the latter always rounds a midpoint away from 0.

  (rna-midpoint) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ x$ is $ (n+1)$-exact but not $ n$-exact, then

$\displaystyle RNA(x,n) = RAZ(x,n) = x + sgn(x)2^{expo(x)-n}.$

PROOF: By Lemmas 6.3.9 and 6.2.2, we may assume $ x>0$. Let $ z = \lfloor 2^{n-1}sig(x)\rfloor$ and $ f = 2^{n-1}sig(x) - z$. Since $ 2^{n-1}sig(x) \notin \mathbb{Z}$, $ 0<f<1$. But $ 2^{n}sig(x) = 2z+2f \in \mathbb{Z}$, hence $ 2f \in \mathbb{Z}$ and $ f=\frac{1}{2}$. Therefore, according to Definition 6.3.2, $ RNA(x,n) = RAZ(x,n)$. The second inequality is a restatement of Lemma 6.2.19


There is one case of a midpoint for which RNE and RNA are guaranteed to produce the same result: if the greater of the two representable numbers that are equidistant from $ x$ is a power of 2, i.e., $ x = 2^{expo(x)+1} - 2^{expo(x)-n}$, then both modes round to this number.

  (rne-power-2, rtz-power-2, rna-power-2)
Let $ n \in \mathbb{N}$, $ n>1$, and $ x \in \mathbb{R}$, $ x>0$. If $ x + 2^{expo(x)-n}
\geq 2^{expo(x)+1}$, then

$\displaystyle RNE(x,n) = RNA(x,n) = 2^{expo(x)+1} = \mathit{RTZ}(x+2^{expo(x)-n},n).$

PROOF: Suppose $ RNE(x,n) \neq 2^{expo(x)+1}$. Then Lemma 6.3.6 implies

$\displaystyle RNE(x,n) < 2^{expo(x)+1}$

and by Lemmas 4.2.16 and 6.3.14,
$\displaystyle 2^{expo(x)+1}$ $\displaystyle \geq$ $\displaystyle RNE(x,n)+2^{expo(x)+1-n}$  
  $\displaystyle \geq$ $\displaystyle x - 2^{expo(x)-n} + 2^{expo(x)+1-n}$  
  $\displaystyle =$ $\displaystyle x + 2^{expo(x)-n}$  
  $\displaystyle \geq$ $\displaystyle 2^{expo(x) + 1}.$  

It follows that $ x = 2^{expo(x)+1} - 2^{expo(x)-n}$, which is easily seen to be $ (n+1)$-exact but not $ n$-exact, while $ RNE(x,n) =
2^{expo(x)+1} - 2^{expo(x)+1-n}$ is $ n$-exact but not $ (n-1)$-exact, contradicting Lemma 6.3.21.

Now suppose $ RNA(x,n) \neq 2^{expo(x)+1}$. Using Lemmas 6.3.27 and 6.3.35, we may show in the same way as above that $ x = 2^{expo(x)+1} - 2^{expo(x)-n}$. Once again, $ x$ is $ (n+1)$-exact but not $ n$-exact, and hence, by Lemma 6.3.39,

$\displaystyle RNA(x,n) = x + 2^{expo(x)-n} = 2^{expo(x)+1},$

a contradiction.

Finally, suppose $ 2^{expo(x)+1} \neq \mathit{RTZ}(x+2^{expo(x)-n},n)$. Since $ 2^{expo(x)+1}$ is $ n$-exact, $ 2^{expo(x)+1} <
\mathit{RTZ}(x+2^{expo(x)-n},n)$ by Lemma 6.1.11. But then by Lemma 4.2.16,

$\displaystyle \mathit{RTZ}(x+2^{expo(x)-n},n)$ $\displaystyle \geq$ $\displaystyle 2^{expo(x)+1} + 2^{expo(x)+2-n}
> x+ 2^{expo(x)-n},$  

contradicting Lemma 6.1.5


The additive property shared by $ \mathit{RTZ}$ and RAZ that is described in Lemmas 6.1.16 and 6.2.21, respectively, does not hold for RNE in precisely the same form. For example, let $ x = 2 = (10)_2$, $ y = 5 = (101)_2$, and $ k = 2$. Then

$\displaystyle k + expo(x) - expo(y) = 2 + 1 - 2 = 1$

and

$\displaystyle k + expo(x+y) - expo(y) = 2 + 2 - 2 = 2.$

Although $ x$ is clearly $ (k+expo(x)-expo(y))$-exact,

$\displaystyle x + RNE(y,k) = (10)_2 + RNE((101)_2,2) = (10)_2 + (100)_2 = (110)_2,$

while

$\displaystyle RNE(x+y,k + expo(x+y) - expo(y)) = RNE((111,2)_2) = (1000)_2.$

However, this property is shared by $ RNA$, and a slightly weaker version holds for RNE.

  (plus-rne,plus-rna) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ k \in \mathbb{Z}$ with $ x \geq 0$ and $ y \geq 0$. Let

$\displaystyle k' = k+expo(x)-expo(y)$

and

$\displaystyle k” = k+expo(x+y)-expo(y).$

(a) If $ x$ is $ (k'-1)$-exact, then

$\displaystyle x+RNE(y,k) = RNE(x+y,k”).$

(b) If $ x$ is $ k'$-exact, then

$\displaystyle x+RNA(y,k) = RNA(x+y,k”).$

PROOF:

(a) Applying Lemma 6.1.16 and 6.2.21, we need only show that either

$\displaystyle RNE(y,k) = \mathit{RTZ}(y,k)$    and $\displaystyle RNE(x+y,k”) = \mathit{RTZ}(x+y,k”)$

or

$\displaystyle RNE(y,k) = RAZ(y,k)$    and $\displaystyle RNE(x+y,k”) = RAZ(x+y,k”).$

Let $ z_1 = \lfloor 2^{k-1}sig(y)\rfloor$, $ f_1 = 2^{k-1}sig(y) - z_1$, $ z_2 = \lfloor 2^{k”-1}sig(x+y)\rfloor$, and $ f_2 = 2^{k”-1}sig(x+y) - z_2$. According to Definition 6.3.1, it will suffice to show that

$\displaystyle 2^{k”-1}sig(x+y) - 2^{k-1}sig(y) = 2\ell,$

for some $ \ell \in \mathbb{Z}$, for then Lemma 1.1.4 will imply that $ z_2 = z_1 + 2\ell$ and $ f_2 = f_1$. But
$\displaystyle 2^{k”-1}sig(x+y) - 2^{k-1}sig(y)$ $\displaystyle =$ $\displaystyle 2^{k”-expo(x+y) - 1}(x+y) - 2^{k-expo(y)-1}y$  
  $\displaystyle =$ $\displaystyle 2^{k-expo(y) - 1}(x+y) - 2^{k-expo(y)-1}y$  
  $\displaystyle =$ $\displaystyle 2^{k-expo(y) - 1}x$  
  $\displaystyle =$ $\displaystyle 2^{k'-expo(x) - 1}x$  
  $\displaystyle =$ $\displaystyle 2\ell,$  

where

$\displaystyle \ell = 2^{(k'-1)-expo(x)-1} \in \mathbb{Z}$

by Lemma 4.2.1.

(b) Here we must show that either

$\displaystyle RNA(y,k) = \mathit{RTZ}(y,k)$    and $\displaystyle RNA(x+y,k”) = \mathit{RTZ}(x+y,k”)$

or

$\displaystyle RNA(y,k) = RAZ(y,k)$    and $\displaystyle RNA(x+y,k”) = RAZ(x+y,k”).$

According to Definition 6.3.2, this is true whenever $ f_1 = f_2$. Thus, we need only show that

$\displaystyle 2^{k”-1}sig(x+y) - 2^{k-1}sig(y) = 2^{k'-expo(x) - 1}x \in \mathbb{Z},$

which is equivalent to the hypothesis that $ x$ is $ k'$-exact. 


The rounding constant $ {\cal C}$ (see the discussion preceding Lemma 6.2.24) for both of the modes $ RNE$ and $ RNA$ is a simple power of 2, equal to half the value of the least significant bit of the rounded result. That is, if the rounding precision is $ n$ and the unrounded result is

$\displaystyle x = (1.\beta_1\beta_2\cdots)_2 \times 2^e,$

then $ {\cal C} = 2^{e-n}$, as illustrated below.


\begin{picture}(120,55)(-25,22)\setlength{\unitlength}{2mm}
\par
\put(10,10){...
...\put(28.7,8){$\times \:\:\:2^e$}
\par
\put(10,6.8){\line(1,0){23}}
\end{picture}

The following lemma exposes the extra expense of implementing $ RNE$ as compared to $ RNA$. While the correctly rounded result is given by $ \mathit{RTZ}(x+{\cal C},n)$ in most cases, special attention is required for the computation of $ RNE$ in the case where it differs from $ RNA$, i.e., when $ x$ is $ (n+1)$-exact and $ \beta_n = 1$. In this case, the least significant bit must be forced to 0. This is accomplished by truncating $ x+{\cal C}$ to $ n-1$ bits rather than $ n$.

  (rne-imp, rna-imp) If $ n \in \mathbb{N}$, $ n>1$, $ x \in \mathbb{R}$, and $ x>0$, then

$\displaystyle RNE(x,n) = \left\{\begin{array}{ll}
\mathit{RTZ}(x\mbox{\rm +}2^...
...thit{RTZ}(x\mbox{\rm +}2^{expo(x)-n},n) & \mbox{otherwise} \end{array} \right. $

and

$\displaystyle RNA(x,n) = \mathit{RTZ}(x+2^{expo(x)-n},n).$

PROOF: If $ x + 2^{expo(x)-n}
\geq 2^{expo(x)+1}$, then by Lemma 6.3.40,

$\displaystyle RNE(x,n) = RNA(x,n) = 2^{expo(x)+1} = \mathit{RTZ}(x+2^{expo(x)-n},n).$

But then, by Lemmas 6.1.15, 4.2.6, and 6.1.10,
$\displaystyle \mathit{RTZ}(x+2^{expo(x)-n},n-1)$ $\displaystyle =$ $\displaystyle \mathit{RTZ}(\mathit{RTZ}(x+2^{expo(x)-n},n), n-1)$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(2^{expo(x)+1},n-1)$  
  $\displaystyle =$ $\displaystyle 2^{expo(x)+1}$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x+2^{expo(x)-n},n).$  

Thus, we may assume $ x+2^{expo(x)-n} < 2^{expo(x)+1}$, and it follows from Lemmas 6.3.6, 6.3.14, 6.3.27, and 6.3.35 that

$\displaystyle expo(RNE(x,n)) = expo(RNA(x,n)) = expo(x+2^{expo(x)-n}) = expo(x).$

Case 1: $ x$ is $ n$-exact

By Lemma 6.1.11, $ rtz(x+2^{expo(x)-n},n) \geq x$. But since

$\displaystyle \mathit{RTZ}(x+2^{expo(x)-n},n) \leq x+2^{expo(x)-n} < x+2^{expo(x)+1-n},$

Lemma 4.2.16 yields $ \mathit{RTZ}(x+2^{expo(x)-n},n) \leq x$, and hence

$\displaystyle \mathit{RTZ}(x+2^{expo(x)-n},n) = x = RNE(x,n) = RNA(x,n).$

Case 2: $ x$ is not $ (n+1)$-exact

We have $ RNE(x,n) > x-2^{expo(x)-n}$, for otherwise Lemma 6.3.14 <would imply

$\displaystyle RNE(x,n) = x-2^{expo(x)-n},$

and since $ RNE(x,n)$ is $ (n+1)$-exact, so would be

$\displaystyle RNE(x,n)+2^{expo(RNE(x,n))-n} = x-2^{expo(x)-n}+2^{expo(RNE(x,n))-n}=x.$

Since $ RNE(x,n) \leq x+2^{expo(x)-n}$, $ RNE(x,n) \leq
\mathit{RTZ}(x+2^{expo(x)-n},n)$ by Lemma 6.1.11. But since

$\displaystyle \mathit{RTZ}(x+2^{expo(x)-n},n) \leq x+2^{expo(x)-n} < RNE(x,n)+2^{expo(x)+1-n},$

$ \mathit{RTZ}(x+2^{expo(x)-n},n) \leq RNE(x,n)$.

The same argument applies to $ RNA(x,n)$, but with Lemma 6.3.35 invoked in place of Lemma 6.3.14.



Case 3: $ x$ is $ (n+1)$-exact but not $ n$-exact

The identity for $ RNA(x,n)$ is given by Lemma 6.3.39. To prove the claim for $ RNE$, we first consider the case $ RNE(x,n) > x$. Since $ RNE(x,n)$ is $ (n+1)$-exact, $ RNE(x,n) \geq x+2^{expo(x)-n}$, hence $ RNE(x,n) = x+2^{expo(x)-n}$, and by Lemma 6.3.21,

$\displaystyle \mathit{RTZ}(x+2^{expo(x)-n},n-1) = \mathit{RTZ}(RNE(x,n),n-1) = RNE(x,n).$

Now suppose $ RNE(x,n) < x$. Then $ RNE(x,n) < x+2^{expo(x)-n}$ implies $ RNE(x,n) \leq \mathit{RTZ}(x+2^{expo(x)-n},n-1)$. But since

$\displaystyle \mathit{RTZ}(x+2^{expo(x)-n},n-1)$ $\displaystyle \leq$ $\displaystyle x + 2^{expo(x)-n}$  
  $\displaystyle =$ $\displaystyle x - 2^{expo(x)-n} + 2^{expo(x)+1-n}$  
  $\displaystyle <$ $\displaystyle RNE(x,n)+2^{expo(x)+2-n},$  

we have $ \mathit{RTZ}(x+2^{expo(x)-n},n-1) \leq RNE(x,n)$


As a consequence of the preceding lemma, $ RNA(x,m)$ depends only on the most significant $ m+1$ bits of $ x$.

  (rna-imp-cor) Let $ x \in \mathbb{R}$, $ m \in \mathbb{N}$, and $ n \in \mathbb{N}$. If $ n>m>0$, then

$\displaystyle RNA(\mathit{RTZ}(x,n),m) = RNA(x,m).$

PROOF: By Lemmas 6.3.30 and 6.1.3, we may assume that $ x>0$. Furthermore, it will suffice to consider the case $ n=m+1$, because then for $ n>m+1$,

$\displaystyle RNA(\mathit{RTZ}(x,n),m)$ $\displaystyle =$ $\displaystyle RNA(\mathit{RTZ}(\mathit{RTZ}(x,n),m+1),m)$  
  $\displaystyle =$ $\displaystyle RNA(\mathit{RTZ}(x,m+1),m)$  
  $\displaystyle =$ $\displaystyle RNA(x,m).$  

Thus, according to Lemmas 6.3.42 and 6.1.6, our goal is to prove

$\displaystyle \mathit{RTZ}(\mathit{RTZ}(x,m+1)+2^{expo(x)-n},m) = \mathit{RTZ}(x+2^{expo(x)-n},m),$

but after applying Lemma 6.1.5, we need only show

$\displaystyle \mathit{RTZ}(\mathit{RTZ}(x,m+1)+2^{expo(x)-n},m) \geq \mathit{RTZ}(x+2^{expo(x)-n},m).$

Let

$\displaystyle y = \mathit{RTZ}(x,m+1)+2^{expo(x)-n} = \mathit{fp}^+(\mathit{RTZ}(x,m+1),m+1).$

By Lemmas 6.1.9 and 4.2.15, $ y$ is $ (m+1)$-exact. Now since $ \mathit{RTZ}(x+2^{expo(x)-n},m)$ is also $ (m+1)$-exact and Lemmas 6.1.5 and 6.1.7 imply
$\displaystyle \mathit{RTZ}(x+2^{expo(x)-n},m)$ $\displaystyle \leq$ $\displaystyle x + 2^{expo(x)-n}$  
  $\displaystyle <$ $\displaystyle \mathit{RTZ}(x,m+1) + 2^{expo(x)-n} + 2^{expo(x)-n}$  
  $\displaystyle =$ $\displaystyle y + 2^{expo(x)-n}$  
  $\displaystyle =$ $\displaystyle \mathit{fp}^+(y,m+1),$  

Lemma 4.2.16 yields $ \mathit{RTZ}(x+2^{expo(x)-n},m) \leq y$. Finally, by Lemma 6.1.11,

$\displaystyle \mathit{RTZ}(x+2^{expo(x)-n},m) \leq \mathit{RTZ}(y,m).$

David Russinoff 2017-08-01