6.2  Rounding Away from Zero

The dual of truncation is defined similarly, using the ceiling instead of the floor.

Definition 6.2.1   (raz) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{Z}$,

$\displaystyle RAZ(x,n) = sgn(x)\lceil 2^{n-1}sig(x) \rceil 2^{expo(x)-n+1}.$

Example: Let $ x = 45/8 = (101.101)_2$ and $ n = 5$. Then $ sgn(x) = 1$, $ expo(x) = 2$, $ sig(x) = (1.01101)_2$,

$\displaystyle \lceil 2^{n-1}sig(x) \rceil 2^{1-n} = \lceil (10110.1)_2 \rceil 2^{-4}
= (10111)_2 \cdot 2^{-4} = (1.0111)_2,$

and

$\displaystyle RAZ(x,n) =\lceil 2^{n-1}sig(x) \rceil 2^{1-n}2^{expo(x)}
= (1.0111)_2 \cdot 2^{2} = (101.11)_2.$

Note that this value is the smallest 5-exact number not exceeded by $ x$.

The next three lemmas list simple properties of RAZ that it shares with $ \mathit{RTZ}$.

Lemma 6.2.1   (sgn-raz) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$, then

$\displaystyle sgn(RAZ(x,n)) = sgn(x).$

Lemma 6.2.2   (raz-minus) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle RAZ(-x,n) = -RAZ(x,n).$

  (raz-shift) For all $ x \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ k \in \mathbb{Z}$,

$\displaystyle RAZ(2^kx,n) = 2^k RAZ(x,n).$

PROOF: By Lemma 4.1.14,

$\displaystyle RAZ(2^kx,n)$ $\displaystyle =$ $\displaystyle sgn(2^kx)\lceil 2^{n-1}sig(2^kx) \rceil 2^{expo(2^kx)-n+1}$  
  $\displaystyle =$ $\displaystyle sgn(x)\lceil 2^{n-1}sig(x) \rceil 2^{expo(x)+k-n+1}$  
  $\displaystyle =$ $\displaystyle 2^k RAZ(x,n).$  

The negative-precision case is less than intuitive.

  (raz-neg-bits) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{Z}$, if $ n \leq 0$, then

$\displaystyle RAZ(x,n) = sgn(x)2^{expo(x)+1-n}.$

PROOF: By Lemma 4.1.8,

$\displaystyle 0 < 2^{n-1}sig(x) \leq sig(x)/2 < 1,$

and hence, by Lemma 1.1.9,

$\displaystyle RAZ(x,n) = sgn(x)\lceil 2^{n-1}sig(x) \rceil 2^{expo(x)+1-n} = sgn(x)2^{expo(x)+1-n}.$

We have the following bounds on $ RAZ(x,n)$.

  (raz-lower-bound) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle \vert RAZ(x,n)\vert \geq \vert x\vert.$

PROOF: By Lemmas 1.1.9 and 4.1.1,

$\displaystyle \vert RAZ(x,n)\vert \geq 2^{n-1}sig(x) 2^{expo(x)-n+1} = sig(x)2^{expo(x)} = \vert x\vert.$

  (raz-upper-bound,raz-upper-2)
If $ x \in \mathbb{R}$, $ x \neq 0$, and $ n \in \mathbb{N}$, then

$\displaystyle \vert RAZ(x,n)\vert < \vert x\vert + 2^{expo(x)-n+1} \leq \vert x\vert(1+2^{1-n}).$

PROOF: By Definitions 6.2.1 and 1.1.1 and Lemma 4.1.1,

$\displaystyle \vert RAZ(x,n)\vert < (2^{n-1}sig(x)+1)2^{expo(x)-n+1} = \vert x\vert + 2^{expo(x)-n+1}.$

The second inequality follows from Definition 4.1.1

  (raz-diff) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle \vert RAZ(x,n)-x\vert < 2^{expo(x)-n+1}.$

PROOF: By Lemmas 6.2.1 and 6.2.5,

$\displaystyle \vert RAZ(x,n)-x\vert = \vert\vert RAZ(x,n)\vert-\vert x\vert\vert = \vert RAZ(x,n)\vert-\vert x\vert.$

The corollary now follows from Lemma 6.2.6


Unlike $ \mathit{RTZ}$, RAZ is not guaranteed to preserve the exponent of its argument, but the only exception is the case in which a number is rounded up to a power of 2.

  (raz-expo-upper) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle \vert RAZ(x,n)\vert \leq 2^{expo(x) + 1}.$

PROOF: By Lemma 4.1.8,

$\displaystyle \vert RAZ(x,n)\vert$ $\displaystyle =$ $\displaystyle \lceil 2^{n-1}sig(x) \rceil 2^{expo(x)-n+1}$  
  $\displaystyle \leq$ $\displaystyle \lceil 2^{n} \rceil 2^{expo(x)-n+1}$  
  $\displaystyle =$ $\displaystyle 2^{n} 2^{expo(x)-n+1}$  
  $\displaystyle =$ $\displaystyle 2^{expo(x) + 1}.$  

  (expo-raz-lower-bound,expo-raz-upper-bound)
For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle expo(x) \leq expo(RAZ(x,n)) \leq expo(x)+1.$

PROOF: The first inequality follows from Lemmas 6.2.5 and 4.1.5, and the second from Lemmas 6.2.8 and 4.1.3

Corollary 6.2.10   (expo-raz) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, if $ \vert RAZ(x,n)\vert \neq 2^{expo(x)+1}$, then

$\displaystyle expo(RAZ(x,n)) = expo(x).$

The standard rounding mode properties may now be derived.

  (raz-exactp-a) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, $ RAZ(x,n)$ is $ n$-exact.

PROOF: By Corollary 6.2.10 and Lemma 4.2.6, we may assume that

$\displaystyle expo(RAZ(x,n)) = expo(x).$

Consequently, it suffices to observe that

$\displaystyle RAZ(x,n)2^{n-1-expo(x)} = sgn(x)\lceil 2^{n-1}sig(x) \rceil \in \mathbb{Z}.$

  (raz-diff-expo) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$ and x is not $ n$-exact, then

$\displaystyle expo(RAZ(x,n)-x) \leq expo(x)-n.$

PROOF: According to Lemma 6.2.11, the hypothesis implies that $ x-RAZ(x,n) \neq 0$, and hence Lemma 4.1.3 applies. 

  (raz-exactp-b) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$ and x is $ n$-exact, then

$\displaystyle RAZ(x,n) = x.$

PROOF: By Definition 4.2.1 and Lemma 1.1.10,

$\displaystyle \lceil 2^{n-1}sig(x) \rceil = 2^{n-1}sig(x),$

and hence by Definition 6.2.1 and Lemma 4.1.1,
$\displaystyle RAZ(x,n)$ $\displaystyle =$ $\displaystyle sgn(x)\lceil 2^{n-1}sig(x) \rceil 2^{expo(x)-n+1}$  
  $\displaystyle =$ $\displaystyle sgn(x)2^{n-1}sig(x)2^{expo(x)-n+1}$  
  $\displaystyle =$ $\displaystyle sgn(x)sig(x)2^{expo(x)}$  
  $\displaystyle =$ $\displaystyle x.$  

  (raz-exactp-c) Let $ x \in \mathbb{R}$, $ a \in \mathbb{R}$, and $ n \in \mathbb{N}$, $ n > 0$. If $ a$ is $ n$-exact and $ a \geq x$, then $ a \geq RAZ(x,n)$.

PROOF: If $ x<0$, then

$\displaystyle x = -\vert x\vert \geq -\vert RAZ(x,n)\vert = RAZ(x,n)$

by Lemmas 6.2.1 and 6.2.5. Therefore, we may assume that $ x \geq 0$. Suppose $ a < RAZ(x,n)$. Then by Lemmas 4.2.16 and 4.1.5,

$\displaystyle RAZ(x,n) \geq a + 2^{expo(a)-n+1} \geq x + 2^{expo(x)-n+1},$

contradicting Lemma 6.2.6

  (raz-squeeze) Let $ x \in \mathbb{R}$, $ a \in \mathbb{R}$and $ n \in \mathbb{N}$. Assume that $ a$ is $ n$-exact, where $ 0<n$ and $ 0 < a < x \leq fp^+(a,n)$. Then $ \mathit{RAZ}(x, n) = fp^+(a,n)$.

PROOF: By Lemmas 6.2.5, 6.2.11, and 4.2.16, $ \mathit{RTZ}(x, n) \geq fp^+(a,n)$. By Lemmas 6.2.14 and 4.2.16, $ \mathit{RTZ}(x, n) \leq fp^+(a,n)$

  (raz-up) Let $ x \in \mathbb{R}$, $ k \in \mathbb{N}$, $ m \in \mathbb{N}$, and $ n \in \mathbb{N}$, with $ 0 < m < n$ and $ \vert x\vert < 2^k$. If $ \vert\mathit{RAZ}(x, n)\vert = 2^k$, then $ \vert\mathit{RAZ}(x, m)\vert = 2^k$.

PROOF: We may assume that $ x>0$. Suppose $ \mathit{RAZ}(x, n) = 2^k$. By Lemmas 6.2.13 and 4.2.19, $ x > fp^-(2^k,n) > fp^-(2^k,m)$, and by Lemmas 6.2.15 and 4.2.20, $ \vert\mathit{RAZ}(x, m)\vert = 2^k$

  (raz-monotone) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ x \leq y$, then $ RAZ(x,n) \leq \mathit{RTZ}(y,n)$.

PROOF: First suppose $ x>0$. By Lemma 6.2.5, $ RAZ(y,n) \geq y \geq x$. Since $ RAZ(x,n)$ is $ n$-exact by Lemma 6.2.11, Lemma 6.2.14 implies

$\displaystyle RAZ(y,n) \geq RAZ(x,n).$

Now suppose $ x \leq 0$. By Lemma 6.2.1, we may assume that $ x \leq y < 0$. Thus, since $ 0 < -y \leq -x$, we have $ RAZ(-y,n) \leq RAZ(-x,n)$ and by Lemma 6.2.2,

$\displaystyle RAZ(x,n) = -RAZ(-x,n) \leq -RAZ(-y,n) = RAZ(y,n).$

If $ x$ is not $ n$-exact, then $ \mathit{RTZ}(x,n)$ and $ RAZ(x,n)$ are successive $ n$-exact numbers.

  (rtz-raz) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ x>0$, $ n > 0$, and $ x$ is not $ n$-exact, then

$\displaystyle RAZ(x,n) = \mathit{RTZ}(x,n) + 2^{expo(x)+1-n} = fp^+(\mathit{RTZ}(x,n), n).$

PROOF: Let $ a = fp^+(\mathit{RTZ}(x,n), n)$. By Lemma 6.1.6, $ a = \mathit{RTZ}(x,n) + 2^{expo(x)+1-n}$. Since $ expo(\mathit{RTZ}(x,n)) = expo(x)$, $ a = \mathit{fp}^+(\mathit{RTZ}(x,n),n)$. Since $ \mathit{RTZ}(x,n)$ and $ RAZ(x,n)$ are both $ n$-exact and $ \mathit{RTZ}(x,n) <
RAZ(x,n)$, Lemmas 4.2.15 and 4.2.16 imply that $ a$ is $ n$-exact and $ RAZ(x,n) \geq a$. But by Lemma 6.1.7, $ a > x$, and therefore, by Lemma 6.2.14, $ a \geq RAZ(x,n)$


The next four results correspond to Lemmas 6.1.14, 6.1.15 6.1.16, and 6.1.17 of the preceding section.

  (raz-midpoint) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ x$ is $ (n+1)$-exact but not $ n$-exact, then

$\displaystyle RAZ(x,n) = x + sgn(x)2^{expo(x)-n}.$

PROOF: For the case $ n = 0$, we have $ sig(x) = 1$ by Definition 4.2.1, and by Lemmas 4.1.1 and 6.2.4,

$\displaystyle x + sgn(x)2^{expo(x)-n}$ $\displaystyle =$ $\displaystyle sgn(x)2^{expo(x)} + sgn(x)2^{expo(x)}$  
  $\displaystyle =$ $\displaystyle sgn(x)2^{expo(x)+1}$  
  $\displaystyle =$ $\displaystyle RAZ(x,n).$  

Thus, we may assume $ n > 0$, and by Lemma 6.2.2, we may also assume $ x>0$. Let $ a = x-2^{expo(x)-n}$ and $ b = x+2^{expo(x)-n}$. As noted in the proof of Lemma 6.1.14, $ a$ and $ b$ are both $ n$-exact and $ b = \mathit{fp}^+(a,n)$. Now by Lemma 6.2.14, $ b \geq RAZ(x,n)$, but if $ b = \mathit{fp}^+(a,n) > RAZ(x,n)$, then since $ RAZ(x,n)$ is $ n$-exact, Lemma 4.2.16 would imply $ a \geq RAZ(x,n)$, contradicting $ a<x$. Therefore, $ a = RAZ(x,n)$

  (raz-raz) Let $ x \in \mathbb{R}$, $ m \in \mathbb{N}$, and $ n \in \mathbb{N}$. If $ m \leq n$, then

$\displaystyle RAZ(RAZ(x,n),m) = RAZ(x,m).$

PROOF: We may assume $ x>0$. Consider first the case

$\displaystyle RAZ(x,n) = 2^{expo(x)+1}.$

In this case, $ RAZ(x,n)$ is $ m$-exact, so that

$\displaystyle RAZ(RAZ(x,n),m) = RAZ(x,n) = 2^{expo(x)+1}.$

By Lemma 6.2.8, we need only show that $ RAZ(x,m) \geq
2^{expo(x)+1}$. But since $ m \leq n$, $ RAZ(x,m)$ is $ n$-exact, and since $ RAZ(x,m) \geq x$, $ RAZ(x,m) \geq RAZ(x,n)$ by Lemma 6.2.14.

Thus, we may assume $ RAZ(x,n) < 2^{expo(x)+1}$. By Corollary 6.2.10,

$\displaystyle expo(RAZ(x,n)) = expo(x),$

and hence by Lemma 1.1.14,
$\displaystyle {RAZ(RAZ(x,n),m)}$
  $\displaystyle =$ $\displaystyle \lceil 2^{m-1-expo(x)} (\lceil 2^{n-1-expo(x)}x\rceil 2^{expo(x)+1-n})\rceil 2^{expo(x)+1-m}$  
  $\displaystyle =$ $\displaystyle \lceil \lceil 2^{n-1-expo(x)}x\rceil/2^{n-m} \rceil 2^{expo(x)+1-m}$  
  $\displaystyle =$ $\displaystyle \lceil 2^{n-1-expo(x)}x/2^{n-m} \rceil 2^{expo(x)+1-m}$  
  $\displaystyle =$ $\displaystyle \lceil 2^{m-1-expo(x)}x\rceil 2^{expo(x)+1-m}$  
  $\displaystyle =$ $\displaystyle RAZ(x,m).$  

  (plus-raz) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ k \in \mathbb{Z}$. If $ x \geq 0$, $ y \geq 0$, and $ x$ is $ (k+expo(x)-expo(y))$-exact, then

$\displaystyle x+RAZ(y,k) = RAZ(x+y,k+expo(x+y)-expo(y)).$

PROOF: Let $ n = k+expo(x)-expo(y)$. Since $ x$ is $ n$-exact,

$\displaystyle x2^{k-1-expo(y)} = x2^{n-1-expo(x)} \in \mathbb{Z}.$

Let $ k' = k+expo(x+y)-expo(y)$. Then by Lemma 1.1.13,
$\displaystyle x+RAZ(y,k)$ $\displaystyle =$ $\displaystyle x+\lceil 2^{k-1-expo(y)}y \rceil 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle (x2^{k-1-expo(y)}+\lceil 2^{k-1-expo(y)}y \rceil) 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle \lceil 2^{k-1-expo(y)}(x+y) \rceil 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle \lceil 2^{k'-1-expo(x+y)}(x+y) \rceil 2^{expo(x+y)+1-k'}$  
  $\displaystyle =$ $\displaystyle RAZ(x+y,k').$  

  (minus-rtz-raz) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ k \in \mathbb{Z}$. If $ x>y>0$, $ k + expo(x-y) - expo(y) > 0$, and $ x$ is $ (k+expo(x)-expo(y))$-exact, then

$\displaystyle x-\mathit{RTZ}(y,k) = RAZ(x-y,k+expo(x-y)-expo(y)).$

PROOF: Let $ n = k+expo(x)-expo(y)$. Since $ x$ is $ n$-exact,

$\displaystyle x2^{k-1-expo(y)} = x2^{n-1-expo(x)} \in \mathbb{Z}.$

Let $ k' = k+expo(x-y)-expo(y)$. Then by Lemma 1.1.4,
$\displaystyle x-\mathit{RTZ}(y,k)$ $\displaystyle =$ $\displaystyle x-\lfloor 2^{k-1-expo(y)}y \rfloor 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle -(\lfloor 2^{k-1-expo(y)}y \rfloor - x2^{k-1-expo(y)}) 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle -\lfloor 2^{k-1-expo(y)}(y-x) \rfloor 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle -\lfloor 2^{k'-1-expo(y-x)}(y-x) \rfloor 2^{expo(y-x)+1-k'}$  
  $\displaystyle =$ $\displaystyle -\mathit{RTZ}(y-x,k')$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x-y,k').$  

The following result combines Lemmas 6.1.17 and 6.2.22.

  (rtz-plus-minus) Let $ x \in \mathbb{R}$ and $ y \in \mathbb{R}$ such that $ x \neq 0$, $ y \neq 0$, and $ x+y \neq 0$. Let $ k \in \mathbb{Z}$,

$\displaystyle k' = k+expo(x)-expo(y),$

and

$\displaystyle k” = k+expo(x+y)-expo(y).$

If $ k”>0$, and $ x$ is $ k'$-exact, then

$\displaystyle x+\mathit{RTZ}(y,k) = \left\{\begin{array}{ll}
\mathit{RTZ}(x+y,...
...gn(y)$}\\
RAZ(x+y,k”) & \mbox{if $sgn(x+y)\neq sgn(y)$}. \end{array} \right.$

PROOF: By Lemmas 6.1.3 and 6.2.2, we may assume that $ x>0$. The case $ y > 0$ is handled by Lemma 6.1.16. For the case $ y<0$, Lemmas 6.1.17 and 6.2.22 cover the subcases $ -y>x$ and $ -y<x$, respectively. 


We turn now to the problem of bit-level implementation of rounding. Truncation, according to Lemma 6.1.18, is equivalent to a bit slice operation, which may be implemented as a logical operation using Corollary 6.1.21. Other rounding modes may be reduced to the case of truncation by a method known as constant injection. Let $ x$ be $ m$-exact with $ expo(x) = e$, say

$\displaystyle x = (1.\beta_1\beta_2\cdots\beta_{n-1})_2\cdot 2^e,$

to be rounded to $ n$ bits, where $ n \leq m$, according to a rounding mode $ {\cal R}$. Our goal is to construct a rounding constant $ {\cal C}$, depending on $ m$, $ n$, $ e$, and $ {\cal R}$, such that

$\displaystyle {\cal R}(x,n) = \mathit{RTZ}(x+{\cal C},n).$

The appropriate constant for the case $ {\cal R} = RAZ$ is

$\displaystyle {\cal C} = 2^e(2^{-(n-1)} - 2^{-(m-1)}) = 2^{e+1}(2^{-n}-2^{-m}),$

which consists of a string of 1's at the bit positions corresponding to the least significant $ m-n$ bits of $ x$, as illustrated below.


\begin{picture}(120,55)(-25,20)\setlength{\unitlength}{2mm}
\par
\put(10,10){...
...\put(33.4,8){$\times \:\:\:2^e$}
\par
\put(10,6.8){\line(1,0){27}}
\end{picture}

As suggested by the diagram, the addition $ x+{\cal C}$ generates a carry into the position of $ \beta_{n-1}$ unless $ \beta_n = \cdots =
\beta_{m-1} = 0$, i.e., unless $ x$ is $ n$-exact, and $ x$ is rounded up accordingly. This observation is formalized by the following lemma.

  (raz-imp) Let $ x \in \mathbb{R}$, $ m \in \mathbb{N}$, and $ n \in \mathbb{N}$. If $ x$ is $ m$-exact, $ x>0$, and $ m \geq n > 0$, then

$\displaystyle RAZ(x,n) = \mathit{RTZ}(x+2^{expo(x)+1}(2^{-n}-2^{-m}),n).$

PROOF: Let $ a = \mathit{RTZ}(x+2^{expo(x)+1}(2^{-n}-2^{-m}),n)$. Since $ a$ and $ RAZ(x,n$ are both $ n$-exact and

$\displaystyle a < x+2^{expo(x)+1-n} \leq RAZ(x,n) + 2^{expo(RAZ(x,n))+1-n},$

$ a \leq RAZ(x,n)$ by Lemma 4.2.16.

If $ x$ is $ n$-exact, then $ a \geq \mathit{RTZ}(x,n) = x = RAZ(x,n)$, and hence $ a = RAZ(x,n)$. Thus, we may assume $ x$ is not $ n$-exact. But then since $ x>\mathit{RTZ}(x,n)$ and $ x$ is $ m$-exact,

$\displaystyle x \geq \mathit{RTZ}(x,n) + 2^{expo(x)+1-m}$

and hence

$\displaystyle x+2^{expo(x)+1}(2^{-n}-2^{-m}) \geq \mathit{RTZ}(x,n) + 2^{expo(x)+1-n} = RAZ(x,n),$

which implies $ a \geq RAZ(x,n)$


David Russinoff 2017-08-01