6.4  Odd Rounding

It is often possible to produce the rounded result of a computation without explicitly generating every bit of the precise result. For example, a single-precision multiplication returns a 24-bit approximation to a 48-bit product, and one would like to avoid computing all 48 of those bits. Naturally, the information required depends on the rounding mode. For $ \mathit{RTZ}$, the most significant 24 bits of the result are sufficient; for $ RNA$, 25 are needed. For RAZ or RNE rounding, no result bits may be ignored, but we shall show that for any of the modes of interest, a correctly rounded $ m$-bit result may always be recovered from an $ (m+1)$-bit truncation together with an appended sticky bit, which simply indicates whether any accuracy was lost in the truncation. Although the operation of computing this $ (m+2)$-bit intermediate result is not conventionally viewed as a rounding mode itself, it satisfies the axioms at the beginning of this chapter and shares some other important properties with the modes discussed in the preceding sections, and therefore, we find this view to be useful.

The following function, which might be called “round to odd”, computes an $ n$-bit rounded result.

Definition 6.4.1   (rto) If $ x \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ n>1$, then

$\displaystyle RTO(x,n) = \left\{\begin{array}{ll}
x & \mbox{if $x$ is $(n\!-\...
...athit{RTZ}(x,n-1)+sgn(x)2^{expo(x)+1-n} & \mbox{otherwise}. \end{array} \right.$

  (sgn-rto) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$, then

$\displaystyle sgn(RTO(x,n)) = sgn(x).$

PROOF: This is an obvious consequence of Lemma 6.1.2

  (rto-minus) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle RTO(-x,n) = -RTO(x,n).$

PROOF: If $ x$ is $ (n-1)$-exact, then so is $ -x$, and

$\displaystyle RTO(-x,n) = -x = -stick(x,n).$

Otherwise, by Lemmas 6.1.3 and 4.1.13,
$\displaystyle RTO(-x,n)$ $\displaystyle =$ $\displaystyle \mathit{RTZ}(-x,n-1)+sgn(-x)2^{expo(-x)+1-n}$  
  $\displaystyle =$ $\displaystyle -\mathit{RTZ}(x,n-1)-sgn(x)2^{expo(x)+1-n}$  
  $\displaystyle =$ $\displaystyle RTO(x,n).$  

  (rto-shift) For all $ x \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ k \in \mathbb{Z}$,

$\displaystyle RTO(2^kx,n) = 2^k RTO(x,n).$

PROOF: If $ x$ is $ (n-1)$-exact, then so is $ -x$, and

$\displaystyle RTO(2^kx,n) = 2^kx = 2^kstick(x,n).$

Otherwise, by Lemmas 6.1.4 and 4.1.14,
$\displaystyle RTO(2^kx,n)$ $\displaystyle =$ $\displaystyle \mathit{RTZ}(2^kx,n-1)+sgn(2^kx)2^{expo(2^kx)+1-n}$  
  $\displaystyle =$ $\displaystyle 2^k\mathit{RTZ}(x,n-1)+2^ksgn(x)2^{expo(x)+1-n}$  
  $\displaystyle =$ $\displaystyle 2^kRTO(x,n).$  

  (expo-rto) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$, then

$\displaystyle expo(RTO(x,n)) = expo(x).$

PROOF: By Lemmas 6.4.2 and 6.4.1, we may assume that $ x>0$. If $ x$ is $ (n-1)$-exact, the claim is trivial. Suppose $ x$ is not $ (n-1)$-exact. By Lemma 6.1.6, $ \mathit{RTZ}(x,n-1) < 2^{expo(x)+1}$. Since $ \mathit{RTZ}(x,n-1)$ and $ 2^{expo(x)+1}$ are both $ (n-1)$-exact, Lemma 4.2.16 implies

$\displaystyle 2^{expo(x)+1}$ $\displaystyle \geq$ $\displaystyle \mathit{fp}^+(\mathit{RTZ}(x,n-1), n-1)$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x,n-1) + 2^{expo(x) + 1 - (n-1)}$  
  $\displaystyle >$ $\displaystyle \mathit{RTZ}(x,n-1) + 2^{expo(x) + 1 - n}$  
  $\displaystyle =$ $\displaystyle RTO(x,n),$  

and it follows that $ expo(RTO(x,n) = expo(x)$

  (rto-exactp-a) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, $ RTO(x,n)$ is $ n$-exact.

PROOF: If $ x$ is $ (n-1)$-exact, then $ RTO(x,n) = x$. Suppose $ x$ is not $ (n-1)$-exact. By Lemma 6.1.9, $ \mathit{RTZ}(x,n-1)$ is $ (n-1)$-exact, i.e.,

$\displaystyle 2^{(n-1)-1-expo(\mathit{RTZ}(x,n-1)}\mathit{RTZ}(x,n-1) = 2^{n-2-expo(x)}\mathit{RTZ}(x,n-1) \in \mathbb{Z}.$

Thus, by Lemma 6.4.4,
$\displaystyle {2^{n-1-expo(RTO(x,n))}RTO(x,n)}$
  $\displaystyle =$ $\displaystyle 2^{n-1-expo(x)}(\mathit{RTZ}(x,n-1) + sgn(x)2^{expo(x)+1-n}$  
  $\displaystyle =$ $\displaystyle 2^{n-1-expo(x)}\mathit{RTZ}(x,n-1) + sgn(x)$  
  $\displaystyle =$ $\displaystyle 2 \cdot 2^{n-2-expo(x)}\mathit{RTZ}(x,n-1) + sgn(x)$  
  $\displaystyle \in$ $\displaystyle \mathbb{Z}.$  

  (rto-exactp-b) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$ and x is $ n$-exact, then

$\displaystyle RTO(x,n) = x.$

PROOF: We may assume that $ x$ is not $ (n-1)$-exact, and hence Lemma 6.1.14 yuelds

$\displaystyle \mathit{RTZ}(x,n-1) = x - sgn(x)2^{expo(x)+1-n}.$

Thus,

$\displaystyle RTO(x,n) = \mathit{RTZ}(x,n-1) + sgn(x)2^{expo(x)+1-n} = x.$

  (rto-monotone) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ x \leq y$, then $ RTO(x,n) \leq RTO(y,n)$.

PROOF: Using Lemmas 6.4.2 and 6.4.1, we may assume that $ 0<x<y$. Suppose first that $ x$ is $ (n-1)$-exact. If $ y$ is also $ (n-1)$-exact, then the claim is trivial, and if not, then by Lemmas 6.1.10 and 6.1.13,

$\displaystyle RTO(x,n)$ $\displaystyle =$ $\displaystyle x$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x,n-1)$  
  $\displaystyle \leq$ $\displaystyle \mathit{RTZ}(y,n-1)$  
  $\displaystyle <$ $\displaystyle \mathit{RTZ}(y,n-1) + 2^{expo(y) + 1-n}$  
  $\displaystyle =$ $\displaystyle RTO(y,n).$  

Similarly, if neither $ x$ nor $ y$ is $ (n-1)$-exact, then Lemmas 6.1.13 and 4.1.5 imply
$\displaystyle RTO(x,n)$ $\displaystyle =$ $\displaystyle \mathit{RTZ}(x,n-1) 2^{expo(x) + 1-n}$  
  $\displaystyle \leq$ $\displaystyle \mathit{RTZ}(y,n-1)$  
  $\displaystyle <$ $\displaystyle \mathit{RTZ}(y,n-1) + 2^{expo(y) + 1-n}$  
  $\displaystyle =$ $\displaystyle RTO(y,n).$  

In the remaining case, $ y$ is $ (n-1)$-exact and $ x$ is not. Now by Lemmas 6.1.9 and 4.2.5, $ \mathit{RTZ}(x,n-1)$ and $ y$ are both $ n$-exact, and hence, by Lemmas 4.2.16 and 6.1.6,
$\displaystyle RTO(y,n)$ $\displaystyle =$ $\displaystyle y$  
  $\displaystyle \geq$ $\displaystyle \mathit{fp}^+(\mathit{RTZ}(x,n-1),n)$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x,n-1) + 2{expo(x) + 1 - n}$  
  $\displaystyle =$ $\displaystyle RTO(x,n).$  

The following property, which is not shared by any of the other modes that we have considered, is the basis of this mode's utility.

  (rto-exactp-c) Let $ x \in \mathbb{R}$, $ m \in \mathbb{N}$, and $ n \in \mathbb{N}$. If $ n>m>0$, then $ RTO(x,n)$ is $ m$-exact if and only if $ x$ is $ m$-exact.

PROOF: According to Lemmas 6.4.2 and 6.4.1, we may assume that $ x>0$. But clearly we may also assume that $ x$ is not $ (n-1)$-exact, and hence, by Lemma 4.2.5, $ x$ is not $ m$-exact. On the other hand, $ \mathit{RTZ}(x,n-1)$ is $ (n-1)$-exact, and since

$\displaystyle RTO(x,n)$ $\displaystyle =$ $\displaystyle \mathit{RTZ}(x,n-1) + 2^{expo(x) + 1 - n}$  
  $\displaystyle <$ $\displaystyle \mathit{RTZ}(x,n-1) + 2^{expo(x) + 1 - (n-1)}$  
  $\displaystyle =$ $\displaystyle \mathit{fp}^+(\mathit{RTZ}(x,n-1),n-1),$  

Lemma 4.2.16 implies that $ RTO(x,n)$ is not $ (n-1)$-exact. Applying Lemma 4.2.5 again, we conclude that $ RTO(x,n)$ is not $ m$-exact. 


For directed rounding, an $ m$-bit rounded result may be derived from an $ (m+1)$-bit odd rounding.

  (rtz-rto, raz-rto) Let $ m \in \mathbb{N}$, $ n \in \mathbb{N}$, and $ x \in \mathbb{R}$. If $ n>m>0$, then

$\displaystyle \mathit{RTZ}(RTO(x,n),m) = \mathit{RTZ}(x,m)$

and

$\displaystyle RAZ(RTO(x,n),m) = RAZ(x,m).$

PROOF: We may assume that $ x>0$ and $ x$ is not $ (n-1)$-exact; the other cases follow trivially. First, note that by Lemmas 6.4.5 and 6.4.8, $ RTO(x,n)$ is $ n$-exact but not $ (n-1)$-exact, and therefore, according to Lemmas 6.1.14 and 6.4.4,

$\displaystyle \mathit{RTZ}(RTO(x,n),n-1)$ $\displaystyle =$ $\displaystyle RTO(x,n) - 2^{expo(RTO(x,n)) - (n-1)}$  
  $\displaystyle =$ $\displaystyle RTO(x,n) - 2^{expo(x) +1-n}$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x,n-1).$  

Thus, by Lemma 6.1.15, for any $ m < n$,

$\displaystyle \mathit{RTZ}(RTO(x,n),m) = \mathit{RTZ}(\mathit{RTZ}(x,n-1),m) = \mathit{RTZ}(x,m).$

The corresponding result for RAZ may be similarly derived. 


For unbiased rounding, one extra bit is required.

  (rne-rto, rna-rto) Let $ m \in \mathbb{N}$, $ n \in \mathbb{N}$, and $ x \in \mathbb{R}$. If $ n>m+1$ and $ m>0$, then

$\displaystyle RNE(RTO(x,n),m) = RNE(x,m)$

and

$\displaystyle RNA(RTO(x,n),m) = RNA(x,m).$

PROOF: The second equation follows easily from Lemmas 6.3.43 and 6.4.9:

$\displaystyle RNA(RTO(x,n),m)$ $\displaystyle =$ $\displaystyle RNA(\mathit{RTZ}(RTO(x,n),m+1),m)$  
  $\displaystyle =$ $\displaystyle RNA(\mathit{RTZ}(x,m+1),m)$  
  $\displaystyle =$ $\displaystyle RNA(x,m).$  

To prove the first equation using Lemma 6.4.9, it will suffice to show that if $ \mathit{RTZ}(x,m+1) = \mathit{RTZ}(y,m+1)$ and $ RAZ(x,m+1)=RAZ(y,m+1)$, then $ RNE(x,m)=RNE(y,m)$. Without loss of generality, we may assume $ x \leq y$. Suppose $ RNE(x,m) \neq RNE(y,m)$. Then by Lemma 6.3.18, for some $ (m+1)$-exact $ a$, $ x \leq a \leq y$. But this implies $ x=a$, for otherwise $ \mathit{RTZ}(x,m+1) \leq x <
a \leq \mathit{RTZ}(y,m+1)$. Similarly, $ y=a$, for otherwise $ RAZ(x,m+1)
\leq a < y \leq RAZ(y,m+1)$. Thus, $ x = y$, a contradiction. 


Lemma 6.4.11   (rto-rto) Let $ m \in \mathbb{N}$, $ n \in \mathbb{N}$, and $ x \in \mathbb{R}$. If $ n \geq m > 1$, then

$\displaystyle RTO(RTO(x,n),m) = RTO(x,m).$

The following important property is essential for floating-point addition, as it allows a rounded sum or difference of unaligned numbers to be derived without computing the full result explicitly.

  (rto-plus) Let $ x \in \mathbb{R}$ and $ y \in \mathbb{R}$ such that $ x \neq 0$, $ y \neq 0$, and $ x+y \neq 0$. Let $ k \in \mathbb{N}$,

$\displaystyle k' = k+expo(x)-expo(y),$

and

$\displaystyle k” = k+expo(x+y)-expo(y).$

If $ k>1$, $ k'>1$, $ k”>1$, and $ x$ is $ (k'-1)$-exact, then

$\displaystyle x+RTO(y,k) = RTO(x+y,k”).$

PROOF: Since $ x$ is $ (k'-1)$-exact,

$\displaystyle 2^{k-2-expo(y)}x = 2^{(k'-1)-1-expo(x)}x \in \mathbb{Z}.$

Thus,
$\displaystyle \mbox{$y$ is $(k-1)$-exact}$ $\displaystyle \Leftrightarrow$ $\displaystyle 2^{k-2-expo(y)}y \in \mathbb{Z}$  
  $\displaystyle \Leftrightarrow$ $\displaystyle 2^{k-2-expo(y)}y + 2^{k-2-expo(y)}x \in \mathbb{Z}$  
  $\displaystyle \Leftrightarrow$ $\displaystyle 2^{k”-2-expo(x+y)}(x+y) \in \mathbb{Z}$  
  $\displaystyle \Leftrightarrow$ $\displaystyle \mbox{$x+y$ is $(k”-1)$-exact}$$\displaystyle .$  

If $ y$ is $ (k-1)$-exact, then

$\displaystyle x+RTO(y,k) = x+y = RTO(x+y,k”).$

Thus, we may assume that $ y$ is not $ (k-1)$-exact. We invoke Corollary 6.2.23:

$\displaystyle x+\mathit{RTZ}(y,k) = \left\{\begin{array}{ll}
\mathit{RTZ}(x+y,...
...gn(y)$}\\
RAZ(x+y,k”) & \mbox{if $sgn(x+y)\neq sgn(y)$}. \end{array} \right.$

Now if $ sgn(x+y) = sgn(y)$, then
$\displaystyle x+RTO(y,k)$ $\displaystyle =$ $\displaystyle x + \mathit{RTZ}(y,k-1) + sgn(y)2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x+y,k”-1) + sgn(x+y)2^{expo(x+y)+1-k”}$  
  $\displaystyle =$ $\displaystyle RTO(x+y,k”).$  

On the other hand, if $ sgn(x+y) \neq sgn(y)$, then
$\displaystyle x+RTO(y,k)$ $\displaystyle =$ $\displaystyle x + \mathit{RTZ}(y,k-1) + sgn(y)2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle RAZ(x+y,k”-1) - sgn(x+y)2^{expo(x+y)+1-k”}$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x+y,k”-1) + sgn(x+y)2^{expo(x+y)+1-k”}$  
  $\displaystyle =$  

David Russinoff 2017-08-01