next up previous contents
Next: IEEE Rounding Up: Rounding Previous: Unbiased Rounding   Contents


Sticky Rounding

It is often possible to produce the rounded result of a computation without explicitly generating every bit of the precise result. For example, a single-precision multiplication returns a 24-bit approximation to a 48-bit product, and one would like to avoid computing all 48 of those bits. Naturally, the information required depends on the rounding mode. For trunc, obviously, the most significant 24 bits of the result are sufficient; for $ near^+$ , 25 are needed. For away or near rounding, no result bits may be ignored, but we shall show that for any of the modes of interest, a correctly rounded $ m$ -bit result may always be recovered from an $ (m+1)$ -bit truncation together with an appended sticky bit, which simply indicates whether any accuracy was lost in the truncation. Although the operation of computing this $ (m+2)$ -bit intermediate result is not conventionally viewed as a rounding mode itself, it shares some important properties with the modes discussed in the preceding sections and therefore, we find this view to be useful.

The following function computes an $ n$ -bit “sticky-rounded” result.

Definition 5.4.1   (sticky) If $ x \in \mathbb{Q}$ , $ n \in \mathbb{N}$ , and $ n>1$ , then

$\displaystyle \;sticky(x,n) = \left\{\begin{array}{ll}
x & \mbox{if $x$ is $(n...
...}\\
trunc(x,n-1)+sgn(x)2^{expo(x)+1-n} & \mbox{otherwise}. \end{array} \right.$

  (sgn-sticky) Let $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ . If $ n>0$ , then

$\displaystyle sgn(sticky(x,n)) = sgn(x).$

PROOF: This is an obvious consequence of Lemma 5.1.2

  (sticky-minus) For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ ,

$\displaystyle sticky(-x,n) = -sticky(x,n).$

PROOF: If $ x$ is $ (n-1)$ -exact, then so is $ -x$ , and

$\displaystyle sticky(-x,n) = -x = -stick(x,n).$

Otherwise, by Lemmas 5.1.3 and 4.1.11,
$\displaystyle sticky(-x,n)$ $\displaystyle =$ $\displaystyle trunc(-x,n-1)+sgn(-x)2^{expo(-x)+1-n}$  
  $\displaystyle =$ $\displaystyle -trunc(x,n-1)-sgn(x)2^{expo(x)+1-n}$  
  $\displaystyle =$ $\displaystyle sticky(x,n).$  

  (sticky-shift) For all $ x \in \mathbb{Q}$ , $ n \in \mathbb{N}$ , and $ k \in \mathbb{Z}$ ,

$\displaystyle sticky(2^kx,n) = 2^k sticky(x,n).$

PROOF: If $ x$ is $ (n-1)$ -exact, then so is $ -x$ , and

$\displaystyle sticky(2^kx,n) = 2^kx = 2^kstick(x,n).$

Otherwise, by Lemmas 5.1.4 and 4.1.12,
$\displaystyle sticky(2^kx,n)$ $\displaystyle =$ $\displaystyle trunc(2^kx,n-1)+sgn(2^kx)2^{expo(2^kx)+1-n}$  
  $\displaystyle =$ $\displaystyle 2^ktrunc(x,n-1)+2^ksgn(x)2^{expo(x)+1-n}$  
  $\displaystyle =$ $\displaystyle 2^ksticky(x,n).$  

  (expo-sticky) Let $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ . If $ n>0$ , then

$\displaystyle expo(sticky(x,n)) = expo(x).$

PROOF: By Lemmas 5.4.2 and 5.4.1, we may assume that $ x>0$ . If $ x$ is $ (n-1)$ -exact, the claim is trivial. Suppose $ x$ is not $ (n-1)$ -exact. By Lemma 5.1.6, $ trunc(x,n-1) < 2^{expo(x)+1}$ . Since $ trunc(x,n-1)$ and $ 2^{expo(x)+1}$ are both $ (n-1)$ -exact, Lemma 4.2.15 implies

$\displaystyle 2^{expo(x)+1}$ $\displaystyle \geq$ fp$\displaystyle ^+(trunc(x,n-1), n-1)$  
  $\displaystyle =$ $\displaystyle trunc(x,n-1) + 2^{expo(x) + 1 - (n-1)}$  
  $\displaystyle >$ $\displaystyle trunc(x,n-1) + 2^{expo(x) + 1 - n}$  
  $\displaystyle =$ $\displaystyle sticky(x,n),$  

and it follows that $ expo(sticky(x,n) = expo(x)$

  (sticky-exactp-a) For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ , $ sticky(x,n)$ is $ n$ -exact.

PROOF: If $ x$ is $ (n-1)$ -exact, then $ sticky(x,n) = x$ . Suppose $ x$ is not $ (n-1)$ -exact. By Lemma 5.1.9, $ trunc(x,n-1)$ is $ (n-1)$ -exact, i.e.,

$\displaystyle 2^{(n-1)-1-expo(trunc(x,n-1)}trunc(x,n-1) = 2^{n-2-expo(x)}trunc(x,n-1) \in \mathbb{Z}.$

Thus, by Lemma 5.4.4,
$\displaystyle {2^{n-1-expo(sticky(x,n))}sticky(x,n)}$
  $\displaystyle =$ $\displaystyle 2^{n-1-expo(x)}(trunc(x,n-1) + sgn(x)2^{expo(x)+1-n}$  
  $\displaystyle =$ $\displaystyle 2^{n-1-expo(x)}trunc(x,n-1) + sgn(x)$  
  $\displaystyle =$ $\displaystyle 2 \cdot 2^{n-2-expo(x)}trunc(x,n-1) + sgn(x)$  
  $\displaystyle \in$ $\displaystyle \mathbb{Z}.$  

  (sticky-exactp-b) Let $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ . If $ n>0$ and x is $ n$ -exact, then

$\displaystyle sticky(x,n) = x.$

PROOF: We may assume that $ x$ is not $ (n-1)$ -exact, and hence Lemma 5.1.14 yuelds

$\displaystyle trunc(x,n-1) = x - sgn(x)2^{expo(x)+1-n}.$

Thus,

$\displaystyle sticky(x,n) = trunc(x,n-1) + sgn(x)2^{expo(x)+1-n} = x.$

  (sticky-monotone) Let $ x \in \mathbb{Q}$ , $ y \in \mathbb{Q}$ , and $ n \in \mathbb{N}$ . If $ x \leq y$ , then $ sticky(x,n) \leq sticky(y,n)$ .

PROOF: Using Lemmas 5.4.2 and 5.4.1, we may assume that $ 0<x<y$ . Suppose first that $ x$ is $ (n-1)$ -exact. If $ y$ is also $ (n-1)$ -exact, then the claim is trivial, and if not, then by Lemmas 5.1.11 and 5.1.13,

$\displaystyle sticky(x,n)$ $\displaystyle =$ $\displaystyle x$  
  $\displaystyle =$ $\displaystyle trunc(x,n-1)$  
  $\displaystyle \leq$ $\displaystyle trunc(y,n-1)$  
  $\displaystyle <$ $\displaystyle trunc(y,n-1) + 2^{expo(y) + 1-n}$  
  $\displaystyle =$ $\displaystyle sticky(y,n).$  

Similarly, if neither $ x$ nor $ y$ is $ (n-1)$ -exact, then Lemmas 5.1.13 and 4.1.4 imply
$\displaystyle sticky(x,n)$ $\displaystyle =$ $\displaystyle trunc(x,n-1) 2^{expo(x) + 1-n}$  
  $\displaystyle \leq$ $\displaystyle trunc(y,n-1)$  
  $\displaystyle <$ $\displaystyle trunc(y,n-1) + 2^{expo(y) + 1-n}$  
  $\displaystyle =$ $\displaystyle sticky(y,n).$  

In the remaining case, $ y$ is $ (n-1)$ -exact and $ x$ is not. Now by Lemmas 5.1.9 and 4.2.5, $ trunc(x,n-1)$ and $ y$ are both $ n$ -exact, and hence, by Lemmas 4.2.15 and 5.1.6,
$\displaystyle sticky(y,n)$ $\displaystyle =$ $\displaystyle y$  
  $\displaystyle \geq$ fp$\displaystyle ^+(trunc(x,n-1),n)$  
  $\displaystyle =$ $\displaystyle trunc(x,n-1) + 2{expo(x) + 1 - n}$  
  $\displaystyle =$ $\displaystyle sticky(x,n).$  

The following property, which is not held by any of the other modes that we have considered, is the basis of this mode's utility.

  (sticky-exactp-m) Let $ x \in \mathbb{Q}$ , $ m \in \mathbb{N}$ , and $ n \in \mathbb{N}$ . If $ n>m>0$ , then $ sticky(x,n)$ is $ m$ -exact if and only if $ x$ is $ m$ -exact.

PROOF: According to Lemmas 5.4.2 and 5.4.1, we may assume that $ x>0$ . But clearly we may also assume that $ x$ is not $ (n-1)$ -exact, and hence, by Lemma 4.2.5, $ x$ is not $ m$ -exact. On the other hand, $ trunc(x,n-1)$ is $ (n-1)$ -exact, and since

$\displaystyle sticky(x,n)$ $\displaystyle =$ $\displaystyle trunc(x,n-1) + 2^{expo(x) + 1 - n}$  
  $\displaystyle <$ $\displaystyle trunc(x,n-1) + 2^{expo(x) + 1 - (n-1)}$  
  $\displaystyle =$ fp$\displaystyle ^+(trunc(x,n-1),n-1),$  

Lemma 4.2.15 implies that $ sticky(x,n)$ is not $ (n-1)$ -exact. Applying Lemma 4.2.5 again, we conclude that $ sticky(x,n)$ is not $ m$ -exact. 


For directed rounding, an $ m$ -bit rounded result may be derived from an $ (m+1)$ -bit sticky rounding.

  (trunc-sticky, away-sticky) Let $ m \in \mathbb{N}$ , $ n \in \mathbb{N}$ , and $ x \in \mathbb{Q}$ . If $ n>m>0$ , then

$\displaystyle trunc(sticky(x,n),m) = trunc(x,m)$

and

$\displaystyle away(sticky(x,n),m) = away(x,m).$

PROOF: We may assume that $ x>0$ and $ x$ is not $ (n-1)$ -exact; the other cases follow trivially. First, note that by Lemmas 5.4.5 and 5.4.8, $ sticky(x,n)$ is $ n$ -exact but not $ (n-1)$ -exact, and therefore, according to Lemmas 5.1.14 and 5.4.4,

$\displaystyle trunc(sticky(x,n),n-1)$ $\displaystyle =$ $\displaystyle sticky(x,n) - 2^{expo(sticky(x,n)) - (n-1)}$  
  $\displaystyle =$ $\displaystyle sticky(x,n) - 2^{expo(x) +1-n}$  
  $\displaystyle =$ $\displaystyle trunc(x,n-1).$  

Thus, by Lemma 5.1.15, for any $ m < n$ ,

$\displaystyle trunc(sticky(x,n),m) = trunc(trunc(x,n-1),m) = trunc(x,m).$

The corresponding result for away may be similarly derived. 


For unbiased rounding, one extra bit is required.

  (near-sticky, near+-sticky) Let $ m \in \mathbb{N}$ , $ n \in \mathbb{N}$ , and $ x \in \mathbb{Q}$ . If $ n>m+1$ and $ m>0$ , then

$\displaystyle near(sticky(x,n),m) = near(x,m)$

and

$\displaystyle near^+(sticky(x,n),m) = near^+(x,m).$

PROOF: The second equation follows easily from Lemmas 5.3.35 and 5.4.9:

$\displaystyle near^+(sticky(x,n),m)$ $\displaystyle =$ $\displaystyle near^+(trunc(sticky(x,n),m+1),m)$  
  $\displaystyle =$ $\displaystyle near^+(trunc(x,m+1),m)$  
  $\displaystyle =$ $\displaystyle near^+(x,m).$  

To prove the first equation using Lemma 5.4.9, it will suffice to show that if $ trunc(x,m+1) = trunc(y,m+1)$ and $ away(x,m+1)=away(y,m+1)$ , then $ near(x,m)=near(y,m)$ . Without loss of generality, we may assume $ x \leq y$ . Suppose $ near(x,m) \neq near(y,m)$ . Then by Lemma 5.3.14, for some $ (m+1)$ -exact $ a$ , $ x \leq a \leq y$ . But this implies $ x=a$ , for otherwise $ trunc(x,m+1) \leq x <
a \leq trunc(y,m+1)$ . Similarly, $ y=a$ , for otherwise $ away(x,m+1)
\leq a < y \leq away(y,m+1)$ . Thus, $ x=y$ , a contradiction. 


Lemma 5.4.11   (sticky-sticky) Let $ m \in \mathbb{N}$ , $ n \in \mathbb{N}$ , and $ x \in \mathbb{Q}$ . If $ n \geq m > 1$ , then

$\displaystyle sticky(sticky(x,n),m) = sticky(x,m).$

The following property is essential for computing a rounded sum or difference.

  (sticky-plus) Let $ x \in \mathbb{Q}$ and $ y \in \mathbb{Q}$ such that $ x \neq 0$ , $ y \neq 0$ , and $ x+y \neq 0$ . Let $ k \in \mathbb{N}$ ,

$\displaystyle k' = k+expo(x)-expo(y),$

and

$\displaystyle k” = k+expo(x+y)-expo(y).$

If $ k>1$ , $ k'>1$ , $ k”>1$ , and $ x$ is $ (k'-1)$ -exact, then

$\displaystyle x+sticky(y,k) = sticky(x+y,k”).$

PROOF: Since $ x$ is $ (k'-1)$ -exact,

$\displaystyle 2^{k-2-expo(y)}x = 2^{(k'-1)-1-expo(x)}x \in \mathbb{Z}.$

Thus,
$\displaystyle \mbox{$y$ is $(k-1)$-exact}$ $\displaystyle \Leftrightarrow$ $\displaystyle 2^{k-2-expo(y)}y \in \mathbb{Z}$  
  $\displaystyle \Leftrightarrow$ $\displaystyle 2^{k-2-expo(y)}y + 2^{k-2-expo(y)}x \in \mathbb{Z}$  
  $\displaystyle \Leftrightarrow$ $\displaystyle 2^{k”-2-expo(x+y)}(x+y) \in \mathbb{Z}$  
  $\displaystyle \Leftrightarrow$ $\displaystyle \mbox{$x+y$ is $(k”-1)$-exact}$$\displaystyle .$  

If $ y$ is $ (k-1)$ -exact, then

$\displaystyle x+sticky(y,k) = x+y = sticky(x+y,k”).$

Thus, we may assume that $ y$ is not $ (k-1)$ -exact. We invoke Corollary 5.2.21:

$\displaystyle x+trunc(y,k) = \left\{\begin{array}{ll}
trunc(x+y,k”) & \mbox{if...
...gn(y)$}\\
away(x+y,k”) & \mbox{if $sgn(x+y)\neq sgn(y)$}. \end{array} \right.$

Now if $ sgn(x+y) = sgn(y)$ , then
$\displaystyle x+sticky(y,k)$ $\displaystyle =$ $\displaystyle x + trunc(y,k-1) + sgn(y)2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle trunc(x+y,k”-1) + sgn(x+y)2^{expo(x+y)+1-k”}$  
  $\displaystyle =$ $\displaystyle sticky(x+y,k”).$  

On the other hand, if $ sgn(x+y) \neq sgn(y)$ , then
$\displaystyle x+sticky(y,k)$ $\displaystyle =$ $\displaystyle x + trunc(y,k-1) + sgn(y)2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle away(x+y,k”-1) - sgn(x+y)2^{expo(x+y)+1-k”}$  
  $\displaystyle =$ $\displaystyle trunc(x+y,k”-1) + sgn(x+y)2^{expo(x+y)+1-k”}$  
  $\displaystyle =$  


next up previous contents
Next: IEEE Rounding Up: Rounding Previous: Unbiased Rounding   Contents
David Russinoff 2007-01-02