next up previous contents
Next: Denormal Rounding Up: Rounding Previous: Sticky Rounding   Contents


IEEE Rounding

The IEEE standard prescribes four rounding modes: “round to nearest” (near), “round toward 0” (trunc), “round toward +INFINITY”, and “round toward -INFINITY”. The last two are formalized here by the functions inf and minf.

Definition 5.5.1   (inf) For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ ,

inf$\displaystyle (x,n) = \left\{\begin{array}{ll}
away(x,n) & \mbox{if $x \geq 0$}\\
trunc(x,n) & \mbox{if $x < 0$.}\end{array} \right.$

  (inf-lower-bound) For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ ,

inf$\displaystyle (x,n) \geq x.$

PROOF: If $ x \geq 0$ , then

inf$\displaystyle (x,n) = away(x,n) = \vert away(x,n)\vert \geq \vert x\vert = x;$

if $ x<0$ , then

inf$\displaystyle (x,n) = trunc(x,n) = -\vert trunc(x,n)\vert \geq -\vert x\vert = x.$

Definition 5.5.2   (minf) For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ ,

minf$\displaystyle (x,n) = \left\{\begin{array}{ll}
trunc(x,n) & \mbox{if $x \geq 0$}\\
away(x,n) & \mbox{if $x < 0$.}
\end{array} \right.$

  (minf-upper-bound) For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ ,

minf$\displaystyle (x,n) \leq x.$

PROOF: If $ x \geq 0$ , then

minf$\displaystyle (x,n) = trunc(x,n) = \vert trunc(x,n)\vert \leq \vert x\vert = x;$

if $ x<0$ , then

inf$\displaystyle (x,n) = away(x,n) = -\vert away(x,n)\vert \leq -\vert x\vert = x.$

In this section, we collect a set of general results that pertain to all of the IEEE rounding modes, most of which are essentially restatements of lemmas that are proved in earlier sections. Since these results also hold for two of the other modes that we have discussed, away and $ near^+$ , we shall state them as generally as possible. For this purpose, we define a common rounding mode to be any of the six, i.e., trunc, away, near, $ near^+$ , inf, or minf.

  (rnd-choice) Let $ {\cal R}$ be a common rounding mode. For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ , either

$\displaystyle {\cal R}(x,n) = trunc(x,n)$

or

$\displaystyle {\cal R}(x,n) = away(x,n).$

PROOF: This is an immediate consequence of Definitions 5.3.1, 5.3.2, 5.5.1, and 5.5.2


Most of following results are consequences of Lemma 5.5.3 and the corresponding lemmas pertaining to trunc and away.

  (sgn-rnd) Let $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ . Let $ {\cal R}$ be a common rounding mode. If $ n>0$ , then

$\displaystyle sgn({\cal R}(x,n)) = sgn(x).$

PROOF: See Lemmas 5.1.2 and 5.2.2


Note that inf and minf do not share the oddness property held by the other modes. A generalization is given by the following lemma.

  (rnd-minus) Let $ {\cal R}$ be a common rounding mode and let

$\displaystyle \hat{\cal R} = \left\{\begin{array}{ll}
\mbox{\it minf} & \mbox{i...
...{\cal R} = \mbox{\it minf}$}\\
{\cal R} & \mbox{otherwise.}\end{array} \right.$

For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ ,

$\displaystyle {\cal R}(-x,n) = -\hat{\cal R}(x,n).$

PROOF: Suppose, for example, that $ {\cal R} =$   inf and $ x>0$ . Then since $ -x < 0$ ,

$\displaystyle {\cal R}(-x,n) =$   inf$\displaystyle (-x,n) = trunc(-x,n) = -trunc(x,n) = -\hat{\cal R}(x,n).$

The other cases are handled similarly. 

  (rnd-exactp-a) Let $ {\cal R}$ be a common rounding mode. For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ , $ {\cal R}(x,n)$ is $ n$ -exact.

PROOF: See Lemmas 5.1.9 and 5.2.11

  (rnd-exactp-b) Let $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ and let $ {\cal R}$ be a common rounding mode. If $ n>0$ and x is $ n$ -exact, then

$\displaystyle {\cal R}(x,n) = x.$

PROOF: See Lemmas 5.1.11 and 5.2.13

  (rnd-exactp-c,rnd-exactp-d) Let $ x \in \mathbb{Q}$ , $ a \in \mathbb{Q}$ , and $ n \in \mathbb{N}$ , $ n>0$ , and let $ {\cal R}$ be a common rounding mode. Suppose $ a$ is $ n$ -exact.

(a) If $ a \geq x$ , then $ a \geq {\cal R}(x,n)$ .
(b) If $ a \leq x$ , then $ a \leq {\cal R}(x,n)$ .

PROOF: See Lemmas 5.1.12 and 5.2.14

  (rnd<=away, rnd>=trunc) Let $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ and let $ {\cal R}$ be a common rounding mode. If $ x>0$ , then

$\displaystyle trunc(x,n) \leq {\cal R}(x,n) \leq away(x,n).$

PROOF: This is an immediate consequence of Lemmas 5.1.5 and 5.2.5:

$\displaystyle trunc(x,n) = \vert trunc(x,n)\vert \leq \vert x\vert \leq \vert away(x,n)\vert = away(x,n).$

  (rnd-diff) Let $ x \in \mathbb{Q}$ , $ n \in \mathbb{N}$ , $ n>0$ , and let $ {\cal R}$ be a common rounding mode. Then

$\displaystyle \vert x - {\cal R}(x,n)\vert < 2^{expo(x)-n+1}.$

PROOF: See Lemmas 5.1.8 and 5.2.7

  (expo-rnd) Let $ {\cal R}$ be a common rounding mode. For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{N}$ , if $ \vert{\cal R}(x,n)\vert \neq 2^{expo(x)+1}$ , then

$\displaystyle expo({\cal R}(x,n)) = expo(x).$

PROOF: See Lemmas 5.1.6 and 5.2.10

  (rnd-monotone) Let $ x \in \mathbb{Q}$ , $ y \in \mathbb{Q}$ , and $ n \in \mathbb{N}$ . Let $ {\cal R}$ be a common rounding mode. If $ x \leq y$ and $ n>0$ , then

$\displaystyle {\cal R}(x,n) \leq {\cal R}(y,n).$

PROOF: The modes trunc, away, near, and $ near^+$ are covered by Lemmas 5.1.13, 5.2.15, 5.3.13, and 5.3.30. For the modes inf and minf, we may assume, using Lemma 5.5.4, that $ x$ and $ y$ are either both positive or both negative. It follows that either

$\displaystyle {\cal R}(x,n) = trunc(x,n) \leq trunc(y,n) = {\cal R}(y,n)$

or

$\displaystyle {\cal R}(x,n) = away(x,n) \leq away(y,n) = {\cal R}(y,n).$

  (rnd-shift) Let $ {\cal R}$ be a common rounding mode. For all $ x \in \mathbb{Q}$ , $ n \in \mathbb{N}$ , and $ k \in \mathbb{Z}$ ,

$\displaystyle {\cal R}(2^kx,n) = 2^k {\cal R}(x,n).$

PROOF: The modes trunc, away, near, and $ near^+$ are covered by Lemmas 5.1.4, 5.2.4, 5.3.8, and 5.3.20. For the modes inf and minf, we may assume, using Lemma 5.5.5, that $ x>0$ , in which case

inf$\displaystyle (2^kx,n) = away(2^kx,n) = 2^kaway(x,n) = 2^k$inf$\displaystyle (x,n)$

and

minf$\displaystyle (2^kx,n) = trunc(2^kx,n) = 2^ktrunc(x,n) = 2^k$minf$\displaystyle (x,n).$

  (plus-rnd) Let $ {\cal R}$ be a common rounding mode, $ x \in \mathbb{Q}$ , $ y \in \mathbb{Q}$ , and $ k \in \mathbb{Z}$ with $ x \geq 0$ and $ y \geq 0$ . Let $ k' = k+expo(x)-expo(y)$ and $ k” = k+expo(x+y)-expo(y)$ . If $ x$ is $ (k'-1)$ -exact, then

$\displaystyle x+{\cal R}(y,k) = {\cal R}(x+y,k+expo(x+y)-expo(y)).$

PROOF: See Lemmas 5.1.16 and 5.2.19

  (rnd-sticky) Let $ {\cal R}$ be a common rounding mode, $ m \in \mathbb{N}$ , $ n \in \mathbb{N}$ , and $ x \in \mathbb{Q}$ . If $ n \geq m+2$ and $ m>0$ , then

$\displaystyle {\cal R}(x,m) = {\cal R}(sticky(x,n),m).$

PROOF: The modes trunc and away are handled by Lemma 5.4.9; near and $ near^+$ are covered by Lemma 5.4.10; inf and minf are reduced to trunc and away by Lemma 5.4.1


For the sake of simplicity, our main result pertaining to constant injection is formulated in the context of bit vector rounding.

  (rnd-const-thm) Let $ {\cal R}$ be a common rounding mode, $ n \in \mathbb{N}$ , $ n>1$ , and $ x \in \mathbb{N}$ with $ expo(x) \geq n$ . Then

$\displaystyle {\cal R}(x,n) = trunc(x+{\cal C},\nu),$

where

$\displaystyle {\cal C} = \left\{\begin{array}{ll}
2^{expo(x)-n} & \mbox{if ${\c...
...mbox{if ${\cal M} = trunc$ or ${\cal M} = \mbox{\it minf}$}\end{array} \right.$

and

$\displaystyle \nu = \left\{\begin{array}{ll}
n-1 & \mbox{if ${\cal R} = near$ ...
...is $(n+1)$-exact but not $n$-exact}\\
n & \mbox{otherwise.}\end{array} \right.$

PROOF: For the modes away and inf, the identity follows from Lemma 5.2.22, with $ m=expo(x)+1$ , and Lemma 4.2.7. For near and $ near^+$ , it reduces to Lemma 5.3.34. For trunc and minf, the lemma is trivial. 


Our final result provides an alternative to Lemma 5.5.16 as an implementation of rounding.

  (roundup-thm) Let $ {\cal R}$ be a common rounding mode, $ n \in \mathbb{N}$ , and $ x \in \mathbb{N}$ with $ expo(x) \geq n > 1$ . If any of the following conditions holds, then $ {\cal R}(x,n) =$   fp$ ^+(trunc(x,n),n)$ :
(1) $ {\cal R} = near^+$ and $ x[expo(x)-n] = 1$

(2) $ {\cal R} = near$ , $ x[expo(x)-n] = 1$ , and either
(a) $ x$ is not $ (n+1)$ -exact or
(b) $ x[expo(x)+1-n] = 1$

(3) $ {\cal R} =$   inf or $ {\cal R} = away$ and $ x$ is not $ n$ -exact.

In all other cases, $ {\cal R}(x,n) = trunc(x,n)$ .

PROOF: Suppose first that $ x$ is $ n$ -exact. By Lemma 4.2.10, $ x[expo(x)-n:0] = 0$ , and it follows from Lemma 2.3.12 that

$\displaystyle x[expo(x)-n] = x[expo(x)-n:0][expo(x)-n] = 0[expo(x)-n] = 0.$

Consequently, none of the conditions (1)-(3) holds, and by Lemma 5.5.7,

$\displaystyle {\cal R}(x,n) = x = trunc(x,n).$

Therefore, we may assume that $ x$ is not $ n$ -exact. Note that by Lemma 5.2.16,

$\displaystyle away(x,n) = {\it fp}^+(trunc(x,n),n).$

The cases $ {\cal R} = trunc$ , $ {\cal R} =$   minf , $ {\cal R} = away$ , and $ {\cal R} =$   inf now follow trivially.

We turn to the cases $ {\cal R} = near$ and $ {\cal R} = near^+$ . Let

$\displaystyle z = \lfloor 2^{n-1}sig(x) \rfloor = \lfloor 2^{n-1-expo(x)}x \rfloor = \lfloor x/2^{expo(x)+1-n} \rfloor$

and

$\displaystyle f = 2^{n-1}sig(x) - z = 2^{n-1-expo(x)}x - z.$

Since $ x < 2^{expo(x)+1}$ , Lemma 1.2.7 implies

$\displaystyle z = \lfloor mod(x,2^{expo(x)+1})/2^{expo(x)+1-n} \rfloor = x[expo(x):expo(x)+1-n].$

By Lemmas 2.2.11 and 2.2.19,

$\displaystyle x = x[expo(x):0] = 2^{expo(x)+1-n}x[expo(x):expo(x)+1-n] + x[expo(x)-n:0].$

Therefore,
$\displaystyle f$ $\displaystyle =$ $\displaystyle 2^{n-1-expo(x)}x - x[expo(x):expo(x)+1-n]$  
  $\displaystyle =$ $\displaystyle 2^{n-1-expo(x)}(x - 2^{expo(x)+1-n}x[expo(x):expo(x)+1-n])$  
  $\displaystyle =$ $\displaystyle 2^{n-1-expo(x)}x[expo(x)-n:0].$  

Using Lemma 2.3.13, we may further reduce this to
$\displaystyle f$ $\displaystyle =$ $\displaystyle 2^{n-1-expo(x)}(2^{expo(x)-n}x[expo(x)-n] + x[expo(x)-n-1:0])$  
  $\displaystyle =$ $\displaystyle \frac{1}{2}x[expo(x)-n] + 2^{n-1-expo(x)}x[expo(x)-n-1:0].$  

By Lemma 2.2.1,

$\displaystyle 2^{n-1-expo(x)}x[expo(x)-n-1:0] < 2^{n-1-expo(x)}2^{expo(x)-n} = 1/2.$

This leads to the following observations:

(i)
$ f \geq 1/2$ if and only if $ x[expo(x)-n] = 1$ .

(ii)
$ f > 1/2$ if and only if $ x[expo(x)-n] = 1$ and $ x[expo(x)-n-1:0] \neq 0$ , and according to Lemma 4.2.10, the latter condition holds if and only if $ x$ is not $ (n+1)$ -exact.

Referring to Definition 5.3.2, we see that (i) is sufficient to complete the proof for the case $ {\cal R} = near^+$ .

For the case $ {\cal R} = near$ , it is clear from Definition 5.3.1 and (i) and (ii) above that we need only show that $ z$ is odd if and only if $ x[expo(x)+1-n] = 1$ . But according to Lemmas 2.3.2 and 2.3.12,

$\displaystyle mod(z,2) = z[0] = x[expo(x):expo(x)+1-n][0] = x[expo(x)+1-n].$


next up previous contents
Next: Denormal Rounding Up: Rounding Previous: Sticky Rounding   Contents
David Russinoff 2007-01-02