6.1  Truncation

The most basic rounding mode, which the IEEE standard calls “round toward 0”, may be described as a three-step operation on the significand of its first argument. Thus, $ \mathit{RTZ}(x,n)$ has the same sign and exponent as $ x$, while its significand is computed from $ sig(x)$ as follows:

  1. Shift the binary point of $ sig(x)$ by $ n-1$ bits to the right.

  2. Extract the floor of the result.

  3. Shift the binary point by $ n-1$ bits to the left.

In other words, $ \mathit{RTZ}(x,n)$ is the result of replacing $ sig(x)$ by $ \lfloor 2^{n-1}sig(x) \rfloor 2^{1-n}$. This is the content of the following definition.

Definition 6.1.1   (rtz) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{Z}$,

$\displaystyle \mathit{RTZ}(x,n) = sgn(x)\lfloor 2^{n-1}sig(x) \rfloor 2^{expo(x)-n+1}.$

Example: Let $ x = 45/8 = (101.101)_2$ and $ n = 5$. Then $ sgn(x) = 1$, $ expo(x) = 2$, $ sig(x) = (1.01101)_2$,

$\displaystyle \lfloor 2^{n-1}sig(x) \rfloor 2^{1-n} = \lfloor (10110.1)_2 \rfloor 2^{-4}
= 10110 \cdot 2^{-4} = 1.011,$

and

$\displaystyle \mathit{RTZ}(x,n) =\lfloor 2^{n-1}sig(x) \rfloor 2^{1-n}2^{expo(x)}
= 1.011 \cdot 2^{2} = 101.1.$

Note that this value is the largest 5-exact number that does not exceed $ x$.

Although the second argument of any rounding mode is normally positive, the following lemma is occasionally useful.

  (rtz-neg-bits) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{Z}$, if $ n \leq 0$, then

$\displaystyle \mathit{RTZ}(x,n) = 0.$

PROOF: By Lemma 4.1.8,

$\displaystyle 0 < 2^{n-1}sig(x) \leq sig(x)/2 < 1,$

which implies $ \lfloor 2^{n-1}sig(x) \rfloor = 0$


Like all rounding modes, RTZ preserves sign.

Lemma 6.1.2   (sgn-rtz) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$, then

$\displaystyle sgn(\mathit{RTZ}(x,n)) = sgn(x).$

This particular mode also has the property of oddness, which, we shall see, is not common to all IEEE modes.

Lemma 6.1.3   (rtz-minus) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle \mathit{RTZ}(-x,n) = -\mathit{RTZ}(x,n).$

All of the modes that we shall consider commute with the shift operation.

  (rtz-shift) For all $ x \in \mathbb{R}$, $ n \in \mathbb{N}$, and $ k \in \mathbb{Z}$,

$\displaystyle \mathit{RTZ}(2^kx,n) = 2^k \mathit{RTZ}(x,n).$

PROOF: By Lemma 4.1.14,

$\displaystyle \mathit{RTZ}(2^kx,n)$ $\displaystyle =$ $\displaystyle sgn(2^kx)\lfloor 2^{n-1}sig(2^kx) \rfloor 2^{expo(2^kx)-n+1}$  
  $\displaystyle =$ $\displaystyle sgn(x)\lfloor 2^{n-1}sig(x) \rfloor 2^{expo(x)+k-n+1}$  
  $\displaystyle =$ $\displaystyle 2^k \mathit{RTZ}(x,n).$  

The inequality stated in the next lemma is the defining characteristic of $ \mathit{RTZ}$ as a rounding mode.

  (rtz-upper-bound) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle \vert\mathit{RTZ}(x,n)\vert \leq \vert x\vert.$

PROOF: By Definition 1.1.1 and Lemma 4.1.1,

$\displaystyle \vert\mathit{RTZ}(x,n)\vert \leq 2^{n-1}sig(x) 2^{expo(x)-n+1} = sig(x)2^{expo(x)} = \vert x\vert.$

As we have noted, RTZ also preserves exponent.

  (expo-rtz) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$, then

$\displaystyle expo(\mathit{RTZ}(x,n)) = expo(x).$

PROOF: Since $ \vert\mathit{RTZ}(x,n)\vert \leq \vert x\vert$, $ expo(\mathit{RTZ}(x,n)) \leq expo(x)$ by Lemma 4.1.5. But by Lemmas 4.1.8 and 1.1.3,

$\displaystyle \vert\mathit{RTZ}(x,n)\vert = \lfloor 2^{n-1}sig(x) \rfloor 2^{expo(x)-n+1}
\geq 2^{n-1} 2^{expo(x)-n+1} = 2^{expo(x)},$

and hence, by Lemma 4.1.3,

$\displaystyle expo(\mathit{RTZ}(x,n)) \geq expo(x).$

The following inequality complements Lemma 6.1.5, confining $ \mathit{RTZ}(x,n)$ to an interval.

  (rtz-lower-bound,rtz-lower-2)
If $ x \in \mathbb{R}$, $ x \neq 0$, and $ n \in \mathbb{N}$, then

$\displaystyle \vert\mathit{RTZ}(x,n)\vert > \vert x\vert - 2^{expo(x)-n+1} \geq \vert x\vert(1-2^{1-n}).$

PROOF: By Definitions 6.1.1 and 1.1.1 and Lemma 4.1.1,

$\displaystyle \vert\mathit{RTZ}(x,n)\vert > (2^{n-1}sig(x)-1)2^{expo(x)-n+1} = \vert x\vert - 2^{expo(x)-n+1}.$

The second inequality follows from Definition 4.1.1.. 

  (rtz-diff) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$,

$\displaystyle \vert x-\mathit{RTZ}(x,n)\vert < 2^{expo(x)-n+1}.$

PROOF: By Lemmas 6.1.2 and 6.1.5,

$\displaystyle \vert x-\mathit{RTZ}(x,n)\vert = \vert\vert x\vert-\vert\mathit{RTZ}(x,n)\vert\vert = \vert x\vert-\vert\mathit{RTZ}(x,n)\vert.$

The corollary now follows from Lemma 6.1.7


  (rtz-exactp-a) For all $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$, $ \mathit{RTZ}(x,n)$ is $ n$-exact.

PROOF: Since $ expo(\mathit{RTZ}(x,n)) = expo(x)$, it suffices to observe that

$\displaystyle \mathit{RTZ}(x,n)2^{n-1-expo(x)} = sgn(x)\lfloor 2^{n-1}sig(x) \rfloor \in \mathbb{Z}.$

  (rtz-exactp-b) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ n > 0$ and x is $ n$-exact, then

$\displaystyle \mathit{RTZ}(x,n) = x.$

PROOF: By Definition 4.2.1 and Lemma 1.1.1,

$\displaystyle \lfloor 2^{n-1}sig(x) \rfloor = 2^{n-1}sig(x),$

and hence by Definition 6.1.1 and Lemma 4.1.1,
$\displaystyle \mathit{RTZ}(x,n)$ $\displaystyle =$ $\displaystyle sgn(x)\lfloor 2^{n-1}sig(x) \rfloor 2^{expo(x)-n+1}$  
  $\displaystyle =$ $\displaystyle sgn(x)2^{n-1}sig(x)2^{expo(x)-n+1}$  
  $\displaystyle =$ $\displaystyle sgn(x)sig(x)2^{expo(x)}$  
  $\displaystyle =$ $\displaystyle x.$  

  (rtz-exactp-c) Let $ x \in \mathbb{R}$, $ a \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ a$ is $ n$-exact and $ a \leq x$, then $ a \leq \mathit{RTZ}(x,n)$.

PROOF: If $ x<0$, then $ x \leq \mathit{RTZ}(x,n)$ by Lemma 6.1.5. Therefore, we may assume that $ x \geq 0$. Suppose $ a > \mathit{RTZ}(x,n)$. Then by Lemma 4.2.16,

$\displaystyle \mathit{RTZ}(x,n) \leq a - 2^{expo(\mathit{RTZ}(x,n))-n+1} = a - 2^{expo(x)-n+1} \leq x - 2^{expo(x)-n+1},$

contradicting Lemma 6.1.7

  (rtz-squeeze) Let $ x \in \mathbb{R}$, $ a \in \mathbb{R}$and $ n \in \mathbb{N}$. Assume that $ a$ is $ n$-exact, where $ 0<n$ and $ 0 < a \leq x < fp^+(a,n)$. Then $ \mathit{RTZ}(x, n) = a$.

PROOF: By Lemmas 6.1.5 and 6.1.11, $ a \leq \mathit{RTZ}(x, n) < fp^+(a,n)$. The claim follows from Lemma 4.2.16 and 6.1.9

  (rtz-monotone) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$. If $ x \leq y$, then $ \mathit{RTZ}(x,n) \leq \mathit{RTZ}(y,n)$.

PROOF: First suppose $ x>0$. By Lemma 6.1.5, $ \mathit{RTZ}(x,n) \leq x \leq y$. Since $ \mathit{RTZ}(x,n)$ is $ n$-exact by Lemma 6.1.9, Lemma 6.1.11 implies

$\displaystyle \mathit{RTZ}(x,n) \leq \mathit{RTZ}(y,n).$

Now suppose $ x \leq 0$. By Lemma 6.1.2, we may assume that $ x \leq y < 0$. Thus, since $ 0 < -y \leq -x$, we have $ \mathit{RTZ}(-y,n) \leq \mathit{RTZ}(-x,n)$ and by Lemma 6.1.3,

$\displaystyle \mathit{RTZ}(x,n) = -\mathit{RTZ}(-x,n) \leq -\mathit{RTZ}(-y,n) = \mathit{RTZ}(y,n).$

Note that with lemmas 6.1.9, 6.1.10, and 6.1.13, we have established $ \mathit{RTZ}$ as a true rounding mode, as defined by the axioms listed at the beginning of this chapter.

If $ x$ is $ (n+1)$-exact but not $ n$-exact, then $ x$ is equidistant from two successive $ n$-exact numbers. In this case, we have an explicit formula for $ \mathit{RTZ}(x,n)$.

  (rtz-midpoint) Let $ x \in \mathbb{R}$ and $ n \in \mathbb{N}$. If $ x$ is $ (n+1)$-exact but not $ n$-exact, then

$\displaystyle \mathit{RTZ}(x,n) = x - sgn(x)2^{expo(x)-n}.$

PROOF: For the case $ n = 0$, we have $ sig(x) = 1$ by Definition 4.2.1, and by Lemmas 4.1.1 and 6.1.1,

$\displaystyle x - sgn(x)2^{expo(x)-n} = sgn(x)2^{expo(x)} - sgn(x)2^{expo(x)} = 0 = \mathit{RTZ}(x,n).$

Thus, we may assume $ n > 0$, and by Lemma 6.1.3, we may also assume $ x>0$.

Let $ a = x-2^{expo(x)-n}$ and $ b = x+2^{expo(x)-n}$. Since $ x>2^{expo(x)}$, $ x \geq 2^{expo(x)}+2^{expo(x)+1-n}$ by Lemma 4.2.16, and hence $ a \geq 2^{expo(x)}$ and $ expo(a) = expo(x)$. It follows that $ b = \mathit{fp}^+(a,n)$.

By hypothesis, $ x2^{n-expo(x)} \in \mathbb{Z}$ but $ x2^{n-expo(x)}/2 = x2^{n-1-expo(x)} \notin \mathbb{Z}$, and therefore, $ x2^{n-expo(x)}$ is odd. Let $ x2^{n-expo(x)} = 2k+1$. Then

$\displaystyle a2^{n-1-expo(a)} = (x-2^{expo(x)-n})2^{n-1-expo(x)} = (2k+1)/2 - 1/2 = k \in \mathbb{Z}.$

Thus, $ a$ is $ n$-exact, and by Lemma 4.2.15, so is $ a+2^{expo(a)+1-n}=b$. Now by Lemma 6.1.11, $ a \leq \mathit{RTZ}(x,n)$, but if $ a < \mathit{RTZ}(x,n)$, then since $ \mathit{RTZ}(x,n)$ is $ n$-exact, Lemma 4.2.16 would imply $ b \leq \mathit{RTZ}(x,n)$, contradicting $ x<b$. Therefore, $ a = \mathit{RTZ}(x,n)$


The following lemma states an intuitively obvious property of truncation, which may be derived directly from Definition 6.1.1.

  (rtz-rtz) Let $ x \in \mathbb{R}$, $ m \in \mathbb{N}$, and $ n \in \mathbb{N}$. If $ m \leq n$, then

$\displaystyle \mathit{RTZ}(\mathit{RTZ}(x,n),m) = \mathit{RTZ}(x,m).$

PROOF: We assume $ x \geq 0$; the case $ x<0$ follows easily from Lemma 6.1.3. Applying Lemma 1.1.5, we have

$\displaystyle {\mathit{RTZ}(\mathit{RTZ}(x,n),m)}$
  $\displaystyle =$ $\displaystyle \lfloor 2^{m-1-expo(x)} (\lfloor 2^{n-1-expo(x)}x\rfloor 2^{expo(x)+1-n})\rfloor 2^{expo(x)+1-m}$  
  $\displaystyle =$ $\displaystyle \lfloor \lfloor 2^{n-1-expo(x)}x\rfloor/2^{n-m} \rfloor 2^{expo(x)+1-m}$  
  $\displaystyle =$ $\displaystyle \lfloor 2^{n-1-expo(x)}x/2^{n-m} \rfloor 2^{expo(x)+1-m}$  
  $\displaystyle =$ $\displaystyle \lfloor 2^{m-1-expo(x)}x\rfloor 2^{expo(x)+1-m}$  
  $\displaystyle =$  

Figure 6.1 is provided as a visual aid in understanding the following lemma, which formulates the precise conditions under which a truncated sum may be computed by truncating one of the summands in advance of the addition.

Figure 6.1: Lemma 6.1.16
\begin{figure}\begin{picture}(120,115)(-58,-2)\setlength{\unitlength}{2mm}
\p...
...                           }}_{k+expo(x+y)-expo(y)}$}}
\end{picture}\end{figure}

  (plus-rtz) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ k \in \mathbb{Z}$. If $ x \geq 0$, $ y \geq 0$, and $ x$ is $ (k+expo(x)-expo(y))$-exact, then

$\displaystyle x+\mathit{RTZ}(y,k) = \mathit{RTZ}(x+y,k+expo(x+y)-expo(y)).$

PROOF: Let $ n = k+expo(x)-expo(y)$. Since $ x$ is $ n$-exact,

$\displaystyle x2^{k-1-expo(y)} = x2^{n-1-expo(x)} \in \mathbb{Z}.$

Let $ k' = k+expo(x+y)-expo(y)$. Then by Lemma 1.1.4,
$\displaystyle x+\mathit{RTZ}(y,k)$ $\displaystyle =$ $\displaystyle x+\lfloor 2^{k-1-expo(y)}y \rfloor 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle (x2^{k-1-expo(y)}+\lfloor 2^{k-1-expo(y)}y \rfloor) 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle \lfloor 2^{k-1-expo(y)}(x+y) \rfloor 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle \lfloor 2^{k'-1-expo(x+y)}(x+y) \rfloor 2^{expo(x+y)+1-k'}$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x+y,k').$  

Lemma 6.1.16 holds for subtraction as well, but only if the summands are properly ordered.

  (minus-rtz) Let $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ k \in \mathbb{Z}$. If $ y>x>0$, $ k + expo(x-y) - expo(y) > 0$, and $ x$ is $ (k+expo(x)-expo(y))$-exact, then

$\displaystyle x-\mathit{RTZ}(y,k) = \mathit{RTZ}(x-y,k+expo(x-y)-expo(y)).$

PROOF: Let $ n = k+expo(x)-expo(y)$. Since $ x$ is $ n$-exact,

$\displaystyle x2^{k-1-expo(y)} = x2^{n-1-expo(x)} \in \mathbb{Z}.$

Let $ k' = k+expo(x-y)-expo(y)$. Then by Lemma 1.1.4,
$\displaystyle x-\mathit{RTZ}(y,k)$ $\displaystyle =$ $\displaystyle x-\lfloor 2^{k-1-expo(y)}y \rfloor 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle -(\lfloor 2^{k-1-expo(y)}y \rfloor - x2^{k-1-expo(y)}) 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle -\lfloor 2^{k-1-expo(y)}(y-x) \rfloor 2^{expo(y)+1-k}$  
  $\displaystyle =$ $\displaystyle -\lfloor 2^{k'-1-expo(y-x)}(y-x) \rfloor 2^{expo(y-x)+1-k'}$  
  $\displaystyle =$ $\displaystyle -\mathit{RTZ}(y-x,k')$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x-y,k').$  

Truncation of a bit vector may be described as a shifted bit slice.

  (bits-rtz) Let $ x \in \mathbb{N}$ and $ k \in \mathbb{N}$, and let $ n = expo(x)+1$. If $ 0<k \leq n$, then

$\displaystyle \mathit{RTZ}(x,k) = 2^{n-k}x[n-1:n-k].$

PROOF: By Lemmas 2.2.5 and 2.2.16,

$\displaystyle \mathit{RTZ}(x,k)$ $\displaystyle =$ $\displaystyle \lfloor 2^{k-1-expo(x)}x \rfloor 2^{expo(x)+1-k}$  
  $\displaystyle =$ $\displaystyle \lfloor x/2^{n-k} \rfloor 2^{n-k}$  
  $\displaystyle =$ $\displaystyle 2^{n-k}\lfloor x/2^{n-k} \rfloor [k-1:0]$  
  $\displaystyle =$ $\displaystyle 2^{n-k}x[n-1:n-k].$  

  (bits-rtz-bits) Let $ x \in \mathbb{N}$ $ i \in \mathbb{N}$, $ j \in \mathbb{N}$, and $ k \in \mathbb{N}$. If $ 0<k \leq \mathit{expo}(x)+1$ and $ \mathit{expo}(x)+1-k \leq j \leq i \leq \mathit{expo}(x)$, then

$\displaystyle \mathit{RTZ}(x,k)[i:j] = x[i:j]
$

PROOF: Let $ n = \mathit{expo}(x)+1$. By Lemmas 6.1.18, 2.2.18, and 2.2.23,

$\displaystyle \mathit{RTZ}(x,k)[i:j] = (2^{n-k}x[n-1:n-k]([i:j] = x[n-1:n-k][i+k-n:j+k-n] = x[i:j]. $

  (rtz-split) Let $ x \in \mathbb{N}$, $ m \in \mathbb{N}$, and $ k \in \mathbb{N}$, and let $ n = expo(x)+1$. If $ x \geq 0$ and $ 0<k < m \leq n$, then

$\displaystyle \mathit{RTZ}(x,m) = \mathit{RTZ}(x,k) + 2^{n-m} x[n-k-1:n-m].$

PROOF: By Lemmas 6.1.18 and2.2.22,

$\displaystyle \mathit{RTZ}(x,m)$ $\displaystyle =$ $\displaystyle 2^{n-m} x[n-1:n-m]$  
  $\displaystyle =$ $\displaystyle 2^{n-m}(2^{m-k}x[n-1:n-k] + x[n-k-1:n-m])$  
  $\displaystyle =$ $\displaystyle 2^{n-k}x[n-1:n-k] + 2^{n-m}x[n-k-1:n-m]$  
  $\displaystyle =$ $\displaystyle \mathit{RTZ}(x,k) + 2^{n-m} x[n-k-1:n-m].$  

  (rtz-logand) Let $ x \in \mathbb{N}$, $ m \in \mathbb{N}$, and $ k \in \mathbb{N}$, and let $ n = \mathit{expo}(x)$. If $ 0<k<n\leq m$, then

$\displaystyle \mathit{RTZ}(x,k) = x \;\verb!&!\; (2^{m} - 2^{n-k}).$

PROOF: By Lemmas 6.1.18 and 3.1.11,

$\displaystyle \mathit{RTZ}(x,k) = 2^{n-k}x[n-1:n-k] = x \;\verb!&!\; (2^{n}-2^{n-k}),$

and by Lemmas 3.1.2 2.2.5, and 3.1.13,

$\displaystyle x \;\verb!&!\; (2^{n}-2^{n-k}) = \left(x \;\verb!&!\; (2^{n}-2^{n...
...0] \;\verb!&!\; (2^{n}-2^{n-k})[n-1:0] = x\;\verb!&!\; (2^{n}-2^{n-k})[n-1:0].
$

But since

$\displaystyle 2^m - 2^{n-k} = 2^n(2^{m-n}-1) + 2^n - 2^{n-k},$

$\displaystyle (2^{m}-2^{n-k})[n-1:0] = mod(2^{m}-2^{n-k}, 2^n) = 2^n - 2^{n-k}.$

David Russinoff 2017-08-01