next up previous contents
Next: Floating-Point Formats Up: Floating-Point Representation Previous: Sign, Exponent, and Significand   Contents

Exactness

In order for the significand of a rational number, which has the form

$\displaystyle \verb!b!1.\beta_1\beta_2 \cdots,$

to be represented by an $ n$ -bit field, we must have $ \beta_k = 0$ for all $ k \geq n$ , or equivalently, a right shift of the radix point by $ n-1$ places must result in an integer. This motivates our definition of exactness.

Definition 4.2.1   (exactp) Let $ x \in \mathbb{Q}$ and $ n \in \mathbb{Z}$ . Then

$\displaystyle exactp(x,n) \Leftrightarrow sig(x)2^{n-1} \in \mathbb{Z}.$

If $ exactp(x,n)$ , then we shall say that $ x$ is $ n$ -exact.

The definition may be restated as follows.

Lemma 4.2.1   (exactp2) For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{Z}$ , $ x$ is $ n$ -exact if and only if $ x2^{n-1-expo(x)} \in \mathbb{Z}$ .

The next three lemmas are consequences of the observation that $ n$ -exactness of $ x$ is a property of $ sig(x)$ .

Lemma 4.2.2   (exactp-sig) For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{Z}$ , $ x$ is $ n$ -exact if and only if $ sig(x)$ is $ n$ -exact.

  (exactp-minus, exactp-neg) For all $ x \in \mathbb{Q}$ and $ n \in \mathbb{Z}$ , $ x$ is $ n$ -exact if and only if $ -x$ is $ n$ -exact.

PROOF: By Lemma 4.1.11, $ sig(-x) = sig(x)$

  (exactp-shift) For all $ x \in \mathbb{Q}$ , $ n \in \mathbb{Z}$ , and $ k \in \mathbb{Z}$ , $ x$ is $ n$ -exact if and only if $ 2^kx$ is $ n$ -exact.

PROOF: By Lemma 4.1.12, $ sig(2^kx) = sig(x)$


It is clear that is $ sig(x)$ is representable in a given field of bits, then it is also representable in any wider field.

  (exactp-<=) For all $ x \in \mathbb{Q}$ , $ n \in \mathbb{Z}$ , and $ m \in \mathbb{Z}$ , if $ m < n$ and $ x$ is $ m$ -exact, then $ x$ is $ n$ -exact.

PROOF: Since $ 2^{n-m} \in \mathbb{Z}$ and $ x\cdot 2^{m-1-expo(x)} \in \mathbb{Z}$ ,

$\displaystyle x\cdot 2^{n-1-expo(x)} = 2^{n-m} \cdot x\cdot 2^{m-1-expo(x)} \in \mathbb{Z}.$

Any power of 2 has a 1-bit significand.

  (exactp-2**n) Let $ n \in \mathbb{Z}$ and $ m \in \mathbb{Z}$ . If $ m>0$ , then $ 2^n$ is $ m$ -exact.

PROOF: By Lemma 4.1.3,

$\displaystyle 2^n\cdot 2^{m-1-expo(2^n)} = 2^n\cdot 2^{m-1-n} = 2^{m-1} \in \mathbb{Z}.$

A bit vector is exact with respect to its width.

  (bvecp-exactp) Let $ x \in \mathbb{Q}$ and $ n \in \mathbb{Z}$ . If $ x$ is a bit vector of width $ n$ , then $ x$ is $ n$ -exact.

PROOF: Since 0 is $ n$ -exact for all $ n \in \mathbb{Z}$ , we may assume $ x>0$ , and therefore $ n>0$ . Now since $ 0 < x < 2^n$ , Lemma 4.1.2 implies $ expo(x) \leq n-1$ , and hence $ x\cdot 2^{n-1-expo(x)}\in \mathbb{Z}$


We have the following formula for the exactness of a product.

  (exactp-prod) Let $ x \in \mathbb{Q}$ , $ y \in \mathbb{Q}$ , $ m \in \mathbb{Z}$ , and $ n \in \mathbb{Z}$ . If $ x$ is $ m$ -exact and $ y$ is $ n$ -exact, then $ xy$ is $ (m+n)$ -exact.

PROOF: If $ x2^{m-1-expo(x)}$ and $ y2^{n-1-expo(y)}$ are integers, then by Lemma 4.1.13, so is

$\displaystyle x2^{m-1-expo(x)}y2^{n-1-expo(y)}2^{expo(x)+expo(y)+1-expo(xy)} =
xy2^{m+n-1-expo(xy)}.$

In particular, if $ x$ is $ n$ -exact, then $ x^2$ is $ 2n$ -exact. The converse also holds.

  (exactp-x2) Let $ x \in \mathbb{Q}$ and $ n \in \mathbb{Z}$ . If $ x^2$ is $ 2n$ -exact, then $ x$ is $ n$ -exact.

PROOF: We shall require an elementary fact of arithmetic:

If $ r \in \mathbb{Q}$ and $ 2r^2 \in \mathbb{Z}$ , then $ r \in \mathbb{Z}$ .
In order to establish this result, let $ r = a/b$ , where $ a$ and $ b$ are integers with no common prime factor, and assume that $ 2r^2 = 2a^2/b^2 \in
\mathbb{Z}$ . If $ b$ were even, say $ b = 2c$ , then $ 2a^2/b^2 =
a^2/2c^2 \in \mathbb{Z}$ , implying that $ a$ is even, which is impossible. Thus, $ b$ cannot have a prime factor that does not also divide $ a$ , and hence $ b = 1$ .

Now to show that $ x$ is $ n$ -exact, i.e., $ x2^{n-1-expo(x)} \in \mathbb{Z}$ , it will suffice to show that

$\displaystyle 2\left(x2^{n-1-expo(x)}\right)^2 = x^22^{2n-1-2 expo(x)} \in \mathbb{Z}.$

But since $ x^2$ is $ 2n$ -exact, we have $ x^22^{2n-1-expo(x^2)} \in \mathbb{Z}$ , and since Lemma 4.1.13 implies $ expo(x^2) \geq 2expo(x)$ ,

$\displaystyle x^22^{2n-1-2 expo(x)} = x^22^{2n-1-expo(x^2)}2^{expo(x^2)-2expo(x)} \in \mathbb{Z}.$

Exactness of a bit vector may be formulated in various ways.

  (exact-bits-1, exact-bits-2, exact-bits-3) Let $ x \in \mathbb{N}$ , $ k \in \mathbb{N}$ , and $ n \in \mathbb{N}$ . If $ expo(x) = n-1$ and $ k<n$ , then the following are equivalent:

(a) $ x$ is $ (n-k)$ -exact
(b) $ x/2^k \in \mathbb{Z}$
(c) $ x[n-1:k] = x/2^k$
(d) $ x[k-1:0] = 0$ .

PROOF: The equivalence between (a) and (b) follows from Lemma 4.2.1, since

$\displaystyle x2^{(n-k)-1-expo(x)} = x2^{n-k-1-(n-1)} = x/2^k.$

Now by Lemmas 2.2.11 and 2.2.19,

$\displaystyle x = x[n-1:0] = 2^kx[n-1:k] + x[k-1:0],$

and hence

$\displaystyle x/2^k = x[n-1:k] + x[k-1:0]/2^k.$

But by Lemma 2.2.1, $ 0 \leq x[k-1:0]/2^k < 1$ , which implies that (b), (c), and (d) are all equivalent. 


Corollary 4.2.11   (exact-k+1) Let $ x \in \mathbb{N}$ , $ k \in \mathbb{N}$ , and $ n \in \mathbb{N}$ such that $ expo(x) = n-1$ and $ k<n$ , Assume that $ x$ is $ (n-k)$ -exact. Then $ x$ is $ (n-k-1)$ -exact if and only if $ x[k] = 0$ .

The next lemma gives a formula for exactness of a difference.

  (exactp-diff) Let $ x \in \mathbb{Q}$ , $ y \in \mathbb{Q}$ , $ n \in \mathbb{Z}$ , and $ k \in \mathbb{Z}$ . Assume that $ n>0$ and $ n>k$ . If x and y are both n-exact and $ expo(x-y)+k \leq min(expo(x),expo(y))$ , then $ x-y$ is $ (n-k)$ -exact.

PROOF: Since $ x$ is $ n$ -exact and $ expo(x-y)+k \leq expo(x)$ ,

$\displaystyle x2^{n-1-(expo(x-y)+k)} = x2^{n-1-expo(x)}2^{expo(x)-(expo(x-y)+k}\in \mathbb{Z}$

by Lemma 4.2.1.. Similarly, $ y2^{n-1-(expo(x-y)+k)} \in \mathbb{Z}$ . Thus,

$\displaystyle (x-y)2^{(n-k)-1-expo(x-y)} = x2^{n-1-(expo(x-y)+k)} -
y2^{n-1-(expo(x-y)+k)} \in \mathbb{Z}.$

Lemma 4.2.12 is often applied with $ k = 0$ , in the following weaker form.

Corollary 4.2.13   (exactp-diff-cor) Let $ x \in \mathbb{Q}$ , $ y \in \mathbb{Q}$ , and $ n \in \mathbb{Z}$ with $ n>0$ . If x and y are both n-exact and $ \vert x-y\vert \leq min(\vert x\vert,\vert y\vert)$ , then $ x-y$ is $ n$ -exact.

If $ x$ is positive and $ n$ -exact, then the least $ n$ -exact number exceeding $ x$ is computed as follows.

Definition 4.2.2   (fp+) Let $ x \in \mathbb{Q}$ , $ x>0$ , and $ n \in \mathbb{N}$ , $ n>0$ . Then

fp$\displaystyle ^+(x,n) = x + 2^{expo(x)+1-n}.$

  (fp+1) Let $ x \in \mathbb{Q}$ , $ x>0$ , and $ n \in \mathbb{N}$ , $ n>0$ . If $ x$ is $ n$ -exact, then fp$ ^+(x,n)$ is $ n$ -exact.

PROOF: Let $ e = expo(x)$ and $ x^+ =$   fp$ ^+(x,n) = x + 2^{e+1-n}$ . Since $ x$ is $ n$ -exact, $ x\cdot 2^{n-1-e} \in \mathbb{Z}$ , and by Lemma 4.1.2, $ x < 2^{e+1}$ . Thus,

$\displaystyle x\cdot 2^{n-1-e} < 2^{e+1}\cdot 2^{n-1-e} = 2^n,$

which implies $ x\cdot 2^{n-1-e} \leq 2^n-1$ . Therefore, $ x \leq 2^{e+1}-2^{e+1-n}$ and $ x^+ = x + 2^{e+1-n} \leq 2^{e+1}$ . If $ x^+ = 2^{e+1}$ , then $ x^+$ is $ n$ -exact by Lemma 4.2.6. Otherwise, $ x^+ < 2^{e+1}$ , $ expo(x^+) = e$ , and

$\displaystyle x^+ \cdot 2^{n-1-expo(x^+)} = (x + 2^{e+1-n})2^{n-1-e} = x\cdot 2^{e+1-n}+1 \in \mathbb{Z}.$

  (fp+2) Let $ x \in \mathbb{Q}$ , $ y \in \mathbb{Q}$ , $ y>x>0$ , and $ n \in \mathbb{N}$ , $ n>0$ . If $ x$ and $ y$ are $ n$ -exact, then $ y \geq$   fp$ ^+(x,n)$ .

PROOF: Let $ e = expo(x)$ and $ x^+ =$   fp$ ^+(x,n) = x + 2^{e+1-n}$ . Since $ x$ is $ n$ -exact, $ x\cdot 2^{n-1-e} \in \mathbb{Z}$ . Similar, since $ y$ is $ n$ -exact, $ y\cdot 2^{n-1-expo(y)} \in \mathbb{Z}$ . But $ expo(y) \geq e$ by Corollary 4.1.4, and hence

$\displaystyle y\cdot 2^{n-1-e} = y\cdot 2^{n-1-expo(y)}\cdot 2^{expo(y)-e} \in \mathbb{Z}$

as well. Now since $ y > x$ , $ y\cdot 2^{n-1-e} > x\cdot 2^{n-1-e}$ , which implies

$\displaystyle y\cdot 2^{n-1-e} \geq x\cdot 2^{n-1-e} + 1.$

Thus,

$\displaystyle y \geq (x\cdot 2^{n-1-e} + 1)/2^{n-1-e} = x + 2^{e+1-n} = x^+.$

  (fp+expo) Let $ x \in \mathbb{Q}$ , $ x>0$ , and $ n \in \mathbb{N}$ , $ n>0$ . If $ x$ is $ n$ -exact and $ expo(fp^+(x,n)) \neq expo(x)$ , then $ fp^+(x,n) = 2^{expo(x)+1}$ .

PROOF: Since $ fp^+(x,n) > x$ , Lemma 4.1.4 implies $ expo(fp^+(x,n)) \geq expo(x)$ . Therefore, we have $ expo(fp^+(x,n)) > expo(x)$ , which, according to Lemma 4.1.2, implies $ fp^+(x,n) \geq 2^{expo(x)+1}$ . On the other hand, since $ 2^{expo(x)+1}$ is $ n$ -exact by Lemma 4.2.6, Lemma 4.2.15 implies $ 2^{expo(x)+1} \geq fp^+(x,n)$

  (expo-diff-min) Let $ x \in \mathbb{Q}$ , $ y \in \mathbb{Q}$ , and $ n \in \mathbb{N}$ with $ n>0$ . If x and y are n-exact and $ x \neq y$ , then

$\displaystyle expo(x-y) \geq min(expo(x),expo(y)) + 1 - n$

PROOF: Since the case $ xy<0$ is trivial, we shall assume that $ xy>0$ and, without loss of generality, that $ y>x>0$ . But then by Lemma 4.2.15,

Similarly, the following definition computes the greatest $ n$ -exact number that is less than a given $ n$ -exact number.

Definition 4.2.3   (fp-) Let $ x \in \mathbb{Q}$ , $ x>0$ , and $ n \in \mathbb{N}$ , $ n>0$ . Then

fp$\displaystyle ^-(x,n) = \left\{\begin{array}{ll}
x - 2^{expo(x)-n} & \mbox{if $...
...}$}\\
x - 2^{expo(x)+1-n} & \mbox{if $x \neq 2^{expo(x)}$.}
\end{array}\right.$

.

  (fp-1) Let $ x \in \mathbb{Q}$ , $ x>0$ , and $ n \in \mathbb{N}$ , $ n>0$ . If $ x$ is $ n$ -exact, then fp$ ^-(x,n)$ is $ n$ -exact.

PROOF: Let $ e = expo(x)$ and $ x^- =$   fp$ ^-(x,n)$ .

Suppose first that $ x = 2^e$ . Then $ x^- = x-2^{e-n}$ , and since

$\displaystyle 2^e > x^- \geq 2^e - 2^{e-1} = 2^{e-1},$

Definition 4.1.1 implies $ expo(x^-) = e-1$ . Therefore,

$\displaystyle x^-2^{n-1-expo(x^-)} = x^-2^{n-e} = (x-2^{e-n})2^{n-e} = (2^e-2^{e-n})2^{n-e} = 2^n-1 \in \mathbb{Z},$

and $ x^-$ is $ n$ -exact by Lemma 4.2.1.

Now suppose $ x \neq 2^e$ . Then by Lemma 4.2.14,

$\displaystyle x \geq$   fp$\displaystyle ^+(2^e,n) = 2^e + 2^{e+1-n}.$

Now $ x^- = x - 2^{e+1-n} \geq 2^e$ , and hence $ expo(x^-) = e$ . Thus,

$\displaystyle x^- 2^{n-1-expo(x^-)} = x^-2^{n-1-e} = (x-2^{e+1-n})2^{n-1-e} = x2^{n-1-e} - 1 \in \mathbb{Z},$

which implies $ x^-$ is $ n$ -exact. 

  (fp+-, fp-+) Let $ x \in \mathbb{Q}$ , $ x>0$ , and $ n \in \mathbb{N}$ , $ n>0$ . If $ x$ is $ n$ -exact, then

fp$\displaystyle ^+($fp$\displaystyle ^-(x,n),n) =$   fp$\displaystyle ^-($fp$\displaystyle ^+(x,n),n) = x.$

PROOF: Let $ e = expo(x)$ $ x^- =$   fp$ ^-(x,n)$ , and $ x^+ =$   fp$ ^+(x,n)$ . We shall prove first that fp$ ^+(x^-,n) = x$ .

As noted in the proof of Lemma 4.2.18,

$\displaystyle expo(x^-) = \left\{\begin{array}{ll}
e-1 & \mbox{if $x = 2^e$}\\
e & \mbox{if $x > 2^e$.}
\end{array}\right.$

If $ x = 2^e$ , then

fp$\displaystyle ^+(x^-,n) = x^- + 2^{(e-1)+1-n} = x - 2^{e-n} + 2^{e-n} = x.$

But if $ x > 2^e$ , then

fp$\displaystyle ^+(x^-,n) = x^- + 2^{e+1-n} = x - 2^{e+1-n} + 2^{e+1-n} = x.$

Next, we prove that fp$ ^-(x^+,n) = x$ . Note that $ x^+ = x + 2^{e+1-n}$ . If $ x^+ < 2^{e+1}$ , then $ expo(x^+) = e$ , and since $ x^+ \neq 2^e$ ,

fp$\displaystyle ^-(x^+,n) = x^+ - 2^{e+1-n} = x + 2^{e+1-n} - 2^{e+1-n} = x.$

We may assume, therefore, that $ x^+ \geq 2^{e+1}$ . By Lemma 4.2.14, since $ 2^{e+1}$ is $ n$ -exact, $ 2^{e+1} \geq x^+$ , and hence, $ x^+ = 2^{e+1}$ . Thus,

fp$\displaystyle ^-(x^+,n) = x^+ - 2^{e+1-n} = x + 2^{e+1-n} - 2^{e+1-n} = x.$

  (fp-2) Let $ x \in \mathbb{Q}$ , $ y \in \mathbb{Q}$ , $ x>y>0$ , and $ n \in \mathbb{N}$ , $ n>0$ . If $ x$ and $ y$ are $ n$ -exact, then $ y \leq$   fp$ ^-(x,n)$ .

PROOF: Suppose $ y >$   fp$ ^-(x,n)$ . Then since fp$ ^-(x,n)$ and $ y$ are both $ n$ -exact, Lemma 4.2.14 implies $ y \geq$   fp$ ^+($fp$ ^-(x,n),n) = x$ , a contradiction. 


next up previous contents
Next: Floating-Point Formats Up: Floating-Point Representation Previous: Sign, Exponent, and Significand   Contents
David Russinoff 2007-01-02