13.1  Quotient Refinement

The first step of each algorithm is the computation of an initial approximation $ y_0$ of $ \frac{1}{b}$ as an application of the function rcp24. An initial approximation of the quotient $ \frac{a}{b}$ is then computed as

$\displaystyle q_0 = \mathit{RNE}(ay_0, p).
$

The accuracy of $ q_0$ may be derived from that of $ y_0$:

  (init-approx) Let $ a>0$, $ b > 0$, and $ p > 1$. Assume that $ \vert 1 - by\vert \leq \epsilon$, and let $ q = \mathit{RNE}(ay, p)$. Then

$\displaystyle \left\vert 1 - \frac{b}{a}q\right\vert \leq \epsilon + 2^{-p}(1 + \epsilon).
$

PROOF: Since

$\displaystyle \left\vert 1 - \frac{b}{a}ay\right\vert = \vert 1 - by\vert \leq \epsilon
$

and

$\displaystyle \vert ay - q\vert \leq 2^{-p}a\vert y\vert \leq 2^{-p}\frac{a}{b}(1 + \epsilon),
$

$\displaystyle \left\vert 1 - \frac{b}{a}q\right\vert \leq \left\vert 1 - \frac{...
...ight\vert + \frac{b}{a}\vert ay - q\vert
\leq \epsilon + 2^{-p}(1 + \epsilon). $

Each of the initial values $ y_0$ and $ q_0$ undergoes a series of refinements, culminating in the final rounded quotient $ q$. Each refinement $ q_{k+1}$ of the quotient is computed from the preceding approximation $ q_k$ and the current reciprocal approximation $ y$ as follows:

$\displaystyle r_k$ $\displaystyle =$ $\displaystyle \mathit{RNE}(a - bq_k, p)$  
$\displaystyle q_{k+1}$ $\displaystyle =$ $\displaystyle \mathit{RNE}(q_k + r_ky, p)$  

In the final step, the input rounding mode $ {\cal R}$ is used instead of RNE:
$\displaystyle r_k$ $\displaystyle =$ $\displaystyle \mathit{RNE}(a - bq_k, p)$  
$\displaystyle q$ $\displaystyle =$ $\displaystyle {\cal R}(q_k + r_ky, p)$  

Our main lemma, due to Markstein [MAR90], ensures that the final quotient is correctly rounded under certain assumptions pertaining to the accuracy of the reciprocal and quotient approximations from which it is derived.

  (r-exactp, markstein) Let $ a$, $ b$, $ y$, and $ q$ be $ p$-exact, where $ p > 1$, $ 1 \leq a < 2$, and $ 1 \leq b < 2$. Assume that the following inequalities hold:

(i) $ \vert\frac{a}{b} - q\vert < 2^{e+1-p}$, where $ e = \left\{\begin{array}{rl}
0 & \mbox{if $a > b$}\\
-1 & \mbox{if $a \leq b$;}\end{array}\right.$


(ii) $ \vert 1 - by\vert < 2^{-p}$.

Let $ r = a - bq$. Then $ r$ is $ p$-exact, and for any IEEE rounding mode $ {\cal R}$,

$\displaystyle {\cal R}(q + ry, p) = {\cal R}\left(\frac{a}{b}, p\right).
$

PROOF: We may assume $ r \neq 0$, for otherwise $ \frac{a}{b} = q = q+ry$ and the claim holds trivially. We may also assume $ a \neq b$, for otherwise,

$\displaystyle \vert 1 - q\vert = \left\vert\frac{a}{b} - q\right\vert < 2^{e+1-p} = 2^{-p}
$

implies $ q = 1 = \frac{a}{b}$ and $ r=0$. It follows from the bounds on $ a$ and $ b$ that $ e = \mathit{expo}(\frac{a}{b})$. We shall show that $ e = \mathit{expo}(q)$ as well.

If $ a>b$, then

$\displaystyle \frac{a}{b} \geq \frac{b+2^{1-p}}{b} = 1 + \frac{2^{1-p}}{b} > 1 + 2^{-p}
$

and

$\displaystyle q > \frac{a}{b} - 2^{e+1-p} = \frac{a}{b} - 2^{1-p} > 1 + 2^{-p} - 2^{1-p} = 1 - 2^{-p},
$

which implies $ q \geq 1$. On the other hand,

$\displaystyle q < \frac{a}{b} + 2^{1-p} \leq 2 - 2^{1-p} + 2^{1-p} = 2,
$

and hence, $ \mathit{expo}(q) = 0 = e$.

Similarly, if $ a < b$, then

$\displaystyle \frac{a}{b} \geq \frac{1}{2-2^{1-p}} = \frac{(1-2^{-p})+2^{-p}}{2-2^{1-p}} = \frac{1}{2} + \frac{2^{-p-1}}{1-2^{-p}}> \frac{1}{2} + 2^{-p-1}
$

and

$\displaystyle q > \frac{a}{b} - 2^{e+1-p} = \frac{a}{b} - 2^{-p} > \frac{1}{2} + 2^{-p-1} - 2^{-p} = \frac{1}{2} - 2^{-p-1},
$

which implies $ q \geq \frac{1}{2}$. On the other hand,

$\displaystyle \frac{a}{b} \leq \frac{a}{a+2^{1-p}} = 1 - \frac{2^{1-p}}{a+2^{1-p}} \leq 1 - \frac{2^{1-p}}{2} = 1 - 2^{-p},
$

$\displaystyle q < \frac{a}{b} + 2^{-p} < 1 - 2^{-p} + 2^{-p} = 1,
$

and $ \mathit{expo}(q) = -1 = e$.

To establish $ p$-exactness of $ r$, since

$\displaystyle \vert r\vert = b\left\vert\frac{a}{b} - q\right\vert < 2\cdot 2^{e+1-p} = 2^{e+2-p},
$

either $ r=0$ or $ \mathit{expo}(r) \leq e + 1 - p$, and it suffices to show that

$\displaystyle 2^{p-1-(e+1-p)}r = 2^{2p-2-e}r = 2^{2p-2-e}a - 2^{2p-2-e}bq \in \mathbb{Z}.
$

Since $ a$ is $ p$-exact, $ \mathit{expo}(a) = 0$, and $ 2p-2-e \geq 2p-2 \geq p-1$, $ 2^{2p-2-e}a \in \mathbb{Z}$; since $ b$ and $ q$ are $ p$-exact, $ \mathit{expo}(b) = 0$, and $ \mathit{expo}(q) \geq e$,

$\displaystyle 2^{2p-2-e}bq = (2^{p-1}b) (2^{p-1-e}q) \in \mathbb{Z}.
$

For the proof of the second claim, we shall focus on the case $ r>0$; the case $ r < 0$ is similar. Let $ q' = q + 2^{e+1-p}$. The quotient $ \frac{a}{b}$ lies in the interval $ (q, q')$, and its rounded value is either $ q$ or $ q'$. For the directed rounding modes (RUP, RDN, and $ \mathit{RTZ}$), we need only show that $ q+ry$ also belongs to this interval, i.e., $ ry < 2^{1+e-p}$. Since $ \frac{a}{b} = q + \frac{r}{b}$, this condition may be expressed as

$\displaystyle \frac{r}{b} < 2^{e+1-p} \Rightarrow ry < 2^{e+1-p},$ (13.1)

or

$\displaystyle 2^{p-1-e} r < b \Rightarrow 2^{p-1-e}r < \frac{1}{y}.
$

Since (ii) implies

$\displaystyle \frac{1}{y} = \frac{b}{by} > \frac{b}{1+2^{-p}},
$

this will follow from

$\displaystyle 2^{p-1-e} r < b \Rightarrow 2^{p-1-e}r \leq \frac{b}{1+2^{-p}}.
$

If $ b>1$, then since $ 2^{p-1-e}r$ and $ b$ are both $ p$-exact and $ \mathit{expo}(b) = 0$, we have

$\displaystyle 2^{p-1-e} r < b \Rightarrow 2^{p-1-e} r \leq b-2^{1-p},
$

and it will suffice to show that

$\displaystyle b - 2^{1-p} \leq \frac{b}{1 + 2^{-p}},
$

but this reduces to $ b \leq 2 + 2^{1-p}$, and we have assumed that $ b<2$.

On the other hand, if $ b = 1$, then

$\displaystyle 2^{p-1-e} r < 1 \Rightarrow 2^{p-1-e} r \leq 1-2^{-p},
$

and we need only show that

$\displaystyle 1 - 2^{-p} \leq \frac{1}{1 + 2^{-p}},
$

which is trivial.

For the remaining case, $ {\cal R} = \mathit{RNE}$, the proof may be completed by showing that $ \frac{a}{b}$ and $ q+ry$ lie on the same side of the midpoint $ m = q + 2^{e-p}$ of the interval $ (q, q')$. Note that $ \frac{a}{b} = m$ is impossible, for if this were true, then since $ a = \frac{a}{b} \cdot b$ is $ p$-exact, Lemma 4.2.10 would imply that $ \frac{a}{b} = m$ is also $ p$-exact, but this is not the case. Thus, we must show that

$\displaystyle \frac{r}{b} < 2^{e-p} \Rightarrow ry < 2^{e-p}$ (13.2)

and

$\displaystyle \frac{r}{b} > 2^{e-p} \Rightarrow ry > 2^{e-p}$ (13.3)

The proof of (13.2) is the same as that of (13.1), and we may similarly show that (13.3) is a consequence of

$\displaystyle b + 2^{1-p} \geq \frac{b}{1 - 2^{-p}}.
$

But this is equivalent to $ b \leq 2 - 2^{1-p}$, which follows from the assumptions that $ b$ is $ p$-exact and $ b<2$


IEEE-compliance of the algorithms of interest will be proved as applications of Lemma 13.1.2 by establishing the two hypotheses (i) and (ii) for appropriate values of $ y$ and $ q$.

The next lemma consists of two results. The first specifies the accuracy of an intermediate quotient approximation;13.1 the second addresses the final approximation, supplying the first inequality (i) required by Lemma 13.1.2.

  (quotient-refinement-1, quotient-refinement-2) Let $ 1 \leq a < 2$, $ 1 \leq b < 2$, and $ p > 0$. Assume that $ \vert 1 - by\vert \leq \epsilon$ and $ \vert 1 - \frac{b}{a}q_0\vert \leq \delta$. Let $ r = \mathit{RNE}(a-bq_0, p)$ and $ q = \mathit{RNE}(q_0 + ry, p)$. Then

$\displaystyle \left\vert 1 - \frac{b}{a}q\right\vert \leq 2^{-p} + (1 + 2^{-p})\delta\epsilon + 2^{-p}\delta(1 + \epsilon) + 2^{-2p}\delta(1 + \epsilon).
$

If $ \delta\epsilon + 2^{-p}\delta(1 + \epsilon) < 2^{-p-1}$, then

$\displaystyle \left\vert q - \frac{a}{b}\right\vert < 2^{e+1-p},
$

where $ e = \left\{\begin{array}{rl}
0 & \mbox{if $a > b$}\\
-1 & \mbox{if $a \leq b$.}\end{array}\right.$

PROOF: Let $ u = 1 - by$, $ v = 1 - \frac{b}{a}q_0$, $ r' = av = a - bq_0$, and $ q' = q_0 + ry$. Then

$\displaystyle q_0 + r'y = \frac{a}{b}(1 - v) + \frac{av}{b}(1 - u) = \frac{a}{b}(1 - uv)
$

and

$\displaystyle q' = q_0 + r'y + (r - r')y = \frac{a}{b}(1 - uv) + (r - r')y,
$

where

$\displaystyle \vert(r - r')y\vert \leq 2^{-p}\vert r'y\vert \leq 2^{-p}\cdot a\delta \cdot \frac{1}{b}(1+\epsilon) = \frac{a}{b}\cdot 2^{-p}\delta(1+\epsilon).
$

Thus,

$\displaystyle q' \leq \frac{a}{b}(1+\delta\epsilon) + \frac{a}{b}\cdot 2^{-p}\delta(1+\epsilon)
$

and

$\displaystyle \left\vert q' - \frac{a}{b}\right\vert \leq \frac{a}{b}(\delta\epsilon + 2^{-p}\delta(1 + \epsilon)).
$

For the proof of the first claim, we have
$\displaystyle \left\vert q - \frac{a}{b}\right\vert$ $\displaystyle \leq$ $\displaystyle \vert q - q'\vert + \left\vert q' - \frac{a}{b}\right\vert$  
  $\displaystyle \leq$ $\displaystyle 2^{-p}q' + \left\vert q' - \frac{a}{b}\right\vert$  
  $\displaystyle \leq$ $\displaystyle \frac{a}{b}(2^{-p}(1+\delta\epsilon) + 2^{-2p}\delta(1 + \epsilon) + \delta\epsilon + 2^{-p}\delta(1+\epsilon))$  
  $\displaystyle =$ $\displaystyle \frac{a}{b}(2^{-p} + (1 + 2^{-p})\delta\epsilon + \cdot 2^{-p}\delta(1 + \epsilon) + \cdot 2^{-2p}\delta(1 + \epsilon)).$  

For the proof of the second claim, since $ \frac{a}{b} \leq 2^{e+1}$, we have

$\displaystyle \left\vert q' - \frac{a}{b}\right\vert \leq 2^{e+1}(\delta\epsilon + 2^{-p}\delta(1 + \epsilon)) < 2^{e+1}2^{-p-1} = 2^{e-p}
$

and

$\displaystyle \left\vert q - \frac{a}{b}\right\vert \leq \vert q - q'\vert + \left\vert q' - \frac{a}{b}\right\vert < 2^{expo(q')-p} + 2^{e-p}.
$

If $ \mathit{expo}(q') \leq e$, then the claim follows trivially. Thus, we may assume that $ \mathit{expo}(q') > e$. But then

$\displaystyle 2^{e+1} \leq q' = \frac{a}{b} + \left(q' - \frac{a}{b}\right) < 2^{e+1} + 2^{e-p}
$

implies $ q = 2^{e+1}$. It follows that $ \frac{a}{b} \leq q \leq q'$ and

$\displaystyle \left\vert q - \frac{a}{b}\right\vert = \leq \left\vert q' - \frac{a}{b}\right\vert < 2^{e-p}. $

David Russinoff 2017-08-01