# 6.3  Unbiased Rounding

Next, we examine the mode that the IEEE standard designates as “round to nearest”, which may round in either direction, selecting the representable number that is closest to its argument. This mode is also sometimes called “round to nearest even” because of the manner in which it resolves the ambiguous case of a midpoint, i.e, a number that is equidistant from two successive representable numbers.

Definition 6.3.1   (rne) Given and , let , and . Then

Example: Let and . Then

and

indicating a “tie”, i.e., that is equidistant from two successive 5-exact numbers. Since is even, the tie is broken in favor of the lesser of the two:

Like all rounding modes, the value of RNE is always that of either or RAZ.

Lemma 6.3.1   (rne-choice) For all and , either

or

We list several properties of RNE that may be derived from Lemma 6.3.1 and the corresponding properties of and RAZ.

Lemma 6.3.2   (sgn-rne) Let and . If , then

Lemma 6.3.3   (rne-exactp-a) For all and , is -exact.

Lemma 6.3.4   (rne-exactp-b) Let and . If and x is -exact, then

Lemma 6.3.5   (rne-exactp-c,rne-exactp-d) Let , , and , . Suppose is -exact.

 (a) If , then . (b) If , then .

Lemma 6.3.6   (expo-rne) For all and , if , then

PROOF: This follows from Lemmas 6.1.5 and 6.2.5, which together ensure that

PROOF: It is clear from Definition 6.3.1 that the choice between and depends only on . Thus, for example, if , then since , as well, and by Lemma 6.1.4,

PROOF: This may be derived from Lemmas 6.1.3 and 6.2.2 by following the same reasoning as used in the proof of Lemma 6.3.8

In the computation of , the choice between and is governed by their relative distances from .

(rne-rtz, rne-raz) Let and .

(a) If , then .

(b) If , then .

PROOF: We may assume that , for otherwise

Let . Then

and

Thus, (a) and (b) correspond to and , respectively.

(rne-down, rne-up) Let , , and . Assume that , , and is -exact.

(a) If , then .

(b) If , then .

PROOF: In either case, by Lemmas 6.1.12 and 6.2.15, and . The claims follow from Lemma 6.3.10

(rne-up-2) Let , , , and , with and . If , then .

PROOF: We may assume that . By Lemmas 4.2.19 and 6.3.5, . Let . Then . Since ,

and the claim follows from Lemma 6.3.11(b).

No -exact number can be closer to than is .

(rne-nearest) Let , , and . If and y is n-exact, then

PROOF: Assume . We shall only consider the case , as the case is handled similarly.

First suppose . Since , we must have and hence by Lemma 6.1.11. But since by Lemma 6.3.10, we also have , and hence by Lemma 6.2.14.

In the remaining case, . Now and by Lemma 6.2.14, . But in this case, Lemma 6.3.10 implies , and hence by Lemma 6.1.11

Consequently, the maximum RNE rounding error is half the distance between successive representable numbers.

PROOF: By Lemma 6.3.9, we may assume . Let

If the statement fails, then since and are both -exact, Lemma 6.3.13 implies

and hence . Then by Lemmas 4.2.16 and 6.2.14, we have , contradicting Lemma 6.1.7

PROOF: This follows from Lemma 6.3.14 and Definition 4.1.1

(rne-force) Let , , and . If , , y is n-exact, and , then .

PROOF: We shall consider the case ; the case then follows from Lemma 6.3.9.

Let . and . By Lemma 6.3.6, . We also have , for otherwise and by Lemma 4.2.21,

If , then since

Lemma 4.2.16 implies . But if , then we have a similar contradiction.

PROOF: Suppose and . Then Lemma 6.3.13 implies , otherwise and . Similarly, , and thus . Applying Lemma 6.3.13 again, we have , and hence . Similarly, , and hence . Consequently, , contradicting

A midpoint with respect to a precision may be characterized as a number that is -exact but not -exact. By virtue of the following result, the term rounding boundary is sometimes used as well.

(rne-rne-lemma) Let , , and . If and , then for some , and a is -exact.

PROOF: By Lemma 6.3.17, . Let ,

and

Since and are both -exact by Lemma 6.3.3, Lemma 4.2.15 implies that is -exact and is -exact and consequently, by Lemma 4.2.16, is not -exact and . Thus,

Moreover, , for if , then , contradicting Lemma 6.3.13 and similarly if , then

(rne-rne) Let , , , , and with . If is -exact, , and , then .

PROOF: By Lemma 6.3.17, we may assume , so that . By Lemmas 4.2.15 and 4.2.16, and are successive -exact numbers, and hence by Lemma 6.3.18

We also have the following partial converse of Lemma 6.3.18.

(rne-boundary) Let , , , and . If , and is -exact but not -exact, then

PROOF: Let . Since is not -exact, . Let

By Lemma 4.2.20,

and it follows that .

By hypothesis, but , and therefore, is odd, i.e., , where . Now since

is -exact. Let

By Lemma 4.2.15, is -exact as well.

If , then by Lemma 4.2.16, , which implies , contradicting Lemma 6.3.13. Similarly, if , then and . Thus,

The meaning of “round to nearest even” is that in the case of a midpoint , is defined to be the “rounder” of the two nearest -exact numbers, i.e., the one that is -exact.

(rne-midpoint) Let , , and . If is -exact but not -exact, then is -exact.

PROOF: Again we may assume . Let and . Since , . But , hence and .

If is even, then

and by Lemma 6.1.6,

If is odd, then

We may assume , and hence by Corollary 6.2.10,

One consequence of this result is that a midpoint is sometimes rounded up and sometimes down, and therefore, over the course of a long series of computations and approximations, rounding error is less likely to accumulate to a significant degree in one particular direction than it would be if the the choice were made more consistently. The cost of this feature is a more complicated definition, requiring a more expensive implementation.

When the goal of a computation is provable accuracy rather than IEEE-compliance, a simpler version of “round to nearest” may be appropriate. The critical feature of this mode then becomes the relative error bound guaranteed by Lemma 6.3.14, since this is likely to be the basis for any formal error analysis. The following definition presents an alternative to RNE that respects the same error bound (see Lemma 6.3.35) but admits a simpler implementation and is therefore commonly used for internal floating-point calculations.

Definition 6.3.2   (rna) Given and , let , and . Then

Example: Let and . Since

Naturally, many of the properties of RNE are held by as well. We list some of them here, omitting the proofs, which are essentially the same as those given above for RNE.

Lemma 6.3.22   (rna-choice) For all and , either

or

Lemma 6.3.23   (sgn-rna) Let and . If , then

Lemma 6.3.24   (rna-exactp-a) For all and , is -exact.

Lemma 6.3.25   (rna-exactp-b) Let and . If and x is -exact, then

Lemma 6.3.26   (rna-exactp-c,rna-exactp-d) Let , , and , . Suppose is -exact.

 (a) If , then . (b) If , then .

Lemma 6.3.27   (expo-rna) For all and , if , then

Lemma 6.3.28   (rna<=raz, rna>=rtz) Let and . If , then

Lemma 6.3.29   (rna-shift) For all , , and ,

Lemma 6.3.30   (rna-minus) For all and ,

Lemma 6.3.31   (rna-rtz, rna-raz) Let and .

(a) If , then .

(b) If , then .

Corollary 6.3.32   (rna-down, rna-up) Let , , and . Assume that , , and is -exact.

(a) If , then .

(b) If , then .

Corollary 6.3.33   (rna-up-2) Let , , , and . , , and , then .

Lemma 6.3.34   (rna-nearest) Let , , and . If and y is n-exact, then

Lemma 6.3.35   (rna-diff) If , , and , then

Lemma 6.3.36   (rna-monotone) Let , , and . If and , then

Lemma 6.3.37   (rna-rna-lemma) Let , , and . If and , then for some , and a is -exact.

Corollary 6.3.38   (rna-rna) Let , , , , and with . If is -exact, , and , then .

The difference between RNE and is that the latter always rounds a midpoint away from 0.

(rna-midpoint) Let and . If is -exact but not -exact, then

PROOF: By Lemmas 6.3.9 and 6.2.2, we may assume . Let and . Since , . But , hence and . Therefore, according to Definition 6.3.2, . The second inequality is a restatement of Lemma 6.2.19

There is one case of a midpoint for which RNE and RNA are guaranteed to produce the same result: if the greater of the two representable numbers that are equidistant from is a power of 2, i.e., , then both modes round to this number.

PROOF: Suppose . Then Lemma 6.3.6 implies

and by Lemmas 4.2.16 and 6.3.14,

It follows that , which is easily seen to be -exact but not -exact, while is -exact but not -exact, contradicting Lemma 6.3.21.

Now suppose . Using Lemmas 6.3.27 and 6.3.35, we may show in the same way as above that . Once again, is -exact but not -exact, and hence, by Lemma 6.3.39,

Finally, suppose . Since is -exact, by Lemma 6.1.11. But then by Lemma 4.2.16,

The additive property shared by and RAZ that is described in Lemmas 6.1.16 and 6.2.21, respectively, does not hold for RNE in precisely the same form. For example, let , , and . Then

and

Although is clearly -exact,

while

However, this property is shared by , and a slightly weaker version holds for RNE.

(plus-rne,plus-rna) Let , , and with and . Let

and

(a) If is -exact, then

(b) If is -exact, then

PROOF:

(a) Applying Lemma 6.1.16 and 6.2.21, we need only show that either

and

or

and

Let , , , and . According to Definition 6.3.1, it will suffice to show that

for some , for then Lemma 1.1.4 will imply that and . But

where

by Lemma 4.2.1.

(b) Here we must show that either

and

or

and

According to Definition 6.3.2, this is true whenever . Thus, we need only show that

which is equivalent to the hypothesis that is -exact.

The rounding constant (see the discussion preceding Lemma 6.2.24) for both of the modes and is a simple power of 2, equal to half the value of the least significant bit of the rounded result. That is, if the rounding precision is and the unrounded result is

then , as illustrated below.

The following lemma exposes the extra expense of implementing as compared to . While the correctly rounded result is given by in most cases, special attention is required for the computation of in the case where it differs from , i.e., when is -exact and . In this case, the least significant bit must be forced to 0. This is accomplished by truncating to bits rather than .

PROOF: If , then by Lemma 6.3.40,

But then, by Lemmas 6.1.15, 4.2.6, and 6.1.10,

Thus, we may assume , and it follows from Lemmas 6.3.6, 6.3.14, 6.3.27, and 6.3.35 that

Case 1: is -exact

By Lemma 6.1.11, . But since

Lemma 4.2.16 yields , and hence

Case 2: is not -exact

We have , for otherwise Lemma 6.3.14 <would imply

and since is -exact, so would be

Since , by Lemma 6.1.11. But since

.

The same argument applies to , but with Lemma 6.3.35 invoked in place of Lemma 6.3.14.

Case 3: is -exact but not -exact

The identity for is given by Lemma 6.3.39. To prove the claim for , we first consider the case . Since is -exact, , hence , and by Lemma 6.3.21,

Now suppose . Then implies . But since

we have

As a consequence of the preceding lemma, depends only on the most significant bits of .

PROOF: By Lemmas 6.3.30 and 6.1.3, we may assume that . Furthermore, it will suffice to consider the case , because then for ,

Thus, according to Lemmas 6.3.42 and 6.1.6, our goal is to prove

but after applying Lemma 6.1.5, we need only show

Let

By Lemmas 6.1.9 and 4.2.15, is -exact. Now since is also -exact and Lemmas 6.1.5 and 6.1.7 imply

Lemma 4.2.16 yields . Finally, by Lemma 6.1.11,

David Russinoff 2017-08-01