next up previous contents
Next: Truncation Up: Floating-Point Arithmetic Previous: Floating-Point Formats   Contents


Rounding

The objective of floating-point rounding is an approximation of an arbitrary rational number by one that is representable with respect to a given floating-point format. In the usual case, when the number to be approximated lies within the exponent range of the target format, this amounts to an $ n$ -exact approximation, where $ n$ is the number of bits of precision provided by the format's significand field.

We define a rounding mode to be a mapping $ {\cal R}: \mathbb{Q}
\times \mathbb{N} \rightarrow \mathbb{Q}$ such that for all $ x \in \mathbb{Q}$ , $ y \in \mathbb{Q}$ , and $ n \in \mathbb{N}$ , the following axioms are satisfied:

(1) $ {\cal R}(x,n)$ is $ n$ -exact.

(2) If $ x$ is $ n$ -exact, then $ {\cal R}(x,n) = x$ .

(3) If $ x \leq y$ , then $ {\cal R}(x,n) \leq {\cal R}(y,n)$ .

One implication of these conditions is that if $ x>0$ , then $ {\cal R}(x,n) > 0$ , since $ x \geq 2^{expo(x)}$ implies

$\displaystyle {\cal R}(x,n) \geq {\cal R}(2^{expo(x)},n) = 2^{expo(x)} > 0.$

Similarly, if $ x<0$ , then $ {\cal R}(x,n) < 0$ . Another consequence is that the approximation given by $ {\cal R}$ is optimal in the sense that there can exist no $ n$ -exact number in the open interval between $ x$ and $ {\cal R}(x,n)$ . For example, if $ y$ is $ n$ -exact and $ x < y$ , then $ {\cal R}(x,n) \leq {\cal R}(y,n) = y$ .

In the first two sections of this chapter, we examine the two basic rounding modes trunc and away, characterized by the inequalities

$\displaystyle \vert trunc(x,n)\vert \leq \vert x\vert$

and

$\displaystyle \vert away(x,n)\vert \geq \vert x\vert,$

respectively. It is clear that for any given arguments $ x$ and $ n$ , either $ {\cal R}(x,n) = trunc(x,n)$ or $ {\cal R}(x,n) = away(x,n)$ . Consequently, any rounding mode $ {\cal R}$ may be defined in terms of these two. In Sections 5.3, 5.4, and 5.5, we discuss various other modes, including those that are prescribed by the IEEE standard as well as others that are commonly used in implementations of floating-point operations.

Other considerations are involved in the rounding of results that lie outside of the normal range of a format. In the case of overflow, which occurs when the result of a computation exceeds the representable range, the standard prescribes rounding either to the maximum representable number or to infinity. The rules that govern this choice are quite arbitrary from a mathematical perspective and will not be discussed here. The more interesting case of underflow, involving a non-zero result that lies below the normal range, is the subject of Section 5.6.



Subsections
next up previous contents
Next: Truncation Up: Floating-Point Arithmetic Previous: Floating-Point Formats   Contents
David Russinoff 2007-01-02