6  Rounding

The objective of floating-point rounding is an approximation of an arbitrary real number by one that is representable with respect to a given floating-point format. We define a rounding mode to be a mapping $ {\cal R}: \mathbb{R} \times \mathbb{N} \rightarrow \mathbb{R}$ such that for all $ x \in \mathbb{R}$, $ y \in \mathbb{R}$, and $ n \in \mathbb{N}$, the following axioms are satisfied:

  1. $ {\cal R}(x,n)$ is $ n$-exact.

  2. If $ x$ is $ n$-exact, then $ {\cal R}(x,n) = x$.

  3. If $ x \leq y$, then $ {\cal R}(x,n) \leq {\cal R}(y,n)$.

One implication of these conditions is that if $ x>0$, then $ {\cal R}(x,n) > 0$, since $ x \geq 2^{expo(x)}$ implies

$\displaystyle {\cal R}(x,n) \geq {\cal R}(2^{expo(x)},n) = 2^{expo(x)} > 0.$

Similarly, if $ x<0$, then $ {\cal R}(x,n) < 0$. Another consequence is that the approximation given by $ {\cal R}$ is optimal in the sense that there can exist no $ n$-exact number in the open interval between $ x$ and $ {\cal R}(x,n)$. For example, if $ y$ is $ n$-exact and $ x < y$, then $ {\cal R}(x,n) \leq {\cal R}(y,n) = y$.

In the first two sections of this chapter, we examine the two basic rounding modes RTZ (“round toward 0”) and RAZ ('round away from 0”), characterized by the inequalities

$\displaystyle \vert\mathit{RTZ}(x,n)\vert \leq \vert x\vert$


$\displaystyle \vert RAZ(x,n)\vert \geq \vert x\vert,$

respectively. It is clear that for any rounding mode $ {\cal R}$ and arguments $ x$ and $ n$, either $ {\cal R}(x,n) = \mathit{RTZ}(x,n)$ or $ {\cal R}(x,n) = RAZ(x,n)$. It is natural, therefore, to define other rounding modes in terms of these two. In Sections 6.3, 6.4, and 6.5, we discuss the modes that are prescribed by the IEEE standard as well as others that are commonly used in implementations of floating-point operations.

Considerations other than $ n$-exactness are involved in the rounding of results that lie outside of the normal range of a format. In the case of overflow, which occurs when the result of a computation exceeds the representable range, the standard prescribes rounding either to the maximum representable number or to infinity. The rules that govern this choice, which are quite arbitrary from a mathematical perspective, are deferred to Part III. The more interesting case of underflow, involving a non-zero result that lies below the normal range, is the subject of Section 6.6.

David Russinoff 2017-08-01