A floating-point format is a scheme for encoding rationals as bit
vectors, derived from the decomposition expressed by Lemma 4.1.1.
A format may be characterized by two positive integers
and
, where
In fact, there are two classes of floating-point formats in common
use: those in which the leading significand bit appears explicitly,
and those in which it is implicit. Representations of the first type
include most implementations of the single extended (
,
) and double extended (
,
) formats
provided by IEEE Standard 754 [IEEE85], as well as the
higher-precision formats that are typically used for internal
computations in floating-point units. Under this scheme, as
diagrammed in Figure 4.3, an encoding
is a concatenation of
the form
the components of which are formally defined as follows.
|
(a)
esgnf |
|
(a)
eexpof |
|
(a)
esigf |
The significand field of such an encoding is interpreted with an implied radix point following the most significant bit. That is, if
then the value represented by this component is
The rational number encoded by
where the the exponent lies in the range
The decoding function for this format, therefore, is defined as follows.
The most commonly used formats, however, are based on an implicit MSB.
These include the IEEE basic single (
,
) and double (
,
) formats, at least one of
which must be implemented by any IEEE-compliant floating-point unit. An
encoding with respect to this scheme, as diagrammed in
Figure 4.3, consists of the fields defined below.
|
(a)
isgnf |
|
(b)
iexpof |
|
(c)
isigf |
The value of the implicit MSB for an encoding
may be either 0 or 1,
and is determined by
iexpof
. If this field has
any value other than the extremes 0 and
, then
is said to
be a normal encoding, as characterized by the following
predicate.
The implicit bit of a normal encoding is always 1. Consequently, the
decoding function is the same as that for an encoding with explicit
MSB, except that
esigf
is replaced by
|
(a)
|
|
(b)
|
|
(c)
|
PROOF: Part (a) is a trivial consequence of Definition 4.3.5.
By Lemma 2.2.1,
and thus, by Definition 4.1.1,
Now by Lemma 4.1.12,
| iexpof |
The following predicate characterizes the rational numbers that are representable by normal encodings.
|
(a) |
|
(b)
|
|
(c) |
The normal encoding of a representable number is derived as follows.
The next two lemmas establish an inverse relation between the encoding and decoding functions, from which it follows that the numbers that admit normal encodings are precisely those that satify nrepp.
PROOF: Let
. It is clear from Definition 4.3.5
that
. By Lemma 4.3.1,
is a
i.e.,
It also clear from Definition 4.3.5 that
Thus, by Definitions 4.3.7 and 4.3.3 and Lemmas 2.4.9 and 2.2.11,
PROOF: Let
.
By Lemma 2.4.1,
is a
-bit vector and by Lemma 2.4.7,
and
Since
is a
-bit vector,
by Lemma 2.2.11.
Since
is
-exact,
and by Lemma 4.1.7,
Finally, according to Definition 4.3.5,
We shall have occasion to refer to the smallest positive number that admits a normal representation.
PROOF: It is clear that
is positive and satisfies Definition 4.3.6.
Moreover, if
and
, then since
,
by Lemma 4.1.2.
Of the two values of the exponent field that lie outside of the range
of normal encodings, the upper extreme
is reserved for the
encoding of infinities and other non-numerical entities, which will
not be discussed here, while an exponent field of 0 is used to encode
numerical values that lie below the normal range. If both the
exponent and significand fields are 0, then the encoded value itself
is 0. In the remaining case, the encoding is said to be denormal.
Theere are two differences between the decoding formulas for denormal and normal representations:
|
(a)
|
|
(b)
|
|
(c)
|
PROOF: (a) is trivial; (b) and (c) follow from Lemmas 4.1.11 and 4.1.12.
The class of rationals that are representable as denormal encodings is recognized by the following predicate.
|
(a) |
|
(b)
|
|
(c) |
If a number is so representable, then its encoding is constructed as follows.
Next, we examine the relationship between the decoding and encoding functions.
PROOF: Let
. Since
isigf
,
and by Lemma 4.1.2,
which is equivalent to Definition 4.3.11(b). In order to prove (c), we must show, according to Definition 4.2.1, that
But
| isigf |
Now by Definition 4.3.10,
Therefore, by Definitions 4.3.9, 4.3.12, and 4.3.3 and Lemmas 2.4.9 and 2.2.11,
PROOF: Let
.
By Lemma 2.4.1,
is a
-bit vector and by Lemma 2.4.7,
and
Since
is
-exact,
and since
by Lemma 4.1.7, which implies
Finally, according to Definition 4.3.10,
The smallest positive denormal is computed by the following function:
PROOF: It is clear that
is positive. To show that
is
-exact, we need only observe that
Finally, since
Every number with a denormal representation is a multiple of the smallest positive denormal.
PROOF: For
, let
. Then
and
We shall show, by induction on
Now suppose that
and
. Let
.
Clearly,
, and
. It follows from Lemma 4.2.16
that
, and consequently,
is
-exact.
Thus, by Lemma 4.2.15,
.
A common task performed by floating-point units is the conversion of
an encoding from one format to another, which generally requires the
rebiasing of the exponent field. This operation may, of course, be
viewed in terms of bit vector addition. If
is the value of an
exponent field of width
, then the actual exponent of the encoded
number is
. The result of rebiasing this value for a field
of width
is given by the following definition.
When the target exponent field is wider than that of the source, rebiasing is always possible.
PROOF: First suppose that
. Then by Lemmas 2.2.11 and 2.3.13,
and
On the other hand, by Definition 2.4.1 and Lemma 2.4.18,
Now suppose![]()
![]()
![]()
![]()
and by Definition 2.4.1 and Lemma 2.4.19,
Now suppose that
is an
-bit biased exponent to be rebiased to fit into
a smaller
-bit field. In order for this to be possible,
must be an
or equivalently,
PROOF: The hypothesis implies that
is an
-bit vector, and hence, by
Lemmas 2.2.11, 2.3.13, and 2.2.19,
Suppose first that
and
then