How to Print Floating-Point Numbers Accurately - Computer Science ...
How to Print Floating-Point Numbers Accurately - Computer Science ...
How to Print Floating-Point Numbers Accurately - Computer Science ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>How</strong> <strong>to</strong> <strong>Print</strong> <strong>Floating</strong>-<strong>Point</strong> <strong>Numbers</strong> <strong>Accurately</strong><br />
Guy L. Steele Jr.<br />
Thinking Machines Corporation<br />
245 First Street<br />
Cambridge, Massachusetts 02142<br />
Abstract<br />
gls0think. corn<br />
We present algorithms for accurately converting<br />
floating-point numbers <strong>to</strong> decimal representation. The<br />
key idea is <strong>to</strong> carry along with the computation an ex-<br />
plicit representation of the required rounding accuracy.<br />
We begin with the simpler problem of converting<br />
fixed-point fractions. A modification of the well-known<br />
algorithm for radix-conversion of fixed-point fractions<br />
by multiplication explicitly determines when <strong>to</strong> termi-<br />
nate the conversion process; a variable number of digits<br />
are produced. The algorithm has these properties:<br />
l No information is lost; the original fraction can be<br />
recovered from the output by rounding.<br />
l No “garbage digits” are produced.<br />
l The output is correctly rounded.<br />
0 It is never necessary <strong>to</strong> propagate carries on<br />
rounding.<br />
We then derive two algorithms for free-format out-<br />
put of floating-point numbers. The first simply scales<br />
the given floating-point number <strong>to</strong> an appropriate frac-<br />
tional range and then applies the algorithm for fractions.<br />
This is quite fast and simple <strong>to</strong> code but has inaccura-<br />
cies stemming from round-off errors and oversimplifica-<br />
tion. The second algorithm guarantees mathematical<br />
accuracy by using multiple-precision integer arithmetic<br />
and handling special cases. Both algorithms produce no<br />
more digits than necessary (intuitively, the “1.3 prints<br />
as 1.2999999” problem does not occur).<br />
Finally, we modify the free-format conversion al-<br />
gorithm for use in jized-format applications. Infor-<br />
mation may be lost if the fixed format provides <strong>to</strong>o<br />
few digit positions, but the output is always correctly<br />
rounded. On the other hand, no “garbage digits” are<br />
ever produced, even if the fixed format specifies <strong>to</strong>o<br />
many digit positions (intuitively, the u4/3 prints as<br />
1.333333328366279602” problem does not occur).<br />
Permission <strong>to</strong> copy without fee all or part of this material is granted<br />
provided that the copies are not made or distributed for direct com-<br />
mercial advantage, the ACM copyright notice and the title of the<br />
publication and its date appear, and notice is given that copying is<br />
by permission of the Association for Computing Machinery. To copy<br />
otherwise, or <strong>to</strong> republish, requires a fee and/or specific permission.<br />
01990 ACM 0-89791-364-7/90/0006/0112 $1.50<br />
I<br />
Proceedings of the ACM SIGPLAN’SO Conference on<br />
Programming Language Design and Implementation.<br />
White Plains, New York, June 20-22, 1990.<br />
I<br />
112<br />
Jon L White<br />
Lucid, Inc.<br />
707 Laurel Street<br />
Menlo Park, California 94025<br />
1 Introduction<br />
j onl0lucid. corn<br />
Isn’t it a pain when you ask a computer <strong>to</strong> divide 1.0 by<br />
10.0 and it prints 0.0999999? Most often the arithmetic<br />
is not <strong>to</strong> blame, but the printing algorithm.<br />
Most people use decimal numerals. Most contempo-<br />
rary computers, however, use a non-decimal internal<br />
representation, usually based on powers of two. This<br />
raises the problem of conversion between the internal<br />
representation and the familiar decimal form.<br />
The external decimal representation is typically used<br />
in two different ways. One is for communication with<br />
people. This divides in<strong>to</strong> two cases. For j&d-format<br />
output, a fixed number of digits is chosen before the<br />
conversion process takes place. In this situation the pri-<br />
mary goal is controlling the precise number of charac-<br />
ters produced so that tabular formats will be correctly<br />
aligned. Contrariwise, in free-format output the goal<br />
is <strong>to</strong> print no more digits than are necessary <strong>to</strong> com-<br />
municate the value. Examples of such applications are<br />
Fortran list-directed output and such interactive lan-<br />
guage systems as Lisp and APL. Here the number of<br />
digits <strong>to</strong> be produced may be dependent on the value<br />
<strong>to</strong> be printed, and so the conversion process itself must<br />
determine how many digits <strong>to</strong> output.<br />
The second way in which the external decimal rep-<br />
resentation is used is for inter-machine communication.<br />
The decimal representation is a convenient and conven-<br />
tional machine-independent format for transferring nu-<br />
merical information from one computer <strong>to</strong> another. Us-<br />
ing decimal representation avoids incompatibilities of<br />
word length, field format, radix (binary versus hexadec-<br />
imal, for example), sign representation, and so on. Here<br />
the main objective is <strong>to</strong> transport the numerical value<br />
as accurately as possible; that the representation is also<br />
easy for people <strong>to</strong> read is a bonus. (The IEEE 754<br />
floating-point standard [IEEE851 has ameliorated this<br />
problem by providing standard binary formats, but the<br />
VAX, IBM 370, and Cray-1 are still with us and will be<br />
for yet a while longer.)
We deal here with the output problem: methods of<br />
converting from the internal representation <strong>to</strong> the exter-<br />
nal representation. The corresponding input problem is<br />
also of interest and is addressed by another paper in thii<br />
conference [ClingerSO]. The two problems are not quite<br />
identical because we make the asymmetrical assumption<br />
that the internal representation has a fixed number of<br />
digits but the external representation may have a vary-<br />
ing number of digits. (Compare this with Matula’s ex-<br />
cellent work, in which both representations are assumed<br />
<strong>to</strong> be of iixed precision [Matula68] [Matula’70].)<br />
In the following discussion we assume the existence<br />
of a program that solves the input problem: it reads<br />
a number in external representation and produces the<br />
input representation whose exact value is closest, of all<br />
possible internal representation values, <strong>to</strong> the exact nu-<br />
merical value of the external number. In other words,<br />
this ideal input routine is assumed <strong>to</strong> round correctly in<br />
all cases.<br />
We present algorithms for fixed-point-fraction output<br />
and floating-point output that decide dynamically what<br />
is an appropriate number of output digits <strong>to</strong> produce<br />
for the external representation. We recommend the last<br />
algorithm, Dragon4, for general use in compilers and<br />
run;time systems; it is completely accurate and accom-<br />
modates a wide variety of output formats.<br />
We express the algorithms in a general form for con-<br />
verting from any radix b (the internal radii) <strong>to</strong> any<br />
other radix B (the ezternal radix); b and B may be any<br />
two integers greater than 1, and either may be larger<br />
than the other. The reader may find it helpful <strong>to</strong> think<br />
of B as being 10, and b as being either 2 or 16. Infor-<br />
mally we use the terms “decimal” and “external radix”<br />
interchangeably, and similarly “binary” and “internal<br />
radix”. We also speak interchangeably of digits being<br />
‘output” or Uprintedn.<br />
2 Properties of Radix Conversion<br />
We assume that a number <strong>to</strong> be converted from one<br />
radix <strong>to</strong> another is perfectly accurate and that the goal<br />
is <strong>to</strong> express that exact value. Thus we ignore issues of<br />
errors (such as floating-point round-off or truncation) in<br />
the calculation of that value.<br />
The decimal representation used <strong>to</strong> communicate a<br />
numerical value should be accurate. This is especially<br />
important when transferring values between comput-<br />
ers. In particular, if the source and destination com-<br />
puters use identical floating-point representations, we<br />
would like the value <strong>to</strong> be communicated exactly; the<br />
destination computer should reconstruct the exact bit<br />
pattern that the source computer had. In this case we<br />
require that conversion from internal form <strong>to</strong> external<br />
form and back be an identity function; we call this the<br />
113<br />
internal identity requirement. We would also like con-<br />
version from external form <strong>to</strong> internal form and back <strong>to</strong><br />
be an identity; we call this the external identity requim-<br />
ment. <strong>How</strong>ever, there are some difficulties with these<br />
requirements.<br />
The first difficulty is that an exact representation is<br />
not always possible. A finite integer in any radix can<br />
be converted in<strong>to</strong> a finite integer in any other radix.<br />
This is not true, however, of finite-length fractions; a<br />
fraction that has a finite representation in one radix may<br />
have an indefinitely repeating representation in another<br />
radix. This occurs only if there is some prime number<br />
p that divides the one radii but not the other; then<br />
l/p (among other numbers) has a finite representation<br />
in the first radix but not in the second. (It should be<br />
noted, by the way, that this relationship between radices<br />
is different from the relationship of incommensumbility<br />
as Matula defines it [Matula70]. For example, radices<br />
10 and 20 are incommensurable, but there is no prime<br />
that divides one and not the other.)<br />
It does happen that any binary, octal, or hexadeci-<br />
mal number has a finite decimal representation (because<br />
there is no prime that divides 2k but not 10). Therefore<br />
when B = 10 and b is a power of two we can guarantee<br />
the internal identity requirement by printing an exact<br />
decimal representation of a given binary number and<br />
using an ideal rounding input routine.<br />
This solution is gross overkill, however. The exact<br />
decimal representation of an n-bit binary number may<br />
require n decimal digits. <strong>How</strong>ever, es a rule fewer than<br />
n digits are needed by the rounding input routine <strong>to</strong><br />
reconstruct the exact binary value. For example, <strong>to</strong><br />
print a binary floating-point number with a 27-bit frac-<br />
tion requires up <strong>to</strong> 30 decimal digits. (A 27-bit frac-<br />
tion can represent a 30-bit binary fraction whose dec-<br />
imal representation has a non-sero leading digit; this<br />
is discussed further below.) <strong>How</strong>ever, 10 decimal dig-<br />
its always suffice <strong>to</strong> communicate the value accurately<br />
enough for the input routine <strong>to</strong> reconstruct the 27-<br />
bit binary representation [Matula68], and many par-<br />
ticular values can be expressed accurately with fewer<br />
than 10 digits. For example, the 27-bit binary floating-<br />
point number .110011001100110011001100110~ x 2-‘,<br />
which is equal <strong>to</strong> the 29-bit binary fixed-point frac-<br />
tion .00011001100110011001100110011~, is also equal <strong>to</strong><br />
the decimal fraction .09999999962747097015380859375,<br />
which has 29 digits. <strong>How</strong>ever, the decimal fraction 0.1<br />
is closer in value <strong>to</strong> the binary number than <strong>to</strong> any other<br />
27-bit floating-point binary number, and so suffices <strong>to</strong><br />
satisfy the internal identity requirement. Moreover, 0.1<br />
is more concise and more readable.<br />
We would like <strong>to</strong> produce enough digits <strong>to</strong> preserve<br />
the information content of a binary number, but no<br />
more. The difficulty is that the number of digits needed
depends not only on the precision of the binary number<br />
but on its value.<br />
Consider for a moment the superficially easier prob-<br />
lem of converting a binary floating-point number <strong>to</strong><br />
octal. Suppose for concreteness that our binary num-<br />
bers have a six-bit significand. It would seem that two<br />
octal digits should be enough <strong>to</strong> express the value of<br />
the binary number, because one octal digit expresses<br />
three bits of information. <strong>How</strong>ever, the precise oc-<br />
tal representation of the binary floating-point number<br />
.101101~ x 2-l is .264s x 8O. The exponent of the bi-<br />
nary floating-point number specifies a shifting of the<br />
significand so that the binary point is Tn the middle”<br />
of an octal digit. The first octal digit thus represents<br />
only two bits of the original binary number, and the<br />
third only one bit. We find that three octal digits are<br />
needed because of the difference in “graininess” of the<br />
exponent (shifting) specitications of the two radices.<br />
The number N of radix-B digits that may be required<br />
in general <strong>to</strong> represent an n-digit radix-b floating-point<br />
number is derived in [Matula68]. The condition is:<br />
B*-l > bn<br />
This may be reformulated as:<br />
N>--!f--+1<br />
log, B<br />
and if N is <strong>to</strong> be minimized we therefore have:<br />
N = 2 + [n/&g, B)]<br />
No more than this number of digits need ever be output,<br />
but there may be some numbers that require exactly<br />
that many.<br />
As an example, in converting a binary floating-point<br />
number <strong>to</strong> decimal on the PDP-10 [DEC73] we have<br />
b = 2, B = 10, and n = 27. One might naively sup-<br />
pose that 27 bits equals 9 octal digits, and if nine octal<br />
digits suffice surely 9 decimal digits should suffice also.<br />
<strong>How</strong>ever, using Matula’s formula above, we find that <strong>to</strong><br />
print such a number may require 2 + L27/(log, lo)] =<br />
2 + /27/(3.3219...)J = 2 + [8.1278...] = 10 decimal<br />
digits. There are indeed some numbers that require 10<br />
digits, such as the one which is greater than the rounded<br />
27-bit floating-point approximation <strong>to</strong> 0.1 by two units<br />
in the last place. On the PDP-10 this is the number<br />
represented in octal as 1756314631508, which represents<br />
the value .63146315s x 8 -l. It is intuitively easy <strong>to</strong> see<br />
why ten digits might be needed: splitting the number<br />
in<strong>to</strong> an integer part (which is zero) and a fractional part<br />
must shift three zero bits in<strong>to</strong> the significand from the<br />
left, producing a 30-bit fraction, and 9 decimal digits is<br />
not enough <strong>to</strong> represent all 30-bit fractions.<br />
The condition specified above assumes that the con-<br />
versions in both directions round correctly. There are<br />
114<br />
other possibilities, such. as using truncation (always<br />
rounding <strong>to</strong>wards zero) instead of true rounding. If the<br />
input conversion rounds but the output conversion trun-<br />
cates, the internal identity requirement can still be met,<br />
but more external digits may be required; the condition<br />
is BN-’ > 2 x bn - 1. If conversion truncates in both<br />
directions, however, it is possible that the internal iden-<br />
tity requirement cannot be met, and that repeated con-<br />
versions may cause a value <strong>to</strong> “drift” very far from its<br />
original value. This implies that if a data base is main-<br />
tained in decimal, and is updated by converting it <strong>to</strong> bi-<br />
nary, processing it, and then converting back <strong>to</strong> decimal,<br />
values not updated may nevertheless be changed, after<br />
many iterations, by a substantial amount [Matula70].<br />
That is why we assume and require proper rounding.<br />
Rounding can guarantee the integrity of the data (in<br />
the form of the internal identity requirement) and also<br />
will minimize the number of output digits. Truncation<br />
cannot and will not.<br />
The external identity requirement obviously cannot<br />
be strictly satisfied, because by assumption there are a<br />
finite number of internally representable values and an<br />
infinite number of external values. For every internal<br />
value there are many corresponding external representa-<br />
tions. From among the several external representations<br />
that correspond <strong>to</strong> a given internal representation, we<br />
prefer <strong>to</strong> get one that is closest in value but also as short<br />
as possible.<br />
Satisfaction of the internal identity requirement im-<br />
plies that an external consistency requirement be met:<br />
if an internal value is printed out, then reading that ex-<br />
act same external value and then printing it once more<br />
will produce the same external representation. This<br />
is important <strong>to</strong> the user of an interactive system. If<br />
one types PRINT 3.1415926535 and the system replies<br />
3.1415926, the user might think “Well and good; the<br />
computer has truncated it, and that’8 all I need type<br />
from now on.” The user would be rightly upset <strong>to</strong><br />
then type in PRINT 3.1415926 and receive the response<br />
3.1415925! Many language implementations indeed ex-<br />
hibit such undesirable “drifting” behavior.<br />
Let us formulate our desires. First, in the interest<br />
of brevity and readability, conversion <strong>to</strong> external form<br />
should produce as few digits as possible. Now suppose,<br />
for some internal number, that there are several exter-<br />
nal representations that can be used: all have the same<br />
number of digits. all can be used <strong>to</strong> reconstruct the in-<br />
ternal value precisely, and no shorter external number<br />
suffices for the reconstruction. In this case we prefer the<br />
one that is closest <strong>to</strong> the value of the binary number.<br />
(As we shall see, all the external number8 must be alike<br />
except in the last digit; satisfying this criterion there-<br />
fore involves only the correct choice for the last digit <strong>to</strong><br />
be output.)
On the other hand, suppose that the number of dig-<br />
its <strong>to</strong> be output has been predetermined (fixed-format<br />
output); then we desire that the printed value be as<br />
close as possible <strong>to</strong> the binary value. In short, the ex-<br />
ternal form should always be correctly rounded. (The<br />
Fortran 77 standard requires that floating-point output<br />
be correctly rounded [ANSI76, 13.9.5.21; however, many<br />
Fortran implementations do round not correctly, espe-<br />
cially for numbers of very large or very small magni-<br />
tude.)<br />
We first present and prove an algorithm for printing<br />
fired-point fractions that has four useful properties:<br />
(a) Information preservation. No information is lost<br />
in the conversion. If we know the length n of the<br />
original radix-b fraction, we can recover the original<br />
fraction from the radix-B output by converting back<br />
<strong>to</strong> radix-b and rounding <strong>to</strong> n digits.<br />
(b) Minimum-length output. No more digits than nec-<br />
essary are output <strong>to</strong> achieve (a). (This implies that<br />
the last digit printed is non-zero.)<br />
(c) Correct rounding. The output is correctly rounded;<br />
that is, the output generated is the closest approx-<br />
imation <strong>to</strong> the original fraction among all radix-B<br />
fractions of the same length, or is one of two closest<br />
approximations, and in the latter case it is easy <strong>to</strong><br />
choose either of the two by any arbitrary criterion.<br />
(d) Left-<strong>to</strong>-right genenrtion. It is never necessary <strong>to</strong><br />
propagate carries. Once a digit has been generated,<br />
it is the correct one.<br />
These properties are characterized still more rigorously<br />
in the description of the algorithm itself.<br />
From thii algorithm we then derive a similar method<br />
for printing integers. The two methods are then com-<br />
bined <strong>to</strong> produce two algorithms for printing floating-<br />
point numbers, one for free-format applications and one<br />
for fixed-format applications.<br />
3 Fixed-<strong>Point</strong> Fraction Output<br />
Following [Knuth68], we use the notation 1x1 <strong>to</strong> mean<br />
the greatest integer not greater than z. Thus 13.51 = 3<br />
and I-3.51 = -4. If z is an integer, then \xJ = 2.<br />
Similarly, we use the notation [z] <strong>to</strong> mean the smallest<br />
integer not less than x; [z] z - ]--zj . Also following<br />
[Knuth68], we use the notation {z} <strong>to</strong> mean z mod 1,<br />
or z - lz], the fractional part of Z. We always indicate<br />
numerical multiplication of a and b explicitly as a x b,<br />
because we use simple juxtaposition <strong>to</strong> indicate digits<br />
in a place-value notation.<br />
The idea behind the algorithm is very simple. The<br />
potential error in a fraction of limited precision may be<br />
described as being equal <strong>to</strong> one-half the minimal non-<br />
zero value of the least significant digit of the fraction.<br />
The algorithm merely initializes a variable M <strong>to</strong> this<br />
115<br />
value <strong>to</strong> represent the precision of the fraction. When-<br />
ever the fraction is multiplied by B, the error M is<br />
also multiplied by B. Therefore M always has a value<br />
against which the residual fraction may be meaningfully<br />
compared <strong>to</strong> decide whether the precision has been ex-<br />
hausted.<br />
Algorithm (FP)s:<br />
Finite-Precision Fixed-<strong>Point</strong> Fraction Prin<strong>to</strong>ut’<br />
Given:<br />
l An n-digit radix-b fraction f, 0 < f < 1:<br />
f = .f-lf-2f-a. .-f-n = c f-i x b-’<br />
i=l<br />
l A radix B (an integer greater than<br />
Output: The digits Fi of an N-digit (N<br />
fraction F:<br />
such that:<br />
N<br />
1).<br />
2 1) radix-B<br />
F = . F-1F-,F-a.. . F-N = c Fmi x B-’<br />
(a) IF - fl < 7<br />
(b) N is the smallest integer 2 1 such that (a) can be<br />
true.<br />
(4 IF - fl 5 7<br />
(d) Each digit is output before the next is generated;<br />
it is never necessary <strong>to</strong> “back up” for corrections<br />
Procedure: see Table 1. m<br />
The algorithmic procedure is expressed in pseudo-<br />
ALGOL. We have used the loop-while-repeat (“n + f<br />
loop”) construct credited <strong>to</strong> Dahl [Knuth74]. The loop<br />
is exited from the while-point if the while-condition<br />
is ever false; thus one may read “while B:” as<br />
“if not B then exitloop fi”. The cases statement<br />
is <strong>to</strong> be taken as a guarded if construction [Dijkstra76];<br />
the branch labeled by the true condition is executed,<br />
and if both conditions are true (here, if R = i) either<br />
branch may be chosen. The ambiguity in the cases<br />
statement here reflects the point of ambiguity when<br />
‘It is difficult <strong>to</strong> give short, meaningfnl names <strong>to</strong> each of a<br />
group of algorithms that are similar. Here we give each algorithm<br />
a long name that has enough adjectives and other qualifiers <strong>to</strong><br />
distinguish it Born the others, but these long names are unwieldy.<br />
For convenience, and as a bit of a joke, two kinds of abbreviated<br />
names are used here. Algorithms that are intermediate versions<br />
along a path of development have names such as “(FP)3” that are<br />
effectively contracted acronyms for the long names. Algorithms<br />
that are in “final form” have names of the form “Dragonk” for<br />
integers ic; these algorithms have long -es whose acronyms<br />
form sequences of letters “F” and “P” that specify the shape of<br />
so-called “dragon curves” [Gardner??].<br />
i=l
Table 1: Procedure for Finite-Precision Fixed-<strong>Point</strong><br />
Fraction Prin<strong>to</strong>ut ((FP)s)<br />
begin<br />
k e0;<br />
R + f;<br />
M c b-“/2;<br />
loop<br />
k+-k+l;<br />
Ut 1RxBJ;<br />
R+-(Rx B);<br />
Me-MxB;<br />
whileRLMandR): F-k+-U+l;<br />
endcases;<br />
N ck;<br />
end<br />
two radix-B representations of length n are equidistant<br />
from f. A given implementation may use any decision<br />
method when R = 3, such as “always U”, which effec-<br />
tively means <strong>to</strong> round down; “always U + l”, meaning <strong>to</strong><br />
round up; or WU E 0 ( mod 2) then U else U + l”,<br />
meaning <strong>to</strong> round so that the last digit is even. Note,<br />
however, that using “always U” does not always round<br />
down if rounding up will allow one fewer digit <strong>to</strong> be<br />
printed.<br />
This method is a generalization of the one presented<br />
by Taran<strong>to</strong> [Taran<strong>to</strong>59] and mentioned in exercise 4.4.3<br />
of [Knuth69].’<br />
4 Proof of Algorithm (FP)z<br />
(The reader who is more interested in practical appli-<br />
cations than in details of proof may wish <strong>to</strong> skip this<br />
section.)<br />
In order <strong>to</strong> demonstrate that Algorithm (FP)s satis-<br />
fies the four claimed properties (a)-(d), it is useful first<br />
<strong>to</strong> prove, by induction on k, the following invariants true<br />
at the <strong>to</strong>p of the loop body:<br />
M _ b-” x Bk<br />
k-<br />
2<br />
2By the way, the paper that was forward-referenced in the<br />
answer <strong>to</strong> exercise 4.4.3 in [Knuth81] was M early draft of this<br />
paper that contained only Algorithm (FP)3, its proof, and a not-<br />
yet-correct generalization <strong>to</strong> floating-point numbers.<br />
(i)<br />
116<br />
RkXBsk+~F-iXBsi=f<br />
i=l<br />
By & and Rk we mean the values of M and R at the<br />
<strong>to</strong>p of the loop as a function of k (the values of M and<br />
R change if and only if k is incremented). Invariant (i)<br />
is easily verified; we shall take it for granted. Invariant<br />
(G) is a little more complicated.<br />
Basis. After the first two assignments in the proce-<br />
dure, k = 0 and & = f . The summation in (ii) has no<br />
summands and is therefore zero, yielding<br />
&xB-‘+O=f<br />
which is true, because initially R = f.<br />
Induction. Suppose (ii) is true for k. We note that<br />
the only path backward in the loop sets F-(k+I) +<br />
[Rk x R] . It follows that:<br />
k<br />
f= Rk X Bsk+CF-iX B-’<br />
i=l<br />
= (B x Rk) x B- @+I) + -&F-; x B-’<br />
i=l<br />
=({RkxB}+lRkxBJ)xB- (k+l) + kFwi x B-”<br />
= (Rk+l + F--(k+l)) X B-(k+l) $ 2 F-i X B-’<br />
= Rk+l X B-(“+l) + C F-i X B-’<br />
k+l<br />
i=l<br />
which establishes the desired invariant.<br />
The procedure definitely terminates, because M ini-<br />
tially has a strictly positive value and is multiplied by<br />
B (which is greater than 1) each time through the loop.<br />
Eventually we have M > 4, at which point R 2 M<br />
and R 5 1 - M cannot both be true and the loop must<br />
terminate. (Of course, the loop may terminate before<br />
M > i, depending on R.)<br />
When the procedure terminates, there may be one of<br />
two cases, depending on RN. Because of the rounding<br />
step, we have one of the following:<br />
f=F+RNxB-N ifRN
f-F=R*xB-N if&s+<br />
F-f=R’xB-N if&z+<br />
From this we have:<br />
,F-f,=R*xB-“57<br />
proving property (c) (correct rounding).<br />
To prove property (a) (information preservation), we<br />
consider the two cases MN > i and MN 5 i. If we<br />
have MN > 3, then<br />
because MN =<br />
IF- fl< 7 1 - MN we have<br />
R* < MN, and so<br />
IF-fl=R’xB-“
Procedure: If f = 0 then simply print “0.0”. Otherwise<br />
proceed as follows. First compute u’ = u Q B-“, where<br />
“a” denotes floating-point multiplication and where z<br />
is chosen so as <strong>to</strong> make Y’ < bp. (If exponential for-<br />
mat is <strong>to</strong> be used, one normally chooses z so that the<br />
result either is between l/B and 1 or is between 1 and<br />
B.) Next, print the integer part of v’ by any suitable<br />
method, such as the usual division-remainder technique;<br />
after that print a decimal point. Then take the frac-<br />
tional part f of v’, let n = p - [log* v’J - 1, and apply<br />
Algorithm (FP)8 <strong>to</strong> f and n. Finally, if 2 was not zero,<br />
the letter “E” and a representation of the scale fac<strong>to</strong>r z<br />
are printed. I<br />
We note that if a floating-point number is the same<br />
size (in bits, or radix-b digits, or whatever) as an inte-<br />
ger, then arithmetic on integers of that size is often suffi-<br />
ciently accurate, because the bits used for the floating-<br />
point exponent are usually more than enough for the<br />
extra digits of precision required.<br />
This method for floating-point output is not entirely<br />
accurate, but is very simple <strong>to</strong> code, typically requir-<br />
ing only single-precision integer arithmetic. For appli-<br />
cations where producing pleasant output is important,<br />
program and data space must be minimized, and the in-<br />
ternal identity requirement may be relaxed, this method<br />
is excellent. An example of such an application might<br />
be an implementation of BASIC for a microcomputer,<br />
where pleasant output and memory conservation are<br />
more important than absolutely guaranteed accuracy.<br />
[The last sentence was written circa 1981. Nowadays all<br />
but the tiniest computers have the memory space for the<br />
full algorithm.] This algorithm was used for many years<br />
in the MacLisp interpreter and found <strong>to</strong> be satisfac<strong>to</strong>ry<br />
for most purposes.<br />
There are two problems that prevent Algorithm<br />
Dragon2 from being completely accurate for floating-<br />
point output.<br />
The first problem is that the spacing between adj,<br />
cent floating-point numbers of a given precision is not<br />
uniform. Let t) be a floating-point number, and let V-<br />
and v+ be respectively the next-lower and next-higher<br />
floating-point numbers of the same precision. For most<br />
values of v, we have v+ - v = v - v-. <strong>How</strong>ever, if v is an<br />
integral power of b, then instead v+ - v = b x (v - v-);<br />
that is, the gap below v is smaller than the gap above<br />
v by a fac<strong>to</strong>r of b.<br />
The second problem is that the scaling may cause<br />
loss of information. There may be two numbers, very<br />
close <strong>to</strong>gether in value, which when scaled by the same<br />
power of B result in the same scaled number, because of<br />
rounding. The problem is inherent in the floating-point<br />
representation. See [Matula68] for further discussion of<br />
this phenomenon.<br />
118<br />
Table 2: Procedure for Indefinite Precision Integer<br />
Prin<strong>to</strong>ut ((IP)2)<br />
begin<br />
k +- 0;<br />
R e d;<br />
M tb";<br />
s t 1;<br />
loop<br />
StSx B;<br />
kck+l;<br />
while(2xR)+M>2xS:<br />
repeat;<br />
Hck-1;<br />
assert S = B(“+‘)<br />
loop<br />
k+-k-l;<br />
S t S/B;<br />
u + [RISJ;<br />
RcRmodS;<br />
while2xR>Mand2xRs(2xS)-M:<br />
Dk + u;<br />
repeat;<br />
cases<br />
2xR5S: Dk+U;<br />
2xRLS: Dk+U+l;<br />
endcases;<br />
N + k;<br />
end<br />
6 Conversion of Integers<br />
To avoid the round-off errors introduced by scaling<br />
floating-point numbers, we develop a completely ac-<br />
curate floating-point output routine that performs no<br />
floating-point computations. As a first step, we exhibit<br />
an algorithm for printing integers that has a tetmina-<br />
tion criterion similar <strong>to</strong> that of Algorithm (FP)‘.<br />
Algorithm (IP)‘:<br />
Indefinite Precision Integer Prin<strong>to</strong>ut<br />
Given:<br />
l An h-digit radix-b integer d, accurate <strong>to</strong> position<br />
n (n 2 0):<br />
d = d,,dh-1. . . d,,+ld,(n zeros). = 2 a$ x b’<br />
l AradixB.
Output: Integers H and N (H 2 N 1 1) as an H-<br />
digit radix-B integer D:<br />
D = DHDH-I.. . DN+IDN(N zeros). = 5 Di x B’<br />
such that:<br />
(a) [D-d! < f<br />
(b) N is the largest integer, and H the smallest, such<br />
that la1 can be true.<br />
(c) ID -ii,‘< B”<br />
(d) Each digit is” output before the next is generated.<br />
Procedure: see Table 2. I<br />
In Algorithm (IP)l all quantities are integers. We<br />
avoid the use of fractional quantities by introducing a<br />
scaling fac<strong>to</strong>r S and logically replacing the quantities R<br />
and M by R/S and M/S. Whereas in Algorithm (FP)3<br />
the variables R and M were repeatedly multiplied by B,<br />
in Algorithm (IP)a the variable S is repeatedly divided<br />
by B.<br />
7 Free-Format <strong>Floating</strong>-<strong>Point</strong> Output<br />
From these two algorithms, (FP)’ and (IP)‘, it is now<br />
easy <strong>to</strong> synthesize an algorithm for “perfect” conversion<br />
of a (positive) fixed-precision floating-point number <strong>to</strong> a<br />
free-format floating-point number in another radix. We<br />
shall assume that a floating-point number f is repre-<br />
sented as a tuple of three integers: the exponent e, the<br />
“fraction” or “significand” m, and the precision p. To-<br />
gether these integers represent the mathematical value<br />
f= m x b(‘-P). We require m < bp; a normalized rep-<br />
resentation additionally requires m 2 b(P-l), but we do<br />
not depend on this.<br />
We define the function S/L& of two integer arguments<br />
z and n (Z > 0):<br />
shi&,(x,n) z lx x b”J<br />
This function is intended <strong>to</strong> be trivial <strong>to</strong> implement in<br />
radix-b arithmetic: it is the result of shifting x by n<br />
radix-b digit positions, <strong>to</strong> the left for positive n (intro-<br />
ducing low-order zeros) or <strong>to</strong> the right for negative n<br />
(discarding digits shifted <strong>to</strong> the right of the “radix-b<br />
point”).<br />
Algorithm (FPP)‘:<br />
Fixed-Precision Positive <strong>Floating</strong>-<strong>Point</strong> Prin<strong>to</strong>ut<br />
Given:<br />
a Three radix-b integers e, f, and p, representing the<br />
number v = f x bte+‘), withpz OandO < f < bP,<br />
l AradixB.<br />
i=N<br />
119<br />
Output: Integers H and N and digits DI, (H > k 2 N)<br />
such that if one defines the value<br />
V = 5 Di x B’<br />
i=N<br />
then:<br />
v- +v<br />
(4 -
Table 3: Fixed-Precision<br />
<strong>Floating</strong>-<strong>Point</strong> Prin<strong>to</strong>ut (( FPP)‘)<br />
begin<br />
Positive Table 4: Procedure Simple-Fizup<br />
procedure Simple-Fizup;<br />
begin<br />
assert f # 0 assert R/S = v<br />
R + Shifib(f) m=(e - no)); assert M-/S = M+/S = bte-p)<br />
S t shifib(l,max(O, -(e - p))); iff= shi&(l,p - 1) then<br />
assert R/S = f x b(‘-p) = v M+ + Shift,(M+,l);<br />
M- t shiflb(l, mu(e - p, 0)); R t- Shift,(R,l);<br />
M++M-; s t shiftb(S, 1);<br />
Simple-F&up; fi;<br />
Htk-1; k t 0;<br />
loop loop<br />
assert (R/S) x B” + 2 Di x B’ = u<br />
i=k<br />
kc&-l;<br />
U + I@ x B)/S_(;<br />
Rt(Rx<br />
M-tM-xB;<br />
M+tM+xB;<br />
B)modS;<br />
lowc2xR(2xS)-M+;<br />
while (not low) and (not high) :<br />
Dk + u;<br />
repeat;<br />
comment Let Vk =<br />
H<br />
C Di x B’.<br />
i=k+l<br />
v- +v<br />
assert low * - < (u X B” + Vk) 5 v<br />
2<br />
v+ + v<br />
assert high =$ u 5 ((u -k 1) x Bk + vk) < 2<br />
cases<br />
low and not high : Dk t U;<br />
high and not low : Dk t- U + 1;<br />
low and high :<br />
cases<br />
2xR5S: Dk+U;<br />
2xRzS: DktU$l;<br />
endcases;<br />
endcases;<br />
N tk;<br />
end;<br />
120<br />
assert (R/S) x Bk = v<br />
assert (M-/S) x Bk = v - v-<br />
assert (M+/S) x Bk = v+ -2)<br />
while R < [S/B] :<br />
&+--k-l;<br />
RtRxB;<br />
M-tM-xB;<br />
M+tM+xB;<br />
repeat;<br />
assert<br />
loop<br />
k = min(O, 1 + [log, vJ)<br />
assert (R/S) x Bk = v<br />
assert (M-/S) x Bk = v - v-<br />
assert (M+/S) x Bk = v+ -v<br />
while(2xR)+M+>2xS:<br />
ScSx<br />
k+-k+1;<br />
repeat;<br />
B;<br />
end:<br />
assert k = 1 + [ logB x$+]
The formula for v+ does not have <strong>to</strong> be conditional,<br />
even when the representation for v+ requires a larger<br />
exponent e than v does in order <strong>to</strong> satisfy f < bp. The<br />
formula for v- , however, must take in<strong>to</strong> account the sit-<br />
uation where f = b(J’- ‘1, because v- may then use the<br />
next-smaller value for e, and so the gap size is smaller by<br />
a fac<strong>to</strong>r of b. There may be some question as <strong>to</strong> whether<br />
this is the correct criterion for floating-point numbers<br />
that are not normalized. Observe, however, that this<br />
is the correct criterion for IEEE standard floating-point<br />
format [IEEE85], because all such numbers are normal-<br />
ized except those with the smallest possible exponent,<br />
so if v is denormalized then v- must also be denormal-<br />
ized and have the same value for the exponent e.<br />
To compensate for the phenomenon of unequal gaps,<br />
the variables M- and M+ are given the value that M<br />
would have in Algorithm (IP)‘, and then adjustments<br />
are made <strong>to</strong> M+, R, and S if necessary. The simplest<br />
way <strong>to</strong> account for unequal gaps would be <strong>to</strong> divide<br />
M- by b. <strong>How</strong>ever, this might cause M- <strong>to</strong> have a<br />
non-integral value in some situations. To avoid this,<br />
we instead scale the entire algorithm by an additional<br />
fac<strong>to</strong>r of b, by multiplying M+, R, and S by b.<br />
The presence of unequal gaps also induces an asym-<br />
metry in the termination test that reveals a previously<br />
hidden problem. In Algorithm (IP)‘, with M- = M+ =<br />
M, it was the case that if R > (2 x S) - M+ and<br />
2 x R < S, then necessarily R < M- also. Here this<br />
is not necessarily so, because M- may be smaller than<br />
M+ by a fac<strong>to</strong>r of b. The interpretation of this is that<br />
in some situations one may have 2 x R < S but nev-<br />
ertheless must round up <strong>to</strong> the digit U + 1, because<br />
the termination test succeeds relative <strong>to</strong> M+ but fails<br />
relative <strong>to</strong> M- .<br />
This problem is solved by rewriting the termination<br />
test in<strong>to</strong> two parts. The boolean variable low is true<br />
if the loop may be exited because M- is satisfied (in<br />
which case the digit U may be used). The boolean vari-<br />
able high is true if the loop may be exited because M+<br />
is satisfied (in which case the digit U + 1 may be used).<br />
If both variables are true, then either digit of U and<br />
U + 1 may be used, as far as information-preservation<br />
is concerned; in this case, and this case only, the com-<br />
parison of 2 x R <strong>to</strong> S should be done <strong>to</strong> satisfy the<br />
correct-rounding property.<br />
Algorithm (FPP) a is conceptually divided in<strong>to</strong> four<br />
parts. First, the variables R, S, M-, and M+ are ini-<br />
tialized. Second, if the gaps are unequal then the neces-<br />
sary adjustment is made. Third, the radix-B weight H<br />
of the first digit is determined by two loops. The first<br />
loop is needed for positive H, and repeatedly multiplies<br />
S by B; the second loop is needed for negative H, and<br />
repeatedly multiplies R, M-, and M+ by B. (Divisions<br />
by B are of course avoided <strong>to</strong> prevent roundoff errors.)<br />
121<br />
Fourth, the digits are generated by the last loop; this<br />
loop should be compared with the one in Table 1.<br />
8 Fixed-Format <strong>Floating</strong>-<strong>Point</strong> Output<br />
Algorithm (FPP)r simply generates digits, s<strong>to</strong>pping<br />
only after producing the appropriate number of digits<br />
for precise free-format output. It needs more flexible<br />
means of cutting off digit generation for fixed-format<br />
applications. Moreover, it is desirable <strong>to</strong> intersperse<br />
formatting with digit generation rather than buffering<br />
digits and then re-traversing the buffer.<br />
It is not difficult <strong>to</strong> adapt Algorithm (FPP)’ for fixedformat<br />
output, where the number of digits <strong>to</strong> be output<br />
is predetermined independently of the floating-point<br />
value <strong>to</strong> be printed. Using this algorithm avoids giving<br />
the illusion of more internal precision than is actually<br />
present, because it cuts off the conversion after sufficiently<br />
many digits have been produced; the remaining<br />
character<br />
ros.<br />
positions are then padded with blanks or ze-<br />
As an example, suppose that a Fortran program<br />
is written <strong>to</strong> calculate z, and that it indeed calculates<br />
a floating-point number that is the best possible<br />
approximation <strong>to</strong> z for that floating-point format,<br />
precise <strong>to</strong>, say, 27 bits. <strong>How</strong>ever, it then prints<br />
the value with format F20.18, producing the output<br />
u3.1415Q265160560607Q”.<br />
This is indeed accurately the value of that floatingpoint<br />
number <strong>to</strong> that many places, but the last ten<br />
digits are in some sense not justified, because the internal<br />
representation is not that precise. Moreover,<br />
this output certainly does not represent the value of<br />
A accurately; in format F20.18, u should be printed<br />
as “3.141592653589793238". The free-format printing<br />
procedure described above would cut off the conversion<br />
after nine digits, producing “3.14159265” and no<br />
more. The fixed-format procedure <strong>to</strong> be developed below<br />
would therefore produce “3.14159265<br />
9,<br />
or “3.141592650000000000”. Either of these is much<br />
less misleading as <strong>to</strong> internal precision.<br />
The fixed-format algorithm works essentially uses the<br />
free-format output method, differing only in when conversion<br />
is cut off. If conversion is cut off by exhaustion<br />
of precision, then any remaining character positions are<br />
padded with blanks or zeros. If conversion is cut off by<br />
the format specification, the only problem is producing<br />
correctly rounded output. This is easily almost-solved<br />
by noting that the remainder R can correctly direct<br />
rounding at any digit position, not just the last. In<br />
terms of the programs, almost all that is necessary is <strong>to</strong><br />
terminate a conversion loop prematurely if appropriate,<br />
and the following cases statement will properly round<br />
the last digit produced.
This doesn’t solve the entire problem, however, be-<br />
cause if conversion is cut off by the format specification<br />
then the early rounding may require propagation of car-<br />
ries; that is, the proof of property (d) fails <strong>to</strong> hold. This<br />
can be taken care of by adjusting the values of M- and<br />
M+ appropriately. In principle this adjustment some-<br />
times requires the use of non-integral values that cannot<br />
be represented exactly in radix b; in practice such values<br />
can be rounded <strong>to</strong> get the proper effect.<br />
As noted above, it is indeed not difficult <strong>to</strong> adapt<br />
the simplified algorithm for a particular fixed format.<br />
<strong>How</strong>ever, straightforwardly adapting it <strong>to</strong> handle a va-<br />
riety of fixed formats is clumsy. We tried <strong>to</strong> do this,<br />
and found that we had <strong>to</strong> introduce many switches and<br />
flags <strong>to</strong> control the various aspects of formatting, such<br />
as <strong>to</strong>tal field width, number of digits before and after<br />
the decimal point, width of exponent field, scale fac<strong>to</strong>r<br />
(as for “P” format specifiers in Fortran), and so on. The<br />
result was a tangled mess, very difficult <strong>to</strong> understand<br />
and maintain.<br />
We finally untangled the mess by completely separat-<br />
ing the generation of digits from the formatting of the<br />
digits. We added a few interface parameters <strong>to</strong> produce<br />
a digit genera<strong>to</strong>r (called Dragon4) <strong>to</strong> which a variety<br />
of formatting processes could be easily interfaced. The<br />
genera<strong>to</strong>r and the formatter execute as coroutines, and<br />
it is assumed that the “user” process executes as a third<br />
coroutine.<br />
For purposes of exposition we have borrowed cer-<br />
tain features of Hoare’s CSP (Communicating Sequen-<br />
tial Processes) notation [Hoare78]. The statement<br />
GENERATE! (zr, . . . , z,,)<br />
means that a message containing values zl,. . . , z, is <strong>to</strong><br />
be sent <strong>to</strong> the GENERATE process; the process that exe-<br />
cutes such a statement then continues its own execution<br />
without waiting for a reply. The statement<br />
FORMAT?(zl,...,z,)<br />
means that a message containing values 21,. . . , z, is<br />
<strong>to</strong> be received from the FORMAT process; the process<br />
that executes such a statement waits if necessary for a<br />
message <strong>to</strong> arrive. (The notation is symmetrical in that<br />
the sender and receiver of a message must each know<br />
the other’s name.) We do not borrow any of the control<br />
structure notations of CSP, and we do not worry about<br />
the matter of processes failing.<br />
To perform floating-point output one must effectively<br />
execute three coroutines <strong>to</strong>gether. In the CSP notation<br />
one writes:<br />
[ USER :: user process ]<br />
FORMAT :: formatting process ]<br />
GENERATE :: Drogon‘j ]<br />
122<br />
The interface convention is that the “user process” must<br />
first send the message<br />
FORMAT! (a, e, f, p, B)<br />
<strong>to</strong> initiate floating-point conversion and then may re-<br />
ceive characters from the FORMAT process until a “0”<br />
character is received. The *@” character serves as a ter-<br />
mina<strong>to</strong>r and should be discarded. Any of the formatting<br />
processes shown below may be used; <strong>to</strong> get free-format<br />
conversion, for example, one would execute<br />
[ USER :: user process 1<br />
FORMAT :: Free-Format ]<br />
GENERATE :: Dragon4 ]<br />
and <strong>to</strong> get fixed-format exponential formatting one<br />
would execute<br />
[ USER :: user process ]<br />
FORMAT :: Fized-Format Exponential 1<br />
GENERATE :: Dragon4 ]<br />
We have found it easy <strong>to</strong> write new formatting routines.<br />
9 Implementations<br />
In 1981 we coded a version of Algorithm Dragon4 in<br />
Pascal, including the formatting routines shown here as<br />
well as formatters for Fortran I%, F, and G formats and for<br />
PL/I-style picture formats such as $$$ , $$$ , $$9.99CR.<br />
This suite has been tested thoroughly. This algorithm<br />
has also been used in various Lisp implementations for<br />
both free-format and fixed-format floating-point output.<br />
A portable and performance-tuned implementation in C<br />
is in progress.<br />
10 His<strong>to</strong>rical Note and Acknowledgments<br />
The work reported here was done in the late 1970’s.<br />
This is the first published appearance, but earlier vex-<br />
sions of this paper have been circulated “unpublished”<br />
through the grapevine for the last decade. The algo-<br />
rithms have been used for years in a variety of language<br />
implementations, perhaps most notably in Zetalisp.<br />
Why wasn’t it published earlier? It just didn’t seem<br />
like that big a deal; but we were wrong. Over the years<br />
floating-point printers have continued <strong>to</strong> be inaccurate<br />
and we kept getting requests for the unpublished draft.<br />
We thank Donald Knuth for giving us that last required<br />
push <strong>to</strong> submit it for publication.<br />
The paper has been almost completely revised for pre-<br />
sentation at this conference. An intermediate version of<br />
the algorithm, Dragon3 (Free-Format Perfect Positive<br />
<strong>Floating</strong>-<strong>Point</strong> Prin<strong>to</strong>ut), has been omitted from this<br />
presentation for reasons of space; it is of interest only
in being closest in form <strong>to</strong> what was actually first im-<br />
plemented in MacLisp. We have allowed a few anachro-<br />
nisms <strong>to</strong> remain in this presentation, such as the sugges-<br />
tion that a tiny version of the printing algorithm might<br />
be required for microcomputer implementations of BA-<br />
SIC. You will find that most of the references date back<br />
<strong>to</strong> the 1960’s and 19’70’s.<br />
Helpful and valuable comments on drafts of this paper<br />
were provided by Jon Bentley, Barbara K. Steele, and<br />
Daniel Weinreb. We are also grateful <strong>to</strong> Donald Knuth<br />
and Will Clinger for their encouragement.<br />
The first part of this work was done in the mid <strong>to</strong> late<br />
1970’s while both authors were at the Massachusetts<br />
Institute of Technology, in the Labora<strong>to</strong>ry for <strong>Computer</strong><br />
<strong>Science</strong> and the Artificial Intelligence Labora<strong>to</strong>ry. From<br />
1978 <strong>to</strong> 1980, Steele was supported by a Fanny and John<br />
Hertz Fellowship.<br />
This work was carried forward by Steele at Carnegie-<br />
Mellon University, Tartan Labora<strong>to</strong>ries, and Thinking<br />
Machines Corporation, and by White at IBM T. J. Wat-<br />
son Research and Lucid, Inc. We thank these institu-<br />
tions for their support.<br />
This work was supported in part by the Defense Ad-<br />
vanced Research Projects Agency, Department of De-<br />
fense, ARPA Order 3597, moni<strong>to</strong>red by the Air Force<br />
Avionics Labora<strong>to</strong>ry under contract F33615-78-C-1551.<br />
The views and conclusions contained in this document<br />
are those of the authors and should not be interpreted<br />
as representing the official policies, either expressed or<br />
implied, of the Defense Advanced Research Projects<br />
Agency or the U.S. Government.<br />
11 References<br />
[ANSI761 American National Standards Institute. Draft<br />
proposed AN.9 Fortmn (BSR X3.9). Reprinted az ACM<br />
SIGPLAN Notices 11, 3 (March 1976).<br />
[Clinger90] Cling er, William D. <strong>How</strong> <strong>to</strong> read floating point<br />
numbers accurately. Proc. ACM SIGPLAN ‘90 Confer-<br />
ence on Progr amming Language Design and Implementa-<br />
tion (White Plains, New York, June 1990).<br />
[DEC73] Digital Equipment Corporation. DecSystem 10<br />
Assembly Language Handbook. Third edition. (Maynard,<br />
Massachusetts, 1973).<br />
[Dijkstra?G] Dijkzt ra, Edsger W. A Discipline of Program-<br />
ming. Prentice-Hall (Englewood Cliffs, New Jersey, 1976).<br />
[Gardner77] Gardner, Martin. “The Dragon Curve and<br />
Other Problems.” In Mathematical Magic Show. Knopf<br />
(New York, 1977), 203-222.<br />
[Hoare78] Hoarc, C. A. R. “Communicating Sequential<br />
Processes.” Communications of the ACM 21, 8 (August<br />
1978), 666-677.<br />
[IEEE811 IEEE <strong>Computer</strong> Society Standard Committee,<br />
Microprocessor Standards Subcommittee, <strong>Floating</strong>-<strong>Point</strong><br />
123<br />
Working Group. “A Proposed Standard for Binary<br />
<strong>Floating</strong>-<strong>Point</strong> Arithmetic.” <strong>Computer</strong> 14, 3 (March<br />
1981), 51-62.<br />
[IEEE851 IEEE. IEEE Standard for Binary <strong>Floating</strong>-<strong>Point</strong><br />
Arithmetic. ANSI/IEEE Std 754-1985 (New York, 1985).<br />
[Jensen741 Jensen, Kathleen, and Wirth, Niklaus. PAS-<br />
CAL User Manual and Report. Second edition. Springer-<br />
Verlag (New York, 1974).<br />
[Knuth68] Knuth, Donald E. The Art of <strong>Computer</strong> Pro-<br />
gramming, Volume 1: Fundamental Algorithms. Addison-<br />
Wesley (Reading, Massachusetts, 1968).<br />
[Knuth69] Knuth, Donald E. The Art of <strong>Computer</strong> Pro-<br />
gramming, Volume L: Seminumerical Algorithms. First<br />
edition. Addison-Wesley (Reading, Massachusetts, 1969).<br />
[Knuth74] Knuth, DonaId E. “Structured Programming<br />
with GO TO Statements.” Computing Survey3 6, 4 (De-<br />
cember 1974).<br />
[Knuth81] Knuth, Donald E. The Art of <strong>Computer</strong> Pro-<br />
gramming, Volume 2: Seminumerical Algorithms. Second<br />
edition. Addison-Wesley (Reading, Massachusetts, 1981).<br />
[MatuIa68] MatuIa, David W. “In-and-Out Conversions.”<br />
Communications of the ACM 11, 1 (January 1968), 47-<br />
50.<br />
[Matnla’lO] Matula, David W. “A Formalization of<br />
<strong>Floating</strong>-<strong>Point</strong> Numeric Base Conversion.” IEEE Trand-<br />
actions on <strong>Computer</strong>s C-19, 8 (August 1970), 681-692.<br />
[Moon741 Moon, David A. MacLiap Reference Manual, Re-<br />
vision 0. Massachusetts Institute of Technology, Project<br />
MAC (Cambridge, Massachusetts, April 1974).<br />
[Taran<strong>to</strong>59] Taran<strong>to</strong>, Donald. “Binary Conversion, with<br />
Fixed Decimal Precision, of a Decimal Fraction.” Com-<br />
munications of the ACM 2, 7 (July 1959), 27.
Table 5: Procedure Dragon4 (Formatter-Feeding Pre<br />
cess for <strong>Floating</strong>-<strong>Point</strong> Prin<strong>to</strong>ut, Performing F’ree-<br />
Format Perfect Positive <strong>Floating</strong>-<strong>Point</strong> Prin<strong>to</strong>ut)<br />
process Dragond;<br />
begin<br />
FORMAT? (b, e, f, p, B, Cu<strong>to</strong>flMode, Cu<strong>to</strong>flPlace);<br />
assert Cu<strong>to</strong>fMode = Urehtive” * Cu<strong>to</strong>ffPlace 5 0<br />
Round WpFlag t false;<br />
if f = 0 then FORMAT ! (0, k) else<br />
R + Wb(f, m=(e - P, 0));<br />
S t &i&(1, max(O, -(e - p)));<br />
M- t shi,&(l, max(e -p, 0));<br />
M+ + M-;<br />
Fizup;<br />
loop<br />
ktk-1;<br />
u t l(R x WSJ i<br />
Rc(Rx B)modS;<br />
M- tM- xB;<br />
M+tM+xB;<br />
low t 2 x RC M-;<br />
if Round UpFlag<br />
then high c 2 x R >(2 x S) - M+<br />
else high + 2 x R > (2 x S) - M+ fi;<br />
while not low and not high<br />
and k # Cu<strong>to</strong>#Place :<br />
FORMAT ! (V, k);<br />
repeat;<br />
cases<br />
low and not high : FORMAT! (V, k);<br />
high and not low : FORMAT! (U + 1,k);<br />
(low and high) or (not low and not high) :<br />
cases<br />
2 x R 2 S: FORMAT! (U, k);<br />
2x RZS: FORMAT!(U+l,k);<br />
endcases;<br />
endcases;<br />
fi;<br />
comment Henceforth this process will generate as<br />
many u- 1” digits as the caller desires, along<br />
with appropriate values of k.<br />
loop k + k - 1; FORMAT! (-1, k) repeat;<br />
end;<br />
124<br />
Table 6: Procedure Pisup<br />
procedure F&up;<br />
begin<br />
iff= shift,(l,p - 1) then<br />
comment Account for unequal gaps.<br />
M+ t shi&,(M+, 1);<br />
R t shi&.(R, 1);<br />
S t shi&(S, 1);<br />
fi;<br />
k t 0;<br />
loop<br />
while R < [S/B] :<br />
ktk-1;<br />
RtRxB;<br />
M-tM-xB;<br />
M+tM+xB;<br />
repeat;<br />
loop<br />
loop<br />
while (2 x R) + M+ 12 x S :<br />
StSxB;<br />
ktk+l;<br />
repeat ;<br />
comment Perform any necessary adjustment<br />
of M- and M+ <strong>to</strong> take in<strong>to</strong> account the for-<br />
matting requirements.<br />
case Cu<strong>to</strong>fiMode of<br />
%omnal” : Cu<strong>to</strong>ffPlace c k;<br />
“absolute* : Cu<strong>to</strong>ffAdjust;<br />
“relative” :<br />
Cu<strong>to</strong>ffPlace t k + Cu<strong>to</strong>flPlace;<br />
Cu<strong>to</strong>fiAdjust;<br />
endcase;<br />
while(2xR)+M+>2xS:<br />
repeat;<br />
end;<br />
Table 7: Procedure fill<br />
procedure fill(k, c);<br />
comment Send k copies of the character c <strong>to</strong> the<br />
USER process. No characters are sent if k = 0.<br />
for i from 1 <strong>to</strong> k do USER! (c) od;
Table 8: Procedure Cu<strong>to</strong>fAdjust<br />
procedure Cu<strong>to</strong>ffAdjust;<br />
begin<br />
a t Cu<strong>to</strong>ffPlace - k;<br />
Y + s;<br />
cases<br />
a?O: forjc1<strong>to</strong>adoytyxB;<br />
aOAw~max(d+1,2)<br />
ccw-d-l;<br />
GENERATE! (b,e, f,p, B, “absolute”, -d);<br />
GENERATE ? (U, k);<br />
ifk < c then<br />
ifk 0 then fiU(c - 1, u “); USER! (“on) fi;<br />
USER! (‘V’);<br />
fiZi(min(-k, d), “0”);<br />
else jiZr(c - k - 1, u “) fi;<br />
loop<br />
whilek>-d:<br />
Digit Chart U);<br />
ifk = 0 then USER! (“.“) fi;<br />
GENERATE<br />
repeat;<br />
? (U, k) ;<br />
else fiZZ(w, u*n) A;<br />
USER ! (“0”);<br />
end:
Table 12: Formatting process for free-format exponen-<br />
tial output<br />
process Free-Fomnal Exponential;<br />
begin<br />
USER ? (h et f, P, B);<br />
GENERATE! (b,e, f,p,B, “normal”, 0);<br />
GENERATE ? (U, expt);<br />
DigiiChar(U);<br />
USER! (“2’);<br />
loop<br />
GENERATE ? (U, k);<br />
whileU#-1:<br />
DigitChar(<br />
repeat ;<br />
if k = ezpt - 1 then USER ! (“0”) fi;<br />
USER! (“E”);<br />
if ezpt < 0 then USER ! (“-“); ezpl t - ezpi ii;<br />
j t 1;<br />
loop j t j x I?; while j 5 ezpt : repeat;<br />
loop<br />
i +- LVBJ;<br />
DigitChar(jezpZ/j]);<br />
ezpept t ezpt mod j;<br />
whilej>l:<br />
repeat;<br />
USER! (“0”);<br />
end;<br />
126<br />
Table 13: Formatting process for fixed-format exponen-<br />
tial output<br />
process Fixed-Format Exponential;<br />
begin<br />
USER ? (he, f, P, B, wl d, 2);<br />
assertd~OAx>lAw>d+x+3<br />
ctw-d-x-3;<br />
GENERATE! (b, e, f,p, B, “relative”, -d);<br />
GENERATE ? (U, ezpept);<br />
j t 1;<br />
a t 0;<br />
loop<br />
j +- j x B;<br />
cr+--a+1;<br />
while j 5 Iezptl :<br />
repeat;<br />
ifa< x then<br />
j;ZZ(c - 1, u “);<br />
DigitChar(<br />
USER! (“2’);<br />
for q t 1 <strong>to</strong> d do<br />
GENERATE ? (U, k) ;<br />
DigiZChar(U);<br />
od;<br />
USER ! (‘92”);<br />
if ezpt < 0 then<br />
USER ! (‘Q’)<br />
else<br />
USER ! (“+“)<br />
fi;<br />
ezpt e 1 expt 1;<br />
fiZZ(x - a, UO”);<br />
loop<br />
i +-- UIBJ;<br />
DigiiChar( [expt/jJ);<br />
ezpt t expt mod j;<br />
whilej>l:<br />
repeat;<br />
else $ZZ(w, u*n) fi;<br />
USER ! (‘W’);<br />
end