How to Print Floating-Point Numbers Accurately - Computer Science ...

How to Print Floating-Point Numbers Accurately 

Guy L. Steele Jr. 

Thinking Machines Corporation 

245 First Street 

Cambridge, Massachusetts 02142 

Abstract 

gls0think. corn 

We present algorithms for accurately converting 

floating-point numbers to decimal representation. The 

key idea is to carry along with the computation an ex- 

plicit representation of the required rounding accuracy. 

We begin with the simpler problem of converting 

fixed-point fractions. A modification of the well-known 

algorithm for radix-conversion of fixed-point fractions 

by multiplication explicitly determines when to termi- 

nate the conversion process; a variable number of digits 

are produced. The algorithm has these properties: 

l No information is lost; the original fraction can be 

recovered from the output by rounding. 

l No “garbage digits” are produced. 

l The output is correctly rounded. 

0 It is never necessary to propagate carries on 

rounding. 

We then derive two algorithms for free-format out- 

put of floating-point numbers. The first simply scales 

the given floating-point number to an appropriate frac- 

tional range and then applies the algorithm for fractions. 

This is quite fast and simple to code but has inaccura- 

cies stemming from round-off errors and oversimplifica- 

tion. The second algorithm guarantees mathematical 

accuracy by using multiple-precision integer arithmetic 

and handling special cases. Both algorithms produce no 

more digits than necessary (intuitively, the “1.3 prints 

as 1.2999999” problem does not occur). 

Finally, we modify the free-format conversion al- 

gorithm for use in jized-format applications. Infor- 

mation may be lost if the fixed format provides too 

few digit positions, but the output is always correctly 

rounded. On the other hand, no “garbage digits” are 

ever produced, even if the fixed format specifies too 

many digit positions (intuitively, the u4/3 prints as 

1.333333328366279602” problem does not occur). 

Permission to copy without fee all or part of this material is granted 

provided that the copies are not made or distributed for direct com- 

mercial advantage, the ACM copyright notice and the title of the 

publication and its date appear, and notice is given that copying is 

by permission of the Association for Computing Machinery. To copy 

otherwise, or to republish, requires a fee and/or specific permission. 

01990 ACM 0-89791-364-7/90/0006/0112 $1.50 

I 

Proceedings of the ACM SIGPLAN’SO Conference on 

Programming Language Design and Implementation. 

White Plains, New York, June 20-22, 1990. 

I 

112 

Jon L White 

Lucid, Inc. 

707 Laurel Street 

Menlo Park, California 94025 

1 Introduction 

j onl0lucid. corn 

Isn’t it a pain when you ask a computer to divide 1.0 by 

10.0 and it prints 0.0999999? Most often the arithmetic 

is not to blame, but the printing algorithm. 

Most people use decimal numerals. Most contempo- 

rary computers, however, use a non-decimal internal 

representation, usually based on powers of two. This 

raises the problem of conversion between the internal 

representation and the familiar decimal form. 

The external decimal representation is typically used 

in two different ways. One is for communication with 

people. This divides into two cases. For j&d-format 

output, a fixed number of digits is chosen before the 

conversion process takes place. In this situation the pri- 

mary goal is controlling the precise number of charac- 

ters produced so that tabular formats will be correctly 

aligned. Contrariwise, in free-format output the goal 

is to print no more digits than are necessary to com- 

municate the value. Examples of such applications are 

Fortran list-directed output and such interactive lan- 

guage systems as Lisp and APL. Here the number of 

digits to be produced may be dependent on the value 

to be printed, and so the conversion process itself must 

determine how many digits to output. 

The second way in which the external decimal rep- 

resentation is used is for inter-machine communication. 

The decimal representation is a convenient and conven- 

tional machine-independent format for transferring nu- 

merical information from one computer to another. Us- 

ing decimal representation avoids incompatibilities of 

word length, field format, radix (binary versus hexadec- 

imal, for example), sign representation, and so on. Here 

the main objective is to transport the numerical value 

as accurately as possible; that the representation is also 

easy for people to read is a bonus. (The IEEE 754 

floating-point standard [IEEE851 has ameliorated this 

problem by providing standard binary formats, but the 

VAX, IBM 370, and Cray-1 are still with us and will be 

for yet a while longer.)

We deal here with the output problem: methods of 

converting from the internal representation to the exter- 

nal representation. The corresponding input problem is 

also of interest and is addressed by another paper in thii 

conference [ClingerSO]. The two problems are not quite 

identical because we make the asymmetrical assumption 

that the internal representation has a fixed number of 

digits but the external representation may have a vary- 

ing number of digits. (Compare this with Matula’s ex- 

cellent work, in which both representations are assumed 

to be of iixed precision [Matula68] [Matula’70].) 

In the following discussion we assume the existence 

of a program that solves the input problem: it reads 

a number in external representation and produces the 

input representation whose exact value is closest, of all 

possible internal representation values, to the exact nu- 

merical value of the external number. In other words, 

this ideal input routine is assumed to round correctly in 

all cases. 

We present algorithms for fixed-point-fraction output 

and floating-point output that decide dynamically what 

is an appropriate number of output digits to produce 

for the external representation. We recommend the last 

algorithm, Dragon4, for general use in compilers and 

run;time systems; it is completely accurate and accom- 

modates a wide variety of output formats. 

We express the algorithms in a general form for con- 

verting from any radix b (the internal radii) to any 

other radix B (the ezternal radix); b and B may be any 

two integers greater than 1, and either may be larger 

than the other. The reader may find it helpful to think 

of B as being 10, and b as being either 2 or 16. Infor- 

mally we use the terms “decimal” and “external radix” 

interchangeably, and similarly “binary” and “internal 

radix”. We also speak interchangeably of digits being 

‘output” or Uprintedn. 

2 Properties of Radix Conversion 

We assume that a number to be converted from one 

radix to another is perfectly accurate and that the goal 

is to express that exact value. Thus we ignore issues of 

errors (such as floating-point round-off or truncation) in 

the calculation of that value. 

The decimal representation used to communicate a 

numerical value should be accurate. This is especially 

important when transferring values between comput- 

ers. In particular, if the source and destination com- 

puters use identical floating-point representations, we 

would like the value to be communicated exactly; the 

destination computer should reconstruct the exact bit 

pattern that the source computer had. In this case we 

require that conversion from internal form to external 

form and back be an identity function; we call this the 

113 

internal identity requirement. We would also like con- 

version from external form to internal form and back to 

be an identity; we call this the external identity requim- 

ment. However, there are some difficulties with these 

requirements. 

The first difficulty is that an exact representation is 

not always possible. A finite integer in any radix can 

be converted into a finite integer in any other radix. 

This is not true, however, of finite-length fractions; a 

fraction that has a finite representation in one radix may 

have an indefinitely repeating representation in another 

radix. This occurs only if there is some prime number 

p that divides the one radii but not the other; then 

l/p (among other numbers) has a finite representation 

in the first radix but not in the second. (It should be 

noted, by the way, that this relationship between radices 

is different from the relationship of incommensumbility 

as Matula defines it [Matula70]. For example, radices 

10 and 20 are incommensurable, but there is no prime 

that divides one and not the other.) 

It does happen that any binary, octal, or hexadeci- 

mal number has a finite decimal representation (because 

there is no prime that divides 2k but not 10). Therefore 

when B = 10 and b is a power of two we can guarantee 

the internal identity requirement by printing an exact 

decimal representation of a given binary number and 

using an ideal rounding input routine. 

This solution is gross overkill, however. The exact 

decimal representation of an n-bit binary number may 

require n decimal digits. However, es a rule fewer than 

n digits are needed by the rounding input routine to 

reconstruct the exact binary value. For example, to 

print a binary floating-point number with a 27-bit frac- 

tion requires up to 30 decimal digits. (A 27-bit frac- 

tion can represent a 30-bit binary fraction whose dec- 

imal representation has a non-sero leading digit; this 

is discussed further below.) However, 10 decimal dig- 

its always suffice to communicate the value accurately 

enough for the input routine to reconstruct the 27- 

bit binary representation [Matula68], and many par- 

ticular values can be expressed accurately with fewer 

than 10 digits. For example, the 27-bit binary floating- 

point number .110011001100110011001100110~ x 2-‘, 

which is equal to the 29-bit binary fixed-point frac- 

tion .00011001100110011001100110011~, is also equal to 

the decimal fraction .09999999962747097015380859375, 

which has 29 digits. However, the decimal fraction 0.1 

is closer in value to the binary number than to any other 

27-bit floating-point binary number, and so suffices to 

satisfy the internal identity requirement. Moreover, 0.1 

is more concise and more readable. 

We would like to produce enough digits to preserve 

the information content of a binary number, but no 

more. The difficulty is that the number of digits needed

depends not only on the precision of the binary number 

but on its value. 

Consider for a moment the superficially easier prob- 

lem of converting a binary floating-point number to 

octal. Suppose for concreteness that our binary num- 

bers have a six-bit significand. It would seem that two 

octal digits should be enough to express the value of 

the binary number, because one octal digit expresses 

three bits of information. However, the precise oc- 

tal representation of the binary floating-point number 

.101101~ x 2-l is .264s x 8O. The exponent of the bi- 

nary floating-point number specifies a shifting of the 

significand so that the binary point is Tn the middle” 

of an octal digit. The first octal digit thus represents 

only two bits of the original binary number, and the 

third only one bit. We find that three octal digits are 

needed because of the difference in “graininess” of the 

exponent (shifting) specitications of the two radices. 

The number N of radix-B digits that may be required 

in general to represent an n-digit radix-b floating-point 

number is derived in [Matula68]. The condition is: 

B*-l > bn 

This may be reformulated as: 

N>--!f--+1 

log, B 

and if N is to be minimized we therefore have: 

N = 2 + [n/&g, B)] 

No more than this number of digits need ever be output, 

but there may be some numbers that require exactly 

that many. 

As an example, in converting a binary floating-point 

number to decimal on the PDP-10 [DEC73] we have 

b = 2, B = 10, and n = 27. One might naively sup- 

pose that 27 bits equals 9 octal digits, and if nine octal 

digits suffice surely 9 decimal digits should suffice also. 

However, using Matula’s formula above, we find that to 

print such a number may require 2 + L27/(log, lo)] = 

2 + /27/(3.3219...)J = 2 + [8.1278...] = 10 decimal 

digits. There are indeed some numbers that require 10 

digits, such as the one which is greater than the rounded 

27-bit floating-point approximation to 0.1 by two units 

in the last place. On the PDP-10 this is the number 

represented in octal as 1756314631508, which represents 

the value .63146315s x 8 -l. It is intuitively easy to see 

why ten digits might be needed: splitting the number 

into an integer part (which is zero) and a fractional part 

must shift three zero bits into the significand from the 

left, producing a 30-bit fraction, and 9 decimal digits is 

not enough to represent all 30-bit fractions. 

The condition specified above assumes that the con- 

versions in both directions round correctly. There are 

114 

other possibilities, such. as using truncation (always 

rounding towards zero) instead of true rounding. If the 

input conversion rounds but the output conversion trun- 

cates, the internal identity requirement can still be met, 

but more external digits may be required; the condition 

is BN-’ > 2 x bn - 1. If conversion truncates in both 

directions, however, it is possible that the internal iden- 

tity requirement cannot be met, and that repeated con- 

versions may cause a value to “drift” very far from its 

original value. This implies that if a data base is main- 

tained in decimal, and is updated by converting it to bi- 

nary, processing it, and then converting back to decimal, 

values not updated may nevertheless be changed, after 

many iterations, by a substantial amount [Matula70]. 

That is why we assume and require proper rounding. 

Rounding can guarantee the integrity of the data (in 

the form of the internal identity requirement) and also 

will minimize the number of output digits. Truncation 

cannot and will not. 

The external identity requirement obviously cannot 

be strictly satisfied, because by assumption there are a 

finite number of internally representable values and an 

infinite number of external values. For every internal 

value there are many corresponding external representa- 

tions. From among the several external representations 

that correspond to a given internal representation, we 

prefer to get one that is closest in value but also as short 

as possible. 

Satisfaction of the internal identity requirement im- 

plies that an external consistency requirement be met: 

if an internal value is printed out, then reading that ex- 

act same external value and then printing it once more 

will produce the same external representation. This 

is important to the user of an interactive system. If 

one types PRINT 3.1415926535 and the system replies 

3.1415926, the user might think “Well and good; the 

computer has truncated it, and that’8 all I need type 

from now on.” The user would be rightly upset to 

then type in PRINT 3.1415926 and receive the response 

3.1415925! Many language implementations indeed ex- 

hibit such undesirable “drifting” behavior. 

Let us formulate our desires. First, in the interest 

of brevity and readability, conversion to external form 

should produce as few digits as possible. Now suppose, 

for some internal number, that there are several exter- 

nal representations that can be used: all have the same 

number of digits. all can be used to reconstruct the in- 

ternal value precisely, and no shorter external number 

suffices for the reconstruction. In this case we prefer the 

one that is closest to the value of the binary number. 

(As we shall see, all the external number8 must be alike 

except in the last digit; satisfying this criterion there- 

fore involves only the correct choice for the last digit to 

be output.)

On the other hand, suppose that the number of dig- 

its to be output has been predetermined (fixed-format 

output); then we desire that the printed value be as 

close as possible to the binary value. In short, the ex- 

ternal form should always be correctly rounded. (The 

Fortran 77 standard requires that floating-point output 

be correctly rounded [ANSI76, 13.9.5.21; however, many 

Fortran implementations do round not correctly, espe- 

cially for numbers of very large or very small magni- 

tude.) 

We first present and prove an algorithm for printing 

fired-point fractions that has four useful properties: 

(a) Information preservation. No information is lost 

in the conversion. If we know the length n of the 

original radix-b fraction, we can recover the original 

fraction from the radix-B output by converting back 

to radix-b and rounding to n digits. 

(b) Minimum-length output. No more digits than nec- 

essary are output to achieve (a). (This implies that 

the last digit printed is non-zero.) 

(c) Correct rounding. The output is correctly rounded; 

that is, the output generated is the closest approx- 

imation to the original fraction among all radix-B 

fractions of the same length, or is one of two closest 

approximations, and in the latter case it is easy to 

choose either of the two by any arbitrary criterion. 

(d) Left-to-right genenrtion. It is never necessary to 

propagate carries. Once a digit has been generated, 

it is the correct one. 

These properties are characterized still more rigorously 

in the description of the algorithm itself. 

From thii algorithm we then derive a similar method 

for printing integers. The two methods are then com- 

bined to produce two algorithms for printing floating- 

point numbers, one for free-format applications and one 

for fixed-format applications. 

3 Fixed-Point Fraction Output 

Following [Knuth68], we use the notation 1x1 to mean 

the greatest integer not greater than z. Thus 13.51 = 3 

and I-3.51 = -4. If z is an integer, then \xJ = 2. 

Similarly, we use the notation [z] to mean the smallest 

integer not less than x; [z] z - ]--zj . Also following 

[Knuth68], we use the notation {z} to mean z mod 1, 

or z - lz], the fractional part of Z. We always indicate 

numerical multiplication of a and b explicitly as a x b, 

because we use simple juxtaposition to indicate digits 

in a place-value notation. 

The idea behind the algorithm is very simple. The 

potential error in a fraction of limited precision may be 

described as being equal to one-half the minimal non- 

zero value of the least significant digit of the fraction. 

The algorithm merely initializes a variable M to this 

115 

value to represent the precision of the fraction. When- 

ever the fraction is multiplied by B, the error M is 

also multiplied by B. Therefore M always has a value 

against which the residual fraction may be meaningfully 

compared to decide whether the precision has been ex- 

hausted. 

Algorithm (FP)s: 

Finite-Precision Fixed-Point Fraction Printout’ 

Given: 

l An n-digit radix-b fraction f, 0 < f < 1: 

f = .f-lf-2f-a. .-f-n = c f-i x b-’ 

i=l 

l A radix B (an integer greater than 

Output: The digits Fi of an N-digit (N 

fraction F: 

such that: 

N 

1). 

2 1) radix-B 

F = . F-1F-,F-a.. . F-N = c Fmi x B-’ 

(a) IF - fl < 7 

(b) N is the smallest integer 2 1 such that (a) can be 

true. 

(4 IF - fl 5 7 

(d) Each digit is output before the next is generated; 

it is never necessary to “back up” for corrections 

Procedure: see Table 1. m 

The algorithmic procedure is expressed in pseudo- 

ALGOL. We have used the loop-while-repeat (“n + f 

loop”) construct credited to Dahl [Knuth74]. The loop 

is exited from the while-point if the while-condition 

is ever false; thus one may read “while B:” as 

“if not B then exitloop fi”. The cases statement 

is to be taken as a guarded if construction [Dijkstra76]; 

the branch labeled by the true condition is executed, 

and if both conditions are true (here, if R = i) either 

branch may be chosen. The ambiguity in the cases 

statement here reflects the point of ambiguity when 

‘It is difficult to give short, meaningfnl names to each of a 

group of algorithms that are similar. Here we give each algorithm 

a long name that has enough adjectives and other qualifiers to 

distinguish it Born the others, but these long names are unwieldy. 

For convenience, and as a bit of a joke, two kinds of abbreviated 

names are used here. Algorithms that are intermediate versions 

along a path of development have names such as “(FP)3” that are 

effectively contracted acronyms for the long names. Algorithms 

that are in “final form” have names of the form “Dragonk” for 

integers ic; these algorithms have long -es whose acronyms 

form sequences of letters “F” and “P” that specify the shape of 

so-called “dragon curves” [Gardner??]. 

i=l

Table 1: Procedure for Finite-Precision Fixed-Point 

Fraction Printout ((FP)s) 

begin 

k e0; 

R + f; 

M c b-“/2; 

loop 

k+-k+l; 

Ut 1RxBJ; 

R+-(Rx B); 

Me-MxB; 

whileRLMandR): F-k+-U+l; 

endcases; 

N ck; 

end 

two radix-B representations of length n are equidistant 

from f. A given implementation may use any decision 

method when R = 3, such as “always U”, which effec- 

tively means to round down; “always U + l”, meaning to 

round up; or WU E 0 ( mod 2) then U else U + l”, 

meaning to round so that the last digit is even. Note, 

however, that using “always U” does not always round 

down if rounding up will allow one fewer digit to be 

printed. 

This method is a generalization of the one presented 

by Taranto [Taranto59] and mentioned in exercise 4.4.3 

of [Knuth69].’ 

4 Proof of Algorithm (FP)z 

(The reader who is more interested in practical appli- 

cations than in details of proof may wish to skip this 

section.) 

In order to demonstrate that Algorithm (FP)s satis- 

fies the four claimed properties (a)-(d), it is useful first 

to prove, by induction on k, the following invariants true 

at the top of the loop body: 

M _ b-” x Bk 

k- 

2 

2By the way, the paper that was forward-referenced in the 

answer to exercise 4.4.3 in [Knuth81] was M early draft of this 

paper that contained only Algorithm (FP)3, its proof, and a not- 

yet-correct generalization to floating-point numbers. 

(i) 

116 

RkXBsk+~F-iXBsi=f 

i=l 

By & and Rk we mean the values of M and R at the 

top of the loop as a function of k (the values of M and 

R change if and only if k is incremented). Invariant (i) 

is easily verified; we shall take it for granted. Invariant 

(G) is a little more complicated. 

Basis. After the first two assignments in the proce- 

dure, k = 0 and & = f . The summation in (ii) has no 

summands and is therefore zero, yielding 

&xB-‘+O=f 

which is true, because initially R = f. 

Induction. Suppose (ii) is true for k. We note that 

the only path backward in the loop sets F-(k+I) + 

[Rk x R] . It follows that: 

k 

f= Rk X Bsk+CF-iX B-’ 

i=l 

= (B x Rk) x B- @+I) + -&F-; x B-’ 

i=l 

=({RkxB}+lRkxBJ)xB- (k+l) + kFwi x B-” 

= (Rk+l + F--(k+l)) X B-(k+l) $ 2 F-i X B-’ 

= Rk+l X B-(“+l) + C F-i X B-’ 

k+l 

i=l 

which establishes the desired invariant. 

The procedure definitely terminates, because M ini- 

tially has a strictly positive value and is multiplied by 

B (which is greater than 1) each time through the loop. 

Eventually we have M > 4, at which point R 2 M 

and R 5 1 - M cannot both be true and the loop must 

terminate. (Of course, the loop may terminate before 

M > i, depending on R.) 

When the procedure terminates, there may be one of 

two cases, depending on RN. Because of the rounding 

step, we have one of the following: 

f=F+RNxB-N ifRN

f-F=R*xB-N if&s+ 

F-f=R’xB-N if&z+ 

From this we have: 

,F-f,=R*xB-“57 

proving property (c) (correct rounding). 

To prove property (a) (information preservation), we 

consider the two cases MN > i and MN 5 i. If we 

have MN > 3, then 

because MN = 

IF- fl< 7 1 - MN we have 

R* < MN, and so 

IF-fl=R’xB-“

Procedure: If f = 0 then simply print “0.0”. Otherwise 

proceed as follows. First compute u’ = u Q B-“, where 

“a” denotes floating-point multiplication and where z 

is chosen so as to make Y’ < bp. (If exponential for- 

mat is to be used, one normally chooses z so that the 

result either is between l/B and 1 or is between 1 and 

B.) Next, print the integer part of v’ by any suitable 

method, such as the usual division-remainder technique; 

after that print a decimal point. Then take the frac- 

tional part f of v’, let n = p - [log* v’J - 1, and apply 

Algorithm (FP)8 to f and n. Finally, if 2 was not zero, 

the letter “E” and a representation of the scale factor z 

are printed. I 

We note that if a floating-point number is the same 

size (in bits, or radix-b digits, or whatever) as an inte- 

ger, then arithmetic on integers of that size is often suffi- 

ciently accurate, because the bits used for the floating- 

point exponent are usually more than enough for the 

extra digits of precision required. 

This method for floating-point output is not entirely 

accurate, but is very simple to code, typically requir- 

ing only single-precision integer arithmetic. For appli- 

cations where producing pleasant output is important, 

program and data space must be minimized, and the in- 

ternal identity requirement may be relaxed, this method 

is excellent. An example of such an application might 

be an implementation of BASIC for a microcomputer, 

where pleasant output and memory conservation are 

more important than absolutely guaranteed accuracy. 

[The last sentence was written circa 1981. Nowadays all 

but the tiniest computers have the memory space for the 

full algorithm.] This algorithm was used for many years 

in the MacLisp interpreter and found to be satisfactory 

for most purposes. 

There are two problems that prevent Algorithm 

Dragon2 from being completely accurate for floating- 

point output. 

The first problem is that the spacing between adj, 

cent floating-point numbers of a given precision is not 

uniform. Let t) be a floating-point number, and let V- 

and v+ be respectively the next-lower and next-higher 

floating-point numbers of the same precision. For most 

values of v, we have v+ - v = v - v-. However, if v is an 

integral power of b, then instead v+ - v = b x (v - v-); 

that is, the gap below v is smaller than the gap above 

v by a factor of b. 

The second problem is that the scaling may cause 

loss of information. There may be two numbers, very 

close together in value, which when scaled by the same 

power of B result in the same scaled number, because of 

rounding. The problem is inherent in the floating-point 

representation. See [Matula68] for further discussion of 

this phenomenon. 

118 

Table 2: Procedure for Indefinite Precision Integer 

Printout ((IP)2) 

begin 

k +- 0; 

R e d; 

M tb"; 

s t 1; 

loop 

StSx B; 

kck+l; 

while(2xR)+M>2xS: 

repeat; 

Hck-1; 

assert S = B(“+‘) 

loop 

k+-k-l; 

S t S/B; 

u + [RISJ; 

RcRmodS; 

while2xR>Mand2xRs(2xS)-M: 

Dk + u; 

repeat; 

cases 

2xR5S: Dk+U; 

2xRLS: Dk+U+l; 

endcases; 

N + k; 

end 

6 Conversion of Integers 

To avoid the round-off errors introduced by scaling 

floating-point numbers, we develop a completely ac- 

curate floating-point output routine that performs no 

floating-point computations. As a first step, we exhibit 

an algorithm for printing integers that has a tetmina- 

tion criterion similar to that of Algorithm (FP)‘. 

Algorithm (IP)‘: 

Indefinite Precision Integer Printout 

Given: 

l An h-digit radix-b integer d, accurate to position 

n (n 2 0): 

d = d,,dh-1. . . d,,+ld,(n zeros). = 2 a$ x b’ 

l AradixB.

Output: Integers H and N (H 2 N 1 1) as an H- 

digit radix-B integer D: 

D = DHDH-I.. . DN+IDN(N zeros). = 5 Di x B’ 

such that: 

(a) [D-d! < f 

(b) N is the largest integer, and H the smallest, such 

that la1 can be true. 

(c) ID -ii,‘< B” 

(d) Each digit is” output before the next is generated. 

Procedure: see Table 2. I 

In Algorithm (IP)l all quantities are integers. We 

avoid the use of fractional quantities by introducing a 

scaling factor S and logically replacing the quantities R 

and M by R/S and M/S. Whereas in Algorithm (FP)3 

the variables R and M were repeatedly multiplied by B, 

in Algorithm (IP)a the variable S is repeatedly divided 

by B. 

7 Free-Format Floating-Point Output 

From these two algorithms, (FP)’ and (IP)‘, it is now 

easy to synthesize an algorithm for “perfect” conversion 

of a (positive) fixed-precision floating-point number to a 

free-format floating-point number in another radix. We 

shall assume that a floating-point number f is repre- 

sented as a tuple of three integers: the exponent e, the 

“fraction” or “significand” m, and the precision p. To- 

gether these integers represent the mathematical value 

f= m x b(‘-P). We require m < bp; a normalized rep- 

resentation additionally requires m 2 b(P-l), but we do 

not depend on this. 

We define the function S/L& of two integer arguments 

z and n (Z > 0): 

shi&,(x,n) z lx x b”J 

This function is intended to be trivial to implement in 

radix-b arithmetic: it is the result of shifting x by n 

radix-b digit positions, to the left for positive n (intro- 

ducing low-order zeros) or to the right for negative n 

(discarding digits shifted to the right of the “radix-b 

point”). 

Algorithm (FPP)‘: 

Fixed-Precision Positive Floating-Point Printout 

Given: 

a Three radix-b integers e, f, and p, representing the 

number v = f x bte+‘), withpz OandO < f < bP, 

l AradixB. 

i=N 

119 

Output: Integers H and N and digits DI, (H > k 2 N) 

such that if one defines the value 

V = 5 Di x B’ 

i=N 

then: 

v- +v 

(4 -

Table 3: Fixed-Precision 

Floating-Point Printout (( FPP)‘) 

begin 

Positive Table 4: Procedure Simple-Fizup 

procedure Simple-Fizup; 

begin 

assert f # 0 assert R/S = v 

R + Shifib(f) m=(e - no)); assert M-/S = M+/S = bte-p) 

S t shifib(l,max(O, -(e - p))); iff= shi&(l,p - 1) then 

assert R/S = f x b(‘-p) = v M+ + Shift,(M+,l); 

M- t shiflb(l, mu(e - p, 0)); R t- Shift,(R,l); 

M++M-; s t shiftb(S, 1); 

Simple-F&up; fi; 

Htk-1; k t 0; 

loop loop 

assert (R/S) x B” + 2 Di x B’ = u 

i=k 

kc&-l; 

U + I@ x B)/S_(; 

Rt(Rx 

M-tM-xB; 

M+tM+xB; 

B)modS; 

lowc2xR(2xS)-M+; 

while (not low) and (not high) : 

Dk + u; 

repeat; 

comment Let Vk = 

H 

C Di x B’. 

i=k+l 

v- +v 

assert low * - < (u X B” + Vk) 5 v 

2 

v+ + v 

assert high =$ u 5 ((u -k 1) x Bk + vk) < 2 

cases 

low and not high : Dk t U; 

high and not low : Dk t- U + 1; 

low and high : 

cases 

2xR5S: Dk+U; 

2xRzS: DktU$l; 

endcases; 

endcases; 

N tk; 

end; 

120 

assert (R/S) x Bk = v 

assert (M-/S) x Bk = v - v- 

assert (M+/S) x Bk = v+ -2) 

while R < [S/B] : 

&+--k-l; 

RtRxB; 

M-tM-xB; 

M+tM+xB; 

repeat; 

assert 

loop 

k = min(O, 1 + [log, vJ) 

assert (R/S) x Bk = v 

assert (M-/S) x Bk = v - v- 

assert (M+/S) x Bk = v+ -v 

while(2xR)+M+>2xS: 

ScSx 

k+-k+1; 

repeat; 

B; 

end: 

assert k = 1 + [ logB x$+]

The formula for v+ does not have to be conditional, 

even when the representation for v+ requires a larger 

exponent e than v does in order to satisfy f < bp. The 

formula for v- , however, must take into account the sit- 

uation where f = b(J’- ‘1, because v- may then use the 

next-smaller value for e, and so the gap size is smaller by 

a factor of b. There may be some question as to whether 

this is the correct criterion for floating-point numbers 

that are not normalized. Observe, however, that this 

is the correct criterion for IEEE standard floating-point 

format [IEEE85], because all such numbers are normal- 

ized except those with the smallest possible exponent, 

so if v is denormalized then v- must also be denormal- 

ized and have the same value for the exponent e. 

To compensate for the phenomenon of unequal gaps, 

the variables M- and M+ are given the value that M 

would have in Algorithm (IP)‘, and then adjustments 

are made to M+, R, and S if necessary. The simplest 

way to account for unequal gaps would be to divide 

M- by b. However, this might cause M- to have a 

non-integral value in some situations. To avoid this, 

we instead scale the entire algorithm by an additional 

factor of b, by multiplying M+, R, and S by b. 

The presence of unequal gaps also induces an asym- 

metry in the termination test that reveals a previously 

hidden problem. In Algorithm (IP)‘, with M- = M+ = 

M, it was the case that if R > (2 x S) - M+ and 

2 x R < S, then necessarily R < M- also. Here this 

is not necessarily so, because M- may be smaller than 

M+ by a factor of b. The interpretation of this is that 

in some situations one may have 2 x R < S but nev- 

ertheless must round up to the digit U + 1, because 

the termination test succeeds relative to M+ but fails 

relative to M- . 

This problem is solved by rewriting the termination 

test into two parts. The boolean variable low is true 

if the loop may be exited because M- is satisfied (in 

which case the digit U may be used). The boolean vari- 

able high is true if the loop may be exited because M+ 

is satisfied (in which case the digit U + 1 may be used). 

If both variables are true, then either digit of U and 

U + 1 may be used, as far as information-preservation 

is concerned; in this case, and this case only, the com- 

parison of 2 x R to S should be done to satisfy the 

correct-rounding property. 

Algorithm (FPP) a is conceptually divided into four 

parts. First, the variables R, S, M-, and M+ are ini- 

tialized. Second, if the gaps are unequal then the neces- 

sary adjustment is made. Third, the radix-B weight H 

of the first digit is determined by two loops. The first 

loop is needed for positive H, and repeatedly multiplies 

S by B; the second loop is needed for negative H, and 

repeatedly multiplies R, M-, and M+ by B. (Divisions 

by B are of course avoided to prevent roundoff errors.) 

121 

Fourth, the digits are generated by the last loop; this 

loop should be compared with the one in Table 1. 

8 Fixed-Format Floating-Point Output 

Algorithm (FPP)r simply generates digits, stopping 

only after producing the appropriate number of digits 

for precise free-format output. It needs more flexible 

means of cutting off digit generation for fixed-format 

applications. Moreover, it is desirable to intersperse 

formatting with digit generation rather than buffering 

digits and then re-traversing the buffer. 

It is not difficult to adapt Algorithm (FPP)’ for fixedformat 

output, where the number of digits to be output 

is predetermined independently of the floating-point 

value to be printed. Using this algorithm avoids giving 

the illusion of more internal precision than is actually 

present, because it cuts off the conversion after sufficiently 

many digits have been produced; the remaining 

character 

ros. 

positions are then padded with blanks or ze- 

As an example, suppose that a Fortran program 

is written to calculate z, and that it indeed calculates 

a floating-point number that is the best possible 

approximation to z for that floating-point format, 

precise to, say, 27 bits. However, it then prints 

the value with format F20.18, producing the output 

u3.1415Q265160560607Q”. 

This is indeed accurately the value of that floatingpoint 

number to that many places, but the last ten 

digits are in some sense not justified, because the internal 

representation is not that precise. Moreover, 

this output certainly does not represent the value of 

A accurately; in format F20.18, u should be printed 

as “3.141592653589793238". The free-format printing 

procedure described above would cut off the conversion 

after nine digits, producing “3.14159265” and no 

more. The fixed-format procedure to be developed below 

would therefore produce “3.14159265 

9, 

or “3.141592650000000000”. Either of these is much 

less misleading as to internal precision. 

The fixed-format algorithm works essentially uses the 

free-format output method, differing only in when conversion 

is cut off. If conversion is cut off by exhaustion 

of precision, then any remaining character positions are 

padded with blanks or zeros. If conversion is cut off by 

the format specification, the only problem is producing 

correctly rounded output. This is easily almost-solved 

by noting that the remainder R can correctly direct 

rounding at any digit position, not just the last. In 

terms of the programs, almost all that is necessary is to 

terminate a conversion loop prematurely if appropriate, 

and the following cases statement will properly round 

the last digit produced.

This doesn’t solve the entire problem, however, be- 

cause if conversion is cut off by the format specification 

then the early rounding may require propagation of car- 

ries; that is, the proof of property (d) fails to hold. This 

can be taken care of by adjusting the values of M- and 

M+ appropriately. In principle this adjustment some- 

times requires the use of non-integral values that cannot 

be represented exactly in radix b; in practice such values 

can be rounded to get the proper effect. 

As noted above, it is indeed not difficult to adapt 

the simplified algorithm for a particular fixed format. 

However, straightforwardly adapting it to handle a va- 

riety of fixed formats is clumsy. We tried to do this, 

and found that we had to introduce many switches and 

flags to control the various aspects of formatting, such 

as total field width, number of digits before and after 

the decimal point, width of exponent field, scale factor 

(as for “P” format specifiers in Fortran), and so on. The 

result was a tangled mess, very difficult to understand 

and maintain. 

We finally untangled the mess by completely separat- 

ing the generation of digits from the formatting of the 

digits. We added a few interface parameters to produce 

a digit generator (called Dragon4) to which a variety 

of formatting processes could be easily interfaced. The 

generator and the formatter execute as coroutines, and 

it is assumed that the “user” process executes as a third 

coroutine. 

For purposes of exposition we have borrowed cer- 

tain features of Hoare’s CSP (Communicating Sequen- 

tial Processes) notation [Hoare78]. The statement 

GENERATE! (zr, . . . , z,,) 

means that a message containing values zl,. . . , z, is to 

be sent to the GENERATE process; the process that exe- 

cutes such a statement then continues its own execution 

without waiting for a reply. The statement 

FORMAT?(zl,...,z,) 

means that a message containing values 21,. . . , z, is 

to be received from the FORMAT process; the process 

that executes such a statement waits if necessary for a 

message to arrive. (The notation is symmetrical in that 

the sender and receiver of a message must each know 

the other’s name.) We do not borrow any of the control 

structure notations of CSP, and we do not worry about 

the matter of processes failing. 

To perform floating-point output one must effectively 

execute three coroutines together. In the CSP notation 

one writes: 

[ USER :: user process ] 

FORMAT :: formatting process ] 

GENERATE :: Drogon‘j ] 

122 

The interface convention is that the “user process” must 

first send the message 

FORMAT! (a, e, f, p, B) 

to initiate floating-point conversion and then may re- 

ceive characters from the FORMAT process until a “0” 

character is received. The *@” character serves as a ter- 

minator and should be discarded. Any of the formatting 

processes shown below may be used; to get free-format 

conversion, for example, one would execute 

[ USER :: user process 1 

FORMAT :: Free-Format ] 

GENERATE :: Dragon4 ] 

and to get fixed-format exponential formatting one 

would execute 

[ USER :: user process ] 

FORMAT :: Fized-Format Exponential 1 

GENERATE :: Dragon4 ] 

We have found it easy to write new formatting routines. 

9 Implementations 

In 1981 we coded a version of Algorithm Dragon4 in 

Pascal, including the formatting routines shown here as 

well as formatters for Fortran I%, F, and G formats and for 

PL/I-style picture formats such as $$$ , $$$ , $$9.99CR. 

This suite has been tested thoroughly. This algorithm 

has also been used in various Lisp implementations for 

both free-format and fixed-format floating-point output. 

A portable and performance-tuned implementation in C 

is in progress. 

10 Historical Note and Acknowledgments 

The work reported here was done in the late 1970’s. 

This is the first published appearance, but earlier vex- 

sions of this paper have been circulated “unpublished” 

through the grapevine for the last decade. The algo- 

rithms have been used for years in a variety of language 

implementations, perhaps most notably in Zetalisp. 

Why wasn’t it published earlier? It just didn’t seem 

like that big a deal; but we were wrong. Over the years 

floating-point printers have continued to be inaccurate 

and we kept getting requests for the unpublished draft. 

We thank Donald Knuth for giving us that last required 

push to submit it for publication. 

The paper has been almost completely revised for pre- 

sentation at this conference. An intermediate version of 

the algorithm, Dragon3 (Free-Format Perfect Positive 

Floating-Point Printout), has been omitted from this 

presentation for reasons of space; it is of interest only

in being closest in form to what was actually first im- 

plemented in MacLisp. We have allowed a few anachro- 

nisms to remain in this presentation, such as the sugges- 

tion that a tiny version of the printing algorithm might 

be required for microcomputer implementations of BA- 

SIC. You will find that most of the references date back 

to the 1960’s and 19’70’s. 

Helpful and valuable comments on drafts of this paper 

were provided by Jon Bentley, Barbara K. Steele, and 

Daniel Weinreb. We are also grateful to Donald Knuth 

and Will Clinger for their encouragement. 

The first part of this work was done in the mid to late 

1970’s while both authors were at the Massachusetts 

Institute of Technology, in the Laboratory for Computer 

Science and the Artificial Intelligence Laboratory. From 

1978 to 1980, Steele was supported by a Fanny and John 

Hertz Fellowship. 

This work was carried forward by Steele at Carnegie- 

Mellon University, Tartan Laboratories, and Thinking 

Machines Corporation, and by White at IBM T. J. Wat- 

son Research and Lucid, Inc. We thank these institu- 

tions for their support. 

This work was supported in part by the Defense Ad- 

vanced Research Projects Agency, Department of De- 

fense, ARPA Order 3597, monitored by the Air Force 

Avionics Laboratory under contract F33615-78-C-1551. 

The views and conclusions contained in this document 

are those of the authors and should not be interpreted 

as representing the official policies, either expressed or 

implied, of the Defense Advanced Research Projects 

Agency or the U.S. Government. 

11 References 

[ANSI761 American National Standards Institute. Draft 

proposed AN.9 Fortmn (BSR X3.9). Reprinted az ACM 

SIGPLAN Notices 11, 3 (March 1976). 

[Clinger90] Cling er, William D. How to read floating point 

numbers accurately. Proc. ACM SIGPLAN ‘90 Confer- 

ence on Progr amming Language Design and Implementa- 

tion (White Plains, New York, June 1990). 

[DEC73] Digital Equipment Corporation. DecSystem 10 

Assembly Language Handbook. Third edition. (Maynard, 

Massachusetts, 1973). 

[Dijkstra?G] Dijkzt ra, Edsger W. A Discipline of Program- 

ming. Prentice-Hall (Englewood Cliffs, New Jersey, 1976). 

[Gardner77] Gardner, Martin. “The Dragon Curve and 

Other Problems.” In Mathematical Magic Show. Knopf 

(New York, 1977), 203-222. 

[Hoare78] Hoarc, C. A. R. “Communicating Sequential 

Processes.” Communications of the ACM 21, 8 (August 

1978), 666-677. 

[IEEE811 IEEE Computer Society Standard Committee, 

Microprocessor Standards Subcommittee, Floating-Point 

123 

Working Group. “A Proposed Standard for Binary 

Floating-Point Arithmetic.” Computer 14, 3 (March 

1981), 51-62. 

[IEEE851 IEEE. IEEE Standard for Binary Floating-Point 

Arithmetic. ANSI/IEEE Std 754-1985 (New York, 1985). 

[Jensen741 Jensen, Kathleen, and Wirth, Niklaus. PAS- 

CAL User Manual and Report. Second edition. Springer- 

Verlag (New York, 1974). 

[Knuth68] Knuth, Donald E. The Art of Computer Pro- 

gramming, Volume 1: Fundamental Algorithms. Addison- 

Wesley (Reading, Massachusetts, 1968). 


gramming, Volume L: Seminumerical Algorithms. First 

edition. Addison-Wesley (Reading, Massachusetts, 1969). 

[Knuth74] Knuth, DonaId E. “Structured Programming 

with GO TO Statements.” Computing Survey3 6, 4 (De- 

cember 1974). 


gramming, Volume 2: Seminumerical Algorithms. Second 

edition. Addison-Wesley (Reading, Massachusetts, 1981). 

[MatuIa68] MatuIa, David W. “In-and-Out Conversions.” 

Communications of the ACM 11, 1 (January 1968), 47- 

50. 

[Matnla’lO] Matula, David W. “A Formalization of 

Floating-Point Numeric Base Conversion.” IEEE Trand- 

actions on Computers C-19, 8 (August 1970), 681-692. 

[Moon741 Moon, David A. MacLiap Reference Manual, Re- 

vision 0. Massachusetts Institute of Technology, Project 

MAC (Cambridge, Massachusetts, April 1974). 

[Taranto59] Taranto, Donald. “Binary Conversion, with 

Fixed Decimal Precision, of a Decimal Fraction.” Com- 

munications of the ACM 2, 7 (July 1959), 27.

Table 5: Procedure Dragon4 (Formatter-Feeding Pre 

cess for Floating-Point Printout, Performing F’ree- 

Format Perfect Positive Floating-Point Printout) 

process Dragond; 

begin 

FORMAT? (b, e, f, p, B, CutoflMode, CutoflPlace); 

assert CutofMode = Urehtive” * CutoffPlace 5 0 

Round WpFlag t false; 

if f = 0 then FORMAT ! (0, k) else 

R + Wb(f, m=(e - P, 0)); 

S t &i&(1, max(O, -(e - p))); 

M- t shi,&(l, max(e -p, 0)); 

M+ + M-; 

Fizup; 

loop 

ktk-1; 

u t l(R x WSJ i 

Rc(Rx B)modS; 

M- tM- xB; 

M+tM+xB; 

low t 2 x RC M-; 

if Round UpFlag 

then high c 2 x R >(2 x S) - M+ 

else high + 2 x R > (2 x S) - M+ fi; 

while not low and not high 

and k # Cuto#Place : 

FORMAT ! (V, k); 

repeat; 

cases 

low and not high : FORMAT! (V, k); 

high and not low : FORMAT! (U + 1,k); 

(low and high) or (not low and not high) : 

cases 

2 x R 2 S: FORMAT! (U, k); 

2x RZS: FORMAT!(U+l,k); 

endcases; 

endcases; 

fi; 

comment Henceforth this process will generate as 

many u- 1” digits as the caller desires, along 

with appropriate values of k. 

loop k + k - 1; FORMAT! (-1, k) repeat; 

end; 

124 

Table 6: Procedure Pisup 

procedure F&up; 

begin 

iff= shift,(l,p - 1) then 

comment Account for unequal gaps. 

M+ t shi&,(M+, 1); 

R t shi&.(R, 1); 

S t shi&(S, 1); 

fi; 

k t 0; 

loop 

while R < [S/B] : 

ktk-1; 

RtRxB; 

M-tM-xB; 

M+tM+xB; 

repeat; 

loop 

loop 

while (2 x R) + M+ 12 x S : 

StSxB; 

ktk+l; 

repeat ; 

comment Perform any necessary adjustment 

of M- and M+ to take into account the for- 

matting requirements. 

case CutofiMode of 

%omnal” : CutoffPlace c k; 

“absolute* : CutoffAdjust; 

“relative” : 

CutoffPlace t k + CutoflPlace; 

CutofiAdjust; 

endcase; 

while(2xR)+M+>2xS: 

repeat; 

end; 

Table 7: Procedure fill 

procedure fill(k, c); 

comment Send k copies of the character c to the 

USER process. No characters are sent if k = 0. 

for i from 1 to k do USER! (c) od;

Table 8: Procedure CutofAdjust 

procedure CutoffAdjust; 

begin 

a t CutoffPlace - k; 

Y + s; 

cases 

a?O: forjc1toadoytyxB; 

aOAw~max(d+1,2) 

ccw-d-l; 

GENERATE! (b,e, f,p, B, “absolute”, -d); 

GENERATE ? (U, k); 

ifk < c then 

ifk 0 then fiU(c - 1, u “); USER! (“on) fi; 

USER! (‘V’); 

fiZi(min(-k, d), “0”); 

else jiZr(c - k - 1, u “) fi; 

loop 

whilek>-d: 

Digit Chart U); 

ifk = 0 then USER! (“.“) fi; 

GENERATE 

repeat; 

? (U, k) ; 

else fiZZ(w, u*n) A; 

USER ! (“0”); 

end:

Table 12: Formatting process for free-format exponen- 

tial output 

process Free-Fomnal Exponential; 

begin 

USER ? (h et f, P, B); 

GENERATE! (b,e, f,p,B, “normal”, 0); 

GENERATE ? (U, expt); 

DigiiChar(U); 

USER! (“2’); 

loop 

GENERATE ? (U, k); 

whileU#-1: 

DigitChar( 

repeat ; 

if k = ezpt - 1 then USER ! (“0”) fi; 

USER! (“E”); 

if ezpt < 0 then USER ! (“-“); ezpl t - ezpi ii; 

j t 1; 

loop j t j x I?; while j 5 ezpt : repeat; 

loop 

i +- LVBJ; 

DigitChar(jezpZ/j]); 

ezpept t ezpt mod j; 

whilej>l: 

repeat; 

USER! (“0”); 

end; 

126 

Table 13: Formatting process for fixed-format exponen- 

tial output 

process Fixed-Format Exponential; 

begin 

USER ? (he, f, P, B, wl d, 2); 

assertd~OAx>lAw>d+x+3 

ctw-d-x-3; 

GENERATE! (b, e, f,p, B, “relative”, -d); 

GENERATE ? (U, ezpept); 

j t 1; 

a t 0; 

loop 

j +- j x B; 

cr+--a+1; 

while j 5 Iezptl : 

repeat; 

ifa< x then 

j;ZZ(c - 1, u “); 

DigitChar( 

USER! (“2’); 

for q t 1 to d do 

GENERATE ? (U, k) ; 

DigiZChar(U); 

od; 

USER ! (‘92”); 

if ezpt < 0 then 

USER ! (‘Q’) 

else 

USER ! (“+“) 

fi; 

ezpt e 1 expt 1; 

fiZZ(x - a, UO”); 

loop 

i +-- UIBJ; 

DigiiChar( [expt/jJ); 

ezpt t expt mod j; 

whilej>l: 

repeat; 

else $ZZ(w, u*n) fi; 

USER ! (‘W’); 

end

How to Print Floating-Point Numbers Accurately - Computer Science ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?