23.10.2013 Views

FAST Forth Native-Language Embedded Computers

FAST Forth Native-Language Embedded Computers

FAST Forth Native-Language Embedded Computers

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

David M. Sanders<br />

San Francisco, California<br />

I. Introduction<br />

Many existing <strong>Forth</strong> compilers generate the code within<br />

a compiled word in the form of a sequence of pointers to<br />

other words. An "inner" interpreter then fetches these<br />

pointers, one at a time, and executes the appropriate<br />

routines that, when taken together, simulate the action of<br />

the <strong>Forth</strong> machine model. However, an increasing number<br />

of <strong>Forth</strong> implementations generate machine code directly<br />

during the compilation of <strong>Forth</strong> words. While the resulting<br />

code tends to take more space, it executes considerably<br />

faster. There is no coded "inner interpreter" in this type of<br />

<strong>Forth</strong> implementation; the inner interpreter is the processor's<br />

silicon-encoded, instruction-fetching mechanism.<br />

This article discusses certain optimization techniques<br />

for machine code generated by a <strong>Forth</strong> compiler. The Intel<br />

386 processor is used to illustrate the techniques, but many<br />

of them are applicable to a wide variety of processors,<br />

including the Motorola 68000 family. The discussion and<br />

examples are based upon 386 assembly language1; in<br />

practice, the optimizations would probably be carried out<br />

directly upon machine-code instruction codes, without<br />

CCLocal optimizations" that<br />

are particularly applicable to<br />

<strong>Forth</strong> compilers can greatly<br />

reduce the amount of<br />

machine code generated.<br />

the use of an assembler.<br />

In this article, I focus principally on techniques that are<br />

particularly applicable to <strong>Forth</strong> compilers, rather than on<br />

techniques that are common to compilers in general. The<br />

techniques dscussed relate to certain "local" optimizations<br />

that, when employed together, can significantly reduce the<br />

amount of machine code that would otherwise be generated.<br />

In particular, I look at optimizations related to three areas:<br />

1. optimizations related to buffering the top of the data<br />

stack in registers;<br />

2. optimizations related to updating the data-stack pointer;<br />

and<br />

3. optimizations related to combining generated code<br />

instruction sequences into shorter and faster code<br />

instructions.<br />

I also look into some ramifications for <strong>Forth</strong> itself, in the<br />

form of some extensions.<br />

II. Buffering the Stack Top in Registers<br />

The 386 has a subroutine stack, which would be used<br />

by <strong>Forth</strong> as the return stack. When a machine-codecompiled<br />

<strong>Forth</strong> word calls another compiled word, an<br />

assembly-language CALL instruction is generated; the<br />

compilation of an exit from a <strong>Forth</strong> word would generate<br />

an assembly-language RET.<br />

The 386 has no explicit stacks aside from the subroutine<br />

stack; the <strong>Forth</strong> data stack would need to be implemented<br />

by using a register such as ED1 as the stack-pointer<br />

register, and incrementing and/or decrementing the pointer<br />

register as ne~essary.~<br />

Ideally, the topmost part of the data stack should be<br />

buffered in one or more registers, so that the number of<br />

memory operations needed to manipulate the stack can be<br />

reduced. The number of memory operations can be further<br />

reduced if a variable number of items from the top of the<br />

I data stack can be buffered in registers. Such a reduction in<br />

I -<br />

the number of generated memory accesses comes at the<br />

expense of compiler complexity. In this case, the stack-<br />

pointer register (ED1 is used in the examples in this article)<br />

points to the top of the in-memory portion of the data stack.<br />

During compilation, the compiler must itself keep track<br />

of the stack "state," which is used to indicate the registers<br />

that currently contain items from the top of the data stack.<br />

F~~ the 386, the ideal registers to use for that purpose are<br />

1 For those of you not familiar with lntel assembler notation, the destination<br />

o~erand comes before the source operand. For example:<br />

MOV EDX, EAX<br />

This instruction sets EDX (the destinat~on) to the contents of €AX (the source). On the lntel 8088, which was used in early IBM PC's, a significant speed-up<br />

In addition, when a register is used as a base or index register, the register in instructionexecution timecould often be realized by using string instructions<br />

name is enclosed in square brackets [I. As an additional example:<br />

such as STOS, even without the REP instruction prefix, whenever possible.<br />

MOV [ED1]+4. EAX<br />

However, in designing the 386, lntel deliberately optimized instructions such<br />

This instruction sets the memory location addressed by [contents of ED1 plus as MOV [EDI], EAX so thatany speed-up resulting from using string instructions<br />

four] to the contents of EAX. Most instructions permit a memory location to be without the REP prefix is completely negated, even if the pointer register (EDI)<br />

used as either a destination or a source.<br />

must be incremented or decremented in a separate instruction.<br />

<strong>Forth</strong> Dimensions 33 March 1994 April

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!