19.11.2014 Views

The Fortress Language Specification - CiteSeerX

The Fortress Language Specification - CiteSeerX

The Fortress Language Specification - CiteSeerX

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

supercalifragilisticexpialidocious = 0.142857142857142857142857 TIMES &<br />

GREEK_SMALL_LETTER_&<br />

&UPSILON_WITH_DIALYTICA_AND_TONOS<br />

becomes<br />

supercalifragilisticexpialidocious = 0.142857142857142857142857 TIMES &<br />

GREEK_SMALL_LETTER_UPSILON_WITH_DIALYTICA_AND_TONOS<br />

E.2 Preprocessing of Names of Unicode Characters<br />

After a program encoded as a sequence of ASCII characters has been processed for word pasting across line breaks<br />

as described in the previous section, this step converts restricted words into corresponding Unicode characters. It also<br />

converts some other characters, as discussed below.<br />

First the program is analyzed to determine the boundaries of string literals and comments as follows: <strong>The</strong>re are three<br />

modes of processing: outside any comment or string literal, inside a string literal and inside a comment. Within a<br />

comment, we also keep track of “nesting depth” (this is 0 when not within a comment). All processing proceeds<br />

from left to right. Outside any comment or string literal, encountering an unescaped string literal delimiter changes<br />

processing to the mode for within a string literal (however, it is a static error if the string literal delimiter is the<br />

right double quotation mark), and encountering the opening comment delimiter “ * (” changes processing to the mode<br />

for within a comment, incrementing the nesting depth (to 1). All other characters are ignored, except to note they<br />

are outside any comment or string literal. Within a string literal, all characters, including comment delimiters, are<br />

ignored (except to note that they are within a string literal) except an unescaped string literal delimiter, which switches<br />

processing back to the mode for outside any comment or string literal. Inside a comment, all characters, including<br />

unescaped string delimiters, are ignored (except to note that they are within a comment) other than the two-character<br />

opening and closing comment delimiters “(*” and “ * )”. Whenever the opening comment delimiter is encountered,<br />

the nesting depth is incremented, and each time the closing comment delimiter is encountered, the nesting depth is<br />

decremented, until it becomes 0. At that point, processing is changed again to the mode for outside any comment or<br />

string literal. This step partitions the characters into those within string literals (including the string literal delimiters)<br />

and those not within string literals. Note that character literal delimiters are ignored in this step. Thus, we require<br />

string literal delimiters to be escaped within character literals.<br />

<strong>The</strong> characters outside of string literals are partitioned into contiguous subsequences formed by the restricted words,<br />

and all the characters between the restricted words and string literals separated by whitespace. That is, no subsequence<br />

considered has any whitespace characters or ampersands (ampersands being part of whitespace). Each subsequence is<br />

considered separately.<br />

For a restricted word, the general rule is that we try to replace the restricted word with a single Unicode character that<br />

it “names”. But we never do the replacement if the character is a printable ASCII character, a control character, or<br />

a left or right double quotation mark (i.e., characters with code points below U+009F, inclusive, or with code point<br />

U+201C or U+201D). We call such characters protected characters. Protecting the backslash and double quotation<br />

mark characters is necessary to maintain the boundaries for string literals, and protecting the printable ASCII characters<br />

ensures that the ASCII conversion process is idempotent. Protecting the control characters makes sense because most<br />

of them are forbidden from valid <strong>Fortress</strong> programs, and those that aren’t are available directly in ASCII.<br />

<strong>The</strong>re are four sources for determining whether a restricted word is a “name” for a Unicode character. Because these<br />

sources overlap in some cases, and not necessarily in compatible ways, the order in which we try these names is<br />

important.<br />

First, <strong>Fortress</strong> explicitly provides short ASCII names for many characters, especially ones that programmers might<br />

be most commonly want. For operators, these names are given in Appendix F. For example, here are some common<br />

ones:<br />

347

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!