21.07.2015 Views

GAWK: Effective AWK Programming

GAWK: Effective AWK Programming

GAWK: Effective AWK Programming

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

38 <strong>G<strong>AWK</strong></strong>: <strong>Effective</strong> <strong>AWK</strong> <strong>Programming</strong>The empty string "" (a string without any characters) has a special meaning as the valueof RS. It means that records are separated by one or more blank lines and nothing else. SeeSection 3.7 [Multiple-Line Records], page 49, for more details.If you change the value of RS in the middle of an awk run, the new value is used todelimit subsequent records, but the record currently being processed, as well as recordsalready processed, are not affected.After the end of the record has been determined, gawk sets the variable RT to the textin the input that matched RS. When using gawk, the value of RS is not limited to aone-character string. It can be any regular expression (see Chapter 2 [Regular Expressions],page 24). In general, each record ends at the next string that matches the regular expression;the next record starts at the end of the matching string. This general rule is actually atwork in the usual case, where RS contains just a newline: a record ends at the beginningof the next matching string (the next newline in the input), and the following record startsjust after the end of this string (at the first character of the following line). The newline,because it matches RS, is not part of either record.When RS is a single character, RT contains the same single character. However, when RS isa regular expression, RT contains the actual input text that matched the regular expression.The following example illustrates both of these features. It sets RS equal to a regularexpression that matches either a newline or a series of one or more uppercase letters withoptional leading and/or trailing whitespace:$ echo record 1 AAAA record 2 BBBB record 3 |> gawk ’BEGIN { RS = "\n|( *[[:upper:]]+ *)" }> { print "Record =", $0, "and RT =", RT }’⊣ Record = record 1 and RT = AAAA⊣ Record = record 2 and RT = BBBB⊣ Record = record 3 and RT =⊣The final line of output has an extra blank line. This is because the value of RT is a newline,and the print statement supplies its own terminating newline. See Section 13.3.8 [A SimpleStream Editor], page 248, for a more useful example of RS as a regexp and RT.If you set RS to a regular expression that allows optional trailing text, such as ‘RS ="abc(XYZ)?"’ it is possible, due to implementation constraints, that gawk may match theleading part of the regular expression, but not the trailing part, particularly if the inputtext that could match the trailing part is fairly long. gawk attempts to avoid this problem,but currently, there’s no guarantee that this will never happen.NOTE: Remember that in awk, the ‘^’ and ‘$’ anchor metacharacters match thebeginning and end of a string, and not the beginning and end of a line. As aresult, something like ‘RS = "^[[:upper:]]"’ can only match at the beginningof a file. This is because gawk views the input file as one long string thathappens to contain newline characters in it. It is thus best to avoid anchorcharacters in the value of RS.The use of RS as a regular expression and the RT variable are gawk extensions; they arenot available in compatibility mode (see Section 11.2 [Command-Line Options], page 177).In compatibility mode, only the first character of the value of RS is used to determine theend of the record.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!