21.07.2015 Views

GAWK: Effective AWK Programming

GAWK: Effective AWK Programming

GAWK: Effective AWK Programming

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

30 <strong>G<strong>AWK</strong></strong>: <strong>Effective</strong> <strong>AWK</strong> <strong>Programming</strong>specification for Extended Regular Expressions (EREs). POSIX EREs are based on theregular expressions accepted by the traditional egrep utility.Character classes are a new feature introduced in the POSIX standard. A characterclass is a special notation for describing lists of characters that have a specific attribute,but the actual characters can vary from country to country and/or from character set tocharacter set. For example, the notion of what is an alphabetic character differs betweenthe United States and France.A character class is only valid in a regexp inside the brackets of a character list. Characterclasses consist of ‘[:’, a keyword denoting the class, and ‘:]’. Table 2.1 lists thecharacter classes defined by the POSIX standard.Class[:alnum:][:alpha:][:blank:][:cntrl:][:digit:][:graph:][:lower:][:print:][:punct:][:space:][:upper:][:xdigit:]MeaningAlphanumeric characters.Alphabetic characters.Space and TAB characters.Control characters.Numeric characters.Characters that are both printable and visible. (A space is printable butnot visible, whereas an ‘a’ is both.)Lowercase alphabetic characters.Printable characters (characters that are not control characters).Punctuation characters (characters that are not letters, digits, control characters,or space characters).Space characters (such as space, TAB, and formfeed, to name a few).Uppercase alphabetic characters.Characters that are hexadecimal digits.Table 2.1: POSIX Character ClassesFor example, before the POSIX standard, you had to write /[A-Za-z0-9]/ to matchalphanumeric characters. If your character set had other alphabetic characters in it, thiswould not match them, and if your character set collated differently from ASCII, this mightnot even match the ASCII alphanumeric characters. With the POSIX character classes, youcan write /[[:alnum:]]/ to match the alphabetic and numeric characters in your characterset.Two additional special sequences can appear in character lists. These apply to non-ASCII character sets, which can have single symbols (called collating elements) that arerepresented with more than one character. They can also have several characters that areequivalent for collating, or sorting, purposes. (For example, in French, a plain “e” and agrave-accented “è” are equivalent.) These sequences are:Collating symbolsMulticharacter collating elements enclosed between ‘[.’ and ‘.]’. For example,if ‘ch’ is a collating element, then [[.ch.]] is a regexp that matches thiscollating element, whereas [ch] is a regexp that matches either ‘c’ or ‘h’.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!