09.07.2015 Views

Attribute Grammars and Parsers for Chunk-Based Binary File Formats

Attribute Grammars and Parsers for Chunk-Based Binary File Formats

Attribute Grammars and Parsers for Chunk-Based Binary File Formats

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

List of FiguresFigure 1. <strong>File</strong> <strong>for</strong>mat (Layout) of a RIFF container <strong>for</strong> a Webp Image........................................................... 2Figure 2. ILBM <strong>File</strong> ham.pic Displayed .......................................................................................................... 6Figure 3. An <strong>Attribute</strong> Grammar <strong>for</strong> an ILBM <strong>Chunk</strong>-based <strong>File</strong> Format ...................................................... 7Figure 4. A parse tree <strong>for</strong> the ILBM file ham.pic ........................................................................................... 8Figure 5. <strong>Binary</strong> Array Grammar <strong>for</strong> RIFF WAVEFORM PCM <strong>File</strong> Format ................................................... 10Figure 6. Top-Level Grammar Rule ............................................................................................................. 10Figure 7.Top-level Java program of a recursive descent parser ................................................................. 11Figure 8. An ANTLR Grammar <strong>for</strong> the ILBM <strong>File</strong> Format. ............................................................................ 16Figure 9. A Java Application <strong>for</strong> Parsing an ILBM <strong>File</strong> ................................................................................. 17Figure 10. Output of the ILBM Parser Applied to the <strong>File</strong> BLK.IFF .............................................................. 18Figure 11. Lexer <strong>for</strong> all Context-free <strong>Binary</strong> Array <strong>Grammars</strong> .................................................................... 19Figure 12. Data Translation <strong>and</strong> Other Gammar Utilities ........................................................................... 21Figure 13. Datatype rules used by all binary array attribute grammars ..................................................... 23Figure 14. The ILBM <strong>File</strong> Format Grammar ................................................................................................. 26Figure 15. The Wave_Form <strong>File</strong> Format Grammar ..................................................................................... 32Figure 16. Parse tree shown as a nested list. .............................................................................................. 33Figure 17. Parse tree shown as a graph. ..................................................................................................... 33Figure 18. ILBM <strong>File</strong> Docking.bru ................................................................................................................ 34Figure 19. Parse Tree Created by ANTLR4 .................................................................................................. 35Figure 20. Pretty Printed Parse Tree of ILBM Formated <strong>File</strong> Docking.bru .................................................. 38vi


1 Introduction1.1 BackgroundAutomated tools are required <strong>for</strong> identifying <strong>and</strong> validating the <strong>for</strong>mats of the huge number offiles ingested into digital data <strong>and</strong> record archives. Automated tools are also needed <strong>for</strong>viewing/playing text <strong>and</strong> binary file <strong>for</strong>mats, <strong>and</strong> <strong>for</strong> converting legacy <strong>and</strong> obsolete file <strong>for</strong>matsto st<strong>and</strong>ard or current <strong>for</strong>mats. Technologies such as data description languages have emerged toaddress these preservation challenges [Dunckley et al 2007].Context-free grammars have been used to specify the syntax of programming languages.<strong>Attribute</strong> grammars provide a framework <strong>for</strong> <strong>for</strong>mally specifying the semantics of a languagebased on its context-free grammar <strong>and</strong> <strong>for</strong> addressing the mildly context-sensitive features ofprogramming languages such as agreement of the data types of variables in expressions with datatype declarations. Such grammars have parsing/translation algorithms that can be used <strong>for</strong>syntax-checking <strong>and</strong> interpretation or translation of the languages the grammars define.1.2 PurposeThe research question addressed by this research is whether it is possible to extend the contextfreegrammars used to specify the syntax of programming languages to the specification ofbinary file <strong>for</strong>mats <strong>and</strong> to use these grammars with parsers <strong>for</strong> validating the file <strong>for</strong>mats ofbinary files.1.3 ScopeThe next section discusses the traditional approach to specifying file <strong>for</strong>mats <strong>and</strong> two of themajor families of binary file <strong>for</strong>mats. In section 3, extensions to the concepts of context-freegrammars <strong>and</strong> attribute grammars are described that enable the specification of binary file<strong>for</strong>mats. In section 4, examples of binary array attribute grammars <strong>for</strong> a chunk-based binary file<strong>for</strong>mats are presented. In section 5, recursive descent parsers <strong>for</strong> these kinds of grammars aredescribed. In section 6, experience in using ANTLR, a parser generator <strong>for</strong> LL(k) stringgrammars, in generating parsers <strong>for</strong> recognizing the <strong>for</strong>mats of binary files is discussed. Thisincludes the extension of ANTLR to work with grammars with data types other than strings <strong>and</strong>the creation of parse trees represented as pretty printed nested lists. In section 7, related researchis described. Section 8 summarizes the results.2 Specification of <strong>Binary</strong> <strong>File</strong> <strong>Formats</strong>Traditionally, a file (or record) layout was a <strong>for</strong>m showing how fields were positioned in a file(or record). The fields are named <strong>and</strong> have as attributes a data type, length <strong>and</strong> sometimes aconstant value or range of values. Fields are offset at addresses relative to the beginning of a file.1


chunks in describing the <strong>for</strong>mat, <strong>for</strong> instance, "atoms" in QuickTime/MP4, "segments" in JPEG,“boxes” in JPEG 2000, <strong>and</strong> “tagged data representations” in AutoCAD DXF.The Audio Interchange <strong>File</strong> Format (AIFF) developed by Apple computer in 1988 was based onElectronic Arts Interchange <strong>File</strong> Format. Apple Core Audio Format (CAF), Apple’s replacement<strong>for</strong>mat <strong>for</strong> AIFF, remains a chunk-based <strong>for</strong>mat [Apple Computer 2005]. The ResourceInterchange <strong>File</strong> Format (RIFF) introduced by IBM <strong>and</strong> Microsoft [1991] is also based onElectronic Arts’ Interchange <strong>File</strong> Format. The Microsoft container <strong>for</strong>mats like Audio-VideoInterleave (AVI) <strong>and</strong> Wave<strong>for</strong>m PCM (WAV) use RIFF as their basis. WebP, a picture <strong>for</strong>matrecently introduced by Google[2010], also uses RIFF as a container.Microsoft’s Advanced Systems Format (ASF) [Microsoft 2004] is also a chunk-based container<strong>for</strong>mat. The most common file <strong>for</strong>mats contained within an ASF file are Windows Media Audio(WMA) <strong>and</strong> Windows Media Video (WMV). Microsoft’s <strong>Binary</strong> Interchange <strong>File</strong> Format(Microsoft Excel’s <strong>File</strong> Format) is chunk-based [Rentz 2008].Another family of binary file <strong>for</strong>mats is directory-based. A directory-based file <strong>for</strong>mat consists ofone or more directory tables which contain one or more directory entries. A directory entryspecifies where the actual data is located. This is the scheme used in TIFF files, OLE (MicrosoftObject Linking <strong>and</strong> Embedding) files, OASIS OpenDocument <strong>and</strong> Microsoft Open Office files.3 Context-Free <strong>Grammars</strong> <strong>and</strong> <strong>Attribute</strong> <strong>Grammars</strong>Context-free grammars have been widely used to define programming languages. A context-freegrammar G is a quadruple where:N is a finite set of non-terminal symbols,Σ is a finite set of terminal symbols,S ∈ N is the start symbol,P is a set of production rules of the <strong>for</strong>m A → {N ∪ Σ}* where A ∈ NThe set of all strings over a vocabulary Σ is denoted by Σ*. Formally, a string is said to directlyderive in G a string ', denoted G ', if ' can be obtained from by replacing a substring with, where is a production rule in G. A string is said to derive ' in G, denoted G* ', if0 G . . . G ' n <strong>for</strong> some 0 , . . . , n such that 0 = <strong>and</strong> n = '.The language generated by a context-free grammar G is the set L(G) = {w: w ∈ Σ* <strong>and</strong> S G*w}.3


4.2 RIFF Wave<strong>for</strong>m PCM Audio <strong>File</strong> Format (WAVE)The wave<strong>for</strong>m PCM <strong>File</strong> <strong>for</strong>mat was originally specified in an extended BNF notation [IBM &Microsoft 1991]. The chunk size was omitted from the EBNF rules. Figure 5 shows the EBNF ruleswith the cksize inserted. To convert this grammar into a context-free binary array grammar, datatypes are added to non-terminals which would be replaced by a binary value were an additionallexical production rule applied. -> “RIFF” “WAVE”[][][][] ->“fmt “ -> -> { | } -> data -> LIST “wavl” { | } -> slnt -> factxksize ->cue + -> ->“plst” + -> “LIST” “adtl” 9


-> “labl” -> “note” -> “ltxt” + -> “file” +Figure 5. <strong>Binary</strong> Array Grammar <strong>for</strong> RIFF WAVEFORM PCM <strong>File</strong> Format5 Recursive Descent <strong>Parsers</strong> <strong>for</strong> <strong>Binary</strong> <strong>File</strong> <strong>Grammars</strong>A recursive descent parser is a top-down parser built from a set of recursive procedures whereeach procedure implements one of the production rules of the grammar. A predictive parser is arecursive descent parser that does not require backtracking. Predictive parsing is possible only<strong>for</strong> the class of LL(k) grammars, which are the context-free grammars <strong>for</strong> which there existssome positive integer k that allows a recursive descent parser to decide which production to useby examining only the next k tokens of input.Figure 6 shows the top-level grammar rule of the binary array attribute grammar <strong>for</strong> the ILBMfile <strong>for</strong>mat. The interpretation of this rule is: The first 4 bytes of the file <strong>for</strong>mat contain thecharacters “FORM”. Next is a 4-byte (32-bit) unsigned integer indicating a chunk size that is thesize of the remainder of the file. The next 4 bytes contain the characters “ILBM”. This must befollowed by a property<strong>Chunk</strong>, a data<strong>Chunk</strong> <strong>and</strong> a BODY chunk.Figure 6. Top-Level Grammar Rule10


Figure 7 shows the top-level Java program named ilbm in a top-down recursive descent parser<strong>for</strong> the ILBM grammar. It is manually constructed from the top-level grammar rule. There arealso Java programs corresponding to , <strong>and</strong> .The property chunks of an ILBM file must contain a BitmapHeader chunk. The attributefoundBMHD() is passed back to the program with a value true or false, indicating whether aBitmapHeader (BMHD) was found.The recursive descent parsers <strong>for</strong> the binary array grammars defined in this paper do not requirea separate lexical analyzer (lexer) to identify the data of a file prior to parsing. The data typesare predicted by the grammar rules. The top-down parser defines which data types are expectednext so that the tokens (data type values) are identified while parsing.Figure 7.Top-level Java program of a recursive descent parserThe semantic rules of an L-attributed grammar can be included in the recursive descent parser tocheck <strong>for</strong> context-sensitive aspects of the grammar such as chunk size or image dimensions <strong>and</strong>to use these values in parsing the chunk data or images. The pointers to file addresses that occurin the file <strong>for</strong>mats might be h<strong>and</strong>led by pushing the current address in the file onto a pushdown11


stack. When the parser exhausts the procedures implementing production rules corresponding tothe pointed to location, the topmost address on the pushdown stack is popped <strong>and</strong> the procedurecorresponding to the nonterminal at that position is executed.Recursive descent parsers with a pushdown stack can be used with binary array grammars tocreate file <strong>for</strong>mat recognizers. A recognizer (syntax checker or validator) is a parser that reads afile <strong>and</strong> generates an error if the file does not con<strong>for</strong>m to the syntax specified by the grammar.6. A Parser Generator <strong>for</strong> <strong>Binary</strong> <strong>File</strong> <strong>Formats</strong>What we want is a parser generator whose input is a binary array attribute grammar <strong>for</strong> aparticular binary file <strong>for</strong>mat, <strong>and</strong> whose generated output is the Java source code of a recursivedescent parser <strong>for</strong> the class of binary file <strong>for</strong>mats specified by the grammar. Such a parsergenerator does not yet exist, but there is a widely used parser generator <strong>for</strong> attribute grammars<strong>for</strong> string-based languages.6.1 ANTLRANTLR (ANother Tool <strong>for</strong> Language Recognition) is a parser generator that uses LL(k) parsing[Parr & Quong 1995; Parr 2007]. ANTLR takes as input an attribute grammar that specifies alanguage <strong>and</strong> generates as output source code <strong>for</strong> a parser <strong>for</strong> that language. ANTLR supportsgenerating code in a number of the programming languages including C, Java, JavaScript <strong>and</strong>Python.ANTLR also generates a Lexical analyzer (lexer) from lexical rules in the grammar. A lexer is aprogram that converts a sequence of characters into tokens. A lexer is not needed to parse binaryfile <strong>for</strong>mats, because the data types predicted by the parser are the tokens of the grammar.6.2 Using ANTLR to Generate a Parser <strong>for</strong> a <strong>Binary</strong> <strong>File</strong> FormatIt is possible to use ANTLR to test our binary array grammars in recognizing the chunk-based<strong>and</strong> directory-based binary file <strong>for</strong>mats. This is accomplished by writing a lexer that treats eachinput byte in a file as a character token. Then functions are created <strong>for</strong> each data type in thebinary file grammar that convert the appropriate number of character tokens to binary data types,e.g., int 16, int32, etc. This is effective, if someone clumsy, <strong>and</strong> has allowed us to test ourattribute grammars <strong>for</strong> binary file <strong>for</strong>mats as well as to demonstrate the feasibility of creating aparser generator <strong>for</strong> binary array grammars.12


6.2.1 An ANTLR Grammar <strong>for</strong> an ILBM <strong>Chunk</strong>-based <strong>File</strong> FormatFigure 8 shows a specification <strong>for</strong> the ILBM file <strong>for</strong>mat that is input to ANTLR. The output ofANTLR is a parser <strong>for</strong> ILBM files.grammar iblm;options {language = Java; // The programming language of the output parserTokenLabelType = CommonToken; }// The text contained in @header is placed at the top of the parser file.@header {package gtri.org.grammars.chunk;import java.io.UnsupportedEncodingException;}// The text in @lexer::header is placed at the top of the lexer file.@lexer::header {package gtri.org.grammars.chunk;import java.io.UnsupportedEncodingException;}// The text in the @members section is placed inside the parser class.@members {// getUByteValue inputs the token of a Byte terminal <strong>and</strong> returns an// integer value.public int getUByteValue(Token tok) throws UnsupportedEncodingException{int intValue = 0;char charValue = tok.getText().charAt(0);intValue = (int) charValue;intValue = intValue & 0x00ff;return intValue;} }// Grammar Rules used to create the Parser. Start symbol ilbmiblm returns [boolean result]:FORM ckSize ILBM// Print out ILBM chunksize.{ System.out.println("ID: ILBM Size: " + $ckSize.value);}// Semantic predicate that ensures that at least on BMHD is found.property<strong>Chunk</strong>s {$property<strong>Chunk</strong>s.foundBMHD == true}?data<strong>Chunk</strong>s?body?{ $result = true; };// The bmhd, cmap <strong>and</strong> camg property chunks can occur in any order.// At least one BMHD data chunk must be present.property<strong>Chunk</strong>s returns [boolean foundBMHD]: (bmhd { $foundBMHD = true; }| cmap| camg)+;bmhd: BMHD ckSize{System.out.println("ID: BMHD Size: " + $ckSize.value);}bitMapHeader;bitMapHeader: width = uWordValue13


height = uWordValuexPosition = wordValueyPosition = wordValuenPanes = uByteValuemasking = uByteValuecompression = uByteValuereserved1 = uByteValuetransparentColor = uWordValuexAspect = uByteValueyAspect = uByteValuepageWidth = wordValuepageHeight = wordValue// Print out values in the bitmap structure.{System.out.println("Width: " + $width.value);System.out.println("Height: " + $height.value);System.out.println("X Position: " + $xPosition.value);System.out.println("Y Position: " + $yPosition.value);System.out.println("Number of Panes: " + $nPanes.value);System.out.println("Masking:" + $masking.value);System.out.println("Compression:" + $compression.value);System.out.println("Reserved1:" + $reserved1.value);System.out.println(" TransParent Color:" +$transparentColor.value);System.out.println("X Aspest:" + $xAspect.value);System.out.println("Y Aspect:" + $yAspect.value);System.out.println("Page Width:" + $pageWidth.value);System.out.println("Page Height:" + $pageHeight.value);};cmap returns [long value]// Initialize the index value i in the cmap function.@init { int i = 0; }:CMAP ckSize// Print out chunk size.{ $value = $ckSize.value;System.out.println("ID: CMAP Size: " + $ckSize.value);}(colorRegister// Print out color values.{ System.out.println(" Color[" + i++ + "]: {" +$colorRegister.redValue +", " + $colorRegister.greenValue +", " + $colorRegister.blueValue +"}");})+;colorRegister returns [int redValue, int greenValue, int blueValue]: red = uByteValue green = uByteValue blue = uByteValue{ $redValue = $red.value;$greenValue = $green.value;$blueValue = $blue.value;};camg: CAMG ckSize longValue{ System.out.println("ID: CAMG Size: " + $ckSize.value);14


System.out.println(" View Modes: 0x" +Long.toHexString($longValue.value));};// The crng <strong>and</strong> ccrt data chunksdata<strong>Chunk</strong>s: crng| ccrt;crng: CRNG ckSize{ System.out.println("ID: CRNG Size: " + $ckSize.value); }cRange;cRange: pad1 = wordValue {$pad1.value == 0}?rate = wordValueactive = wordValuelow = uByteValuehigh = uByteValue ;ccrt: CCRT ckSize{ System.out.println("ID: CCRT Size: " + $ckSize.value); }cycleInfo;cycleInfo: direction = wordValuestart = uByteValueend = uByteValueseconds = longValuemicroseconds = longValuepad = wordValue {$pad.value == 0}? ;body: BODY ckSize{ System.out.println("ID: BODY Size: " + $ckSize.value); }{$ckSize.value > 0}? data{ System.out.println(" Number of bytes: " + $data.value); }{$data.value == $ckSize.value}?;// defines the data chunk of the body chunkdata returns [long value]: (b+=BYTE)+ {$b.size() > 0}?{ $value = $b.size(); };// The cksize is the value of a long integer.ckSize returns [long value]: longValue{$value = $longValue.value;};// defines the longValue data typein terms of BYTE terminalslongValue returns [long value] throws UnsupportedEncodingException: h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE{$value =(long)( getUByteValue($h1)


System.out.println("Unsupported Encoding Exception"); }// defines the unsigned word data typeuWordValue returns [int value]: b1 = BYTE b2 = BYTE{$value =(int)( getUByteValue($b1)


6.2.2 The Lexer <strong>and</strong> ParserWhen provided the described in the previous section, ANTLR generates the ilbmlexer <strong>and</strong>ilbmparser Java programs. The Java application shown in Figure 9 uses these programs to parsethe ILBM file BLK.IFF (a black rectangle) <strong>and</strong> extract the metadata shown in Figure 10.package org.gtri.grammars.chunk;import java.io.IOException;import org.antlr.runtime.ANTLR<strong>File</strong>Stream;import org.antlr.runtime.ANTLR<strong>File</strong>Stream;import org.antlr.runtime.ANTLRStringStream;import org.antlr.runtime.CharStream;import org.antlr.runtime.CommonTokenStream;import org.antlr.runtime.RecognitionException;import org.antlr.runtime.TokenStream;import gtri.org.grammars.chunk.iblmLexer;import gtri.org.grammars.chunk.iblmParser;public class Test {static String filePath = "C:\\Users\\sl170\\Pictures\\IFF-ilbm\\BLK.IFF";public static void main(String[] args) throws RecognitionException, IOException {ANTLR<strong>File</strong>Stream stream = new ANTLR<strong>File</strong>Stream(filePath, "ISO-8859-1");iblmLexer lexer = new iblmLexer(stream);TokenStream tokenStream = new CommonTokenStream(lexer);iblmParser parser = new iblmParser(tokenStream);System.out.println("FILE: " + filePath);System.out.println("Successfully parsed = " + parser.iblm());}}Figure 9. A Java Application <strong>for</strong> Parsing an ILBM <strong>File</strong>17


FILE: C:\Users\sl170\Pictures\IFF-ilbm\BLK.IFFID: ILBM Size: 2556ID: BMHD Size: 20Width: 200Height: 144X Position: 0Y Position: 0Number of Panes: 6Masking:0Compression:1Reserved1:0TransParent Color:0X Aspest:3Y Aspect:3Page Width:200Page Height:144ID: CMAP Size: 48Color[0]: {0, 0, 0}Color[1]: {0, 128, 128}Color[2]: {0, 255, 255}Color[3]: {255, 0, 255}Color[4]: {0, 255, 0}Color[5]: {255, 0, 0}Color[6]: {0, 0, 255}Color[7]: {255, 255, 0}Color[8]: {128, 128, 128}Color[9]: {192, 192, 192}Color[10]: {128, 0, 0}Color[11]: {0, 128, 0}Color[12]: {128, 128, 0}Color[13]: {0, 0, 128}Color[14]: {128, 0, 128}Color[15]: {255, 255, 255}ID: CAMG Size: 4View Modes: 0x800ID: BODY Size: 2448Number of bytes: 2448Successfully parsed = trueFigure 10. Output of the ILBM Parser Applied to the <strong>File</strong> BLK.IFF6.3 Improvements to the Lexical Analysis of Data TypesThe ANTLR Lexer returns tokens made up of characters. For the simple sentence “This is asimple string.” could return ten tokens, one token <strong>for</strong> each word, one token <strong>for</strong> each space <strong>and</strong>one token <strong>for</strong> the period. The number of tokens the lexer returned would depend on the grammarit was built from. The previous version of the ilbm grammar also created a lexer that groupsbytes intermixed with certain strings. Examples of some of the strings that were returned were“FORM”, “ILBM”, “CAMG”, “BODY”. In some cases this worked fine. But in other cases itcaused problems. Whenever the lexer encountered a ‘C’ it expected the next byte to be an ‘M’,‘A’, ‘R’, or another ‘C’. The lexer returned an error if the ASCII equivalent of the next byte was18


not what it expected. This error would state what it was expecting <strong>and</strong> what it encounteredinstead. The problem occurred because the same byte that can represent the ASCII ‘C’ could alsooccur as part of a set of bytes that were supposed to represent a number. Another problem thatoccurred with the previous grammar, was that the specification <strong>for</strong> this file type allowed <strong>for</strong>unknown chunks as long as they had a chunk id that consisted of four capital letters followed bythe size of the chunk <strong>and</strong> the data in the chunk.The lexer created from a chunk-based binary array attribute grammar, needs to simply return anarray of bytes. It needs to leave it up to the parser as to whether a group of bytes represents anASCII string or a number. Figure 11 shows a grammar that ANTLR uses to create a Lexer thatreturns an array of bytes.lexer grammar binaryArray;//Terminal rules used to create the LexerBYTE: . // The period is used to match any byte;Figure 11. Lexer <strong>for</strong> all Context-free <strong>Binary</strong> Array <strong>Grammars</strong>Applications store data to a file using groups of bytes. Some applications store the mostsignificant byte of numeric data first. This is called big-endian representation. Other applicationsstore the least significant byte first. This is called little-endian representation. The applicationsthat store data in ILBM files use little-endian representation. To make use of this stored numericdata, the data types that the bytes represent have to be translated into the appropriate numericrepresentation.A set of JAVA utilities have been created that can be referenced by binary attribute grammarrules to return values of these big <strong>and</strong> little-endian data types. There are also data types thatreturn a string of characters <strong>and</strong> one that returns the hex representation of the bytes as they werestored in the file. Many chunk-based file specifications allow <strong>for</strong> the addition of unknownchunks. These chunks of data may contain data that is used by one application but can be ignoredby other applications. The data types of the data in these chunks are not known, but the size ofthe array of data that they contain is known. In this case, if the data contains only ASCIIprintable characters, the ASCII representation is returned. In the case that the data contain nonprintablecharacters, the hex representation of the data is returned. There is a utility <strong>for</strong> checkingwhether a number is odd or even. This utility is needed because odd-sized chunks in chunk-basedfiles often have a pad byte added. These functions are shown in Figure 12.19


public static boolean isOdd(long value) {return ((value % 2) != 0);}public static boolean isPrintable(String str) {if (str == null) {return false;}int sz = str.length();if (str.charAt(sz - 1) == 0)sz--;<strong>for</strong> (int i = 0; i < sz; i++) {if (isPrintable(str.charAt(i)) == false) {return false;}}return true;}public static boolean isPrintable(char ch) {return ((ch >= 32 && ch = 8 && ch = 12 && ch


}public static long uBytesToDWord(String str, String byteOrder) {if (byteOrder.toUpperCase().equals("LE")) {str = new StringBuffer(str).reverse().toString();}byte[] b = str.getBytes();return (long) (uByte(b[0])


Each of the rules <strong>for</strong> a numeric data type take a parameter byteOrder. This parameter is thenpassed on to the translation utilities. This allows the bytes of numeric data to be correctlytranslated. The byteOrder is either big-endian or little-endian depending on how the data wasoriginally stored in the file. For example, ilbm files are stored in big-endian order while wavefiles are stored in little-endian order.As long as chunk-based binary array grammars are built on top of the simplified lexer grammar<strong>and</strong> the data type grammar, ANTLR 4 can be used to automatically create a lexer <strong>and</strong> parser thatcan be used with chunk-based binary array grammars.grammar dataType;// The binaryArray grammar is a simple lexer grammar with BYTE as the only// Token type.import binaryArray;// The following rules are a work around.// data types instead of Antlr4's text only data type.// They allow other binary grammars read <strong>and</strong> return// data types instead of Antlr4's text only Token type.// These parser <strong>and</strong> lexer rules simulate a lexer that// inputs bytes <strong>and</strong> returns data type Token on dem<strong>and</strong>.// This rule inputs byte array of a given size. It uses a// long as the size of the array to input <strong>and</strong> returns// a string of all printable charactors of size or the// hex representation of the array as a string that is// length of size. dtData [long size] returns [String value]: byteArray[$size];{ if(Utilities.isPrintable($text))}else$value = $text;$value = Utilities.uBytesToHex($text);dtString [long size] returns [String value]: byteArray[$size];{$value = $text;}dtZString [long size] returns [String value]: byteArray[$size - 1]22


{$value = $text;}dtUByte{Utilities.isValidZString($text)}?;dtHexString [long size] returns [String value]: byteArray[$size]{$value = Utilities.uBytesToHex($text);};dtLong [String byteOrder] returns [long value]: byteArray[4]{$value = Utilities.uBytesToLong($text, $byteOrder);};dtDWord [String byteOrder] returns [long value]: byteArray[4]{$value = Utilities.uBytesToDWord($text, $byteOrder);};dtUWord [String byteOrder] returns [int value]: byteArray[2]{$value = Utilities.uBytesToUWord($text, $byteOrder);};dtWord [String byteOrder] returns [int value]: byteArray[2]{$value = Utilities.uBytesToWord($text, $byteOrder);};dtUbyte returns [int value]: byteArray[1]{$value = Utilities.uByte($text);};byteArray [long size]locals [long index = 0]: ({$index < $size}?BYTE{$index += 1;})+;Figure 13. Datatype rules used by all binary array attribute grammars23


The single lexer rule <strong>and</strong> all the data type rules are a data type grammar. This makes it possibleto simply import the data type grammar into the specific grammar <strong>for</strong> each chunk-based file<strong>for</strong>mat. A grammar <strong>for</strong> a file <strong>for</strong>mat needs rules that define the layout (structure) of the file<strong>for</strong>mat plus the data type (lexical) rules to h<strong>and</strong>le the data type definition <strong>and</strong> the translation ofnumeric data into a usable <strong>for</strong>m. A rewritten version of the ilbm grammar is shown in Figure 14.It imports the data type rules.grammar ilbm;import dataType;// Nonterminal rules used to create the Parserilbmlocals [String byteOrder = "BE", boolean foundBMHD]: <strong>for</strong>mID = ckID;{$<strong>for</strong>mID.text.equals("FORM")}? // check that first ckID == FORMckSizeilbmID = ckID{$ilbmID.text.equals("ILBM")}? // check that second ckID == ILBMpropertiesLoopbodypropertiesLooplocals [String ckName]: chunkName1 = ckID;{$ckName = $chunkName1.text;}()*{!$ckName.equals("BODY")}? // check <strong>for</strong> the end of the loop.property[$ckName]chunkName2 = ckID {$ckName = $chunkName2.text;}catch [FailedPredicateException fpe]{}// Do nothing; ckName == BODY// This exception is used as an exit <strong>for</strong> the propertyLoop// It only occurrs when the next $ckName is equal to "BODY".// The BODY is where the actual picture is stored.24


property[String currentID]: {$currentID.equals("BMHD")}? bmhd| {$currentID.equals("CMAP")}? cmap| {$currentID.equals("CAMG")}? camg| {$currentID.equals("CRNG")}? crng| additionalProperty[currentID];bmhd: ckSize {$ckSize.value == 20}?bitMapHeader{$ilbm::foundBMHD = true;};bitMapHeader: width = dtUWord[$ilbm::byteOrder]height = dtUWord[$ilbm::byteOrder]xPosition = dtWord[$ilbm::byteOrder]yPosition = dtWord [$ilbm::byteOrder]nPanes = dtUBytemasking = dtUBytecompression = dtUBytereserved1 = dtUBytetransparentColor = dtUWord[$ilbm::byteOrder]xAspect = dtUByteyAspect = dtUBytepageWidth = dtWord[$ilbm::byteOrder]pageHeight = dtWord[$ilbm::byteOrder];cmaplocals [int count, int index]@init {$count = 1;$index = 0;}: ckSize {($ckSize.value % 3) == 0}?({$count


colorRegister [int index]: red = dtUByte green = dtUByte blue = dtUByte;camg: ckSize dtLong[$ilbm::byteOrder]{Long.toHexString($dtLong.value);};crng: ckSizecRange;cRange: pad1 = dtWord[$ilbm::byteOrder] {$pad1.value == 0}?rate = dtWord[$ilbm::byteOrder]active = dtWord[$ilbm::byteOrder]low = dtUBytehigh = dtUByte;additionalProperty [String currentID]: ckSizedtData[$ckSize.value];body: {$ilbm::foundBMHD}?ckSize dtHexString[$ckSize.value];ckSize returns [long value]: dtLong[$ilbm::byteOrder]{$value = $dtLong.value;};ckID: dtString[4] {$text.equals($text.toUpperCase())}?;Figure 14. The ILBM <strong>File</strong> Format Grammar26


A wave_<strong>for</strong>m grammar is shown in Figure 15. It also imports the datatype rules.grammar wave_<strong>for</strong>m; // Define a grammar called wave_<strong>for</strong>mimport dataType;// Nonterminal rules used to create the Parserwave_<strong>for</strong>mlocals [String byteOrder = "LE", boolean foundFmtCk]: riff = dtString[4] {$riff.value.equals("RIFF")}?dtDWord[$byteOrder]wave = dtString[4] {$wave.value.equals("WAVE")}?chunkOrListLoop;chunkOrListLoop: (ckID = dtString[4]chunkOrList[$ckID.value])+;chunkOrList[String ckID]: {$ckID.equals("fmt ")}? fmt_ck {$wave_<strong>for</strong>m::foundFmtCk = true;}| {$ckID.equals("fact")}? fact_ck| {$ckID.equals("data") && $wave_<strong>for</strong>m::foundFmtCk}? data_ck| {$ckID.equals("cue ")}? cue_ck| {$ckID.equals("plst")}? playlist_ck| {$ckID.equals("LIST")}?(ckSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]typeList = dtString[4]listType[$ckSize.value, $typeList.value])| unknown<strong>Chunk</strong>;fmt_ck: ckSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]common_fields<strong>for</strong>mat_specific_fields[$common_fields.<strong>for</strong>matTag,$ckSize.value - 14]({Utilities.isOdd($ckSize.value)}? pad_byte)?;common_fields returns [int <strong>for</strong>matTag]27


: // Format categorywFormatTag = dtUWord[$wave_<strong>for</strong>m::byteOrder]{$<strong>for</strong>matTag = $wFormatTag.value;}// Number of channelswChannels = dtUWord[$wave_<strong>for</strong>m::byteOrder]// Sampling ratedwSamplesPerSec = dtDWord[$wave_<strong>for</strong>m::byteOrder]// For buffer estimationdwAvgBytesPerSec = dtDWord[$wave_<strong>for</strong>m::byteOrder]// Data block sizewBlockAlign = dtUWord[$wave_<strong>for</strong>m::byteOrder];/** consist of zero or more bytes of parameters.* Which parameters occur depends on the WAVE <strong>for</strong>mat category. Validation* should allow <strong>for</strong> (<strong>and</strong> ignore) <strong>and</strong> unknown parameters. */<strong>for</strong>mat_specific_fields [int <strong>for</strong>matTag, long sizeRest]returns[int bitsPerSample]: ({$sizeRest > 0}?({$<strong>for</strong>matTag == 1}? (pcm_specific_fields[$sizeRest]{$bitsPerSample = $pcm_specific_fields.bitsPerSample;})| dtHexString[$sizeRest])|);pcm_specific_fields [long sizeRest] returns[int bitsPerSample]: wBitsPerSample = dtUWord[$wave_<strong>for</strong>m::byteOrder] // Sample rate{$bitsPerSample = $wBitsPerSample.value;}({($sizeRest - 2) > 0}? dtHexString[$sizeRest - 2])?;fact_ck: ckSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]dw<strong>File</strong>Size = dtDWord[$wave_<strong>for</strong>m::byteOrder];data_ck: ckSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]dtHexString[$ckSize.value]({Utilities.isOdd($ckSize.value)}? pad_byte)?28


;cue_ck: dtDWord[$wave_<strong>for</strong>m::byteOrder]dwCuePoints = dtDWord[$wave_<strong>for</strong>m::byteOrder]cue_point_array[$dwCuePoints.value];cue_point_array [long size]locals [long index]: ({$index < $size}?cue_point{$index += 1;})+;cue_point: dwName = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwPosition = dtDWord[$wave_<strong>for</strong>m::byteOrder]fcc<strong>Chunk</strong> = dtString[4]dw<strong>Chunk</strong>Start = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwBlockStart = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwSampleOffset = dtDWord[$wave_<strong>for</strong>m::byteOrder];playlist_ck: ckSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwSegments = dtDWord[$wave_<strong>for</strong>m::byteOrder]play_segment_array[$dwSegments.value];play_segment_array [long size]locals [long index]: ({$index < $size}?play_segment{$index += 1;})+;play_segment: dwName = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwLength = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwLoops = dtDWord[$wave_<strong>for</strong>m::byteOrder];29


listType[long size, String typeList]: {$typeList.equals("adtl")}? assoc_data_list[$size]| {$typeList.equals("wavl") && $wave_<strong>for</strong>m::foundFmtCk}?wave_list[$size];wave_list[long size]locals[long index = 4]| {$typeList.equals("INFO")}? info_list[$size]| unknown<strong>Chunk</strong>Loop[size]: ({$index < $size}?;waveListType = dtString[4] dataOrSlnt[$waveListType.value]{$index += 4 + $dataOrSlnt.text.length();})+dataOrSlnt[String dataType]: {$dataType.equals("data")}? data_ck;slnt_ck| {$dataType.equals("slnt")}? slnt_ck: ckSize = dtDWord[$wave_<strong>for</strong>m::byteOrder];dtHexString[$ckSize.value]({Utilities.isOdd($ckSize.value)}? pad_byte)?assoc_data_list[long size]locals[long index = 4]: ({$index < $size}?;assoc_dataType = dtString[4] assoc_data_type[$assoc_dataType.value]{$index += 4 + $assoc_data_type.text.length();})+assoc_data_type[String assoc_dataType]: {$assoc_dataType.equals("labl")}? labl_ck;labl_ck| {$assoc_dataType.equals("note")}? note_ck| {$assoc_dataType.equals("ltxt")}? ltxt_ck| {$assoc_dataType.equals("file")}? file_ck: ckSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwName = dtDWord[$wave_<strong>for</strong>m::byteOrder]30


labal = dtZString[$ckSize.value - 4]({Utilities.isOdd($ckSize.value)}? pad_byte)?;note_ck: ckSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwName = dtDWord[$wave_<strong>for</strong>m::byteOrder]comment = dtZString[$ckSize.value - 4]({Utilities.isOdd($ckSize.value)}? pad_byte)?;ltxt_ck: ckSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwName = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwSampleLength = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwPurpose = dtString[4]wCountry = dtUWord[$wave_<strong>for</strong>m::byteOrder]wLanguage = dtUWord[$wave_<strong>for</strong>m::byteOrder]wDialect = dtUWord[$wave_<strong>for</strong>m::byteOrder]wCodePage = dtUWord[$wave_<strong>for</strong>m::byteOrder]({$ckSize.value - 20 > 0}? dtData[$ckSize.value - 20])?({Utilities.isOdd($ckSize.value)}? pad_byte)?;file_ck: ckSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwName = dtDWord[$wave_<strong>for</strong>m::byteOrder]dwMedType = dtDWord[$wave_<strong>for</strong>m::byteOrder]({$ckSize.value - 8 > 0}? dtData[$ckSize.value - 8])?({Utilities.isOdd($ckSize.value)}? pad_byte)?;info_list [long size]locals [long index = 4]: ({$index < $size}?typeInfo = dtString[4]info_type[$typeInfo.value]{$index += $info_type.text.length() + 4;})+;info_type[String typeInfo]31


: {$typeInfo.equals("ISFT")}? isft| {$typeInfo.equals("IENG")}? ieng| {$typeInfo.equals("ICRD")}? icrd| unknown<strong>Chunk</strong>;isft: strSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]software = dtZString[$strSize.value]({Utilities.isOdd($strSize.value)}? pad_byte)?;ieng: strSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]engineer = dtZString[$strSize.value]({Utilities.isOdd($strSize.value)}? pad_byte)?;icrd: strSize = dtDWord[$wave_<strong>for</strong>m::byteOrder]creationDate = dtZString[$strSize.value]({Utilities.isOdd($strSize.value)}? pad_byte)?;unknown<strong>Chunk</strong>: dtDWord[$wave_<strong>for</strong>m::byteOrder]dtData[$dtDWord.value]({Utilities.isOdd($dtDWord.value)}? pad_byte)?;pad_byte: dtUByte;unknown<strong>Chunk</strong>Loop [long size]locals [long index = 4]: ({$index < $size}?dtString[4]unknown<strong>Chunk</strong>{$index += $unknown<strong>Chunk</strong>.text.length() + 4;})+;Figure 15. The Wave_Form <strong>File</strong> Format Grammar32


6.4 Generating Parse Trees <strong>for</strong> <strong>Binary</strong> Array <strong>Grammars</strong>There are two ways to generate parse trees using ANTLR. One way is to create a tree grammar.The other way is to embed print statements as actions in the grammar itself. Both methods areawkward <strong>and</strong> make it difficult to both read <strong>and</strong> specify grammars. With the advent of ANTLR4[Parr 2012], this has changed. ANTLR4 has a built in parse tree walker. It automatically createsnodes <strong>for</strong> each rule in the grammar <strong>and</strong> a terminal node <strong>for</strong> each token.6.4.1 How ANTLR4 Generates Parse Trees <strong>for</strong> String <strong>Attribute</strong> <strong>Grammars</strong>ANTLR4 provides a comm<strong>and</strong> line parameter that allows the user to specify whether they wouldlike the parse tree displayed as a nested list style tree or as a graphical parse tree in a graphicalwindow. Figure 16 shows a parse tree displayed as a nested list.(prog (stat (expr 193) \r\n) (stat a = (expr 5) \r\n) (stat b = (expr 6) \r\n) (stat (expr (expr a) +(expr (expr b) * (expr 2))) \r\n) (stat (expr (expr ( (expr (expr 1) + (expr 2)) )) * (expr 3)) \r\n))Figure 16. Parse tree shown as a nested list.Figure 17 shows the parse tree output in a graphical user interface window.Figure 17. Parse tree shown as a graph.33


6.4.2 Using ANTLR4 to Generate Parse Trees <strong>for</strong> <strong>Binary</strong> <strong>Attribute</strong> <strong>Grammars</strong>There are two problems with using the ANTLR4 function <strong>for</strong> displaying the parse tree as anested list. Since the parse tree is displayed as one continuous string, the parse tree becomesalmost unreadable when the input file is longer than a couple of lines. If one uses the ANTLR4parse tree function on an attribute grammar that has been extended to enable binary data types,the output becomes even more unreadable. In string grammars, a token is a printable string. In abinary array grammar, the byte is not always a printable character. When a printable character<strong>for</strong> a byte is unavailable, a space is used by ANTLR4. Also, in the case of binary grammars, datatypes are represented as byte arrays of different lengths. Even a string is read as an array of bytetokens where each token value is separated by a space. The data type long integer is an array of 4bytes. The decimal number that the bytes of data represent is what is important, not theindividual bytes.Figure 18 shows a display of the ILBM <strong>for</strong>mated file Docking.bru.Figure 18. ILBM <strong>File</strong> Docking.bruFigure 19 shows a nested list style parse tree created from a parse of the file using the ILBMANTLR grammar shown in Figure 14. This single string requires 3 pages to printout <strong>and</strong> isalmost unreadable. Most of the terminal bytes are represented as spaces or odd characters. Theactual size of the image is shown to be 6594 bytes. However, to conserve space in this paper, inFigure 19 only the first three bytes <strong>and</strong> the last three bytes of the image are shown separated byan ellipsis.34


(ilbm (ckID (dtString (byteArray F O R M))) (ckSize (dtLong (byteArray D)))(ckID (dtString (byteArray I L B M))) (propertiesLoop (ckID (dtString (byteArray B M HD))) (property (bmhd (ckSize (dtLong (byteArray ))) (bitMapHeader (dtWord(byteArray ¹)) (dtWord (byteArray H)) (dtWord (byteArray )) (dtWord (byteArray)) (dtUByye (byteArray )) (dtUByye (byteArray )) (dtUByye (byteArray )) (dtUByye(byteArray €)) (dtWord (byteArray )) (dtUByye (byteArray )) (dtUByye (byteArray)) (dtWord (byteArray ¹)) (dtWord (byteArray ?))))) (ckID (dtString (byteArray A NN O))) (property (additionalProperty (ckSize (dtLong (byteArray H))) (dtData(byteArray $ V E R : W r i t t e n b y A S D G ' s A r t D e p a r t m e n tP r o f e s s i o n a l I F F 3 . 0 . 4 ( 0 5 . 0 2 . 9 5 ) )))) (ckID (dtString(byteArray C M A P))) (property (cmap (ckSize (dtLong (byteArray )))(colorRegister (dtUByye (byteArray ÿ)) (dtUByye (byteArray ÿ)) (dtUByye (byteArrayÿ))) (colorRegister (dtUByye (byteArray )) (dtUByye (byteArray )) (dtUByye(byteArray ))) (colorRegister (dtUByye (byteArray ÿ)) (dtUByye (byteArray ))(dtUByye (byteArray ))) (colorRegister (dtUByye (byteArray ÿ)) (dtUByye (byteArrayf)) (dtUByye (byteArray ))))) (ckID (dtString (byteArray C A M G))) (property (camg(ckSize (dtLong (byteArray ))) (dtLong (byteArray € )))) (ckID (dtString(byteArray D P I ))) (property (additionalProperty (ckSize (dtLong (byteArray))) (dtData (byteArray ¥ ,)))) (ckID (dtString (byteArray B O D Y)))) (body(ckSize (dtLong (byteArray ))) (dtHexString (byteArray © © … © ))))Figure 19. Parse Tree Created by ANTLR4The parse tree of binary array grammars can be made more readable by (1) pretty printing thenested lists so that the levels of the tree can be shown, <strong>and</strong> (2) by outputting the values returnedby the datatype functions.6.4.3 Extensions to ANTLR4 to Generate Pretty Printed Parse Trees with DataType ValuesANTLR4 creates the nested list style parse tree by using a built-in parse tree walker. WhenANTLR4 generates a parser, it creates a RuleContext <strong>for</strong> every rule. Each RuleContext has aRuleNode interface. The RuleNode contains a list of children, <strong>and</strong> an index into both a list ofRules <strong>and</strong> a list of RuleNames. A RuleNode child can be either a RuleNode or a TerminalNode.The TerminalNode is associated with text that matches its TokenType. When an ANTLR4created parser calls its top rule, it returns the rule’s RuleContext. To get the parse tree string theuser calls the toStringTree method of the returned context. The toStringTree method creates astring with its ruleName as the head of the list followed by a space then appends resulting stringsfrom its recursive calls to the toStringTree method of all its children. The recursive calling stopswhen it reaches a TerminalNode, which simply returns a string containing its matching text.A group of utilities, called prettyPrintTree, was created to pretty print a tree. This overloadedfunction behaves in a manner similar to the toStringTree method. This method takes a TreeNodeas a parameter <strong>and</strong> walks the RuleNodes until it finds a RuleNode with “dt” as the start of itsruleName. As shown in Figure 14, every data type rule starts with “dt”. All the data type rulescall binaryArray to build the data type from the stored data. Every data type rule has a return35


value called “value”. When the data type rule is reached it returns the string value of its “value”attribute. Another difference between the toStringTree method <strong>and</strong> the prettyPrintTree method isthat while ruleName is followed by a space in toStringTree, it is followed by a newline character<strong>and</strong> is indented according to the depth of its embedding in prettyPrintTree. The prettyPrintTreeutility methods are shown in Appendix C.Though the method is called prettyPrintTree, it does not actually print the parse tree. It <strong>for</strong>matsthe parse tree <strong>for</strong> printing or displaying in a window.Figure 20 shows the parse tree that is result of calling the prettyPrintTree utility method with thetop-level RuleContext returned from parsing the DOCKING.BRU ilbm file. Only a few of the6594 bytes in the image are shown in the figure. The pretty printed parse tree is easier to read<strong>and</strong> underst<strong>and</strong> than the nested lists shown in Figure 19.(ilbm(ckID(dtStringFORM))(ckSize(dtLong 6724))(ckID(dtStringILBM))(propertiesLoop(ckID(dtStringBMHD))(property(bmhd(ckSize(dtLong 20))(bitMapHeader(dtWord 706)(dtWord 328)(dtWord 0)(dtWord 0)(dtUByye 2)(dtUByye 0)(dtUByye 1)(dtUByye 194)(dtWord 0)(dtUByye 20)(dtUByye 11)(dtWord 706)(dtWord 450))))(ckID(dtStringANNO))(property(additionalProperty36


(ckSize(dtLong 72))(dtData$VER: Written by ASDG's Art Department Professional IFF3.0.4 (05.)))(ckID(dtStringCMAP))(property(cmap(ckSize(dtLong 12))(colorRegister(dtUByye 195)(dtUByye 195)(dtUByye 195))(colorRegister(dtUByye 0)(dtUByye 0)(dtUByye 0))(colorRegister(dtUByye 195)(dtUByye 0)(dtUByye 0))(colorRegister(dtUByye 195)(dtUByye 102)(dtUByye 0))))(ckID(dtStringCAMG))(property(camg(ckSize(dtLong 4))(dtLong 49792)))(ckID(dtStringDPI ))(property(additionalProperty(ckSize(dtLong 4))(dtData0x00c2a5012c)))(ckID(dtStringBODY)))(body(ckSize(dtLong 6594))(dtHexString0xc2a900c2a900c2a900c2a900c2a900c2a900c2a900c39b000101c3bec3bd00017c380c39700c2a900c39b0001c3bfc3bec3bd00027fc3bfc280c39800c2a900c39c0023fc3bfc3bec3bd00027fc3bfc3bec39800c2a900c39d000307c3bfc3bfc3bec3b37


. . .00c39c00023fc3bfc3bec3bd00027fc3bfc3bec39800c2a900c39b0001c3bfc3becbd00027fc3bfc280c39800c2a900c39b000103c3bec3bd00017fc380c39700c2a90c2a900c2a900c2a900c2a900c2a900c2a900c2a900c2a900c2a900c2a900c2a9000)))Figure 20. Pretty Printed Parse Tree of ILBM Formated <strong>File</strong> Docking.bru6.4.4 How ANTLR4’s <strong>Attribute</strong> <strong>Grammars</strong>, Semantic Predicates <strong>and</strong> ActionClauses Help with Parsing <strong>and</strong> ValidationAlong with the ability to pass attribute values to <strong>and</strong> from rules, ANTLR4 allows rules to createlocal variables that can be seen <strong>and</strong> modified by rules below them in the parse tree. It is best topass variables to <strong>and</strong> from rules using its parameters. But, this is not always efficient. Many file<strong>for</strong>mats use a single byte order <strong>for</strong> the entire file. It is inconvenient to pass the byte order fromevery rule down to the data type level. By setting a local variable at the top level of a grammar,all rules in the grammar can access the variable by prefacing the variable name with the top-levelrule’s name followed by two colons. Any rule can access one of its ancestor’s local variables inthis way. As seen in Figures 14 <strong>and</strong> 15, the ilbm grammar <strong>and</strong> wave_<strong>for</strong>m grammars each havea top-level variable defined as a string called byteOrder. In the ilbm grammar byteOrder is set to“BE” <strong>for</strong> big endian <strong>and</strong> in the wave_<strong>for</strong>m grammar the byteOrder is set to “LE” <strong>for</strong> little endian.Semantic predicates can be used in two ways. They can be used in guiding the parser or invalidating the file <strong>for</strong>mat. The property rule below is an example of semantic predicates beingused to guide the parse.property[String currentID]: {$currentID.equals("BMHD")}? bmhd| {$currentID.equals("CMAP")}? cmap| {$currentID.equals("CAMG")}? camg| {$currentID.equals("CRNG")}? crng| additionalProperty[currentID] ;The value of the semantic predicate currentID is not available until it is read from the file.Without this value, there no way to determine which rule should be called next. When thecurrentID is equal to the string “BMHD”, there are only two possible following rules—the bmhdrule or the additionalProperty rule. Because the bmhd path occurs first, it is chosen. If all thecurrentID tests fail, the additionalProperty rule would be the only possible rule. In that case, thevalue of chunk size is used. If the chunk size evaluates to odd, there is a pad byte at the end ofthe chunk.Another use of semantic predicates is in validation of file <strong>for</strong>mats. The top-level rule of thewave_<strong>for</strong>m grammar shown below uses semantic predicates to support validation. Semantic38


predicates used in this way are similar to assert statements in the C programming language. If thepredicates evaluate to false, a failed predicate exception is thrown by the parser <strong>and</strong> everythingfollowing the predicate is ignored. The local foundFmtCk variable at the top-level is set to truewhen the <strong>for</strong>mat chunk rule is successful. Later, the foundFmtCk will be used as a semanticpredicate to validate that the <strong>for</strong>mat chunk comes be<strong>for</strong>e the data chunk or be<strong>for</strong>e the LISTchunk with the list type “wavl”.wave_<strong>for</strong>mlocals [String byteOrder = "LE", boolean foundFmtCk]: riff = dtString[4] {$riff.value.equals("RIFF")}?dtDWord[$byteOrder]wave = dtString[4] {$wave.value.equals("WAVE")}?chunkOrListLoop ;ANTLR4’s action clauses can also be used to guide the parser or to aid in validating the file<strong>for</strong>mat. In action clauses, mathematical computations can be per<strong>for</strong>med <strong>and</strong> variables can beassigned values. An if statement inside an action clause can be used to throw an exception or tosend a message to the st<strong>and</strong>ard output or to the error output. The bmhd rule below shows thetop-level variable ilbm foundBMHD being set to true after successfully parsing the bitmapheader chunk. The body rule shows the foundBMHD value being tested <strong>for</strong> the purpose ofvalidating the file <strong>for</strong>mat.bmhdbody: ckSize {$ckSize.value == 20}?bitMapHeader{$ilbm::foundBMHD = true;} ;: {$ilbm::foundBMHD}?ckSize dtHexString[$ckSize.value] ;Shown below is the unknown<strong>Chunk</strong>Loop rule that is found in the wave<strong>for</strong>m grammar. It iscalled when a list is encountered that has an unknown list type. There is no way to know howmany chunks it contains or what the chunks are. The action clause increments the index so theloop can be terminated when the size of the accumulated chunks equal the size of the list chunk.unknown<strong>Chunk</strong>Loop [long size]locals [long index = 4]: ({$index < $size}?dtString[4]unknown<strong>Chunk</strong>{$index += $unknown<strong>Chunk</strong>.text.length() + 4;})+ ;39


7 Related ResearchResearchers have developed a number of data description languages to support validation of file<strong>for</strong>mats. EAST (Enhanced ADA Subset) is a data description language developed by theConsultative Committee <strong>for</strong> Space Data Systems [2010]. The Data Entity DictionarySpecification Language (DEDSL) can be used in conjunction with EAST <strong>for</strong> defining semanticin<strong>for</strong>mation. The EAST description is used to interpret <strong>and</strong> provide access to in<strong>for</strong>mation inbinary <strong>and</strong> text files.ASN.1 (Abstract Syntax Notation One) is an international st<strong>and</strong>ard which aims a specifying thestructure of telecommunication <strong>and</strong> computer networking protocols between heterogeneoussystems [Larmouth, 1999; Dubuission 2000; Walkin 2006].DATASCRIPT [Back 2002] supports specifying <strong>and</strong> parsing binary data <strong>and</strong> has been used tomanipulate Java jar files <strong>and</strong> ELF object files. The PADS/C data description language can beused to define the file <strong>for</strong>mats of text <strong>and</strong> binary files [Fisher & Gruber 2005]. The PADS/Ccompiler compiles the description into tools that can be used to recognize, manipulate <strong>and</strong>trans<strong>for</strong>m the data into other <strong>for</strong>mats.The Data Format Description Language (DFDL) is being developed by a Working Group of theOpen Grid Forum [OGF 2011]. The data description language is based on a subset of W3C XMLSchema using annotations to carry the extra in<strong>for</strong>mation necessary to describe non-XML physical representations. Version 1 of the language specification was published in 2011<strong>and</strong> a parser is being implemented.Each of the data description languages described above can be used to define data types <strong>and</strong> filestructures of binary files. However, they are based on file structure declarations such as those ofC or Java. The binary file <strong>for</strong>mat grammar described in this paper most closely resembles DFDL.However, the binary file grammar described in this paper is the only data description languagebased on <strong>for</strong>mal grammars that is used <strong>for</strong> creating parsers <strong>for</strong> file <strong>for</strong>mats.JHOVE (JSTOR/Harvard Object Validation Environment) is an API written in Java <strong>for</strong> thevalidation of the <strong>for</strong>mats of digital files [Harvard University Library 2011]. JHOVE supportsvalidation of the following <strong>for</strong>mats: AIFF, GIF, HTML, JPEG, JPEG 2000, PDF, TIFF, WAVE,XML <strong>and</strong> ASCII <strong>and</strong> UTF-8 encoded text. JHOVE2 is a rewrite of JHOVE. It currently containsvalidation modules <strong>for</strong> the following <strong>for</strong>mats: ICC Color Profile, SGML, ESRI Shapefile, TIFF,WAVE, XML <strong>and</strong> UTF-8 encoded text [Cali<strong>for</strong>nia Digital Library 2011].The research reported in this paper is similar in intent to that of the JHOVE projects—validationof binary file <strong>for</strong>mats. As a matter of fact, binary array grammars <strong>and</strong> parsers are being40


constructed <strong>for</strong> the chunk-based file <strong>for</strong>mats AIFF, JPEG, JPEG 2000, WAVE <strong>and</strong> ESRIShapefile. However, the research reported herein differs from that of the JHOVE projects in thatwhat is sought is a technology <strong>for</strong> generating validators <strong>for</strong> binary file <strong>for</strong>mats from grammarsspecifying the binary file <strong>for</strong>mat.8. ConclusionThe research question addressed by his research is whether it is possible to extend the contextfreegrammars used to specify the syntax of programming languages to the specification ofbinary file <strong>for</strong>mats <strong>and</strong> to use these grammars with parsers <strong>for</strong> validating the file <strong>for</strong>mats ofbinary files. A major family of binary file <strong>for</strong>mats, chunk-based, was described. Then extensionsto the concepts of context-free grammars <strong>and</strong> attribute grammars were described that enable thespecification of binary file <strong>for</strong>mats. Examples of attribute array grammars <strong>for</strong> two chunk-basedbinary file <strong>for</strong>mats were then presented. Recursive descent parsers <strong>for</strong> this class of grammars wasthen described. Experience in using ANTLR, a parser generator <strong>for</strong> LL(k) string grammars, ingenerating parsers <strong>for</strong> recognizing the <strong>for</strong>mats of binary files was discussed. Finally, relatedresearch in data description languages <strong>for</strong> binary file <strong>for</strong>mats was described <strong>and</strong> the JHOVE2project which is creating Java-based tools <strong>for</strong> validating file <strong>for</strong>mats was described.It is concluded that it is possible to extend context-free grammars to the specification of chunkbasedbinary file <strong>for</strong>mats. Furthermore, these grammars can be used with recursive descentparsers <strong>for</strong> parsing the file <strong>for</strong>mats of chunk-based binary files. It remains to be determinedwhether these attribute grammars based on binary array grammars are adequate to define otherfamilies of binary file <strong>for</strong>mats.A capability to validate binary file <strong>for</strong>mats assumes that the file <strong>for</strong>mat of a file is alreadyknown. In this research, we use a file type identifier based on the UNIX file comm<strong>and</strong> <strong>and</strong> amagic file that we have created [Underwood 2009]. We have samples of the chunk-based file<strong>for</strong>mats discussed in this report including those referenced in Appendices A <strong>and</strong> B. We also havefile signature tests (magic tests) <strong>for</strong> all these file types.We would like to have a parser generator (also called a compiler–compiler) that would take asinput an attribute grammar <strong>for</strong> a binary file <strong>for</strong>mat <strong>and</strong> generate a parser or interpreter thatrecognizes the file <strong>for</strong>mat or that interprets the file <strong>for</strong>mat <strong>and</strong> displays or plays the contents. Theclosest parser generator that we have found to having these capabilities is ANTLR <strong>and</strong> ANTLR4(www.antlr.org/ ). ANTLR will generate a top-down recursive descent parser from a context-freeattribute grammar. However, it works with strings, <strong>and</strong> generates a separate lexical analyzer.41


Appendices A <strong>and</strong> B show the names of 86 file <strong>for</strong>mats that belong to the family of “chunkbased”file <strong>for</strong>mats. We have specifications <strong>for</strong> most of these file <strong>for</strong>mats <strong>and</strong> samples of fileswith these <strong>for</strong>mats.These file <strong>for</strong>mats have a similar structure. We have developed binary array grammars <strong>for</strong> two ofthese binary file <strong>for</strong>mats. We should be able to do so <strong>for</strong> all of them. We used these grammarswith the extensions to ANTLR4 to generate top-down recursive descent parsers <strong>for</strong> the file<strong>for</strong>mats defined with these grammars.In this paper it is shown how to extend ANTLR4 to generate parsers <strong>for</strong> binary array attributegrammars by creating lexical rules <strong>and</strong> functions <strong>for</strong> binary data types. It is also shown how togenerate parse trees represented as nested lists that can be pretty printed.In this paper, it has not yet been demonstrated how to create a parser that validates binary file<strong>for</strong>mats with error messages <strong>for</strong> noncompliant syntax or semantics. Validation of binary file<strong>for</strong>mats via binary array attribute grammars will be demonstrated in a <strong>for</strong>thcoming report onbinary array attribute grammars <strong>for</strong> directory-based binary file <strong>for</strong>mats.The significance of success in this research task is that if binary file <strong>for</strong>mats can be specifiedwith binary array attribute grammars, then only one parsing generator is needed to generate theparsers <strong>for</strong> verifying that a file con<strong>for</strong>ms to a particular <strong>for</strong>mat. Similarly, the same parsinggenerator can possibly be used <strong>for</strong> conversion of legacy file <strong>for</strong>mats to current or st<strong>and</strong>ard<strong>for</strong>mats. Finally, the same parser generator can possibly be used <strong>for</strong> generating viewers/players<strong>for</strong> most file <strong>for</strong>mats. This would increase the likelihood of preserving <strong>and</strong> making available intothe indefinite future those digital records encoded in binary file <strong>for</strong>mats.42


References[3GPP2 2003] 3GPP2 C.S0050, 3GPP2 <strong>File</strong> <strong>Formats</strong> <strong>for</strong> Multimedia Services, <strong>File</strong> Format <strong>for</strong>Multimedia Services <strong>for</strong> cdma2000. www.3gpp2.org/Public_html/specs/index.cfm[Aho 1968] A. V. Aho. Indexed <strong>Grammars</strong> – An Extension of Context-free <strong>Grammars</strong>. Journalof the ACM, Vol. 15, Issue 4, Oct. 1968.[Alias-wavefront 1999] Alias-Wavefront. Maya <strong>File</strong> <strong>Formats</strong>, Maya 2.0, 1999.http://caad.arch.ethz.ch/info/maya/manual/[Amigan 2011] Amigan Software. IFF FORM <strong>and</strong> <strong>Chunk</strong> Registryhttp://amigan.1emu.net/reg/iff.html[Anisimovsky 2002a] Valery V. Anisimovsky. Electronic Arts Audio <strong>File</strong> <strong>Formats</strong> Description,May 1, 2002. www.extractor.ru/articles/electronic_arts_audio_file_<strong>for</strong>mats_description[Anisimovsky 2002b] Valery V. Anisimovsky. Electronic Arts New Audio <strong>File</strong> <strong>Formats</strong>Description, May 1 2002. www.wotsit.org/list.asp?search=electronic+arts&button=GO%21[Apple 1989] Audio Interchange <strong>File</strong> Format: "AIFF" A St<strong>and</strong>ard <strong>for</strong> Sampled Sound <strong>File</strong>sVersion 1.3 Apple Computer, Inc., January 1989.www.aesnashville.org/PDFs/AIFF%20Specs.pdf[Apple 1991] AIFF-C Apple Computer, Inc. Draft 08/26/91http://www-mmsp.ece.mcgill.ca/Documents/Audio<strong>Formats</strong>/AIFF/Docs/AIFF-C.9.26.91.pdf[Apple 1996] Apple Computer Inc. QuickTime <strong>File</strong> Format Specification, May 1996.www.idea2ic.com/<strong>File</strong>_<strong>Formats</strong>/QuickTime%20<strong>File</strong>%20Format.pdf[Apple 2001] Apple Computer, Inc. QuickTime <strong>File</strong> Format, 2001-03-01.https://developer.apple.com/st<strong>and</strong>ards/classicquicktime.html[Apple 2005] Apple. Core Audio Format Specification, 2005http://developer.apple.com/library/mac/#documentation/MusicAudio/Reference/CAFSpec/CAF_spec/CAF_spec.html[Apple 2007] Apple Inc. QuickTime <strong>File</strong> Format Specification, 2007.http://developer.apple.com/library/mac/#documentation/QuickTime/QTFF/QTFFPreface/qtffPreface.html[Autodesk 1997] Autodesk Ltd. 3D Studio <strong>File</strong> Format (3ds). Document Revision 0.93 - January1997 www.dcs.ed.ac.uk/home/mxr/gfx/3d/3DS.spec43


[Autodesk 2009] Autodesk. DXF Reference, January 2009.http://usa.autodesk.com/adsk/servlet/item?linkID=10809853&id=12272454&siteID=123112[Back 2002] G. Back. (2002) A specification <strong>and</strong> scripting language <strong>for</strong> binary data. GenerativeProgramming <strong>and</strong> Component Engineering, Vol. 2487, Lecture Notes in Computer Science, 66-77.[Barth 1979] G. Barth. Fast recognition of context-sensitive structures. Journal of Computing, V22, 1979, pp. 243-256.[Baum & Noel 2002]D. Baum & G. Noel. The Simms Interchange <strong>File</strong> Format, 2002http://simtech.source<strong>for</strong>ge.net/tech/iff.html[Bayless 1987] James Bayless. WORD (word processing FORM used by ProWrite) NewHorizons Software, Inc. 1987. http://amigan.1emu.net/reg/WORD.txt[Beham 1995] H. Beham. DigiTrekker Module Format (*.DTM) rev1.0, 1995.ftp://ftp.modl<strong>and</strong>.com/pub/documents/<strong>for</strong>mat_documentation/DigiTrekker%20(.dtm).txt[Blank Aspect 2010a] blankaspect.org. NLF (Nested List <strong>File</strong>) <strong>for</strong>mat, December 2010.http://onda.source<strong>for</strong>ge.net/nlf/nlfFormat.html[Blank Aspect 2010b] blankaspect.org. Onda lossless audio compression algorithm & Onda file<strong>for</strong>mats, March 2010. http://onda.source<strong>for</strong>ge.net/ondaAlgorithmAnd<strong>File</strong><strong>Formats</strong>.html[Cali<strong>for</strong>nia Digital Library 2011] Cali<strong>for</strong>nia Digital Library. JHOVE2 User’s Guide. 2011https://bytebucket.org/jhove2/main/wiki/documents/JHOVE2-Users-Guide_20110222.pdf[Caligari 1998] Caligri. Caligari trueSpace file <strong>for</strong>mat, Document Version 6, July2002. http://www.caligari.com/download/zip/scncob65.zip[CCITT 1992] CCITT. Terminal Equipment And Protocols <strong>for</strong> Telematic Services In<strong>for</strong>mationTechnology Digital Compression <strong>and</strong> Coding of Continuous-Tone Still Images Requirements<strong>and</strong> Guidelines Recommendation T.81, September 1992. http://www.w3.org/Graphics/JPEG/itut81.pdf[CCSDS 2010] Consultative Committee <strong>for</strong> Space Data Systems. (2010) The Data DescriptionLanguage EAST Specification. http://public.ccsds.org/publications/archive/644x0b3.pdfCompuphase. The FLIC <strong>File</strong> Format. www.compuphase.com/flic.htm[Corel 1998] Corel Corporation, CMX <strong>File</strong> Format Ver. 8.0, 31 March 1998.44


[Crazy Talk] Crazy Talk. Animator Pro file <strong>Formats</strong>. www.martinreddy.net/gfx/anim/FLC.txt[Cunniff <strong>and</strong> Orr] Ross Cunniff <strong>and</strong> John Orr. 2D St<strong>and</strong>ard Object Format: FORM DR2D.Taliesin, Inc. http://amigan.1emu.net/reg/DR2D.txt[Cuturicu <strong>and</strong> Fromme 1999] Cuturicu, Cristi <strong>and</strong> Oliver Fromme. .CRYX's note about the JPEGdecoding algorithm., 1999. www.coco3.com/text/doc_JPEG.txt[Dubuission 2000] O. Dubuission. ASN.1 Communication between heterogeneous systems.2000. www.oss.com/asn1/resources/books-whitepapers-pubs/dubuisson-asn1-book.PDF[Dunckley et al 2007] Dunckley, M., Rankin, S., Conway, E. & Giaretta, D. (2007) The use offile description languages <strong>for</strong> file <strong>for</strong>mat identification <strong>and</strong> validation. Proc. PV 2007Conference: Ensuring the Long-Term Preservation <strong>and</strong> Value Adding to Scientific <strong>and</strong> TechnicalData, DLR, Oberpfaffenhofen/Munich, Germany. http://epubs.cclrc.ac.uk/workdetails?w=50089[EBU 2001] European Broadcasting Union. Specification of the EBU Broadcast Wave Format,Tech 3285, July 2001. http://tech.ebu.ch/docs/tech/tech3285.pdf[Ecma 2009] Ecma International. JPEG <strong>File</strong> Interchange Format (JFIF). ECMA TR/98 1st ed.,Ecma International, 2009. www.ecma-international.org/publications/techreports/E-TR-098.htm[ESRI 1998] ESRI Shapefile Technical Description. July 1998.www.esri.com/library/whitepapers/pdfs/shapefile.pdf[ETSI 2012] ETSI TS 26.244; Transparent end-to-end packet switched streaming service (PSS);3GPP file <strong>for</strong>mat (3GP) 2012 www.3gpp.org/ftp/Specs/html-info/26244.htmFrans Faase. BFF: A Grammar <strong>for</strong> <strong>Binary</strong> <strong>File</strong> <strong>Formats</strong>. www.iwriteiam.nl/Ha_BFF.html[Fercoq 1996] Robin Fercoq, 3D-Studio Material-Library <strong>File</strong> Format (.mli), Autodesk Ltd.,May 1996. www.mediatel.lu/workshop/graphic/3D_file<strong>for</strong>mat/h_3ds2.html[Ferguson 1995] Stuart Ferguson. Lightwave 3d Layered Object <strong>File</strong> Format. Lightwave, 10/4/95www.mediatel.lu/workshop/graphic/3D_file<strong>for</strong>mat/lightwave/h_lwlo.html[Fisher & Gruber 2005] Fisher, K. & Gruber, R. (2005). PADS: A domain specific language <strong>for</strong>processing ad hoc data. ACM Conference on Programming language Design <strong>and</strong>Implementation. ACM Press. http://www.padsproj.org/papers/pldi.pdf[Foo 1995] Kenneth Foo, Audio Manager 4 Module Format. TechnoMaestro/RDG, 199545


[Frost 1997a] Martin Frost. Z-machine Common Save-<strong>File</strong> Format St<strong>and</strong>ard also called Quetzal:Quetzal Unifies Efficiently The Z-Machine Archive Language version 1.3b 29th September1997. www.gnelson.demon.co.uk/zspec/quetzal.html[Frost 1997b] Martin Frost. Z-machine Common Save-<strong>File</strong> Format St<strong>and</strong>ard. also calledQuetzal: Quetzal Unifies Efficiently The Z-Machine Archive Language version 1.4 (03-Nov-97)http://ifarchive.heanet.ie/if-archive/infocom/interpreters/specification/savefile_14.txt[Glanville 2003] R. S. Glanville. Format of Anim8or .an8 <strong>File</strong>s - V0.85, 21-Sep-03.http://www.anim8or.com/resources/an8_<strong>for</strong>mat.txt[Google 2010] Google. WebP RIFF Container, 2010.http://code.google.com/speed/webp/docs/riff_container.html[Haal<strong>and</strong>] Mike Haal<strong>and</strong>. Flic <strong>File</strong>s (.FLI) Format description.http://www.martinreddy.net/gfx/anim/FLI.txt[Hamilton 1992] Eric Hamilton. JPEG <strong>File</strong> Interchange Format: Version 1.02., C-CubeMicrosystems; Milpitas, CA; September 1, 1992. www.jpeg.org/public/jfif.pdf[Harvard University Library 2011] JHOVE Documentation.http://jhove.source<strong>for</strong>ge.net/documentation.html[Hayes <strong>and</strong> Morrison 1985] Steve Hayes <strong>and</strong> Jerry Morrison, "8SVX" IFF 8-Bit Sampled Voice.Electronic Arts, February 7, 1985. http://amigan.1emu.net/reg/8SVX.txt[Hopcroft et al 1995] J.E. Hopcroft, R. Motwani, <strong>and</strong> J.D. Ullman. Introduction to AutomataTheory, Languages <strong>and</strong> Computation, Second Edition. Addison Wesley, ISBN: 0-201-44124-1,1995.[Houghtaling 1996] R. James Houghtaling. ANI (Windows95 Animated Cursor <strong>File</strong> Format),August 1996. www.wotsit.org/list.asp?search=ani&button=GO%21[IBM & Microsoft 1991] IBM <strong>and</strong> Microsoft. Multimedia Programming Interface <strong>and</strong> DataSpecifications 1.0, August 1991. www.kk.iij4u.or.jp/~kondo/wave/mpidata.txtwww.tactilemedia.com/info/MCI_Control_Info.htmlhttp://www-mmsp.ece.mcgill.ca/documents/audio<strong>for</strong>mats/wave/Docs/riffmci.pdf[Int. MIDI Assoc 2008] The International MIDI Association. St<strong>and</strong>ard MIDI-<strong>File</strong> Format Spec.1.1, 2008 http://duskblue.org/proj/toymidi/midi<strong>for</strong>mat.pdf[ISO/IEC JTC 1/SC 29/WG 1] JPEG 2000 Part I : Core Coding System, Final Committee DraftVersion 1.0, March, 2000. http://www.jpeg.org/public/fcd15444-1.pdf46


[ISO/IEC 10918-1:1994] In<strong>for</strong>mation technology -- Digital compression <strong>and</strong> coding ofcontinuous-tone still images-Requirements <strong>and</strong> Guidelines.www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=18902[ISO/IEC IS 10918-3] Digital Compression <strong>and</strong> Coding of Continuous-Tone Still Images -Extensions. Also ITU-T T.84 Annex F www.jpeg.org/public/spiff.pdf[ISO/IEC 13818-7:2006] In<strong>for</strong>mation technology -- Generic coding of moving pictures <strong>and</strong>associated audio in<strong>for</strong>mation -- Part 7: Advanced Audio Coding (AAC)[ISO/IEC 14495-1:1999] In<strong>for</strong>mation technology -- Lossless <strong>and</strong> near-lossless compression ofcontinuous-tone still images: Baseline, December 1, 1999.[ISO/IEC 14495-2:2003] In<strong>for</strong>mation technology -- Lossless <strong>and</strong> near-lossless compression ofcontinuous-tone still images: Extensions, April 2, 2003[ISO/IEC 14496-3:2009] - In<strong>for</strong>mation technology -- Coding of audio-visual objects -- Part 3:Audio http://webstore.iec.ch/preview/info_isoiec14496-3%7Bed4.0%7Den.pdf[ISO/IEC 14496-12:2012]. In<strong>for</strong>mation technology -- Coding of audio-visual objects -- Part 12:ISO base media file <strong>for</strong>mat (technically identical to ISO/IEC 15444-12:2012).http://st<strong>and</strong>ards.iso.org/ittf/PubliclyAvailableSt<strong>and</strong>ards/index.html[ISO/IEC 14496-14:2003] In<strong>for</strong>mation technology -- Coding of audio-visual objects -- Part 14:MP4 file <strong>for</strong>mat.[ISO/IEC 15444-1:2004] In<strong>for</strong>mation technology -- JPEG 2000 image coding system -- Part 1:Core Coding System. Annex I JP2 <strong>File</strong> Format syntax Sept 15 2004.www.jpeg.org/jpeg2000/CDs15444.html[ISO/IEC 15444-2:2000] In<strong>for</strong>mation technology -- JPEG 2000 image coding system -- Part 2:Extensions, Dec 7 2000, Final Committee Draft (FCD) version available aswww.jpeg.org/public/fcd15444-2.pdf[ISO/IEC 15444-12:2012] In<strong>for</strong>mation technology -- JPEG 2000 image coding system - ISOBase Media Format. http://st<strong>and</strong>ards.iso.org/ittf/PubliclyAvailableSt<strong>and</strong>ards/index.html[ISO/IEC 15948:2003] In<strong>for</strong>mation technology — Computer graphics <strong>and</strong> image processing —Portable Network Graphics (PNG): Functional specification. W3C Recommendation 10November 2003 www.w3.org/TR/2003/REC-PNG-20031110/[ITU 1996] T-REC-T.84. In<strong>for</strong>mation technology -- Digital compression <strong>and</strong> coding ofcontinuous-tone still images: Extensions, 1996. www.jpeg.org/public/spiff.pdf47


[Jasc 1998] Jasc Software . Paint Shop Pro (PSP) <strong>File</strong> Format Specification, Jasc Software,November, 1998 (Paint Shop Pro (PSP) <strong>File</strong> Format Version 3.0, Paint Shop Pro ApplicationVersion 5.0) www.jasc.com/support/kb/articles/pspspec.asp[JEIDA 1998] Japan Electronic Industry Development Association. Digital Still Camera Image<strong>File</strong> Format St<strong>and</strong>ard (Exchangeable image file <strong>for</strong>mat <strong>for</strong> Digital Still Camera: Exif), Version2.1, JEIDA-49-1998, December 1998 www.exif.org/dcf-exif.PDF[JEITA 2002] Japan Electronic In<strong>for</strong>mation Technology Industries Association. Exchangeableimage file <strong>for</strong>mat <strong>for</strong> Digital Still Camera: Exif Version 2.2. JEITA CP-3451. April 2002.www.exif.org/Exif2-2.PDFwww.kodak.com/global/plugins/acrobat/en/service/digCam/exifSt<strong>and</strong>ard2.pdf[Kirvan 1994] S. Kirvan. Imagine 3.0 Object Format, Impulse Inc., May 1994.http://www.textfiles.com/programming/FORMATS/t3d_doc1.txt[Knuth 1968] D. E. Knuth. Semantics of Context-Free <strong>Grammars</strong>. Mathematical Systems Theory,vol. 2, pp. 127-145, 1968.[Knuth 1971] D. E. Knuth. Semantics of Context-Free <strong>Grammars</strong> (corrections). MathematicalSystems Theory, vol. 5, pp. 95_96, 1971.[Koster 1971] C.H.A. Koster. Affix grammars. In ALGOL 68 Implementation, J.E.L. Peck (ed.)North-Holl<strong>and</strong> Publ. Co., Amsterdam, 1971, pp. 95-109.[Kuznetkov 2001] Introduction to Advanced Streaming Format version 1.0, January, 2001.http://avifile.source<strong>for</strong>ge.net/asf-1.0.htm[Larmouth 1999] J. Larmouth. ASN.1 Complete. Morgan Kaufmann.1999.www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf[Lemone 1993] Karen Lemone, Design of Compilers: techniques of Programming LanguageTranslation, CRC Press, 1993. http://web.cs.wpi.edu/~kal/courses/compilers/project/index.htmlhttp://web.cs.wpi.edu/~kal/courses/compilers/module4/mysa.html[LightWave 1996] LightWave 3D Object <strong>File</strong> Format Oct 16, 1996http://s<strong>and</strong>box.de/osg/lightwave.htm[Lightwave 2001a] Lightwave. Lightwave 3D Object <strong>File</strong>s (LWo2], November 9, 2001http://muskrat.middlebury.edu/lt/cr/ongoing/sciviz/LightWave/SDK/docs/filefmts/lwsc.html[Losch 1997] Philip Losch. CINEMA 4D® — <strong>File</strong> Format Description, MAXON Computer,1997. http://paulbourke.net/data<strong>for</strong>mats/cinema4d/cinema4d.pdf48


[Luxology 2006] Luxology. “Appendix H: LXO file <strong>for</strong>mat Extensions” modo 202 Scripting <strong>and</strong>Comm<strong>and</strong>s. 2006 www.kxcad.net/modo/Scripting_<strong>and</strong>_Comm<strong>and</strong>s.pdf[Maishein] Max Maischein. PIC(PRO) Format.http://147.91.177.212/extra/file<strong>for</strong>mat/graphics/pic/picpro.txt[Martin 2007] Michael Martin. SVAF <strong>for</strong>mat description, Version 0.2, January 2007.http://svaf.source<strong>for</strong>ge.net/svaf.html[Microsoft 1992] Microsoft Multimedia St<strong>and</strong>ards Update, New Multimedia Data Types <strong>and</strong>Data Techniques July 29, 1992 Revision: 1.0.97http://netghost.narod.ru/gff/vendspec/micriff/ms_riff.txt[Microsoft 1994] Microsoft Corporation. Microsoft Multimedia St<strong>and</strong>ards Update: NewMultimedia Data Types <strong>and</strong> Data Techniques., Revision 3.0, April 15, 1994.http://download.microsoft.com/download/9/8/6/9863C72A-A3AA-4DDB-B1BA-CA8D17EFD2D4/RIFFNEW.pdf[Microsoft & RealNetwork 1998] Microsoft Corporation <strong>and</strong> RealNetworks, Inc. AdvancedStreaming Format (ASF) Specification. Public Specification Version 1.0, February 26, 1998.http://avifile.source<strong>for</strong>ge.net/asf0298rtf.htm http://avifile.source<strong>for</strong>ge.net/asf0298rtf.zip[Microsoft 2003b] Microsoft Corporation. Multiple Channel Audio Data <strong>and</strong> WAVE <strong>File</strong>shttp://www-mmsp.ece.mcgill.ca/Documents/Audio<strong>Formats</strong>/WAVE/Docs/multichaudP.pdf[Microsoft 2004] Microsoft. Advanced Systems Format (ASF) Specification, Revision 01.20.03,December 2004. www.microsoft.com/windows/windowsmedia/<strong>for</strong>pros/<strong>for</strong>mat/asfspec.aspxwww.microsoft.com/en-us/download/details.aspx?displaylang=en&id=14995[Microsoft 2005] Microsoft, AVI RIFF <strong>File</strong> Reference, 2005 http://msdn.microsoft.com/enus/library/ms899421.aspx[Morrison 1985] Jerry Morrison. “EA IFF 85” St<strong>and</strong>ard <strong>for</strong> Interchange Format <strong>File</strong>s,Electronic Arts, January 14, 1985 www.martinreddy.net/gfx/2d/IFF.txt[Morrison 1986a] Jerry Morrison. "ILBM" IFF Interleaved Bitmap , Electronic Arts, January 17,1986 www.fine-view.com/jp/labs/doc/ilbm.txt[Morrison 1986b] J. Morrison. “SMUS" IFF Simple Musical Score”, Electronic Arts, Inc.,February 5, 1986. http://amigan.1emu.net/reg/SMUS.txt[MultimediaWiki 2006] Smush Animation Format, August 2006http://wiki.multimedia.cx/index.php?title=SAN#<strong>Chunk</strong>_Forma49


[MultimeduiaWiki 2008a] Electronic Arts CMV, April 2008http://wiki.multimedia.cx/index.php?title=Electronic_Arts_CMV[MultimediaWiki 2008b] Electronic Arts DCT, April 2008http://wiki.multimedia.cx/index.php?title=Electronic_Arts_DCT[MultimediaWiki 2008c] Electronic Arts MAD, April 2008http://wiki.multimedia.cx/index.php?title=Electronic_Arts_MAD[MultimediaWiki 2008d] Electronic Arts MPC, April 2008http://wiki.multimedia.cx/index.php?title=Electronic_Arts_MPC[MultimediaWiki 2008e] Electronic Arts VP6, April 2008http://wiki.multimedia.cx/index.php?title=Electronic_Arts_VP6[MultimediaWiki 2008f] Electronic Arts TGV, August 2008http://wiki.multimedia.cx/index.php?title=Electronic_Arts_TGV[MultimediaWiki 2008g] Electronic Arts TGQ, November 2008http://wiki.multimedia.cx/index.php?title=Electronic_Arts_TGQ[MultimediaWiki 2008h] American Laser Games MM, January 2008http://wiki.multimedia.cx/index.php?title=American_Laser_Games_MM[MultimediaWiki 2009a] Electronic Arts TQI, February 2009http://wiki.multimedia.cx/index.php?title=Electronic_Arts_TQI[MultimediaWiki 2009b] SANM, October 2009.http://wiki.multimedia.cx/index.php?title=SANM[Novikov <strong>and</strong> Fillipov] Igor E. Novikov <strong>and</strong> Valek Fillipov. Cdrloader.py, A cdr import filterwritten in Python in the open source graphics file convertor, UniConvertor.http://sk1project.org/modules.php?name=Products&product=cdrexplorer[OGF 2011] OGF DFDL WG. Data Format Description Language (DFDL) v1.0 Specification,January 31, 2011. www.ogf.org/documents/GFD.174.pdf[Oppenheim 1988] Dave Oppenheim. St<strong>and</strong>ard MIDI <strong>File</strong>s 0.06, March 1, 1988www.ibiblio.org/emusic-l/info-docs-FAQs/MIDI-doc/MIDI-SMF.txt[Parr & Quong 1995] T. J. Parr & R. W. Quong. ANTLR: A predicated- ll(k) parser generator.Software: Practice <strong>and</strong> Experience. 25, 7 (July 1995), 789-810.http://www.antlr.org/article/1055550346383/antlr.pdf50


[Parr 2007] T. Parr. The Definitive ANTLR Reference: Building Domain-Specific Languages.Pragmatic, 2007[Parr 2012] T. Parr. The Definitive ANTLR4 Reference. Pragmatic, 2012.[Peters] Christoph Peters. The Ultimate 3D model file <strong>for</strong>mat (v. 2.0)http://ultimate3d.org/U3DModel<strong>File</strong>FormatV2.htm[Pfeiffer 2003] S. Pfeiffer. The Ogg Encapsulation Format Version 0, RFC 3533, May 2003.http://xiph.org/ogg/doc/rfc3533.txt[Plotkin] Andrew Plotkin . Blorb: An IF Resource Collection Format St<strong>and</strong>ard, Version 2.0www.eblong.com/zarf/blorb/www.ifarchive.org/if-archive/infocom/interpreters/specification/blorb_<strong>for</strong>mat.txt[R<strong>and</strong>elshofer 2011] Werner R<strong>and</strong>elshofer. Class RIFFParser, January 6, 2011www.r<strong>and</strong>elshofer.ch/multishow/javadoc/ch/r<strong>and</strong>elshofer/media/riff/RIFFParser.html[R<strong>and</strong>ers-Pehrson 2001] Glenn R<strong>and</strong>ers-Pehrson (ed.) MNG (Multiple-image Network Graphics)Format Version 1.0, 2001 www.libpng.org/pub/mng/spec/#Credits[Rentz 2008] Daniel Rentz. OpenOffice.org's Documentation of the Microsoft Excel <strong>File</strong>Format, Excel Versions 2, 3, 4, 5, 95, 97, 2000, XP, 2003, OpenOffice.org, April 2008.http://sc.openoffice.org/excelfile<strong>for</strong>mat.pdf[REWiki 2009]reWiki. .IFF. 2009 http://rewiki.regengedanken.de/wiki/.IFF[RFC 3072] M. Wildgrube. Structured Data Exchange Format (SDXF) Request <strong>for</strong> Comments:3072, March 2001 www.ietf.org/rfc/rfc3072.txt[Shaw et al 1985] Steve Shaw, Jerry Morrison <strong>and</strong> Bob Burns, FTXT: IFF Formatted Text, Draft2.6, Electronic Arts. November 15, 1985 http://<strong>for</strong>tunaty.net/com/textfiles/computers/ftxt[Soras 1995] L. de Soras. Official <strong>for</strong>mats of the Graoumf Tracker modules .GT2 v1 - v3 <strong>and</strong>old GT modules GTK, 31/08/95. http://www.wotsit.org/list.asp?search=GTK&button=GO%21[Sparta 1988] Sparta, Inc., ANIM: An IFF Format For CEL Animations, 4 May 1988http://amigan.1emu.net/reg/ANIM.txt[Tamir 2005] Giora Tamir. C# RIFF Parser, June 2005www.codeproject.com/KB/files/riffparser.aspx51


[Thirunarayan 2008] Krishnaprasad Thirunarayan. <strong>Attribute</strong> <strong>Grammars</strong> <strong>and</strong> their Applications.In: Encyclopedia of In<strong>for</strong>mation Science <strong>and</strong> Technology, Second Edition, Editor: MehdiKhosrow-Pour, In<strong>for</strong>mation Science Publishing, pp. 268-273, 2008.www.cs.wright.edu/~tkprasad/papers/<strong>Attribute</strong>-<strong>Grammars</strong>.pdf[Underwood 2009] W. Underwood. Extensions of the UNIX file Comm<strong>and</strong> <strong>and</strong> Magic <strong>File</strong> <strong>for</strong><strong>File</strong> Type Identification. PERPOS TR ITTL/CSITD 09-02 Georgia Tech Research Institute,September 2009.[Ung 1996] David Ung. Retargetable loader, Honours thesis, Computer Science <strong>and</strong> ElectricalEngineering, The University of Queensl<strong>and</strong>, 1996http://itee.uq.edu.au/~cristina/students/david/david.html[Walkin 2006] L. Walkin. Using the Open Source ANS.1 Compiler. 2006http://lionet.info/asn1c/asn1c-usage.pdf[van Wijngaarden 1974] A. van Wijngaarden. The generative power of two-level grammars.Automata, Languages <strong>and</strong> Programming. Lecture Notes in Computer Science #14, J. Loeckx (e.)Springer-Verlag, Berlin, 1974, pp. 9-16.[Wikipedia 2012] Wikipedia. DivX http://en.wikipedia.org/wiki/DivX[Wilhelm <strong>and</strong> Mauer 1995] R. Wilhelm <strong>and</strong> D. Maurer. Compiler Design. Addison-Wesley,1995. (chapter 10) (<strong>Attribute</strong> <strong>Grammars</strong>)[Xiph 2010] Xiph.Org Foundation. Vorbis I specification, February 3, 2010http://xiph.org/vorbis/doc/Vorbis_I_spec.html[Zappe 1994] H. Zappe. Oktalyzer, 1994.http://www.wotsit.org/list.asp?search=okt&button=GO%2152


Appendix A: <strong>Binary</strong> Array <strong>Grammars</strong> <strong>for</strong> <strong>Chunk</strong>-based <strong>Binary</strong> <strong>File</strong><strong>Formats</strong>This appendix sketches binary array grammars <strong>for</strong> some of binary chunk-based file <strong>for</strong>mats.A.1 Electronic Arts Interchange <strong>File</strong> Format (IFF)Electronic Arts <strong>and</strong> Commodore-Amiga introduced the first chunk-based file <strong>for</strong>mat in thespecification <strong>for</strong> the Interchange <strong>File</strong> Format in 1985[Morrison 1985, Amiga 2011].A.1.1 Interleaved Bitmap (ILBM)A binary array attribute grammar <strong>for</strong> this file <strong>for</strong>mat was presented in section 4.1 of this report.[Morrison 1986]A.1.2 8-bit Sampled Voice (8SVX)The file <strong>for</strong>mat specification <strong>for</strong> the 9-bit sampled voice file <strong>for</strong>mat uses C programminglanguage data structures to define the <strong>for</strong>mat [Hayes <strong>and</strong> Morrison 1985]. The following is a reexpressionof that specification as a context-free binary array grammar. It does not yet includesynthesized <strong>and</strong> inherited attributes. → “FORM” “8SVX” → → < bodyck> → < bodyck> → < bodyck> → < bodyck> → < bodyck> →< voice8header> < bodyck> → < releaseck>< bodyck> → “NAME” CHAR[kk] → ‘(c) ‘ CHAR[ll] → “AUTH”CHAR[mm] → → → “ANNO” CHAR[i]→ “ATAK” [nn] → “RLSE” [m] → < dest FIXED> → “BODY” samples[n]53


A.2 St<strong>and</strong>ard Midi <strong>File</strong> FormatOppenheim [1988] described the midi file <strong>for</strong>mat using the pseudo EBNF notation shown below[Int. MIDI Assoc 2008]. → → “MThd” → → → “MTrk” → + → → | | → → 0xF0 → 0xF7 → 0xFF The “length of header data” is a 32-bit big-endian integer with value 6. Header data consists ofthree 16-bit integers. The pseudo EBNF can be more precisely specified with the followingcontext -free binary array grammar. This grammar does not yet include synthesized <strong>and</strong> inheritedattributes → → “MThd” → → 1 | 2| 3 → → “MTrk” → + → → | | → → 0xF0 [k] → 0xF7 [m] → 0xFF [n]The datatype VARLENQUAN is a Variable –Length Quantity. “These numbers are represented7 bits per byte, most significant bits first. All bytes except the last have bit 7 set, <strong>and</strong> the last bytehas bit 7 clear.”A.3 Audio Interchange <strong>File</strong> FormatThe Audio Interchange <strong>File</strong> Format (AIFF) developed by Apple Computer in 1988 was based onElectronic Arts’ Interchange <strong>File</strong> Format. The file <strong>for</strong>mat is specified using C data structure54


definitions [Apple 1989]. The following grammar recasts the specification as a context-freepseudo EBNF grammar. It does not yet include the data types per a binary array grammar orsynthesized <strong>and</strong> inherited attributes. -> “FORM”“AIFF” -> -> COMM” -> “SSND”[k-8] -> “MARK”[mm] -> -> -> “INST” -> -> 0 | 1 | 2 -> “MIDI”[nn] -> “AESD”[24] -> “APPL”[kk-4] -> “COMT”[i-4] ->[i]55


A.4 Resource Interchange <strong>File</strong> <strong>Formats</strong>The Resource Interchange <strong>File</strong> Format (RIFF) introduced by IBM <strong>and</strong> Microsoft[1991] is alsobased on Electronic Arts’ Interchange <strong>File</strong> Format. The Microsoft container <strong>for</strong>mats like Audio-Video Interleave (AVI) <strong>and</strong> Wave<strong>for</strong>m PCM (WAV) use RIFF as their basis [Microsoft 1992][R<strong>and</strong>elshofer 2011]. WebP, a picture <strong>for</strong>mat introduced in (2010) by Google, also uses RIFF asa container.A.4.1 RIFF-Wave<strong>for</strong>m (WAVE)This file <strong>for</strong>mat was discussed in section 6.3 of this report [IBM & Microsoft 1991].A.4.2 WAVEFORMATEXThe WAVEFORMATEX type must contain both a fact chunk <strong>and</strong> an extended wave <strong>for</strong>matdescription within the 'fmt' chunk [Microsoft 1994]. The fact chunk specifies the length of thefile in samples. Wave_Format-PCM has value 0x0001 <strong>for</strong> wFormat Tag. WAVEFORMATEXhas different values <strong>for</strong> wFormat Tag. It also adds cbsize, the count in bytes of the extra size tothe common fields. -> [m]A.5 JPEGThe Joint Photographic Experts Group (JPEG) group issued the first JPEG st<strong>and</strong>ard in 1992 asITU-T Recommendation T.81 [CCITT 1992] <strong>and</strong> in 1994 as [ISO/IEC 10918-1:1994]The JPEG st<strong>and</strong>ard specifies the codec, which defines how an image is compressed into a streamof bytes <strong>and</strong> decompressed back into an image, but not the file <strong>for</strong>mat used to contain thatstream. The Exif <strong>and</strong> JFIF st<strong>and</strong>ards define the commonly used file <strong>for</strong>mats <strong>for</strong> interchange ofJPEG-compressed images.A.5.1 JPEG <strong>File</strong> Interchange Format (JFIF)A JPEG image consists of a sequence of segments, each beginning with a marker, each of whichbegins with a 0xFF byte followed by a byte indicating what kind of marker it is. Some markersconsist of just those two bytes; others are followed by two bytes indicating the length of marker-56


data that follows. The length includes the two bytes <strong>for</strong> the length, but not the two bytes <strong>for</strong> themarker. [Hamilton 1992, Cuturicu <strong>and</strong> Fromme 1999, Ecma 2009]APP0 is the Application Segment <strong>for</strong> JFIF. → < segments> → 0xFFE0 “JFIF” > [tw x th] → → [n-2] → 0xFF → 0xFFD8 → 0xFFD9 → 0x00A.5.2 JPEG/ExifAPP1 is the Application Segment <strong>for</strong> EXIF <strong>for</strong> JPEG files. [JEIDA 1998, JEITA 2002] <strong>Attribute</strong>In<strong>for</strong>mation <strong>for</strong> JPEG compressed files is stored in the Tag in<strong>for</strong>mation <strong>for</strong>mat of TIFF 6.0. → [APP2>] → 0xFFE1 “Exif” →”II”|”MM”)(0x2A → | | | | → ”0201” → 0xFFE2”FPXR” → 0xFFDB[64]


→ 0xFFDA → 0xFFD8 → 0xFFD9 → 0x00A.5.3 SPIFF (Still Picture Interchange <strong>File</strong> Format)SPIFF is a JPEG file <strong>for</strong>mat that is intended to replace the defacto JFIF (JPEG <strong>File</strong> InterchangeFormat). SPIFF includes all of the features of JFIF <strong>and</strong> adds more functionality. The SPIFF file<strong>for</strong>mat is specified in annex F of ITU-T Recommendation T.84. [ITU 1998] → < DirectoryEntries> → 0xFFE8 “SPIFF” → → 0xFFE8 → 0xFFE8 [n-6] → 0xFFD8 → 0xFFD9 → 0x00A.6 Advanced Systems FormatMicrosoft’s Advanced Systems Format (ASF) version 1 [Kuznetkov 2001] is also a chunk-basedcontainer <strong>for</strong>mat. It was developed about 1995-96, but its specification was never published.58


Microsoft’s Advanced System Format version 2 (Public Version 1) was published in 1998[Microsoft & RealNetwork 1998] with a revision in 2004 [Microsoft 2004].The <strong>for</strong>mat does not specify with which codec the video or audio should be encoded. It justspecifies the structure of the video/audio stream. The most common file <strong>for</strong>mats (codecs?)contained within an ASF file are Windows Media Audio (WMA) <strong>and</strong> Windows Media Video(WMV). → → → → 59


[m][n] → [n] → [m]A.6.1 Windows Media Audio 7-9When the Stream Type of the Stream Properties Object has the value ASF_Audio_Media =F8699E40-5B4D-11CF-A8FD-00805F5C442B, the ASF audio media type that populates theType-Specific Data field of the Stream Properties Object is represented using the followingstructure.[m] → [n]The specifies the unique ID (FormatTag) of the codec used to encode the audio data.The following values indicate codecs <strong>for</strong> Windows Media Audio 7-9.Codec name Codec ID/ Notes60


Windows MediaAudioWindows MediaAudio 9 ProfessionalWindows MediaAudio 9 LosslessFormat tag0x1610x1620x163Versions 7, 8, <strong>and</strong> 9 Series9 Series9 SeriesA.6.2 Windows Media Video 7-9.1When the Stream Type of the Stream Properties Object has the value ASF_Video_Media =BC19EFC0-5B4D-11CF-A8FD-00805F5C442B, the ASF video media type that populates theType-Specific Data field of the Stream Properties Object is represented using the followingstructure. [Microsoft 2004] →


WMV3 Windows Media Video 9A.7 Portable Network Graphics[ISO/IEC 15948:2003] -> → 0x89”PNG”0xODOA1A0A -> “IHDR” -> “PLTE” -> → ”IDAT”[n] → cksize UINT32>”IEND”A.8 Apple QuickTime Movie[Apple 2007] → → [atomsize-16] → → 62


→ → →


A.9 ESRI Shapefile[ESRI 1998]→ [5] → | → → | | | | | | | | | |64


| | | → → → → → → → → → → → → → → → → [NumPoints] → → [NumParts][NumPoints] → [NumParts][NumPoints> → → [NumPoints][BNumPoints] → 65


[NumParts][NumPoints][NumPoints] → [NumParts][NumPoints][NumPoints] → → [NumPoints][NumPoints][NumPoints] → [NumParts][NumPoints][NumPoints][NumPoints → [NumParts][NumPoints]66


[NumPoints][NumPoints → [NumParts][NumParts][NumPoints][NumPoints][NumPoints67


Appendix B: Other <strong>Binary</strong> <strong>Chunk</strong>-based <strong>File</strong> <strong>Formats</strong><strong>File</strong> Format Name Reference CommentsCEL Animation (ANIM) Sparta 1988 Electronic Arts IFFIFF Formatted Text (FTXT) Shaw et al 1985 “IFF Simple Musical Score Morrison 1986b “(SMUS)Planar Bitmap (PBM) REWiki 2009 “Audio Interchange <strong>File</strong> Format - Apple 1991CompressedWAVEFORMATEXTENSIBLE Microsoft 2003bResource Interchange <strong>File</strong><strong>Formats</strong> (RIFF)Broadcast Wave Format (BWF) EBU 2001 “Windows Audio-Video Microsoft 2005 “Interleave (AVI)Animated Windows Cursor(ANI)Houghtaling 1996 “RIFF MIDIfile (RMID) IBM & Microsoft 1991 “DIVX Media Format (DMF) Wikipedia 2012 “Device Independent Bitmap IBM & Microsoft 1991 “(RIFF RDIB)WebP Google 2010 “JPEG Titled Image Pyramid ISO/IEC 10918-3:1997(JTIP)JPEG-LS ISO/IEC 14495-1:1999ISO/IEC 14495-2:2003Multiple-Image Network R<strong>and</strong>ers-Pehrson 2001GraphicsJPEGn Network Graphics R<strong>and</strong>ers-Pehrson 2001<strong>Binary</strong> Interchange <strong>File</strong> Format Rentz 2008 Excel <strong>File</strong> Format,Versions 2, 3, 4, 5, 95, 97,2000, XP, 20033D Studio <strong>File</strong> <strong>for</strong>mat – 3ds Autodesk 19973D Studio Material Library <strong>File</strong> Fercoq 1996Anim8or Glanville 2003Animator Pro CelCrazy Talk68


Animator Pro FliHaal<strong>and</strong>Animator FLCComuphaseAnimator Pro PICMaischeinAudio Manager Module Foo 1995Autodesk Drawing Interchange Autodesk 2009<strong>File</strong>s (DXF) (<strong>Binary</strong>)Caligari Truespace Scene/Object Caligari 1998(<strong>Binary</strong>)CorelDraw Vector Graphics <strong>File</strong> Novikov <strong>and</strong> Fillipov(Versions 3.0-X3)Corel Metafile Exchange Image Corel 1998(CMX)DigiTrekker Module Format Beham 1995 Music ModulesGraoumf Tracker Modules Soras 1995 “Oktalyzer Zappa 1994 “Electronic Arts Music (.asf) Anisimovsky 2002b EA Multimedia Game <strong>File</strong><strong>Formats</strong>, Anisimovky2002aElectronic ARTS CMV Multimedia Wiki 2008a “Electronic Arts DCT Multimedia Wiki 2008b “Electronic Arts MAD Multimedia Wiki 2008c “Electronic Arts MPC Multimedia Wiki 2008d “Electronic Arts VP6 Multimedia Wiki 2008e “Electronic Arts TGV Multimedia Wiki 2008f “Electronic Arts TGQ Multimedia Wiki 2008g “Electronic Arts TQI Multimedia Wiki 2009a “Imagine 3D Data Description Kirvan 1994(TDDD)LWOB Lightwave 1996 Lightwave 3D ObjectsLWO2 Lightwave 2001a “LWLO Ferguson 1995 “Luxology Modo 3D Object Luxology 2006(LXOB)Pant Shop Pro (PSP) Jasc 1998ProVector 2D Drawing Cunni9ff <strong>and</strong> OrrProWrite Document Bayless 1987Blorb Z-machine Resource <strong>File</strong> Plotkin Interactive Fiction69


Quetzal Frost 1997a, 1997b Interactive FictionApple QuickTime Apple 1996, 2001Apple itunes Video ISO/IEC 14496-14:2003Apple itunes AAC Audio ISO/IEC 13818-7:2006ISO/IEC 14496-3:2009Apple itunes AAC AESProduction AudioISO/IEC 13818-7:2006ISO/IEC 14496-3:2009JPEG 2000 ISO/IEC 15444-1:2004JPEG 2000 Codestream ISO/IEC JTC 1/SC 29/WG 1Annex A: Codestream SyntaxISO Base Media Format ISO/IEC 14496-12:2012ISO/IEC 15444-12:2012MPEG-4 ISO/IEC 14496-14:2003Motion JPEG 2000 ISO/IEC 15444-3ITU-T T.8023GGP Media ETSI 20123GPP2 Media 3GPP2 2003Maya IFF Bitmap Alias-Wavefront 1999Apple Core Audio <strong>for</strong>mat Apple 2005American Laser Games (MM) Multimedia Wiki 2008hSmush Animation Format(SAN)Multimedia Wiki 2006Multimedia Wiki 2009bStructured Data eXchange RFC 3072Format (SDXF)Ogg Vorbis Pfeiffer 2003, Xiph 2010Simple Vector Array Format Martin 2007Nested List <strong>File</strong> (NLF) Blank Aspect 2010a 2010bCinema4D Losch 1997The Sims Interchange <strong>File</strong>FormatBaum & Noel 200270


Appendix C: Utilities to Pretty Print a Nested List Style ParseTreeprivate static int depth = 0;public static String prettyPrintTree(Tree t) throws Exception {return prettyPrintTree(t, (List) null);}public static String prettyPrintTree(Tree t, Parser recog)throws Exception {String[] ruleNames = recog != null ? recog.getRuleNames() : null;List ruleNamesList =ruleNames != null ? Arrays.asList(ruleNames) : null;return prettyPrintTree(t, ruleNamesList);}public static String prettyPrintTree(Tree t, List ruleNames)throws Exception {String ruleName = Trees.getNodeText(t, ruleNames);StringBuilder buf = new StringBuilder();depth++;buf.append("(");buf.append(ruleName);buf.append(' ');if (ruleName.startsWith("dt")) {// Since t is a Rule Node <strong>and</strong> it starts with dt,// it must be the context of one of the data type rules.// All the data type rules have a return field named value.Object value = t.getClass().getField("value").get(t);if (value instanceof String) {String str = (String) value;if (depth + str.length() > 0) {buf.append(prettyPrintTree(str));} else {buf.append(str);}} else {71


uf.append(t.getClass().getField("value").get(t));}} else if (t.getChildCount() > 0) {<strong>for</strong> (int i = 0; i < t.getChildCount(); i++) {buf.append('\n');<strong>for</strong> (int j = 0; j < depth; j++) {buf.append(" ");}buf.append(prettyPrintTree(t.getChild(i), ruleNames));}}buf.append(")");depth--;return buf.toString();}public static String prettyPrintTree(String str) throws Exception {StringBuilder buf = new StringBuilder();int subLength = 70 - depth;do {buf.append('\n');<strong>for</strong> (int j = 0; j < depth; j++) {buf.append(" ");}if (str.length() > subLength) {buf.append(str.substring(0, subLength));str = str.substring(subLength + 1);} else {buf.append(str);}} while (depth + str.length() > 70);return buf.toString();}72

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!